Enhancing SVM-Based Classification Performance on Indonesian Sentences through TF-IDF and Directional Augmentation

Rianto Rianto; Eko Setyo Humanika; Iwan Hartadi Tri  Untoro

doi:10.29407/intensif.v10i1.25179

Authors

Rianto Universitas Teknologi Yogyakarta https://orcid.org/0000-0002-5058-4580
Eko Setyo Humanika Universitas Teknologi Yogyakarta https://orcid.org/0000-0001-9789-0764
Iwan Hartadi Tri Untoro Universitas Teknologi Yogyakarta https://orcid.org/0009-0000-3041-3381

DOI:

https://doi.org/10.29407/intensif.v10i1.25179

Keywords:

Directional Augmentation, Indonesian Sentences, SVM, Text Classification, TF-IDF

Abstract

Background: The distinction between standard and non-standard Indonesian sentences is traditionally well-defined, yet the ubiquity of digital communication has increasingly blurred these boundaries. This convergence introduces significant lexical ambiguity in formal contexts, complicating the performance of automated text classification systems. Objective: This study aims to enhance the robustness of Support Vector Machine (SVM) classification by addressing these linguistic irregularities through TF-IDF vectorization and a targeted directional augmentation strategy. Methods: A corpus comprising 5,394 labeled sentences was processed under a strict anti-leak grouping strategy to rigorously prevent semantic leakage between training, validation, and testing sets. To resolve decision boundary overlaps often missed by the baseline model, manual directional augmentation was applied, specifically targeting ambiguous sentence structures to enrich the training distribution and linguistic diversity. Results: The experiments demonstrated that directional augmentation significantly refined the model's decision margins. While the baseline model achieved a test accuracy of 94.39%, the augmented approach substantially improved generalization capabilities across unseen groups, elevating validation accuracy from 96.11% to 97.39% and test accuracy to 96.16%. Conclusion: These findings substantiate that structurally enriching the dataset effectively mitigates overfitting and improves sensitivity. However, given the scalability constraints of manual intervention, future research should prioritize automated augmentation techniques and contextual embeddings to handle deep linguistic nuances further.

Downloads

Download data is not yet available.

Abstract views: 533 , PDF downloads: 555

References

[1] S. U. Hassan, J. Ahamed, and K. Ahmad, “Analytics of machine learning-based algorithms for text classification,” Sustainable Operations and Computers, vol. 3, pp. 238–248, Jan. 2022, doi: 10.1016/J.SUSOC.2022.03.001.

[2] A. F. Hidayatullah, R. A. Apong, D. T. C. Lai, and A. Qazi, “Word Level Language Identification in Indonesian-Javanese-English Code-Mixed Text,” Procedia Comput Sci, vol. 244, pp. 105–112, Jan. 2024, doi: 10.1016/J.PROCS.2024.10.183.

[3] D. R. Febryanti, I. Hamad, and U. Rusadi, “Pemetaan Wacana Berbasis Korpus di Media Sosial,” Jurnal ILMU KOMUNIKASI, vol. 21, no. 1, pp. 1–18, Jun. 2024, doi: 10.24002/JIK.V21I1.6452.

[4] N. S. Nurliza, N. Hidayah, S. V. Azzahra, and B. Ginanjar, “Afiks Ng- pada Bahasa Gaul di Media Sosial Beserta Padanan Formalnya: Kajian Morfologi,” Linguistik Indonesia, vol. 43, no. 1, pp. 81–98, Feb. 2025, doi: 10.26499/li.v43i1.673.

[5] H. Murfi, S. Theresia Gowandi, G. Ardaneswari, and S. Nurrohmah, “BERT-based combination of convolutional and recurrent neural network for indonesian sentiment analysis,” Appl Soft Comput, vol. 151, p. 111112, Jan. 2024, doi: 10.1016/J.ASOC.2023.111112.

[6] A. Gasparetto, M. Marcuzzo, A. Zangari, and A. Albarelli, “A Survey on Text Classification Algorithms: From Text to Predictions,” Information 2022, vol. 13, no. 2, p. 83, Feb. 2022, doi: 10.3390/INFO13020083.

[7] W. F. Satrya, R. Aprilliyani, and E. H. Yossy, “Sentiment analysis of Indonesian police chief using multi-level ensemble model,” Procedia Comput Sci, vol. 216, pp. 620–629, Jan. 2023, doi: 10.1016/J.PROCS.2022.12.177.

[8] I. S. M. Fadhil, M. H. M. Yusof, I. A. Khalid, S. H. Teoh, and A. A. Almohammedi, “Sentiment analysis comparisons across selected ml models: application on Malaysia online banking twitter data,” Procedia Comput Sci, vol. 245, no. C, pp. 979–988, Jan. 2024, doi: 10.1016/J.PROCS.2024.10.326.

[9] K. S. B. Kharthik et al., “Transfer learned deep feature based crack detection using support vector machine: a comparative study,” Scientific Reports 2024 14:1, vol. 14, no. 1, pp. 1–19, Jun. 2024, doi: 10.1038/s41598-024-63767-5.

[10] E. Mitreva, V. Georgiev, and A. Nikolova, “Classification of Short Noisy Text,” ACM International Conference Proceeding Series, pp. 227–231, Jun. 2024, doi: 10.1145/3674912.3674935.

[11] B. Bakiyev, “Method for Determining the Similarity of Text Documents for the Kazakh language, Taking Into Account Synonyms: Extension to TF-IDF,” SIST 2022 - 2022 International Conference on Smart Information Systems and Technologies, Proceedings, 2022, doi: 10.1109/SIST54437.2022.9945747.

[12] Y. Bilgen and M. Kaya, “EGMA: Ensemble Learning-Based Hybrid Model Approach for Spam Detection,” Applied Sciences 2024, Vol. 14, Page 9669, vol. 14, no. 21, p. 9669, Oct. 2024, doi: 10.3390/APP14219669.

[13] M. Bayer, M. A. Kaufhold, and C. Reuter, “A Survey on Data Augmentation for Text Classification,” ACM Comput Surv, vol. 55, no. 7, Jul. 2022, doi: 10.1145/3544558/SUPPL_FILE/3544558.SUPP.PDF.

[14] M. Bayer, M. A. Kaufhold, B. Buchhold, M. Keller, J. Dallmeyer, and C. Reuter, “Data augmentation in natural language processing: a novel text generation approach for long and short text classifiers,” International Journal of Machine Learning and Cybernetics, vol. 14, no. 1, pp. 135–150, Jan. 2023, doi: 10.1007/S13042-022-01553-3/TABLES/5.

[15] H. T. Kesgin and M. F. Amasyali, “Advancing NLP models with strategic text augmentation: A comprehensive study of augmentation methods and curriculum strategies,” Natural Language Processing Journal, vol. 7, p. 100071, Jun. 2024, doi: 10.1016/J.NLP.2024.100071.

[16] I. Putu Widiarta Nandana Githa, A. Syananda, R. Faustine, I. S. Edbert, and D. Suhartono, “Hate Speech Classification in Indonesian Tweets Using TF-IDF and Data Augmentation,” 2024 International Conference on Green Energy, Computing and Sustainable Technology, GECOST 2024, pp. 61–65, 2024, doi: 10.1109/GECOST60902.2024.10474781.

[17] Y. C. Zhou, Z. Zheng, J. R. Lin, and X. Z. Lu, “Integrating NLP and context-free grammar for complex rule interpretation towards automated compliance checking,” Comput Ind, vol. 142, p. 103746, Nov. 2022, doi: 10.1016/J.COMPIND.2022.103746.

[18] S. Li et al., “Preprocessing of natural language process variables using a data-driven method improves the association with suicide risk in a large veterans affairs population,” Comput Biol Med, vol. 189, p. 109939, May 2025, doi: 10.1016/J.COMPBIOMED.2025.109939.

[19] M. Siino, I. Tinnirello, and M. La Cascia, “Is text preprocessing still worth the time? A comparative survey on the influence of popular preprocessing methods on Transformers and traditional classifiers,” Inf Syst, vol. 121, p. 102342, Mar. 2024, doi: 10.1016/J.IS.2023.102342.

[20] S. Raza and V. Chatrath, “HarmonyNet: Navigating hate speech detection,” Natural Language Processing Journal, vol. 8, p. 100098, Sep. 2024, doi: 10.1016/J.NLP.2024.100098.

[21] C. Bulut and E. Arslan, “Comparison of the impact of dimensionality reduction and data splitting on classification performance in credit risk assessment,” Artif Intell Rev, vol. 57, no. 9, pp. 1–23, Sep. 2024, doi: 10.1007/S10462-024-10904-1/TABLES/9.

[22] V. R. Joseph and A. Vakayil, “SPlit: An Optimal Method for Data Splitting,” Technometrics, vol. 64, no. 2, pp. 166–176, 2022, doi: 10.1080/00401706.2021.1921037.

[23] T. Turki and S. S. Roy, “Novel Hate Speech Detection Using Word Cloud Visualization and Ensemble Learning Coupled with Count Vectorizer,” Applied Sciences 2022, Vol. 12, Page 6611, vol. 12, no. 13, p. 6611, Jun. 2022, doi: 10.3390/APP12136611.

[24] M. M. Danyal, S. S. Khan, M. Khan, S. Ullah, M. B. Ghaffar, and W. Khan, “Sentiment analysis of movie reviews based on NB approaches using TF–IDF and count vectorizer,” Soc Netw Anal Min, vol. 14, no. 1, pp. 1–15, Dec. 2024, doi: 10.1007/S13278-024-01250-9/METRICS.

[25] M. Bayer, M. A. Kaufhold, B. Buchhold, M. Keller, J. Dallmeyer, and C. Reuter, “Data augmentation in natural language processing: a novel text generation approach for long and short text classifiers,” International Journal of Machine Learning and Cybernetics, vol. 14, no. 1, pp. 135–150, Jan. 2023, doi: 10.1007/S13042-022-01553-3/TABLES/5.

[26] G. Chao, J. Liu, M. Wang, and D. Chu, “Data augmentation for sentiment classification with semantic preservation and diversity,” Knowl Based Syst, vol. 280, p. 111038, Nov. 2023, doi: 10.1016/J.KNOSYS.2023.111038.

[27] S. S. Sikarwar, C. Kumar Babubhai Patel, P. U. Dhanjibhai, V. Vir Singh, A. T. Ravi, and L. Mohan Sakya, “Enhancing False News Detection through Supervised Machine Learning and NLP Techniques: A Comparative Study of Feature Extraction and Selection Methods Using Python Scikit-Learn,” Proceedings - IEEE 2024 1st International Conference on Advances in Computing, Communication and Networking, ICAC2N 2024, pp. 1361–1366, 2024, doi: 10.1109/ICAC2N63387.2024.10895797.

[28] R. Dumbre, P. Ankalwar, S. Bhagwat, D. Pandit, M. Bhutada, and S. Gund, “SpaCy and NLTK NLP Techniques for Text Summarization: A Comprehensive Comparison,” Lecture Notes in Networks and Systems, vol. 1149, pp. 55–64, 2025, doi: 10.1007/978-981-97-8160-7_5.

[29] C. Lompa and P. Luczynski, “Analysis and Reproducibility of ‘Productivity, Portability, Performance: Data-Centric Python,’” IEEE Transactions on Parallel and Distributed Systems, 2024, doi: 10.1109/TPDS.2024.3366571.

[30] P. Gupta and A. Bagchi, “Data Visualization with Python,” In: Essentials of Python for Artificial Intelligence and Machine Learning. Synthesis Lectures on Engineering, Science, and Technology, pp. 237–282, 2024, doi: 10.1007/978-3-031-43725-0_7.

[31] G. Phillips et al., “Setting nutrient boundaries to protect aquatic communities: The importance of comparing observed and predicted classifications using measures derived from a confusion matrix,” Science of The Total Environment, vol. 912, p. 168872, Feb. 2024, doi: 10.1016/J.SCITOTENV.2023.168872.

[32] A. Vanacore, M. S. Pellegrino, and A. Ciardiello, “Fair evaluation of classifier predictive performance based on binary confusion matrix,” Comput Stat, vol. 39, no. 1, pp. 363–383, Feb. 2024, doi: 10.1007/S00180-022-01301-9/TABLES/5.

[33] N. C. A. Agustina, R. Novita, Mustakim, and N. E. Rozanda, “The Implementation of TF-IDF and Word2Vec on Booster Vaccine Sentiment Analysis Using Support Vector Machine Algorithm,” Procedia Comput Sci, vol. 234, pp. 156–163, Jan. 2024, doi: 10.1016/J.PROCS.2024.02.162.