Enhancing SVM-Based Classification Performance on Indonesian Sentences through TF-IDF and Directional Augmentation
DOI:
https://doi.org/10.29407/intensif.v10i1.25179Keywords:
Directional Augmentation, Indonesian Sentences, SVM, Text Classification, TF-IDFAbstract
Background: The distinction between standard and non-standard Indonesian sentences is traditionally well-defined, yet the ubiquity of digital communication has increasingly blurred these boundaries. This convergence introduces significant lexical ambiguity in formal contexts, complicating the performance of automated text classification systems. Objective: This study aims to enhance the robustness of Support Vector Machine (SVM) classification by addressing these linguistic irregularities through TF-IDF vectorization and a targeted directional augmentation strategy. Methods: A corpus comprising 5,394 labeled sentences was processed under a strict anti-leak grouping strategy to rigorously prevent semantic leakage between training, validation, and testing sets. To resolve decision boundary overlaps often missed by the baseline model, manual directional augmentation was applied, specifically targeting ambiguous sentence structures to enrich the training distribution and linguistic diversity. Results: The experiments demonstrated that directional augmentation significantly refined the model's decision margins. While the baseline model achieved a test accuracy of 94.39%, the augmented approach substantially improved generalization capabilities across unseen groups, elevating validation accuracy from 96.11% to 97.39% and test accuracy to 96.16%. Conclusion: These findings substantiate that structurally enriching the dataset effectively mitigates overfitting and improves sensitivity. However, given the scalability constraints of manual intervention, future research should prioritize automated augmentation techniques and contextual embeddings to handle deep linguistic nuances further.
Downloads
References
[1] S. U. Hassan, J. Ahamed, and K. Ahmad, “Analytics of machine learning-based algorithms for text classification,” Sustainable Operations and Computers, vol. 3, pp. 238–248, Jan. 2022, doi: 10.1016/J.SUSOC.2022.03.001.
[2] A. F. Hidayatullah, R. A. Apong, D. T. C. Lai, and A. Qazi, “Word Level Language Identification in Indonesian-Javanese-English Code-Mixed Text,” Procedia Comput Sci, vol. 244, pp. 105–112, Jan. 2024, doi: 10.1016/J.PROCS.2024.10.183.
[3] D. R. Febryanti, I. Hamad, and U. Rusadi, “Pemetaan Wacana Berbasis Korpus di Media Sosial,” Jurnal ILMU KOMUNIKASI, vol. 21, no. 1, pp. 1–18, Jun. 2024, doi: 10.24002/JIK.V21I1.6452.
[4] N. S. Nurliza, N. Hidayah, S. V. Azzahra, and B. Ginanjar, “Afiks Ng- pada Bahasa Gaul di Media Sosial Beserta Padanan Formalnya: Kajian Morfologi,” Linguistik Indonesia, vol. 43, no. 1, pp. 81–98, Feb. 2025, doi: 10.26499/li.v43i1.673.
[5] H. Murfi, S. Theresia Gowandi, G. Ardaneswari, and S. Nurrohmah, “BERT-based combination of convolutional and recurrent neural network for indonesian sentiment analysis,” Appl Soft Comput, vol. 151, p. 111112, Jan. 2024, doi: 10.1016/J.ASOC.2023.111112.
[6] A. Gasparetto, M. Marcuzzo, A. Zangari, and A. Albarelli, “A Survey on Text Classification Algorithms: From Text to Predictions,” Information 2022, vol. 13, no. 2, p. 83, Feb. 2022, doi: 10.3390/INFO13020083.
[7] W. F. Satrya, R. Aprilliyani, and E. H. Yossy, “Sentiment analysis of Indonesian police chief using multi-level ensemble model,” Procedia Comput Sci, vol. 216, pp. 620–629, Jan. 2023, doi: 10.1016/J.PROCS.2022.12.177.
[8] I. S. M. Fadhil, M. H. M. Yusof, I. A. Khalid, S. H. Teoh, and A. A. Almohammedi, “Sentiment analysis comparisons across selected ml models: application on Malaysia online banking twitter data,” Procedia Comput Sci, vol. 245, no. C, pp. 979–988, Jan. 2024, doi: 10.1016/J.PROCS.2024.10.326.
[9] K. S. B. Kharthik et al., “Transfer learned deep feature based crack detection using support vector machine: a comparative study,” Scientific Reports 2024 14:1, vol. 14, no. 1, pp. 1–19, Jun. 2024, doi: 10.1038/s41598-024-63767-5.
[10] E. Mitreva, V. Georgiev, and A. Nikolova, “Classification of Short Noisy Text,” ACM International Conference Proceeding Series, pp. 227–231, Jun. 2024, doi: 10.1145/3674912.3674935.
[11] B. Bakiyev, “Method for Determining the Similarity of Text Documents for the Kazakh language, Taking Into Account Synonyms: Extension to TF-IDF,” SIST 2022 - 2022 International Conference on Smart Information Systems and Technologies, Proceedings, 2022, doi: 10.1109/SIST54437.2022.9945747.
[12] Y. Bilgen and M. Kaya, “EGMA: Ensemble Learning-Based Hybrid Model Approach for Spam Detection,” Applied Sciences 2024, Vol. 14, Page 9669, vol. 14, no. 21, p. 9669, Oct. 2024, doi: 10.3390/APP14219669.
[13] M. Bayer, M. A. Kaufhold, and C. Reuter, “A Survey on Data Augmentation for Text Classification,” ACM Comput Surv, vol. 55, no. 7, Jul. 2022, doi: 10.1145/3544558/SUPPL_FILE/3544558.SUPP.PDF.
[14] M. Bayer, M. A. Kaufhold, B. Buchhold, M. Keller, J. Dallmeyer, and C. Reuter, “Data augmentation in natural language processing: a novel text generation approach for long and short text classifiers,” International Journal of Machine Learning and Cybernetics, vol. 14, no. 1, pp. 135–150, Jan. 2023, doi: 10.1007/S13042-022-01553-3/TABLES/5.
[15] H. T. Kesgin and M. F. Amasyali, “Advancing NLP models with strategic text augmentation: A comprehensive study of augmentation methods and curriculum strategies,” Natural Language Processing Journal, vol. 7, p. 100071, Jun. 2024, doi: 10.1016/J.NLP.2024.100071.
[16] I. Putu Widiarta Nandana Githa, A. Syananda, R. Faustine, I. S. Edbert, and D. Suhartono, “Hate Speech Classification in Indonesian Tweets Using TF-IDF and Data Augmentation,” 2024 International Conference on Green Energy, Computing and Sustainable Technology, GECOST 2024, pp. 61–65, 2024, doi: 10.1109/GECOST60902.2024.10474781.
[17] Y. C. Zhou, Z. Zheng, J. R. Lin, and X. Z. Lu, “Integrating NLP and context-free grammar for complex rule interpretation towards automated compliance checking,” Comput Ind, vol. 142, p. 103746, Nov. 2022, doi: 10.1016/J.COMPIND.2022.103746.
[18] S. Li et al., “Preprocessing of natural language process variables using a data-driven method improves the association with suicide risk in a large veterans affairs population,” Comput Biol Med, vol. 189, p. 109939, May 2025, doi: 10.1016/J.COMPBIOMED.2025.109939.
[19] M. Siino, I. Tinnirello, and M. La Cascia, “Is text preprocessing still worth the time? A comparative survey on the influence of popular preprocessing methods on Transformers and traditional classifiers,” Inf Syst, vol. 121, p. 102342, Mar. 2024, doi: 10.1016/J.IS.2023.102342.
[20] S. Raza and V. Chatrath, “HarmonyNet: Navigating hate speech detection,” Natural Language Processing Journal, vol. 8, p. 100098, Sep. 2024, doi: 10.1016/J.NLP.2024.100098.
[21] C. Bulut and E. Arslan, “Comparison of the impact of dimensionality reduction and data splitting on classification performance in credit risk assessment,” Artif Intell Rev, vol. 57, no. 9, pp. 1–23, Sep. 2024, doi: 10.1007/S10462-024-10904-1/TABLES/9.
[22] V. R. Joseph and A. Vakayil, “SPlit: An Optimal Method for Data Splitting,” Technometrics, vol. 64, no. 2, pp. 166–176, 2022, doi: 10.1080/00401706.2021.1921037.
[23] T. Turki and S. S. Roy, “Novel Hate Speech Detection Using Word Cloud Visualization and Ensemble Learning Coupled with Count Vectorizer,” Applied Sciences 2022, Vol. 12, Page 6611, vol. 12, no. 13, p. 6611, Jun. 2022, doi: 10.3390/APP12136611.
[24] M. M. Danyal, S. S. Khan, M. Khan, S. Ullah, M. B. Ghaffar, and W. Khan, “Sentiment analysis of movie reviews based on NB approaches using TF–IDF and count vectorizer,” Soc Netw Anal Min, vol. 14, no. 1, pp. 1–15, Dec. 2024, doi: 10.1007/S13278-024-01250-9/METRICS.
[25] M. Bayer, M. A. Kaufhold, B. Buchhold, M. Keller, J. Dallmeyer, and C. Reuter, “Data augmentation in natural language processing: a novel text generation approach for long and short text classifiers,” International Journal of Machine Learning and Cybernetics, vol. 14, no. 1, pp. 135–150, Jan. 2023, doi: 10.1007/S13042-022-01553-3/TABLES/5.
[26] G. Chao, J. Liu, M. Wang, and D. Chu, “Data augmentation for sentiment classification with semantic preservation and diversity,” Knowl Based Syst, vol. 280, p. 111038, Nov. 2023, doi: 10.1016/J.KNOSYS.2023.111038.
[27] S. S. Sikarwar, C. Kumar Babubhai Patel, P. U. Dhanjibhai, V. Vir Singh, A. T. Ravi, and L. Mohan Sakya, “Enhancing False News Detection through Supervised Machine Learning and NLP Techniques: A Comparative Study of Feature Extraction and Selection Methods Using Python Scikit-Learn,” Proceedings - IEEE 2024 1st International Conference on Advances in Computing, Communication and Networking, ICAC2N 2024, pp. 1361–1366, 2024, doi: 10.1109/ICAC2N63387.2024.10895797.
[28] R. Dumbre, P. Ankalwar, S. Bhagwat, D. Pandit, M. Bhutada, and S. Gund, “SpaCy and NLTK NLP Techniques for Text Summarization: A Comprehensive Comparison,” Lecture Notes in Networks and Systems, vol. 1149, pp. 55–64, 2025, doi: 10.1007/978-981-97-8160-7_5.
[29] C. Lompa and P. Luczynski, “Analysis and Reproducibility of ‘Productivity, Portability, Performance: Data-Centric Python,’” IEEE Transactions on Parallel and Distributed Systems, 2024, doi: 10.1109/TPDS.2024.3366571.
[30] P. Gupta and A. Bagchi, “Data Visualization with Python,” In: Essentials of Python for Artificial Intelligence and Machine Learning. Synthesis Lectures on Engineering, Science, and Technology, pp. 237–282, 2024, doi: 10.1007/978-3-031-43725-0_7.
[31] G. Phillips et al., “Setting nutrient boundaries to protect aquatic communities: The importance of comparing observed and predicted classifications using measures derived from a confusion matrix,” Science of The Total Environment, vol. 912, p. 168872, Feb. 2024, doi: 10.1016/J.SCITOTENV.2023.168872.
[32] A. Vanacore, M. S. Pellegrino, and A. Ciardiello, “Fair evaluation of classifier predictive performance based on binary confusion matrix,” Comput Stat, vol. 39, no. 1, pp. 363–383, Feb. 2024, doi: 10.1007/S00180-022-01301-9/TABLES/5.
[33] N. C. A. Agustina, R. Novita, Mustakim, and N. E. Rozanda, “The Implementation of TF-IDF and Word2Vec on Booster Vaccine Sentiment Analysis Using Support Vector Machine Algorithm,” Procedia Comput Sci, vol. 234, pp. 156–163, Jan. 2024, doi: 10.1016/J.PROCS.2024.02.162.
Downloads
Published
Issue
Section
License
Copyright (c) 2026 Rianto, Eko Setyo Humanika, Iwan Hartadi Tri Untoro

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Authors who publish with this journal agree to the following terms:
- Copyright on any article is retained by the author(s).
- The author grants the journal, the right of first publication with the work simultaneously licensed under a Creative Commons Attribution License that allows others to share the work with an acknowledgment of the work’s authorship and initial publication in this journal.
- Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal’s published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgment of its initial publication in this journal.
- Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work.
- The article and any associated published material is distributed under the Creative Commons Attribution-ShareAlike 4.0 International License


