Implementation of SMOTE to Improve the Performance of Random Forest Classification in Credit Risk Assessment in Banking

Nafa Nur Adifia  Nanda; Yuniar Farida; Wika Dianita Utami

doi:10.29407/intensif.v9i2.23930

Authors

Nafa Nur Adifia Nanda UIN Sunan Ampel Surabaya https://orcid.org/0009-0008-7652-8027
Yuniar Farida UIN Sunan Ampel Surabaya https://orcid.org/0000-0001-8666-4980
Wika Dianita Utami UIN Sunan Ampel Surabaya https://orcid.org/0009-0004-9214-9845

DOI:

https://doi.org/10.29407/intensif.v9i2.23930

Keywords:

Classification, Credit Assessment, Credit Risk, Imbalance Data, Random Forest, SMOTE

Abstract

Background: Credit is essential in banking operations, facilitating investment, corporate expansion, and financial satisfaction. Credit risk may emerge if the borrower defaults on payment commitments. Objective: This study aims to evaluate an individual's creditworthiness by classifying and assessing their eligibility for credit. Methods: This study uses the Random Forest technique to categorize credit risk evaluation. Random Forest is a decision tree technique recognized for its high accuracy in data classification, utilizing an ensemble method of many decision trees. Before executing the classification process, issues frequently arise when data cannot be directly processed due to class imbalance. This study employs the SMOTE (Synthetic Minority Over-sampling Technique) algorithm to address class imbalance. The SMOTE algorithm is a method that emphasizes oversampling and is designed to augment the data in the minority class by generating synthetic data that aligns with the minority class data. The findings indicated that the ideal ratio for partitioning training and testing data was 80:20, and implementing the SMOTE technique within Random Forest enhanced performance assessment. Results: This research contributes to improving the accuracy of credit risk classification using the Random Forest algorithm, which effectively handles complex data and is supported by the implementation of SMOTE to overcome the class imbalance in the data. The classification accuracy value rose from 91.54% to 94.41%. The precision value rose from 90.83% to 97.03%, while the recall value increased from 60.26% to 91.55%. Conclusion: This method helps banks identify high-risk debtors more objectively and efficiently and supports appropriate credit decision-making.

Downloads

Download data is not yet available.

Abstract views: 997 , PDF downloads: 904

Author Biographies

Nafa Nur Adifia Nanda, UIN Sunan Ampel Surabaya

Mathematics Department, UIN Sunan Ampel Surabaya
Yuniar Farida, UIN Sunan Ampel Surabaya

Mathematics Department, UIN Sunan Ampel Surabaya
Wika Dianita Utami, UIN Sunan Ampel Surabaya

Mathematics Department, UIN Sunan Ampel Surabaya

References

[1] K. Danylkiv, N. Hembarska, and O. Voloshyn, “Efficiency of Using Financial and Credit Instruments To Intensify the Innovative Development of Small Business Structures in Ukraine,” J. Lviv Polytech. Natl. Univ. Ser. Econ. Manag. Issues, vol. 4, no. 2, pp. 133–143, 2020, doi: 10.23939/semi2020.02.133.

[2] P. E. T. Dewi, “the Legal Obligation of Bank in Implementing Prudential Principles Through Credit Analysis,” Int. J. Business, Econ. Law, vol. 15, no. 5, p. 109, 2018, [Online]. Available: doi: https://ijbel.com/wp-content/uploads/2018/06/ijbel-243.pdf

[3] E. Gila-Gourgoura and E. Nikolaidou, “Credit Risk Determinants in the Vulnerable Economies of Europe: Evidence from the Spanish Banking System,” Int. J. Bus. Econ. Sci. Appl. Res., vol. 10, no. 1, pp. 60–71, 2017, doi: 10.25103/ijbesar.101.08.

[4] J. N. Githama and P. Gachanja, “Effects of Credit Appraisal Methods on Non-Performing Loans in Government Owned Financial Institutions, A Case of Kenya Commercial Bank Limited,” Int. J. Curr. Asp., vol. 4, no. 2, pp. 1–12, 2020, doi: 10.35942/ijcab.v4i2.123.

[5] R. Ranyard, S. McNair, G. Nicolini, and D. Duxbury, “An item response theory approach to constructing and evaluating brief and in-depth financial literacy scales,” J. Consum. Aff., vol. 54, no. 3, pp. 1121–1156, 2020, doi: 10.1111/joca.12322.

[6] A. Fattahi, J. Sijm, and A. Faaij, “A systemic approach to analyze integrated energy system modeling tools: A review of national models,” Renew. Sustain. Energy Rev., vol. 133, no. August, p. 110195, 2020, doi: 10.1016/j.rser.2020.110195.

[7] W. M. Shaban, A. H. Rabie, A. I. Saleh, and M. A. Abo-Elsoud, “A new COVID-19 Patients Detection Strategy (CPDS) based on hybrid feature selection and enhanced KNN classifier,” Knowledge-Based Syst., vol. 205, p. 106270, 2020, doi: 10.1016/j.knosys.2020.106270.

[8] T. R. Ramesh, U. K. Lilhore, M. Poongodi, S. Simaiya, A. Kaur, and M. Hamdi, “Predictive Analysis of Heart Diseases With Machine Learning Approaches,” Malaysian J. Comput. Sci., vol. 2022, no. Special Issue 1, pp. 132–148, 2022, doi: 10.22452/mjcs.sp2022no1.10.

[9] Y. Farida, M. R. Nurfadila, and D. Yuliati, “Identifying Significant Factors Affecting the Human Development Index in East Java Using Ordinal Logistic Regression Model,” JTAM (Jurnal Teor. dan Apl. Mat., vol. 6, no. 3, p. 476, 2022, doi: 10.31764/jtam.v6i3.8301.

[10] P. Piȩta and T. Szmuc, “Applications of rough sets in big data analysis: An overview,” Int. J. Appl. Math. Comput. Sci., vol. 31, no. 4, pp. 659–683, 2021, doi: 10.34768/amcs-2021-0046.

[11] D. Chicco, V. Starovoitov, and G. Jurman, “The Benefits of the Matthews Correlation Coefficient (MCC) over the Diagnostic Odds Ratio (DOR) in Binary Classification Assessment,” IEEE Access, vol. 9, no. Mcc, pp. 47112–47124, 2021, doi: 10.1109/ACCESS.2021.3068614.

[12] E. Y. Boateng, J. Otoo, and D. A. Abaye, “Basic Tenets of Classification Algorithms K-Nearest-Neighbor, Support Vector Machine, Random Forest and Neural Network: A Review,” J. Data Anal. Inf. Process., vol. 08, no. 04, pp. 341–357, 2020, doi: 10.4236/jdaip.2020.84020.

[13] A. Alfani W.P.R., F. Rozi, and F. Sukmana, “Prediksi Penjualan Produk Unilever Menggunakan Metode K-Nearest Neighbor,” JIPI (Jurnal Ilm. Penelit. dan Pembelajaran Inform., vol. 6, no. 1, pp. 155–160, 2021, doi: 10.29100/jipi.v6i1.1910.

[14] Z. Khan et al., “Ensemble of optimal trees, random forest and random projection ensemble classification,” Adv. Data Anal. Classif., vol. 14, no. 1, pp. 97–116, 2020, doi: 10.1007/s11634-019-00364-9.

[15] S. H. Hasanah and E. Julianti, “Analysis of CART and Random Forest on Statistics Student Status at Universitas Terbuka,” INTENSIF J. Ilm. Penelit. dan Penerapan Teknol. Sist. Inf., vol. 6, no. 1, pp. 56–65, 2022, doi: 10.29407/intensif.v6i1.16156.

[16] N. Arora and P. D. Kaur, “A Bolasso based consistent feature selection enabled random forest classification algorithm: An application to credit risk assessment,” Appl. Soft Comput. J., vol. 86, p. 105936, 2020, doi: 10.1016/j.asoc.2019.105936.

[17] S. Han, B. D. Williamson, and Y. Fong, “Improving random forest predictions in small datasets from two-phase sampling designs,” BMC Med. Inform. Decis. Mak., vol. 21, no. 1, pp. 1–9, 2021, doi: 10.1186/s12911-021-01688-3.

[18] Z. Sajjadnia, R. Khayami, and M. R. Moosavi, “Preprocessing Breast Cancer Data to Improve the Data Quality, Diagnosis Procedure, and Medical Care Services,” Cancer Inform., vol. 19, pp. 7–12, 2020, doi: 10.1177/1176935120917955.

[19] M. Hasnain, M. F. Pasha, I. Ghani, M. Imran, M. Y. Alzahrani, and R. Budiarto, “Evaluating Trust Prediction and Confusion Matrix Measures for Web Services Ranking,” IEEE Access, vol. 8, pp. 90847–90861, 2020, doi: 10.1109/ACCESS.2020.2994222.

[20] S. Ben Atitallah, M. Driss, and I. Almomani, “A Novel Detection and Multi-Classification Approach for IoT-Malware Using Random Forest Voting of Fine-Tuning Convolutional Neural Networks,” Sensors, vol. 22, no. 11, 2022, doi: 10.3390/s22114302.

[21] V. Jackins, S. Vimal, M. Kaliappan, and M. Y. Lee, “AI-based smart prediction of clinical disease using random forest classifier and Naïve Bayes,” J. Supercomput., vol. 77, no. 5, pp. 5198–5219, 2021, doi: 10.1007/s11227-020-03481-x.

[22] D. H. Depari, Y. Widiastiwi, and M. M. Santoni, “Perbandingan Model Decision Tree, Naive Bayes dan Random Forest untuk Prediksi Klasifikasi Penyakit Jantung,” Inform. J. Ilmu Komput., vol. 18, no. 3, p. 239, 2022, doi: 10.52958/iftk.v18i3.4694.

[23] R. Devika, S. V. Avilala, and V. Subramaniyaswamy, “Comparative study of classifier for chronic kidney disease prediction using naïve Bayes, KNN and random forest,” Proc. 3rd Int. Conf. Comput. Methodol. Commun. ICCMC 2019, no. Iccmc, pp. 679–684, 2019, doi: 10.1109/ICCMC.2019.8819654.

[24] K. M. Hasib et al., “A Survey of Methods for Managing the Classification and Solution of Data Imbalance Problem,” J. Comput. Sci., vol. 16, no. 11, pp. 1546–1557, 2020, doi: 10.3844/JCSSP.2020.1546.1557.

[25] D. C. R. Novitasari et al., “Whirlwind Classification with Imbalanced Upper Air Data Handling using SMOTE Algorithm and SVM Classifier,” J. Phys. Conf. Ser., vol. 1501, no. 1, 2020, doi: 10.1088/1742-6596/1501/1/012010.

[26] S. Park, Y. Hong, B. Heo, S. Yun, and J. Y. Choi, “The Majority Can Help the Minority: Context-rich Minority Oversampling for Long-tailed Classification,” Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit., vol. 2022-June, pp. 6877–6886, 2022, doi: 10.1109/CVPR52688.2022.00676.

[27] L. Wang, M. Han, X. Li, N. Zhang, and H. Cheng, “Review of Classification Methods on Unbalanced Data Sets,” IEEE Access, vol. 9, pp. 64606–64628, 2021, doi: 10.1109/ACCESS.2021.3074243.

[28] D. A. Nurdeni, “Extracting Information From Twitter Data To Identify Types of Assistance for Victims of Natural Disasters: an Indonesian Case Study,” J. Manag. Inf. Decis. Sci., vol. 25, no. S1, pp. 1–14, 2022, [Online]. Available: doi: https://www.researchgate.net/profile/Ariana-Yunita-2/publication/369795342_Special_Issue_1_2022_1_Journal_of_Management_Information_and_Decision_Sciences/links/642d4abaad9b6d17dc393e2f/Special-Issue-1-2022-1-Journal-of-Management-Information-and-Deci

[29] G. A. Pradipta, R. Wardoyo, A. Musdholifah, and I. N. H. Sanjaya, “Radius-SMOTE: A New Oversampling Technique of Minority Samples Based on Radius Distance for Learning from Imbalanced Data,” IEEE Access, vol. 9, pp. 74763–74777, 2021, doi: 10.1109/ACCESS.2021.3080316.

[30] D. Elreedy, A. F. Atiya, and F. Kamalov, “A theoretical distribution analysis of synthetic minority oversampling technique (SMOTE) for imbalanced learning,” Mach. Learn., vol. 113, no. 7, pp. 4903–4923, 2024, doi: 10.1007/s10994-022-06296-4.

[31] S. DEMİR and E. K. ŞAHİN, “Evaluation of Oversampling Methods (OVER, SMOTE, and ROSE) in Classifying Soil Liquefaction Dataset based on SVM, RF, and Naïve Bayes,” Eur. J. Sci. Technol., no. 34, pp. 142–147, 2022, doi: 10.31590/ejosat.1077867.

[32] J. Prasetya and A. Abdurakhman, “Comparison of Smote Random Forest and Smote K-Nearest Neighbors Classification Analysis on Imbalanced Data,” Media Stat., vol. 15, no. 2, pp. 198–208, 2023, doi: 10.14710/medstat.15.2.198-208.

[33] Kaggle, “Credit Risk Analysis.” Accessed: Sep. 12, 2023. [Online]. Available: https://www.kaggle.com/datasets/nanditapore/credit-risk-analysis

[34] S. Lasniari, J. Jasril, S. Sanjaya, F. Yanto, and M. Affandes, “Klasifikasi Citra Daging Babi dan Daging Sapi Menggunakan Deep Learning Arsitektur ResNet-50 dengan Augmentasi Citra,” J. Sist. Komput. dan Inform., vol. 3, no. 4, p. 450, 2022, doi: 10.30865/json.v3i4.4167.

[35] R. M. Candra and A. Nanda Rozana, “Klasifikasi Komentar Bullying pada Instagram Menggunakan Metode K-Nearest Neighbor,” IT J. Res. Dev., vol. 5, no. 1, pp. 45–52, 2020, doi: 10.25299/itjrd.2020.vol5(1).4962.

[36] A. M. A. Rahim, I. Y. R. P. Pratiwi, and M. A. Fikri, “Indonesian Journal of Computer Science,” Indones. J. Comput. Sci., vol. 12, no. 2, pp. 284–301, 2023, doi: https://doi.org/10.33022/ijcs.v12i1.3135.

[37] K. Polat, “A hybrid approach to Parkinson disease classification using speech signal: The combination of SMOTE and random forests,” 2019 Sci. Meet. Electr. Biomed. Eng. Comput. Sci. EBBT 2019, pp. 1–3, 2019, doi: 10.1109/EBBT.2019.8741725.