Analysis of CART and Random Forest on Statistics Student Status at Universitas Terbuka

— CART and Random Forest are part of machine learning which is an essential part of the purpose of this research. CART is used to determine student status indicators, and Random Forest improves classification accuracy results. Based on the results of CART, three parameters can affect student status, namely the year of initial registration, number of rolls, and credits. Meanwhile, based on the classification accuracy results, RF can improve the accuracy performance on student status data with a difference in the percentage of CART by 1.44% in training data and testing data by 2.24%.


I. INTRODUCTION
The Open University (UT) has several faculties, one of which is the Faculty of Science and Technology (FST). This faculty was previously known as the Faculty of Mathematics and Natural Sciences (FMIPA). FST offers a Bachelor of Statistics study program. This study program was established in 1994 and has succeeded in producing Statistics graduates from various characteristics students possess. UT has an open and distance learning system, so it is one of UT's advantages compared to other universities in Indonesia by opening 39 service offices spread throughout Indonesia. UT does not limit the period of study completion, and there is no application of a dropout system. There is no limitation on the year of diploma graduation or age.
The time for registration or registration is free throughout the year [1] [2]. Based on the advantages and convenience of studying at UT, there is one problem that must be faced by us, namely the active status of students. It happens because students are given the convenience of registering at any time, so many UT students are inactive in a particular semester. They do not know when the student will become active again [1].
This study aims to determine what indicators can affect student status to classify the status of active and inactive students. From the results of this study, a policy will also be made as a solution for students who can become inactive students in the next semester to reduce the number of inactive students in the UT Statistics study program. Several statistical methods used in the classification technique are CART and Random Forest (RF). Classification and Regression Trees (CART) is a data exploration method based on decision tree techniques. A regression tree is generated when the response variable is numeric, while a classification tree is generated when the response variable is categorical. The tree formed from the binary recursive sorting process in a data cluster makes the response variable values in each data cluster more homogeneous. CART is sensitive to new data so that if there is a slight change in a data set, it can result in a significant change in decision trees [3]. The way to solve this problem is to use an Ensemble method. This method is a classifier set that is trained individually. The prediction results obtained are combined when classifying new data [4] [5]. Some of the Ensemble techniques include Bagging [6] [7], Boosting [7] [8], and Random forest [9][10] [11]. Random Forest (RF) is a development of the CART method by applying bootstrap aggregating (bagging) and random feature selection methods [9]. In RF, many trees are grown to form like a forest. The analysis is carried out by collecting these trees in a data set consisting of n observations and p explanatory variables.
Several studies have shown that RF is better than some machine learning methods, namely empirical studies using residential apartment data. These studies indicate that RF works better Boosted Trees [12]. In the study of solar radiation, they are forecasting one day to 6 days ahead using MARS, CART, M5, and RF models. RF gives the best accuracy results while CART produces the lowest accuracy results [11]. Research in predicting the delay or progress of ship arrivals using three approaches, namely ANN Backpropagation, CART, and RF, from the results of the three methods obtained, RF is better than BP and CART [13]. Research on many attributes for early cancer detection so that dimension reduction is needed, then the CART and RF methods are applied. After analysis, it is found that RF is the method that produces the best performance compared to CART [14].

A. Data and Variables
This research data comes from UT Statistics Study Program students in 2019, totaling 1493, divided into 2, namely 70% training data and 30% testing data with nine independent variables and one response variable. These nine independent and one response were taken based on the data collection results, which we generally use to determine student characteristics.
Characteristics of UT Statistics Study Program students consist of: Each split only depends on the value originating from one independent variable.
b. Terminal node determination node t can be used as a terminal node if there is no significant decrease in heterogeneity in splitting, there is only one observation on each child node, or there is a minimum limit of n. There is a limit on the number of maximum tree levels or depth.

c. Class label marking
Based on the rules of the most significant number of class members.
The formation of the classification tree stops if there is only one observation in each child node or there is a minimum limit of n. All observations in each child node are identical, and there is a limit to the maximum number of tree levels. After the maximum tree is formed, the tree pruning stage is carried out, which aims to prevent the formation of vast and complex classification trees so that a decent tree size is obtained based on cost complexity pruning.
RF is done in the following way [17] : a. Perform a random sampling of size n b. with recovery on cluster data. This stage is the bootstrap stage.
In this stage, the tree is constructed until it reaches its maximum size (without pruning).
At each node, the disaggregation is done by selecting m explanatory variables at random, where . The best disaggregation is chosen from them explanatory variables. This stage is the stage of random feature selection.
c. Repeat steps a and b times so that a forest of k trees is formed. The observation response is predicted by aggregating the expected results of trees, and the classification is based on the majority vote.

III. RESULT AND DISCUSSION
The following is a descriptive analysis that describes student status based on the following characteristics of the data: The table above it shows that the percentage of women who are active as students are 70.9%, the percentage of women is more significant than 64.1% of men. So it can be concluded that women are more dominant in taking the Statistics study program than men.Percentages based on education that S1 and S2 are active as students are 71% and 100%, respectively. It shows that the status of active students who have S1 and S2 education is easier to attend lectures because they have studied Statistics during previous lectures and took Statistics as a support to improve competence in terms of data analysis and processing. While those with S3 education were not active 100% after the analysis, it was found that there was only one doctoral student who was not active as a student. CART can produce a model that is simple and easy to interpret. The resulting CART model is based on the variables. The variables affect the response and work as markers for forming a node. Based on this research data, it can also be concluded that the saturation point of students in studying at the Statistics Study Program is students who have taken lectures for more than seven years. After analyzing using the CART method, the authors build the Ensemble method, which is used to improve the accuracy of the classification of student status by using the development of the CART method, namely RF.

Figure 2. ORANGE APP ON CART AND RF
The picture above explains that the application used in this study is the Orange application.
Orange is an open-source data mining software. Figure 2 illustrates that this study aims to apply the CART and RF methods and compare the two methods. The two methods have good accuracy in classifying the active status of Statistics students based on cross-validation results on training data and testing data.

IV. CONCLUSION
The parameters in classifying the status of Statistics Study Program students using the CART method are nine parameters consisting of gender, age, education, marital status, job, initial registration year, number of registrations, credits, and GPA. Based on the results of the CART analysis, parameters that affect student status are obtained, namely the year of initial registration, number of registrations, and credits. The results of the analysis of RF training data are better in determining the accuracy level of classification of inactive students by 97.99% than CART, which is 93.97%, and RF has a classification accuracy of active students of 96.28% and better than CART of 96, 13%. Likewise, the RF testing data is better in determining the accuracy level of classification of inactive students by 94.67% than CART 90.67%. It is also better in determining the classification accuracy of active students 97.31% from CART of 95.96%. So based on these results, RF can improve the accuracy performance on student status data with a difference in the percentage of CART by 1.44% in training data and testing data by 2.24%.