Classifiers ensemble and synthetic minority oversampling techniques for academic performance prediction

The increasing need for data driven decision making recently has resulted in the application of data mining in various fields including the educational sector which is referred to as educational data mining. The need for improving the performance of data mining models has also been identified as a gap for future researcher. In Nigeria, higher educational institutions collect various students’ data, but these data are rarely used in any decision or policy making to improve the academic performance of students. This research work, attempts to improve the performance of data mining models for predicting students’ academic performance using stacking classifiers ensemble and synthetic minority over-sampling techniques. The research was conducted by adopting and evaluating the performance of J48, IBK and SMO classifiers. The individual classifiers models, standard stacking classifier ensemble model and stacking classifiers ensemble model were trained and tested on 206 students’ data set from the faculty of science federal university Dutse. Students’ specific previous academic performance records at Unified Tertiary Matriculation Examination, Senior Secondary Certificate Examination and first year Cumulative Grade Point Average of students are used as data inputs in WEKA 3.9.1 data mining tool to predict students’ graduation classes of degrees at undergraduate level. The result shows that application of synthetic minority over-sampling technique for class balancing improves all the various models performance with the proposed modified stacking classifiers ensemble model outperforming the various classifiers models in both performance accuracy and RSME values making it the best model.


INTRODUCTION
Decision making has gradually become data driven recently, due to the large amount of data available as a result of advancement in information and communication technology (ICT). Data mining has been applied in various fields like medical, marketing, machine learning, artificial intelligence, customer relations etc. Recently, Data mining is widely used on educational dataset which is referred to as educational data mining (EDM) and has now become a very useful research area [1]. This new emerging field, called educational data mining, [2] is concerned with developing methods that discovers knowledge in data originating from educational environments. To do this it uses different data mining techniques and machine learning algorithms. The study [3] indicates that some of the problems related to students' success in a course are hard to solve simply because usual statistical methods are not deep enough to discover the hidden patterns and knowledge, useful for educational processes planning and organization. Therefore, there is need to adopt data mining technique for solving problems related to students' success using data originating from educational environments. Various data mining techniques have been implemented in research studies for educational data mining. The research work conducted [4] categorized all the various methods used in educational data mining into the following categories: Classification, Clustering, Relationship mining, Discovery with models and finally Distillation of data for human judgment. These data mining methods have been applied in many research works and were reported to have better performance than other methods. A cross validation test result [5] indicates that data mining techniques predicted significantly better than its statistical counterpart. Therefore, [6] suggested that with the increasing need for data mining and analyses, there is a need for improving the performance of data mining models and machine learning algorithms.

RELATED STUDIES
In recent time various research studies have been conducted on predicting students' academic performance using various data mining techniques and machine learning algorithms. The study of [7] adopted only two classifiers algorithms in predicting the dropout features of students. The result of the research on three different datasets which contains different students' attributes, such as: nationality of the students, sex, city of living, high school grades, program enrolled, number of earned credits in the first year of study and average grade in the first year of study indicates that data mining with J48 decision tree algorithm is more accurate than Naïve Bayes classifier algorithm with an accuracy of 81.1679 %. The research only considered two classifier algorithms. Meanwhile, [8] used more classifier algorithms which are; Neural Network (NN), Decision Tree, Support Vector Machine (SVM), K-nearest neighbor (KNN), Naïve Bayes and Rule Based to predict learners' progression in tertiary education. The finding from the research indicates that SVM has the highest performance accuracy of 73.33% and the least performance was recorded by Logistic Regression which has 60.05% accuracy. Only psychometric factors related to students are considered in conducting the research. Academic performance of student [2] is not a result of only one deciding factor besides it heavily hinges on various factors like personal, socio-economic, psychological and other environmental variables.
The work of [9] made use of three classifiers which are Naïve Bayes, Decision tree and Neural Network. In the study, continuous attributes were discretized using optimal equal width binning and Synthetic Minority Over-Sampling (SMOTE) technique was used to increase the volume of data, because there were limited instances in the acquired data during preprocessing. Neural Network and Naïve Bayes were reported to be more accurate than Decision tree for classification when Optimal Equal Width Binning and Synthetic Minority Over-Sampling (SMOTE) techniques are applied on data. Both classifiers have an accuracy of 71.6% though; Neural Network algorithm was discovered to be slower when compared to the Naïve Bayes algorithm thereby making Naïve Bayes model better than the neural network model. [2] Focused on identifying the slow learners among students and displaying it by a predictive data mining model using classification-based algorithms (Multilayer Perception, Naïve Bayes, SMO, J48 and REPTree. The work shows that multilayer Perception has the highest accuracy of 75% and RepTree has the least accuracy with 67.76%. A comparative analysis of three selected classification algorithms; Decision Tree (DT), Naïve Bayes (NB), and Rule Based (RB) was conducted by [10] to predict students' academic performance. The analysis was done to discover the best techniques to develop a predictive model for Student Academic Performance of first semester performance for first year Bachelor of computer science students at Universiti Sultan Zainal Abidin. Rule Based classifier was discovered to be the best model amongst the other classifiers by receiving the highest performance accuracy value of 71.3%. The model in this study does not provide detailed information about the students' performance. It only predicted the first-year performance of students not the graduation performance and classifies students' performance into poor, average and good. The research on early prediction of students' Grade Point Average (GPA) by [11] also showed that support vector machine (SVM) classifier algorithm prediction is more accurate than the extreme learning machine method and neural network. With performance accuracy of 93.06% when second year GPA of students is considered and 97.98% when the third year GPA of students are considered. Three supervised machine learning algorithms' performance were evaluated on students' assessment data characteristics by [14] to predict success in a course (either passed or failed) the result indicates that base on their prediction accuracy, ease of learning and user friendly characteristics, Naïve Bayes classifier outperforms decision tree and neural network classifiers.
The research conducted by [13] to predict students' performance shows that Random forest is a more accurate and faster algorithm compared to decision tree, K-Nearest Neighbour (IBK) and Multi-layer perceptron algorithms with an accuracy of 89.23%. Predicting academic performance of students is [14] challenging since students' academic performance depends on diverse factors such as personal, socioeconomic, psychological and other environmental variables. The study also identify ensemble methods are the most influential development in Data Mining and Machine Learning in the past decade. An approach for predicting students' academic performance using ensemble model was presented in the study of [15]. Stacking  [16] for predicting academic achievement of students. Performance of three classifiers algorithms were evaluated and stacking ensemble technique was used to combine the three classifiers and a better root mean square error (RMSE) value of 0.1291 was obtained as compare to 0.1898 for back propagation neural network, 0.1314 for M5P and 0.1343 for support vector machine.
Stacking is one of the ensemble techniques used by researcher with the aim of improving model's performance. [17] Stacking ensemble technique has capability of combining heterogeneous base classifiers and a Meta classifier is trained for final prediction. Predictions output of base classifiers are fed directly as data input into the Meta classifier for training and final prediction.
Therefore, in this research work, the improvement of the performance of stacking classifier ensemble model was considered, so that only instances that are correctly predicted from the base classifiers are fed to the meta classifier.

RESEARCH METHOD
The methodology of Educational Data Mining was not yet clearly defined and there are no clear standards about which data mining methods or algorithms are preferable in this context. Various data mining methods have been used by different researchers for estimating preferable algorithms in this context [7]. But in general, it was stated in [18] that Data mining processes follows a set of steps that must be executed regardless of the algorithms or methodology that will be implemented. In this study the Cross Industry Standard Process for Data Mining (CRISP-DM) was adopted.

Data collection
A total of 206 students' data from the faculty of science, Federal University Dutse was collected. The data set was divided into two subsets for model training and testing respectively. The model was trained using 164 student's data which represents 80% of the data set while 42 students' data which represent 20% of the data set was used in model testing.

Data preparation and cleaning
The data preparation phase covers all activities required in constructing the final data set that was fed into WEKA 3.9.1 data mining tools from the initial raw data. It is a known fact that real-world data tend to be incomplete, inconsistent and noisy. Therefore, for real-world data to be utilized by the data mining tool, they have to be further pre-processed. The attribute filter in WEKA 3.9.1 was used to remove noisy and incomplete data. The final summary of attributes used for conducting the experiments after data cleaning are presented in Table 1.

Modeling
The model this research work attempts to develop is a stacking classifiers ensemble model. The data set for the study is small and imbalance as such, machine learning algorithms that are likely to perform well in developing this type of model based on previous studies were adopted. The dataset before class balancing as shown in Figure 1. SMOTE was used for balancing the classes in the data set thus, increasing the volume of the training data set from 164 instances to 312 instances thereby making all the four classes to have 78 equal numbers of instances illustrated in Figure 2.
The various models were trained and tested using 10-fold cross validation to avoid over-fitting the models. Proposed model framework can be seen in Figure 3.

Model training and testing
In this research, series of training and testing were carried out on the using the various model by dividing the data set was divided into two subsets for model training and testing., for training, 80 % of the data set was used and the remaining 20% of the data set was used for testing. 10-fold cross validation was used throughout model training and testing to avoid over fitting the models. Since the data set is small and imbalanced SMOTE technique was used to balance the classes and increase data volume of the training data sets. The WEKA 3.9.1 data mining tool provides a training and testing option to train and test on the same data set.

Model evaluation
To evaluate the performance of the various models, Performance accuracy and Root mean square error (RMSE) was used to indicate the various model performances which are presented in tabular forms.

RESULTS AND DISCUSSION
The results as obtained on training the various models using the training data set before class balancing is presented in Table 2 while the various models performance results after class balancing with SMOTE is presented in Table 3 The result from Table 3 indicates that class balancing using SMOTE results in improving all the various models performance. Though, all the various models recorded improvement in their performance. The proposed modified stacking classifiers ensemble model outperformed the other models in both performance accuracy and RMSE values which makes the model better than the other classifiers model.  The various models performance accuracy results obtained on testing the various classifier models indicates that the modified stacking ensemble model outperformed the other models with an accuracy of 97.7564% and RMSE of 0.1060.

CONCLUSION
Data mining can be applied on students' data available to higher educational institutions to develop models for predicting students' graduation classes of degrees early using students' first year CGPA, UTME subjects' scores and their corresponding grades in SSCE. Resolving class imbalance problem in data set used for developing data mining models to predict students' academic performance using synthetic minority oversampling technique (SMOTE) results in improving model performance.