Educational data mining in moodle data

ABSTRACT


INTRODUCTION
Moodle (Modular Object-Oriented Dynamic Learning Environment) is an open-source Content Management System (CMS) software that allows instructors to provide and share course documents, assignments, quizzes, video materials, etc. for students' enabling quality online learning environment. It is a Virtual Learning Environment (VLE) or a virtual platform for users to communicate, learn and participate [1]. It is also called Learning Management System (LMS) that enables course instructors to keep track of students and assess their learning performance, evaluate the learning material such as assignments and quizzes, host text-based and video lectures. Moodle not only offers a platform to host video and text lectures, assignment, and quizzes but also offer collaborative learning with online discussion forum and keeps track of student activity via its logging system [2]. So, there is a growing use of the Moodle system for online learning. Students ineteract with the moodle system to assess the course contents and course materials. Due to this, students produce a huge amount of data through their interaction with the system for example;  [3]. These varieties of data can be analyzed to extract meaningful information. So, it is important to implement the concept of Educational Data Mining (EDM) to extract useful knowledge from raw data. EDM is the process of converting raw data collected from the educational system into meaningful information [4]. Hence data mining in education also called EDM is one of the emerging research areas. It is useful in many different application areas as all the problems related to the educational environment can be managed or resolved using EDM such as identifying weak/strong leaners, identifying learning needs, reducing dropout rates, enhancing academic achievement by improving learners' performance, improving teaching/learning processes and so on. So, the analysis of user's online (OL) activities can be very useful to identify the hidden patterns and extract meaningful information to built the model useful for prediction. Rapid growth of educational data clears to the fact that distilling of huge amount of data requires appropriate data mining algorithm for the appropriate solution [5]. Data Mining is the extraction of knowledge from the large storage of data. Hence it is important to use EDM in order to utilize the reports that moodle keeps as well as to analyze and to build predictive models about the activity of the students that are using the system [4]. EDM uses computational approaches to analyze educational data in order to study questions [6]. It is concerned with developing methods that discover useful knowledge from data originating from educational environments. It utilizes DM methods to better understand student's performance in an educational system [7,8]. There are several popular methods of EDM which can be applied in educational data such as classification, clustering and regression. Classification is a procedure in which individual items are placed into groups based on quantitative information regarding one or more characteristics inherent in the items and based on a training set of previously labeled items [6,9]. Prediction of student performance, prediction of dropout and retention is the most popular application of classification algorithm in EDM. K-Nearest neighbors, Decision Tree, Naïve Bayes, Random Forest, etc. are the most used classification algorithm. Clustering is a technique of grouping the students according to their learning and interaction patterns [10]. Recommendation of resources, understanding, and preventing academic failure (exam failure) among university students are the common application of clustering in EDM. Hierarchical clustering and K-means are mostly used clustering algorithm. Regression is a data mining technique used to predict a range of numeric values (also called continuous values), given a particular dataset. In EDM, regression analysis has been used to predict a student's knowledge. Regression has also been applied for predicting whether the student will answer a question correctly enough, and also to create a model that illustrates the user's learning behavior [6]. So, in conclusion, EDM can be described as the process, which deals with the automatic extraction and analysis of data from large sets of data to explore previously unknown patterns [11].
This research aimed to apply EDM tools and techniques to analyze students' OL data to develop a prediction model of student performance. One of the EDM methods i.e., classification is mainly used for students' performance analysis and prediction. The classification technique is a supervised learning algorithm that can be used to predict categorical class labels [12]. To build a predictive model, different classification techniques such as K-Nearest Neighbour (KNN), Naïve Bayes (NB), Support Vector Machine (SVM), CART decision tree, and Random Forest (RF) are applied in this research. Further, it also includes a comparative analysis of these models to find the best classifier which helps in the early identification of strong and weak students in the course. Also, this research aims to find the parameters affecting the performance of students.

RELATED WORKS
In [13], the authors had applied three widely used decision tree learning algorithms such as ID3, C4.5, and CART for the classification task. The main goal of this research was to predict student performance in the final exam. For this study, data from 90 engineering students were collected from the Institute of Engineering and Technology at VBS Purvanchal University, Jaunpur (Uttar Pradesh). This study implemented the comparative analysis among ID3, C4.5, and CART in the Weka tool. The evaluation was done using 10-fold cross-validation based on accuracy and time of execution where it was found that C4.5 had the highest accuracy of 67.7778% than other classifiers such as ID3 and CART with the same accuracy of 62.2222%. However, the execution time to build the model was less for ID3 with 0.00 sec than for C4.5 with 0.03 sec and CART with 0.09 sec.
The random forest method was used in another research to examine the important variables [14]. The data were collected from the online discussion-based learning as the first blended learning class and lecture-based learning as the second blended learning class. This study compared the prediction model based on these collected data. The experimented result revealed that in discussion-based learning, active learner's participation in online forum affect student's achievement while in lecture-based learning, submitting tasks or downloading material as main online activities doesn't affect student's achievement but only log frequency does affect i.e. both cases indicated different important features for the prediction model. Similarly, in another study, classification techniques such as Decision Tree, Naïve Bayes, and Rule-Based were applied for the data mining process [15]. The main goal of this study was to predict students' academic performance using the classification technique. Data from 497 students were collected that included their demographics, previous academic records, and family background information. From the experimented result, the study showed Rule-Based as the best prediction model with the highest accuracy of 71.3% than other techniques such as Naïve Bayes with 67.0% accuracy and Decision Tree with 68.8% accuracy [15].
In [16], four classification techniques such as Random Forest, Naïve Bayes, K-Nearest Neighbour, and Decision Tree were applied for the students' performance prediction. This study also conducted a comparative analysis of the four classifier models based on Accuracy, Precision, Recall, and F-Measure. For this study, data from 26 students were collected with a total of 11 features such as CourseView, AssginView, Assign_submit_update, ResourceView, ForumView, PT I (Overall score in programming technique I), PT II (Overall score in programming technique II), Assignment (total score in all assignment), LabTotal (total score in all lab work), Midterm and Performance (Students' overall grade as low, medium and high). The experimented result revealed that Random Forest had a better performance with an accuracy of 76.9% than KNN (69.2%) and Decision Tree (61.5%) but it was outperformed by Naïve Bayes with an accuracy of 93.3%.

METHODS
This section discusses the research methods followed in this research.

Data collection
This study collected the data from students enrolled in the course COMP 341 called Human Computer Interaction (HCI) from the Moodle system of Kathmandu University, Nepal. The types of data used for the analysis were from students' OL activities such as System Log and Quiz Grades.

System log
The system log consists of data of each click made in the system by a user. This dataset consists of 14839 observations and 9 attributes of 128 active users in the system. Figure 1 shows the sample of the system log dataset.

Quiz grades
The grade dataset related to assessments taken during the term contains information about the scores of each student. There are 104 observations and 10 variables in the dataset. Figure 2 shows the sample of the quiz grades dataset.

Data preprocessing
In this step, only required features related to OL activities in the log system were collected from the Moodle system for the data mining process. The number of clicks by the students in the log components such as Assignment. Click

Data visualization
The visualization of the data gives a quick review of the facts behind the analysis which consequently helps the researcher recognize the patterns of the students [17]. So, in this research, different charts are plotted for the visualization of data related to online learners. For example, pie chart to visualize users' most active participation in online learning, correlation plot to visualize the relationship between features i.e., between independent features (users' online activities in the system) and dependent feature (users' grade in the course), and Boruta plot to visualize the most influencing features of the students' performance in the course.

Feature selection
Feature selection is the process of selecting the relevant feature subset from the collection of whole features that contained several irrelevant features. So, in this step, irrelevant features were removed which consequently helped in building high accuracy prediction model. Likewise, there are other benefits such as reduce overfitting of data, reduce the complexity of computation, increase model classification accuracy, etc. [18]. The feature selection technique is very useful to analyze the relationship between the independent features and dependent features i.e., from the analysis, it helps to identify the most influencing independent features to the dependent feature [19]. The study in this research focuses to analyze the student OL features that have a greater impact on the students' performance. In this study, one of the wrapper-based feature selection method called Boruta was applied for the feature selection task. Boruta is a wrapper method built around the random forest classification algorithm [20]. The main advantage of using Boruta is that it decides where a feature is important or not i.e. it select the statistically significant features. So, it helps to obtain all the important features from the dataset concerning the target feature.

Applying classification technique
Classification is a supervised learning technique used for the prediction of a predefined class or group. It is widely applied in students' performance prediction. There are many techniques such as Decision Tree, Naïve Bayes, SVM, Neural Network, K Nearest Neighbour, and Random Forest that can be applied to the students' dataset for the classification task [21]. In this study, K Nearest Neighbour, Naïve Bayes, SVM, Random Forest, and CART decision tree were applied for the classification task.

K-nearest neighbour
K Nearest Neighbor (KNN) is a classification technique used for classifying unknown objects based on the closest neighbor whose class is already known (i.e., it is an instance-based classifier that operates on unknown instances) [22]. This classifier retains the entire training set during learning and assigns a class to a new unknown instance represented by the major vote of its nearest neighbor label in the training set. KNN is the simplest algorithm that is easy to understand and implement for the classification task [23,24].

Naïve bayes
Naïve Bayes (NB) is a simple probabilistic classifier that finds a probabilistic relationship between classes and their attributes [19]. NB algorithm is based on the Bayesian theorem that computes the probability of the target on a given predictor or attribute values. It is a better probabilistic classifier that can compute the most possible output based on the input and has proven to work satisfactorily in many application domains [21]. It is used when the dimensionality of input is high [12].

Support vector machine (SVM)
SVM is one of the supervised learning algorithms used for classification and regression. It is a new classification method used for both linear and nonlinear data [23]. It is one of the most popular classification techniques to predict accurate results for most of the classification and prediction problems [12].

CART decision tree
CART (Classification and Regression Trees) algorithm is a decision tree algorithm that can be used to build both classification and regression decision trees. It can handle both the categorical and numerical attributes. For example, CART is said to be a classification decision tree if it is used to predict a dataset into two classes. It is said to be a regression decision tree if it is used to predict a numerical variable. It has some advantages such as: it handles the missing values and uses the cost complexity pruning to remove the unreliable branches from the decision tree to increase the accuracy [13].

Random forest
Random Forest (RF) is a supervised machine learning algorithm used for classification, regression, and other tasks. Decision tree split each node using the best among the attributes while RF split each node using the best among the randomly chosen subset of predictors at the node. Hence RF performs well compared to other classification techniques such as SVM and neural network. Besides this, it is robust against overfitting [16].

Result evaluation
After the implementation of the classification technique on the students' dataset, the results of the classifier models were evaluated and reviewed to get the viewpoint of the result and find the best prediction model. This study evaluated models using four commonly used performance measure metrics such as Accuracy, Recall, Precision, and F1 (F-measure) where these matrices were calculated using the confusion matrix [17]. In classification problems, good accuracy in classification is the primary concern [18]. So, confusion matrix is the suitable method to determine the accuracy which is shown in Table 2. "A confusion matrix of size n x n associated with a classifier shows the predicted and actual classification, where n is the number of different classes" [18].

RESULTS AND DISCUSSION
The outcome of this research is divided into two parts. First is the visualization of data and second is the implementation of classification techniques. Figure 3 represents a sample of a preprocessed dataset of students collected from a moodle system of 81 students from the course called Human Computer Interaction (COMP 341), offered by the Department of Computer Science and Engineering of Kathmandu University, Nepal.

Visualization of data
This section visualizes the moodle data related to students' log and quiz grades. Figure 4 shows the online activities such as File. Click Table 3 shows the frequency (count) of users' online activities in moodle.    Figure 6 shows the Boruta result plot for the variable importance. Blue box plots represent the minimum, average, and maximum Z score of shadow attributes. These are not actual attributes but are used by the Boruta algorithm to decide whether a variable is important or not based on the default P-value of 0.01 confidence level for the significance test. Red box plots represent Z score for rejected variables importance and green box plots represent Z score for confirmed variables importance that is good predictors to include in a feature classification model. The yellow box plot represents a tentative variable that Boruta is unable to decide whether a variable is important or not [21]. Table 4 shows the features with a mean Imp and a decision of confirmed and rejected feature importance by fixing the tentative feature (i.e. Url.Click). Figure 6 and Table 4

Implementation of classification techniques
Moodle data was collected from the course called Human Computer Interaction (COMP 341). The dataset is divided into two parts i.e. 60% to the training set and 40% to the testing set. For the classification task, training data are learned by the classification algorithms to construct a model. Testing data are then used to train a model to estimate the accuracy of the classifier model. 5-fold Cross-Validation (CV) method is applied. Models are evaluated based on the commonly used performance measure metrics such as Accuracy, Recall, Precision, and F1. Figure 7 shows the performance of five classifier models without feature selection method where it is found that SVM has the highest accuracy of 93.94% than other classifiers such as KNN ( Figure 8 shows the performance of five models based on Boruta wrapper-based feature selection method. From Figure 6 and Table 4  The result shows that SVM has the highest accuracy of 93.94% than other classifiers' accuracies such as KNN (87.88%), RF (84.85%), CART (84.85%), and NB (69.70%) on the testing dataset before feature selection methods. After the feature selection method, two features File.Click and System.Click is found to be the most influencing features for the prediction model. So, for the further experiment, File.Click and System.Click is considered as independent features and Grade as a dependent. The result with selected features shows that SVM and CART have the same accuracy as before feature selection. However, other classifiers such as KNN, NB, and RF show different accuracy results.

CONCLUSION
The finding of this research suggests that the selected most influencing features increase the accuracy of students' performance prediction model i.e. there is a strong relationship between users' online activities and their performance. The study helps in early identification of students' performance and courserelated problems that they can plan early with a proper decision such as improvement of teaching/learning processes, proper counseling to the weak learners. As a result, it will improve the students' performance and reduce the number of students' dropout in the course which consequently helps in the academic achievement of the institution. Future work can be implementing the concept of learning analytics (LA), which focuses on informing and empowering instructors and learners. Both EDM and LA can be applied to educational data to get meaningful information for the improvement of the teaching-learning process. To achieve this task, a user model can be developed that fit with the instructional design strategies. Different types of learner's data related to their background, motivation, learning styles, and cognition can be linked with their log data. The focus can be on explanatory models that highlight the relationship between these data.