Improved ICHI square feature selection method for Arabic classifiers

Received Jan 23, 2020 Revised Mar 3, 2020 Accepted Mar 20, 2020 Feature selection problem is one of the main important problems in the text and data mining domain. This paper presents a comparative study of feature selection methods for Arabic text classification. Five of the feature selection methods were selected: ICHI square, CHI square, Information Gain, Mutual Information and Wrapper. It was tested with five classification algorithms: Bayes Net, Naive Bayes, Random Forest, Decision Tree and Artificial Neural Networks. In addition, Data Collection was used in Arabic consisting of 9055 documents, which were compared by four criteria: Precision, Recall, Fmeasure and Time to build model. The results showed that the improved ICHI feature selection got almost all the best results in comparison with other methods.


INTRODUCTION
Research and development in the science of information retrieval and development in the technology used has increased in many applications in various data such as texts, images, sound ... etc. These data are textual in different languages, and here we speak the Arabic language as the Arabic language flourished in the field of information retrieval and specifically in the field of classification [1]. This boom has led researchers to research and develop properties that improve results and improve efficiency. Information Retrieval (IR) is a field of computer science of great importance in our time because of the increasing volume of information [2]. This information may need to be arranged and classified so that it can be easily retrieved. Text classification (TC) a process that has been emerged in an important manner in various fields, especially in areas on the Internet [3].
Text mining is a textual analysis of data in natural language text and seeks to extract useful information from textual data [4]. In addition, text mining helps organizations extract valuable ideas from document content. Text mining process used to increase the efficiency of the text retrieval process by discovering patterns in the text and the relationships between them to help the retrieval of texts correctly [5]. The most important applications on Text Mining are: IR, Information Extraction, Classification and Natural Language Processing (NLP). Meaning it extracts useful information from a large amount of data to replace data and information search problems [6].
There are many processes that can be searched from the data collection used through the preprocessing to the classification process as pre-processing contains several steps [7,8], including: Tokenization, Normalization, Stopword removal and Stemming. There are several algorithms in these

Pre-processing 2.2.1. Overview
Pre-processing is a very important step in data mining operations as it supports the results and increases their efficiency so it is very important to go through this step before any process. In this research, we will use four processes of pre-processing to improve the classification process and these processes are: Tokenization, Normalization, Stopword removal and Stemming. Figure 1 shows the steps of this phase from data collection to classification.

Tokenization
At this phase, the sentences are cut into a series of words, depending on the white space between the words, to facilitate the compilation of words belonging to the same category. They are segmented into a series of keywords, phrases, etc., where some characters and symbols such as punctuation in the Tokenization process are ignored. This series becomes input into another process such as data analysis and data mining. Tokenization is also a language type and this process is very important in data mining tools [11].

Normalization
This phase converts the set of words and sentences into more complex and more precise sentences, so that the process of processing is easy and data handling easier. Normalization improves text matching and improves retrieval by taking synonyms for meaningful words to produce the best results and higher efficiency. For example, in this process the un-dotted Arabic letter ‫)ى(‬ replace to the dotted letter ‫)ي(‬ [12].

Stemming
The stemming process is important to find the root of the word, to reduce the size of the document by removing the Prefix and Suffixes in addition to removing the increase characters such as ‫ى(‬ ‫و،‬ ‫.)أ،‬ The process of stemming is by pre-processing the information retrieval and also facilitates searches in search engines thus increasing the results [14].

Feature selection methods 2.3.1. Overview
The feature selection process is important to reduce the amount of unwanted data and to maintain the most important data. The features are defined according to predefined criteria and are a pre-processing stage and are important in improving the classification process. Figure 2 shows the stages in the classification process using feature selection methods.

CHI square
Is a statistical method by selecting random data and depends on two independent variables or variables from the data sample and is considered as a feature selection method. CHI square is an important pre-processing in the classification process and CHI square is used in the classification system [15].

Improved CHI square (ICHI)
ICHI is a development of the CHI square method and is done by improving the basic properties where ICHI square is an extension of the original CHI square method. ICHI square has been used with Chinese language. The results have proved effective in using ICHI square with Chinese language. [16] In addition, ICHI square was used in Arabic using several algorithms: Bayes Net, Naïve Bayes, Naïve Bayes Multinomial, Random Forest, Decision Tree and Artificial Neural Networks. The results proved effective in using Arabic [17].

Information gain (IG)
Information Gain is based on the measurement of the number of bits gained by the information obtained by predicting the category by the presence of the word or not in the document [18].

Mutual information (MI)
Mutual information is based on the fact that information exchanged is a measure of dependence between variables, since the information exchanged is a measure of information that is (possibly multidimensional). Where the amount of information obtained is determined by one or more random variables [19].

Wrapper
Wrapper Algorithm is based on the use of learning algorithms where it selects relevant features based on the performance of learning algorithms. In addition to that the process of training the model with different features, or selecting the features that lead to the best results out of the model and be an example of a complex model of several features related to each other [20].

Arabic text classifiers 2.4.1. Overview
In order to carry out the classification process, we need algorithms that perform the classification process through the classification process through the training process based on a data set categorized into the testing process of a new text and distinguish it into which category belongs. As the classification process is supervised learning and is set to input. Figure 3 shows Text Classification system.

Bayes net (BN)
Bayes Net classifier is a model that reflects situations that are part of the Bayes Net classifier and how they relate to each other. This model may be related to any entity in this world and can be represented by BN classifier and all entities occur frequently when another entity exists. This model is very important because it helps us to predict the results of the entity in this world and it is easy to represent the entity through this model [21].

Naïve bayes (NB)
Naive Bayes classifier is based on Bayes Theorem, where NB is easy to build and very useful with large data. Despite the ease of building the NB model it outperforms many of the workbooks in the classification process [11]. In the NB analysis process, the final classification is produced by combining all sources of information. The model is created based on training data to classify the document into any category [21].

Random forest (RF)
Random Forest Classifier is a set of classification algorithms that are widely used in many applications, especially with the large data set due to its characteristic properties. The Random Forest algorithm is used in many applications, including: Network intrusion detection, Email spam detection, gene classification, Credit card fraud detection, and Text classification [22].

Decision tree (DT)
Decision Tree is an algorithm that compiles and is also used in the field of extracting data in various fields on the Internet. Also used in data mining algorithms, in addition, the Decision Tree algorithm is used in Arabic Text Classification [23]. The Decision Tree with the CHI square as a feature selection method was used with data collection in Arabic language, and Decision Tree was used with the ICHI square as a feature selection method. Results showed that the ICHI square surpassed the regular CHI square [24].

Artificial neural networks (ANNs)
Artificial Neural Networks is an algorithm that is used for classification. In addition, its branch of artificial intelligence, where different sets of functions have been studied from training and learning to the testing process. [25] ANNs has been successfully applied to problems in: pattern classification, function approximation, optimization, and pattern matching [26].

Evaluation measure 2.5.1. Overview
After the classification process shows us many results and until the results are read and compared, there must be criteria for this comparison. These results are compared to four criteria: Precision, Recall, Fmeasure and Time to build model.

Precision
Precision is a positive predictive value which is a part of the cases retrieved relevant from all cases retrieved from the results of the process [26,27,28]. Where precision is represented by the following equation: (1)

Recall
The recall is described as the sensitivity measure, which is the part of the relevant cases of the total number of recovered cases. (2)

F-measure
The F-measure is a measure of the accuracy of the classifier being tested, which is a precision and recall ratio. (3)

Time
It is the time it takes to build the data model, which is also the time used in the process of analyzing the data and calculating the processes performed by the model.  Table 2 shows bayes net classifier results without pre-processing using several feature selection algorithms. The results showed that in the case of Precision was the highest result of classifier when used with ICHI square as a feature selection method. Moreover, that in the case of Recall was the highest results of the classifier when used with ICHI square as a feature selection method, but in the case of F-measure was the highest results of the classifier when used with ICHI square as a feature selection method and in the case of Time was the less time to build model of the classifier when used with Information Gain as a feature selection method. Figure 4 shows diagram Bayes Net results without pre-processing.   Table 3 shows naïve bayes classifier results without pre-processing using several feature selection algorithms. The results showed that in the case of Precision was the highest result of classifier when used with ICHI square as a feature selection method. Moreover, that in the case of Recall was the highest results of the classifier when used with ICHI square as a feature selection method, but in the case of F-measure was the highest results of the classifier when used with ICHI square as a feature selection method and in the case of Time was the less time to build model of the classifier when used with Without FS as a feature election method. Figure 5 shows diagram Naïve Bayes results without pre-processing.  Figure 5. Naive bayes results without pre-processing Table 4 shows random forest classifier results without pre-processing using several feature selection algorithms. The results showed that in the case of Precision was the highest result of classifier when used with ICHI square as a feature selection method. Moreover, that in the case of Recall was the highest results of the classifier when used with ICHI square as a feature selection method, but in the case of F-measure was the highest results of the classifier when used with ICHI square as a feature selection method and in the case of Time was the less time to build model of the classifier when used with Wrapper as a feature selection method. Figure 6 shows diagram Random Forest results without pre-processing.  Figure 6. Random forest results without preprocessing Table 5 shows decision tree classifier results without pre-processing using several feature selection algorithms. The results showed that in the case of Precision was the highest result of classifier when used with ICHI square as a feature selection method. Moreover, that in the case of Recall was the highest results of the classifier when used with CHI square as a feature selection method, but in the case of F-measure was the highest results of the classifier when used with ICHI square as a feature selection method and in the case of Time was the less time to build model of the classifier when used with Wrapper as a feature selection method. Figure 7 shows diagram Decision Tree results without pre-processing.  Figure 7. Decision tree results without preprocessing Table 6 shows artificial neural networks classifier results without pre-processing using several feature selection algorithms. The results showed that in the case of Precision was the highest result of classifier when used with ICHI square as a feature selection method. Moreover, that in the case of Recall was the highest results of the classifier when used with CHI square as a feature selection method, but in the case of F-measure was the highest results of the classifier when used with CHI square as a feature selection method and in the case of Time was the less time to build model of the classifier when used with Without FS as a feature selection method. Figure 8 shows diagram ANNs results without pre-processing.   Table 7 shows bayes net classifier results with pre-processing using several feature selection algorithms. The results showed that in the case of Precision was the highest result of classifier when used with ICHI square as a feature selection method. Moreover, that in the case of Recall was the highest results of the classifier when used with ICHI square as a feature selection method, but in the case of F-measure was the highest results of the classifier when used with ICHI square as a feature selection method and in the case of Time was the less time to build model of the classifier when used with Wrapper as a feature selection method. Figure 9 shows diagram Bayes Net results with pre-processing.  Figure 9. Bayes net results with preprocessing Table 8 shows naïve bayes classifier results with pre-processing using several feature selection algorithms. The results showed that in the case of Precision was the highest result of classifier when used with CHI square as a feature selection method. Moreover, that in the case of Recall was the highest results of the classifier when used with ICHI square as a feature selection method, but in the case of F-measure was the highest results of the classifier when used with ICHI square as a feature selection method and in the case of Time was the less time to build model of the classifier when used with Without FS as a feature selection method. Figure 10 shows diagram Naïve Bayes results with pre-processing.  Figure 10. Naive bayes results with pre-processing Table 9 shows random forest classifier results with pre-processing using several feature selection algorithms. The results showed that in the case of Precision was the highest result of classifier when used with ICHI square as a feature selection method. Moreover, that in the case of Recall was the highest results of the classifier when used with ICHI square as a feature selection method, but in the case of F-measure was the highest results of the classifier when used with ICHI square as a feature selection method and in the case of Time was the less time to build model of the classifier when used with Without FS as a feature selection method. Figure 11 shows diagram Random Forest results with pre-processing.  Figure 11. Random forest results with pre-processing Table 10 shows decision tree classifier results with pre-processing using several feature selection algorithms. The results showed that in the case of Precision was the highest result of classifier when used with ICHI square as a feature selection method. Moreover, that in the case of Recall was the highest results of the classifier when used with ICHI square as a feature selection method, but in the case of F-measure was the highest results of the classifier when used with ICHI square as a feature selection method and in the case of Time was the less time to build model of the classifier when used with Without FS as a feature selection method. Figure 12 shows diagram Decision Tree results with pre-processing.  Figure 12. Decision tree results with pre-processing Table 11 shows artificial neural networks classifier results with pre-processing using several feature selection algorithms. The results showed that in the case of Precision was the highest result of classifier when used with ICHI square as a feature selection method. Moreover, that in the case of Recall was the highest results of the classifier when used with ICHI square as a feature selection method, but in the case of Fmeasure was the highest results of the classifier when used with ICHI square as a feature selection method and in the case of Time was the less time to build model of the classifier when used with Mutual Information as a feature selection method. Figure 13 shows diagram ANNs results with pre-processing.  Table 12 and Figure 14 shows results based on avg. precision without pre-processing. We found that when we compare the state of Precision and also without pre-processing, the best results are as follows:

Results based on avg. precision
• Bayes Net Classifier when used with ICHI square as a feature selection method. • Naïve Bayes Classifier when used with ICHI square as a feature selection method.

•
Random Forest Classifier when used with ICHI square as a feature selection method.

•
Decision Tree Classifier when used with ICHI square as a feature selection method.

•
Artificial Neural Networks Classifier when used with ICHI square as a feature selection method.  Figure 14. Results based on avg. precision without pre-processing Table 13 and Figure 15 shows results based on avg. recall without pre-processing. We found that when we compare the state of Recall and also without pre-processing, the best results are as follows:

Results based on avg. recall
• Bayes Net Classifier when used with ICHI square as a feature selection method. • Naïve Bayes Classifier when used with ICHI square as a feature selection method.

•
Random Forest Classifier when used with ICHI square as a feature selection method.

•
Decision Tree Classifier when used with CHI square as a feature selection method.

•
Artificial Neural Networks Classifier when used with CHI square as a feature selection method.  Figure 15. Results based on avg. recall without pre-processing Table 14 and Figure 16 shows results based on avg. f-measure without pre-processing. We found that when we compare the state of F-measure and also without pre-processing, the best results are as follows:  Figure 16. Results based on avg. f-measure without pre-processing Table 15 and Figure 17 shows results based on avg. time without pre-processing. We found that when we compare the state of Time and also without pre-processing, the best results with less time are as follows:

Results based on avg. time
• Bayes Net Classifier when used with Information Gain as a feature selection method.  Table 16 and Figure 17 shows results based on avg. precision with pre-processing. We found that when we compare the state of Precision and also with pre-processing, the best results are as follows:  Table 17 and Figure 19 shows results based on avg. recall with pre-processing. We found that when we compare the state of Recall and also with pre-processing, the best results are as follows:   Table 18 and Figure 20 shows results based on avg. f-measure with pre-processing. We found that when we compare the state of F-measure and also with pre-processing, the best results are as follows:  Figure 20. Results based on avg. f-measure with pre-processing Table 19 and Figure 21 shows results based on avg. time with pre-processing. We found that when we compare the state of Time and also with pre-processing, the best results with less time are as follows:  Figure 21. Results based on avg. time with preprocessing

CONCLUSION
Several feature methods were tested: ICHI square, CHI square, Information Gain, Mutual Information and Wrapper. By testing them with five of the classification algorithms: Bayes Net, Naïve Bayes, Random Forest, Decision Tree and Artificial Neural Networks. The testing process was done without pre-processing and with used pre-processing. The results were compared with four performance measure: Precision, Recall, F-measure and Time to build model. We found that the best feature selection method when testing without using pre-processing, in the case of Precision was the best result for ICHI square, and so is the case for Recall and F-measure. While the time needed to build model was the best result without used feature selection. In Addition, we found that the best feature selection method when testing with using preprocessing, in the case of Precision was the best result for ICHI square, and so is the case for Recall and Fmeasure. While the time needed to build model was the best result without used feature selection. We conclude from the results that the ICHI square as a feature selection method for use with the Arabic language is superior to the feature selection methods, which were tested together in the same environment.