Memetic algorithm for short messaging service spam filter using text normalization and semantic approach

ABSTRACT


SMS, SPAMS AND FILTERS
The tremendous rise in the usage of SMS is attributed to its ease of use, ubiquity in nature, high open rates, low cost of transaction and inherent trust in the channel. The ease of use, portability, ubiquity, low open rate and low cost of SMS are major factors for its popularity and usage. This growth rate has equally attracted spamming to the channel. Spammers are well organized businesses seeking to make money through the use of email, mobile (SMS), Instant message, UseNet newsgroup, Social network and internet telephony channel without the consent of subscriber (user). Their merchandise are unsolicited advertising, inappropriate or adult-themed content, premium fraud, smishing and even distribution of malware generally called spam. SMS spams are thus, unsolicited and unwanted messages sent to mobile phone users. Spam trend is on the rise and its toll on subscribers and even MNO is getting intensive and proven to be of great concern to all [18-20].

Spams: sources and consequents
SMS spam is generates from various sources; one of the typical spam sources is number harvesting, which is carried out by Internet sites offering "free" services. End users can also receive mobile spam from the following sources [12]: − Organizations and individuals that pay MNO to deliver SMS to the subscribers: They are responsible for the highest number of spam received on subscriber's mobile phones. Although, MNOs have adopted and enforced use of opt-out, or even opt-in processes for the user to stop receiving promos or ads. − Organizations that do not pay for the SMS that are delivered to the subscribers: they are usually worse and considered as fraud because it damages MNO brands. − Individual originated messages that disturb recipients.
Apart from the distracting and annoying effects of spam, there are other serious consequences generated. There is the issue of competition for resources between millions of illegitimate and legitimate messages being transmitted. These messages consume network resources that could have otherwise been allocated to other legitimate services by MNO [15]. Spamming activities attracts extra cost for mobile operators to adequately maintain and service their mobile communication infrastructures for effective service delivery. Also flooding of MNO infrastructure with illegitimate massages can cause legitimate users to suffer denial of service. Huge amount of spam messages also concerns the cellular carriers as the messages traverse through the network, causing congestion and hence degrade network performance [16]. Mobile communication industries are also faced with threat from virus, Trojan horse, worms and malware propagated by spam SMS [15]. Fraudulent messaging activities such as phishing identity theft and other fraud related activities which were prominent in email messaging services has migrated to SMS platform [17,18]. Financial loss, damage to mobile user's reputation and that of the MNO are issues to be considered [19].

Spam filters
SMS spam filters shares similar features and challenges with email spam filters. They are both saddled with the task of real-time filtering efficiency and the option to decide between client-side and or server-side filtering. The mobile space is also faced with the challenge of overcoming misclassification cost and eliminate false-positives (genuine SMS incorrectly classified as spam by filter), and issue of concept drift in order to evade filters. Thus, most existing approaches of combating SMS spam are imported from successful email-solutions [21,22]. Not all solutions to email spam are applicable to SMS due to the fact that established email spam filters are unable to tackle SMS Spam because performance of email spam filters is seriously degraded when used to filter SMS spam. This is attributed to its limited 160-character of 140-bytes sized messages. Also, these messages are rife with slangs, symbols, emoticons and abbreviations that inhibit proper classification [23][24]. To overcome the shortfall of email filters in handling SMS spam successfully, a combined filtering technique to reduce noise in SMS and expands the message size [25,26] is the focus of this research. Spam filters can be divided into a number of broad categories based on the method used to filter Spam. They include [27]: list based, challenge/response system, content based, collaborative and Heuristics Based filters.

Challenge-response filters
This filter forces a message sender to prove they are human via some test. This filter blocks undesirable messages by forcing the sender to perform a task before their message is delivered. With task success, the message (and future messages) will be delivered to the recipient; While, failure to complete the challenge after a certain time period, leads to message rejection [24]. The most common challenge consists of distorted images and text. To triumph this challenge, a user must type text or arrange images correctly. With challenge/response false positives can be reduced to barest minimum. Another merit of this approach is in its low system resource requirements, since no CPU-intensive pattern matching is required. However, this approach causes more problems than it solves. For inexperienced or visual handicapped users, the challenges are completely unsolvable. Regular users are provoked by the challenges and choose not to do so since they view it as an unacceptable irritation. Also, automated email that a user would want to receive (travel confirmations, online purchase receipts, etc) are trapped by this approach and never delivered [28][29][30].

List-based filters
− Blacklist: This earliest spam-filtering method seeks to block unwanted messages from an already created list of senders. Blacklists are records of email addresses, Internet Protocol (IP) addresses and phone numbers that have been previously used to send spam. When incoming message arrives, spam filter checks if IP, email address or phone number is on a blacklist. If so, the message is considered spam and rejected. Blacklists ensure known spammers cannot reach users' inboxes. Their only demerit is that they can also misidentify legitimate senders as spammers [24,29]]. − Whitelist: To block spams, whitelist rather than specify senders to block messages from, it specifies which senders to allow messages from. These addresses are stored in trusted-users list. Most spam filters uses a whitelist alongside other techniques to cut down on the number of genuine SMS that accidentally get flagged as spam. A filter that uses just whitelist implies that anyone not approved is automatically blocked. Some anti-spams use a whitelist variation called automatic whitelist. Here, an unknown sender address is checked against a database; if they have no history of spammingtheir message is delivered to the recipient's inbox and added to the whitelist [24,29]. − Greylist: This filter works with the assumption that most spammers sends batch of messages once. When message from unknown address is received, it blocks and revert a failure delivery to the sending server. If the message is resent, which most legitimate servers do, filter receives it and adds the address/phone number to the list. Although overhead of the filter is low, its demerit is the unjust delay delivery experienced by genuine messages to its recipient [24,29].

Content-based filters
Content-based filtering methods are based on the evaluation of individual words or phrases found in the mail/message to determine if message is spam or not. This method analyzes message header, subject and body to discover any distinctive characteristic [30]. They are further classified into word-based and heuristic filters. Word-based filters use a set of rules to detect genuine from spam SMS. Also known as rule-filters, they use rules about actual word(s) or phrase(s) in a message to classify messages into genuine and spam classes. Rule features include word type, frequency of occurrence, structure of text (e.g. font size, colour etc), presence of many periods between letters (e.g. F.R.E.E), existence of image, etc. Rules are filter-dependent and can vary from simple to very complex. A demerit of rule-based filters is that: (a) they are knowledge intensive, (b) time consuming process in reviewing spam messages to determine the rules, and (c) needs regular update of rules as spammers changes their tactics [31][32][33][34].
Conversely, heuristic-based filter examines message content through various algorithms and resources, and assigns points to words or phrases. Words commonly found in spams such as "FREE" or "SEX," receive higher scores. Terms commonly found in normal messages receive lower scores. The filter then adds up total scores. If the message receives a certain score or higher (determined by anti-spam application's administrator), the filter identifies it as spam and blocks it. Messages with score(s) lower than the target number are delivered to the use [35]. Bayesian filter, KNN classifier, AdaBoost classifier, Gary Robinson technique, Support Vector Machine, Neural Network are examples [36]. Using a heuristic filter allows many spam filtering methods to be used, resulting in better performance than any single method by itself.

SOFT-COMPUTING FRAMEWORK 3.1. Bayesian networks
Are based on the Bayesian theorem of conditional probability. They have been successfully applied to many domains such as medicine, machine learning, speech recognition, signal processing, natural language processing and cellular networks. They are an attractive machine learning technique that represents domain knowledge and data in an elegant mathematical structure with simplified visual representation. Bayesian net shows graphic probability relationships between a set of variables under the domain of uncertainty. They are usually structured as a directed acyclic graph and conditional probability tables (CPTs). CPT tables represent probability of a random variable where, given the occurrence of its parent nodes. We can apply same conceptual strategy to spam filters [37].
Bayesian net classifiers are built based on the training data. Its building process includes structure learning, parameter learning, and building probability distribution tables for each node in the network. There are two major learning processes namely: (a) structured learning or casual discovery in which network learns the structure and parameters with the provided input data. The causal discovery aims to learn the structure and learn the parameters. It achieves this using either of K2, Hill climbing and Tabu-Search; and (b) probability distribution learning is achieved with algorithms like Bayes Net estimator, BMA estimator and multinomial estimator. Once structure learning is complete, parameter learning completes the CPT tables for each feature in the Bayesian Network. The network design in fig 1 is for detecting texts in SMS and helping the model and algorithm to classify these SMS into either of genuine/legitimate and spam SMS. Bayesian network design needs to consider the attributes, search algorithm and estimation algorithms. Thus, we use the hill-climber search algorithm with five parents used as the search algorithm for this network with simple estimator as an estimate on algorithm with threshold value "0.5" [38].

Genetic algorithm (GA)
Inspired by Darwinian evolution of survival of fittest, it consists of a chosen population with potential solutions to a specific task. Each potential solution is an individual for which optimal is found using four operators namely: initialize, select, crossover and mutation [39]. Individuals with genes close to optimal, is said to be fit. Fitness function determines how close an individual is to optimal solution. [40][41][42]. The basic operators for GA include: − Initialize -Individual data are encoded into forms suitable for selection. Each encodings type used has its merit. Binary encodings are computationally more expensive. Decimal encoding has greater diversity in chromosome and greater variance of pools generated; float-point encoding or its combination is more efficient than binary. Thus, it encodes as fixed length vectors for one or more pools of different types. The fitness function evaluates how close a solution is to its optimalafter which they are chosen for reproduction. If solution is found, function is good and selected for crossover. The fitness function is the only part with knowledge of task. If more solutions are found, the higher its fitness value.

13
− Selectionbest fit individuals close to optimal are chosen to mate. The larger the number of selected, the better the chances of yielding fitter individuals. This continues until one is chosen, from the last two/three remaining solutions, to become selected parents to new offspring. Selection ensures the fittest individuals are chosen for mating but also allows for less fit individuals from the pool and the fittest to be selected. A selection that only mates the fittest is elitist and often leads to converging at local optima. − Crossover ensures best fit individual genes are exchanged to yield a new, fitter pool. There are two crossover types (depends on encoding type used): (a) simple crossover for binary encoded pool. It allows single-or multi-point cross with all genes from a parent, and (b) arithmetic crossover allows new pool to be created by adding an individual's percentage to another. − Mutation alters chromosomes by changing its genes or its sequence, to ensure new pool converges to global minima (instead of local optima). Algorithm stops if optimal is found, or after number of runs if new pools are created (though computationally expensive), or when no better solution is found. Genes may change based on probability of mutation rate. Mutation improves the much-needed diversity in reproduction.
Cultural GA is a variants of GA with a belief space define as thus: (a) Normative (has specific value ranges to which an individual is bound), (b) Domain (has data about task domain), (c) Temporal (has data about events' space is available), and (d) Spatial (has topographical data). In addition, an influence function mediates between belief space and the poolto ensure and alter individuals in the pool to conform to belief space. CGA is chosen to yield a pool that does not violate its belief space and helps reduce number of possible individuals GA generates till an optimum is found [43,44].

Motivation / statement of problem
a. Spams have continued to soar with the advent of SMS. The alarming growth rate of spams with SMS popularity have now created a propitious environ for spammers to exploit subscribers; Thus, causing both financial loss and emotional instability as consequences to users, corporate organs and mobile network operator(s). b. Academic researches and companies are today, faced with the challenge of dealing with SMS spam. A major issue has been that existing approaches to resolving SMS spam are imported from successful email anti-spam solutions (Wang et al., 2010). Thus, are quite unable to effectively and efficiently tackle SMS spam successfullyas their performance is seriously hampered and degraded by the parametric feats used to filter spams. c. The formulation and design of an effective SMS filter has continued to suffered setback(s) due to the inherent reason that SMS filters by design are not as simple as email filters due to its limited size of 160characters of 140bytes sized data. These amongst other constraints, continue to create rippled impediment in size of feature to be selected for training and consequently contributing to poor learning and classification of learning algorithm. d. Furthermore, SMS are rippled with slangs, abbreviations, symbols and emoticons that inhibit proper classification of words or texts [45].
To overcome these amongst many other shortfalls inherent in the adoption of email filters as adapted to handling SMS spam successfully, a hybrid filtering technique that reduces noise in form of slangs, emoticons, abbreviations in SMS as well as expand message size must be employed to enhance adequate classification. Thus, our research goal(s) is to propose a hybrid deep learning neural network model for text normalization and semantic expansion in SMS spam filtering.
The proposed model properties and goals will include: − Perform repetitive tasks without emotional defects − Embody the knowledge of human experts with the help of special software tools, manipulate data to solve problems and make decisions in that domain. − Processes are better formalized and defined on machines. − Knowledgebase update is automatic − Processes are better formalized and defined on machines.

MEMETIC BAYESIAN NETWORK EXPERIMENTAL FRAMEWORK
SMS spam filters can have capacity and granted capability to transcribe emoticons, abbreviations and slangs into standard terms as well as expand message size to enhance better feature extraction for classification algorithms and approaches. The study will also serve to reduce orthographic error found in SMS, chat groups and another social network communication medium that impedes machine learning algorithm. This is because from the various approaches adopted to SMS spam filtersthe content-based 14 models with text pre-processing has shown to perform better. Machine translation (MT) performs better when applied to normalized text messages [46]. It can combined multiple approaches in noisy data, text normalization to create a better output. But, extracting only relevant feats and/or parameter to train the classifier has been reported to contribute to the efficiency of SMS spam filters [47][48][49][50]. Thus, we propose text preprocessing SMS spam filter model with the capability of normalizing, expanding text messages and extracting suitable features as dataset input parameters for training the adopted classification algorithm and model. Study uses KDD-CUP '99 dataset.

Figure 1. Proposed genetic algorithm trained bayesian network
The model is represented in Figure 1 explained as thus: a. Raw text represents the original text from the sender for normalization and expansion. b. Text normalization employs two dictionaries: (a) first, an English dictionary to check if text are English so as to then normalize text to its root form, and (b) second, is a slang dictionary to translate slangs into English text. The basic operation of this stage is to replace slangs and abbreviation with standard English words from these dictionaries. The Freeling English dictionary and No slang dictionary are proposed. c. Concepts generation are semantically analyzed already normalized text to deduce their concept. The concepts are provided by Language Data Base BabelNet repository. d. Word sense disambiguation (WSD): Here, from a variety of concept generated, this stage is used to find the concept that is more relevant according to the context of the original message, among all generated concepts related to a certain word. It equally relies on concepts are provided by Language Data Base (LDB) BabelNet repository e. Tokenisation unit: Tokenization is the process of breaking down a text corpus into individual elements that serve as input for various natural language processing algorithms. Normalised texts are broken into individual words and stop words and punctuation characters are equally removed in this unit. f. Merging Rule: It employs parameters that define the combination of result of pre-processing (original text, normalization and disambiguation stage). Merging rule answers the question from each stage as follows: (a) should it keep the original token(s)?, (b) should text normalization be performed?, (c) should it perform concepts generation?, and (d) should it perform the word sense disambiguation? g. Normalized and Expanded text is a combination of text obtained from various output of preferred stages of the pre-processing model.

Feature selection, training and rationale for choice of model
Need to minimize the number of features as input parameters for classificationsince, an increase in the number of features used will add to the computational complexity of the system. Thus, the CGA algorithm is used in selection of features obtained from the text pre-processing section. The input is the dataset (tokens obtained via tokenization of normalized and expanded text from text pre-processing section). The model is made up of the following sections:

15
− GA Unityields a rule-based, genetic representation of normalized and expanded test defined. The algorithm then initializes model with a random population that is created and subjected to repetitive application of recombination, mutation, inversion and selection operators to improve the generated population from the original dataset. − Evaluation Unit contains a fitness function that measures the quality of represented solution. It computes optimality of a solution by comparing the chromosomes against all other chromosome using some predefined function. − Training Unit: Trains the filter based on Bayes Probability Theorem. It uses known SMS corpus of spam and genuine messages/texts. A collection of tokens appearing in each corpus and their total occurrences (scores) are maintained in the databaseso that based on their occurrences, each set of spam and genuine data is assigned a criterion or probability score for its capacity of determining a text or message to either be a spam or genuine text.

Classification section
Based on the frequency probability of occurrence of each word (tokens) as spam or legitimate, each incoming unseen normalized message data is processed and classified as either legitimate or spam by the Bayesian classifier. In the event of misclassification, users can rectify this classification by reading the message and re-adding the message to inbox. This will automatically correct and update the database for future classification. Thus, making Bayesian filters quite adaptive.

Output section
Result of the classification of the filter into Spam or Ham, is the expected output of this unit.

Experimental model operations
Ojugo [42] described a genetic algorithm trained neural network employed in early diabetes detection. GANN is initialized with (n-r!) individual if-then, fuzzy rules (i.e. 6-4!). Individual fitness is computed as 30-individuals are selected via the tournament method to determine new pool and selection for mating. Crossover and mutation are applied to help net learn the dynamic, non-linear underlying feats of interest via multipoint crossover to yield new parents. The new parents contribute to yield new individuals. Mutation is reapplied and individuals are allotted new random values that still conform to the belief space. The mutation applied depends on how far CGA is progressed on the net and how fit the fittest individual in the pool (i.e. fitness of the fittest individual divided by 2). New individuals replace old with low fitness so as to create a new pool. Process continues until individual with fitness of 0 (i.e. solution) is found. Rule-based encoded spam as shown in Table 1. Generation of population from parents as shown in Table 2.  Initialization/selection via ANN ensures that first 3-beliefs are met; mutation ensures fourth belief is met. Its influence function influences how many mutations take place, and the knowledge of solution (how close its solution is) has direct impact on how algorithm is processed. Algorithm stops when best individual has fitness of 0.3. Model stops if stop criterion is met. GANN utilizes number of epochs to determine stop criterion.

FINDINGS AND DISCUSSION
With Naïve Bayes and GA (as standalone model) to benchmark the intelligent system and ascertain how well our hybrid GABN algorithm performed, we obtain the results in Figure 2 and Figure 3 respectively as seen below. the hybrid gabn (memetic) algorithm outperforms standalone naïve bayes and GA model. However, for the mean processing time required to convergeit is found that GABN performed least. This can be attributed to the fact that: (a) the hybrid model needs to first use GA as pre-processor to train Bayesian network, (b) for such hybrids, there are always structural dependencies with the underlying heuristics employed/merged and conflicts in data encoding that is required. These must be resolved in order for the model to perform appropriately.

Model Evaluation
In this study, accuracy, recall, error rate (ER) and specificity are used to evaluate the performance of the detection models. The formulas of the above criteria are calculated as follows: A true positive (TP) is a case (rule) that correctly distinguishes spam from ham. A true negative (TN) shows normal text data classified correctly as normal. A false negative (FN) is a case in which a text is classified as normal data, and a false positive (FP) is a case in which a normal text is classified as a spam. The accuracy rate is the overall correct detection accuracy of the dataset. ER refers to the robustness of the classifier, Recall is degree of correctly detected attack types of all cases classified as attacks; while, specificity is the percentage of correctly classified data. In the above, higher accuracy and recall and lower ER indicate good performance.
To further measure effectiveness and accuracy, we measure their rate of misclassification and corresponding improvement percentages in both training and test data sets as summarized in Tables 3 and 4 respectively. Equations for misclassification rate and its improvement percentage of unsupervised (B) model against supervised (A) model respectively, is calculated as follows: Tables 3 and 4 respectively shows misclassification error rate with Naïve Bayes, GA and GABN at 23.2%, 4.7% and 1.02% (i.e. error rate in false-positive and true-negative) respectively; Consequently, they all promise an improvement rate as of 3.6%, 4.02% and 0.12% respectively.

CONCLUSION
From the consequences of spam to users, several concerted efforts to detect spam intrusion in various communication media has paid off especially in combating email spam. Spam Filters work by first receiving part (or all) of the message and then analyzing it in some way to decide whether it is ham (i.e. legitimate message) or spam. The performance of a spam filter can be measured by the number of falsepositives (incorrectly marked as spam) and false-negatives (unidentified spam) that it generates. An ideal spam filter will correctly classify all SMS with almost zero error rates of false positive/negativethrough tradeoffs between the number of false positives and false negatives.