Correcting optical character recognition result via a novel approach

ABSTRACT


INTRODUCTION
With the advancement in innovation and handling speed, an ever-increasing number of complicated calculations for optical character recognition (ORC) frameworks including AI and neural organizations are proposed.[1] OCR is a course of changing over a picture portrayal into editable and text design.It is the technique for digitizing printed and written by hand text [2].Numerous applications including number plate acknowledgment, book checking, and continuous transformation of transcribed text advantage from OCR. [3] Unfortunately, the results given by OCR are not always satisfying; it contains errors influencing the meaning of sentences.These errors are divided into several types: i) Missing characters: the result of the OCR contains several characters less than the number of characters in the image to be recognized.ii) Addition of characters: the result of the OCR contains several characters greater than the number of characters appearing in the image to be recognized.iii) Character modification: the result of the OCR contains several characters equal to the number of characters in the image to be recognized, but there are some that are modified by other characters different from the origin.
To solve this problem, post-processing must be added.At this level, we propose to use a new approach that applies in two stages; it begins with the detection of errors, then it attacks the syntactic and semantic correction of the OCR output, this approach is based on the frequency of two correct words in the sentence and a recursive technique.This approach starts with the frequency calculation of each two words Successive in the corpora, the words that have the greatest frequency build a correction center, then it begins to correct all words using the recursive technique that will describe in the next section.This approach belongs to the natural language processing (NLP) domain, Which is a branch of artificial intelligence that analyzes, Int J Inf & Commun Technol ISSN: 2252-8776  Correcting optical character recognition result via a novel approach (Otman Maarouf) 9 understands, and generates natural languages used by humans to interact with computers in written contexts and spoken using natural human languages rather than computer languages [4].
In the literature, we find that the NLP domain is used in several languages, namely: English [5] French [6] Arabic [7].On the other hand, the Amazigh language, which uses the Tifinagh characters, has not beneficiated the advantages offered by this domain.This lack motivated us to approach this area to improve OCR results for Tifinagh characters.
Tifinagh [8] is the set of alphabets used by the Amazigh population.The Royal Institute of Amazigh Culture (IRCAM) has normalized the Tifinagh alphabet of thirty-three characters as shown in Figure 1.

Figure 1. Tifinagh characters (IRCAM)
The remainder of the paper is coordinated as follows: section 2 portrays the engineering of the OCR framework utilized for the acknowledgment of composed archives in Tifinagh, section 3 examines the proposed approach of NLP took on to further develop the outcomes given by an OCR, section 4 shows the test results acquired to pass judgment on the exhibition of the proposed approach, at last, an end is given, to sum up, the reason for the work and to report the extricated ends.

OPTICAL CHARACTER RECOGNITION SYSTEM
Explaining research chronological, including research design, research procedure (in the form of algorithms, Pseudocode, or other), how to test, and data acquisition [3]- [5].The description of the course of research should be supported references, so the explanation can be accepted scientifically [4]- [6].
The advancement in pattern recognition has accelerated recently due to the many emerging applications which are not only challenging, but also computationally more demanding, such as evident in OCR, [9] document classification, Computer vision [10], data mining, shape recognition, and biometric authentication, for example.The space of OCR is turning into an indispensable piece of archive scanners and is utilized in numerous applications like postal handling, script acknowledgment, banking, security (for example visa confirmation), and language distinguishing proof.The exploration in this space has been progressing for over 50 years and the results have been surprising with fruitful acknowledgment rates for printed characters surpassing close to 100%, with huge enhancements in execution for written by hand cursive person acknowledgment where acknowledgment rates have surpassed the 90% imprint.These days, numerous associations are relying upon OCR frameworks to dispense with human connections for better execution and proficiency [11].
Under the current work, a character recognition system is presented for recognizing Tifinagh characters extracted from pictures/designs implanted text records, for example, business cards pictures [12].Figure 2 describes our OCR steps, it takes as the input image, it starts by the text extraction, then the binarization phase, Segmentation phase, and the last one is recognition finally it gives as output a text file.

Text extraction
Through the filtering system, an advanced picture of the first record is caught.In OCR optical scanners are utilized, which for the most part comprise a vehicle instrument in addition to a detecting gadget that converts light power into dark levels.[16] Printed records generally comprise of dark print on a white foundation.Henceforth, when performing OCR, it isn't unexpected practice to change over the staggered picture into a bi-level picture of high contrast.Regularly this cycle, known as thresholding, is performed on the scanner to save memory space and computational exertion.Issues in thresholding can be seen in Figure 3.
The thresholding system is significant as the aftereffects of the accompanying acknowledgment are absolutely subject to the nature of the bi-level picture.In any case, the thresholding performed on the scanner is normally exceptionally straightforward.[17] A fixed limit is utilized, where dim levels underneath this edge are supposed to be dark and levels above are supposed to be white.For a high-balance archive with a uniform foundation, a prechosen fixed limit can be sufficient.In any case, a ton of archives experienced practically speaking have a somewhat huge reach interestingly.In these cases, more modern strategies for thresholding are needed to get a decent outcome [18].
The best techniques for thresholding are generally those, which can differ the limit over the record adjusting to the neighborhood properties as differentiation and splendor.In any case, such techniques ordinarily rely on staggered filtering of the record, which requires more memory and computational limit.In this manner, such procedures are rarely utilized regarding OCR frameworks, despite the fact that they bring about better pictures [11].

Binarization
A skew revised text locale is binarized utilizing a straightforward yet proficient binarization technique created by us before dividing it [19].The calculation has been given below.Essentially, this is a further developed form of Bernsen's binarization strategy.In his technique, the number juggling method for the most extreme (Gmax) and the base (Gmin) dim levels around a pixel is taken as the limit for binarizing Correcting optical character recognition result via a novel approach (Otman Maarouf) 11 the pixel.In the current calculation, the eight quick neighbors around the pixel subject to binarization are likewise taken as a central consideration for binarization.This sort of approach is particularly valuable to interface the detached forefront pixels of a person [12].

Segmentation
Segmentation is the separation of characters or words.Most optical person acknowledgment calculations fragment the words into segregated characters, which are perceived separately [11].Typically, this division is performed by segregating each associated part that is each associated dark region.This procedure is not difficult to carry out, however, issues happen if characters contact or then again in case characters are divided and comprise of a few sections.The primary issues in the division might be partitioned into four gatherings: -Extraction of touching and fragmented characters.
-Distinguishing noise from the text.
-Mistaking graphics or geometry for text.
-Mistaking text for graphics.
For our case, the operation of the segmentation process is based on histogram-based thresholding.We will signify the histogram of pixel esteems by ℎ 0 , ℎ 1 , . . ., ℎ  , where ℎ  specifies the quantity of pixels in a picture with greyscale esteem k and N is the greatest pixel esteem (regularly 255).Ridler and Calvard (1978) and Trussell (1979) proposed a straightforward algorithm for picking a solitary edge.We will allude to it as the entomb implies algorithm.
At first, a supposition should be made at a potential incentive for the edge.From this, the mean upsides of pixels in the two classes created utilizing this limit are determined.The limit is repositioned to lie precisely somewhere between the two methods.Mean qualities are determined once more, and another limit is gotten until the edge quits evolving esteem [20].

Recognition
Recognition is the last phase of the OCR system which is used to identify the segmented content.In this step, the correlation coefficients are used in the classification.The correlation coefficient is processed from the example information estimates the strength and bearing of a connection between two factors.A relationship coefficient is a number somewhere in the range of 0 and 1.In case there is no relationship, between the anticipated qualities and the real qualities, the connection coefficient is 0 or exceptionally low (the anticipated qualities are no more excellent than irregular numbers).As the strength of the connection between the anticipated qualities and real worth increments so does the relationship coefficient.An ideal fit gives a coefficient of 1.0.In this way, a superior outcome is compared to the higher correlation coefficient [21].Corr2 computes the correlation coefficient using: where -A and B are two matrices -m: number of rows -n: number of columns

THE NOISY CHANNEL MODEL
Around here, we present the boisterous channel model and advise the most ideal way of applying it to the endeavor of recognizing and changing spelling botches.The boisterous channel model was applied to the spelling remedy task at about a similar time by research artificial AT and T Bell [22] and IBM Watson Research [23].
The nature of the loud channel model in Figure 4 is to treat the inaccurately spelled uproarious channel word like an adequately spelled word had been "ravaged" by being used as a clamorous correspondence procedure.This channel presents "uproar" as replacements or various changes to the letters, making it hard to see the "certified" word.The goal, by then, is to develop a model of the boisterous channel.Given this model, we then find the authentic word bypassing every declaration of the language through the model of the loud channel and see which one comes the closest to the mistakenly spelled word [24].
In the noisy channel model, we envision that the surfaces' structure we see is really a "misshaped" type of a unique word that went through a loud channel.The decoder goes every theory through a model of this channel and picks the word that best matches the surface boisterous word [25]

Extraction of candidates
This noisy channel model is a sort of Bayesian inference.We see a perception  (an incorrectly spelled word) and the work is to find the word w that created this incorrectly spelled word.Out of all potential words in the jargon, V we need to find the word w to such an extent that (|) is most elevated.We utilize the cap documentation ˆ to signify "the gauge of the right word".
The function   () means "the  such that () is maximized".The (2) in this way implies that out of all words in the jargon, we need the specific word that augments the right-hand side (|).

Bayesian classification
The instinct of Bayesian classification is to utilize Bayes rule to change (2) into a bunch of different probabilities.Bayes rule is introduced in (3) it offers us a way of reprieving down any contingent likelihood (|) into three different probabilities: it can then substitute (3) into (2) to get (4): It can helpfully streamline (4) by dropping the denominator().For what reason is that?Since we are picking a potential adjustment word out everything being equal, we will process for each word.
In any case, () does not change for each word; we are continually approaching the no doubt word for the equivalent watched mistake, which must have the same probability ().Along these lines, we can pick the word that augments this less difficult [24].
To summarize, the noisy channel model says that we have some evident basic word w, and we have an uproarious channel that modifies the word into some conceivable incorrectly spelled noticed surface structure.The probability or channel model of the uproarious channel model delivering a specific perception grouping x is demonstrated by (|).The earlier likelihood of a secret word is demonstrated by ().We can process the most earlier likelihood plausible word  ̂ considering that we have seen some noticed incorrect spelling x by increasing the prior () and the likelihood (|) and picking the word for which this item is most prominent.
The noisy channel approach way to deal with rectifying non-word spelling blunders by taking any word not in the spell word reference, creating a rundown of up-and-comer words, positioning them as per (5), and picking the most noteworthy positioned one.We can adjust (5) to allude to this rundown of competitor words rather than the full jargon V as by [24].

THE PROPOSED APPROACH
NLP is a way for computers to analyze, comprehend, and get importance from human language in a shrewd and helpful manner.By using NLP, designers can coordinate and construction information to perform assignments like programmed rundown, interpretation, named substance acknowledgment, relationship extraction, opinion investigation, discourse acknowledgment, and subject division.

Correction of words
Correcting the wrong word requires a list of candidates to select among them.In this part, we will treat the word correction based on a dictionary of Tifinagh words, calculating the distance between words in order to have the list of candidates taking the words that we have a maximum distance.The distance between two words is the number of common letters between the two letters in the same position, which is being calculated by (7).
The character of index i in the word 1.
- 1  : The position k in the word 1. in OCR We can find certain types of errors, erroneous words by deletion, insertion, transposition or substitution.
Table 1 shows us that the two types of erroneous words (by transposition and substitution), have the same number of the correct word characters, on the other hand, the words erased by suppression have a number of characters of less than the correct word character, and the words erased by the insertion have a number of characters greater than the number of correct word characters.To correct an erroneous word, we will calculate its distance with all the dictionary words that have the same size or a size ±1, after which we will return the max of the distance by (8). with: - the corrected word -  : are dictionary words that have a size equal to the size or size ± 1 of  we can retrieve all the words that have a maximum distance with the wrong word, using the maximum of the distance indicated (7).

Words frequency
To correct a sentence, we need to start the correction with correct words.Therefore, we will determine the word position belongs to the dictionary, such that the word that follows also belongs to the dictionary, by (10). with: -  the word of the sentence has the position i.For   = ∅ , we will recalculate   by (11).
We calculate the frequency, using (11) of the words   in the corpus knowing that  +1 consecutive with   by (12). where: - /+1 is a number of occurrences of   in the corpus followed by  +1 -N is a total number of corpora words if we used ( 6), we will calculate the frequency of the words by (13).
with -  : the number of word occurrences in the corpus -N: total number of corpora words after the calculation of the frequencies of the words, we will seek the maximum frequency by (14).
Then we will return the word position that has a maximum frequency by (15).Therefore, the correction process focuses on this position (called: the correction center).

Correction of the sentence
The frequency of the words that we calculated in the previous section, allowed us to correct each word of a sentence, based on the words that exist in the dictionary.Now we will compute the frequency of words that are close to the erroneous word, using the result of ( 14) and (15), by a calculation that takes into consideration n correct words after the erroneous word or before the erroneous word by (16).i) Left formula: where - / the number of occurrences of the word   in the corpus knowing that it is by ( 17)  +1   .
-N: The total number of corpora words - ̅: The word to correct ii) Right formula: with: - / The number of occurrences of the word   in the corpora knowing that preceded by  0 .+1 -N: The total number of corpora words - ̅: The word to correct For each (11) and ( 12), we calculate the maximum of frequencies on the left or the right as by ( 18) and (19).
with  ̅ is the wrong word.Finally, we can correct a sentence, using the word correction on the left or the right and the recursive technique, by (22).
where - < : The sentence words locate before the word exists in the correction center - > : The sentence words locate after the word exists in the correction center

TEST AND MEASURE
The evaluation of our system requires experiments, for that, we divided the dataset, containing 5000 images, into two parts: 80% for training, and 20% to test the performance of the system.

Experimental results of OCR
The realized OCR system starts by reading an image containing a text written in Tifinagh, then performs the recognition of the content, finally generates a file containing the result of the recognition.Figure 5 is an example of the image, containing a sentence written in Tifinagh, used to test our OCR.The OCR output obtained is illustrated in Figure 6.The observation of this result shows that there are two errors, these errors are reported in Table 2.In the first word, the error exists in the third character.On the other hand, in the second word, the error appears in the first character.

Experimental results of noisy channel
Find words that have a comparative spelling to the info word.Investigation of spelling mistake information has shown that most spelling blunders comprise of a solitary letter change thus we regularly make the working on the supposition that these competitors have an alter distance of 1 from the blunder word.To find this rundown of applicants we'll utilize the base alter distance calculation, yet stretched out so that notwithstanding inclusions, erasures, and replacements, we'll add a fourth sort of alter, interpretations, in which two letters are traded.The form of alter distance with interpretation is called Damerau-Levenshtein update distance.Applying all such single changes to "ⵉⵏⵣⵉ" yields the rundown of up-and-comer words as shown in Table 3.
Once have a bunch of up-and-comers, to score every one utilizing (6) requires that process the earlier and the channel model.The earlier likelihood of every amendment () is the language model likelihood of the word w in setting, which can be processed utilizing any language model, from unigram to trigram or 4-gram.For this model, let us start in the accompanying Table 4 by accepting a unigram language model.Table 3. Candidate corrections for the misspelling "ⵉⵏⵣⵉ" and the transformations that would have produced the error (after [22] "-" represents a null letter) How might appraise the probability (|), likewise called the channel model or blunder model?An ideal model of the likelihood that a word will be mistyped would blunder model condition on a wide range of components: who the typist was, regardless of whether the typist was left given or right-gave, Fortunately, we can get a sensible gauge of (|) just by checking out nearby setting: the character of the right letter itself, the incorrect spelling, and the encompassing letters.
Once have the disarray lattices, we can appraise (|) as follows (where   is the th character of the correct word ) and   is the ith character of the typo .Using the counts from [17] results in the error model probabilities for acres shown in Table 5.Table 6 shows the final results for each of the corrections; the unigram prior is calculated with (23) and the confusion matrices.The computations in Table 6.show that the noisy channel model chooses ⵉⵏⵥⵉ as the better, and ⵉⵏⵣⵉⵏ as the second most likely word.For this reason, it is important to use larger language models than unigrams.For example, if we use a corpus o compute bigram probability for the words ⵉⵏⵥⵉ and ⵉⵏⵣⵉⵏ in their context using add-one smoothing, we get the following probabilities: P(ⵢⴰⵏ|ⵉⵏⵥⵉ) = .000051P(ⵢⴰⵏ|ⵉⵏⵣⵉⵏ) = .000002

Experimental results of proposed approach
Errors correction will be realized using the proposed approach (NLP).Using ( 14), we found the following results for the correction of "ⵉⵏⵍⵉ": The candidate words are ⵉⵏⵙⵉ, ⵉⵏⵥⵉ, ⵉⵏⵣⵉ and ⵉⵏⴳⵉ.For the correction of "ⵄⵔ", we have three candidate words: ⴳⵔ, ⵖⵔ and ⵎⵔ.Now we will look for the position words most frequently used in Figure 7.According to the results in Table 7, it can be seen that "ⵢⴰⵏ ⵓⵔⴱⴰ" is the most common combination, so the position returned by ( 15) is "1".This is the correction center and the position of the word "ⵢⴰⵏ".Using the correction on the left, we will correct the words that are before the correction center (20).The word "ⵉⵏⵍⵉ" does not exist in the dictionary, we will calculate the frequency of all the words that are close to it using (20) with pos=1 (i.e., frequency of words closes to "ⵉⵏⵍⵉ" followed by "ⵢⴰⵏ ⵓⵔⴱⴰ") as shown in Table 8.According to the results indicate in Table 8, we conclude that the correct word is "ⵉⵏⵣⵉ", instead of putting "ⵉⵏⵍⵉ", we will replace it with "ⵉⵏⵣⵉ".Now we will go to the right correction, we have the word « ⵄⵔ» does not exist in the dictionary, we will compute the frequency of all the words that are close to it using (21) with pos=1 (i.e., the frequency of words closes to "ⵄⵔ" preceded by "ⵉⵏⵣⵉ ⵢⴰⵏ ⵓⵔⴱⴰ" x) can be seen in Table 9.From the results, indicate in Table 9, we can conclude that the correct word is "ⵖⵔ", instead of putting "ⵄⵔ", we will replace it with "ⵖⵔ".We have the word "ⵜⵉⵏⵎⵍ" exists in the dictionary its frequency knowing that it is preceded by "ⵉⵏⵣⵉ ⵢⴰⵏ ⵓⵔⴱⴰ ⵖⵔ" is 1.3558880795743596E-6 not null, so the word "ⵜⵉⵏⵎⵍ" is correct.Similarly, we have the word "ⵏⵙ" exists in the dictionary its frequency knowing that it is preceded by "ⵉⵏⵣⵉ ⵢⴰⵏ ⵓⵔⴱⴰ ⵖⵔ ⵜⵉⵏⵎⵍ" is 1.3558880795743596E-6 not null, so the word "ⵏⵙ" is correct.Similarly, we have the word "ⵏⵙ" exists in the dictionary its frequency knowing that it is preceded by "ⵉⵏⵣⵉ ⵢⴰⵏ ⵓⵔⴱⴰ ⵖⵔ ⵜⵉⵏⵎⵍ" is 1.3558880795743596E-6 not null, so the word "ⵏⵙ" is correct.After correcting the results of OCR, we can schematize a global process as follows Figure 7.
After several tests, we find that the proposed approach has improved the results given by the OCR.Results of Table 10 shows that the recognition rate of OCR is 86%, after using the approach proposed, we note that the recognition is increased to 98%.Better than noisy channel despite the slight increase of time.

CONCLUSION
The results of an OCR are not always correct, which requires post-processing allowing the detection and the correction of errors.In this paper, we have elaborated on two systems: (i) The first represents an example of OCR which is used to recognize documents written in Tifinagh characters; (ii) The second contains two parts: OCR and Post-processing.In this last part, we have adopted a proposed approach, called NLP, to correct the output of OCR.The effectiveness of the proposed approach is justified by obtained experimental results: 86% for OCR without post-processing and 98% using post-processing.From perspective, we can improve the obtained results in several ways: Enlarge the corpus, improve the proposed approach.

 10 Figure 2 .
Figure 2. Block diagram of the present system

Figure 3 .
Figure 3. Issues in thresholding: (a) original dark level pictures, (b) image thresholded with worldwide strategy, and (c) image thresholded with a versatile technique

 12 Figure 4 .
Figure 4. Loud channel model & Commun Technol, Vol.11, No. 1, April 2022: 8-19 14 Correcting optical character recognition result via a novel approach (Otman Maarouf) 15 If we have  /  =zero, we will apply the recursive technique: The incrementing of pos (pos=pos+1) until  /  > 0. Similarly, if we have  /  =zero we will apply the recursive technique: The decrement of pos (pos=pos-1) until  /  > 0. To correct an erroneous word, we will use two formulas depending on its position in relation to the correction center.If we have the position of the wrong word above the correction center:   ( ̅) = {  :  /  =  /      _( ̅)} (20) if we have the position of the wrong word below the correction center:   ( ̅) = {  :  /  =  /      _( ̅)}

Figure 5 .Figure 6 .
Figure 5. Example of image presented at the OCR input

Table 1 .
Types of erroneous words

Table 4 .
The probability to correct the misspelling "ⵉⵏⵣⵉ

Table 5 .
Channel model for ⵉⵏⵣⵉ

Table 6 .
17Combining the language model with the error model in Table6, the bigram noisy channel model now chooses the correct word ⵉⵏⵥⵉ.Calculated the ranking for each word correction Correcting optical character recognition result via a novel approach (Otman Maarouf)

Table 7 .
Frequency of two successive correct words

Table 8 .
Frequency of each word among the candidate words of "ⵉⵏⵍⵉ"

Table 9 .
Frequency of each word among the candidate words of "ⵄⵔ"

Table 10 .
Correction rate and execution time