Estimating the Effects of Text Genre, Image Resolution and Algorithmic Complexity needed for Sinhala Optical Character Recognition

— While optical character recognition for Latin based scripts have seen near human quality performance, the accuracy for the rounded scripts of South Asia still lags behind. Work on Sinhala OCR has mainly reported on performance on constrained classes of font faces and so been inconclusive. This paper provides a comprehensive series of experiments using conventional machine learning as well as deep learning on texts and font faces of diverse types and in diverse resolutions, in order to present a realistic estimation of the complexity of recognizing the rounded script of Sinhala. While texts of both old and contemporary books can be recognized with over 87% accuracy, those in old newspapers are much harder to recognize owing to poor print quality and resolution.


I. INTRODUCTION
Optical Character Recognition (OCR) technology is designed to recognize printed texts into machine operable text. OCR is a collection of multiple steps such as scanning, preprocessing, segmentation, feature extraction, classification, recognition and post-processing. In recent literature, many OCR systems have been developed for recognizing Latin characters [1]. With the advancement of Natural Language Processing during the past few years, researchers have integrated machine learning/deep learning techniques for analysing the textual representations on digital documents. Template Matching, Neural Network (NN) and Recurrent Neural Network (RNN) are popular and widely used algorithms for character recognition. These technologies are better when applied for the other character sets, since large volumes of data are available in print media for many languages. The proposed Sinhala OCR is discussed in this paper with special focus on the text genre, image resolution and algorithmic complexities needed for training an OCR system for the Sinhala character set.
As the state of the art OCR technology, currently Tesseract is used in the training of OCR systems for many character sets. Further, Tesseract has moved from machine learning to deep learning with LSTM architecture and provides relatively better recognition competence [2]. However, algorithmic complexity is not enough for training an OCR model, as text genre and image quality affect training a more accurate OCR model. Since large volumes of available data are in print media and they have been printed before the computer era, the documents have been printed using different techniques such as offset printing and screen printing. Therefore, common type-faces used in the history of printing should also be trained to train the model to get such text recognizable. Further, types and sizes of the fonts and size of the training text is also significant. In this paper we discuss the OCR system developed for Sinhala by estimating the effects of text genre, image resolution and algorithmic complexity.
The rest of this paper is structured as follows: Section II gives a brief overview of the related work in this area. Section III discusses some properties and characteristics of the Sinhala script as it is significant to review the complexities with regard to the particular script. Algorithmic complicacy adopted to OCR is discussed in section IV. Further, section V gives the motivation and rationale for the experimental set and systematic description on training data, word lists, and training regime adopted to develop the Sinhala OCR. Section VI presents experimental results on the OCR methods, and we also give an analysis of their performance comparison. Finally, the paper is concluded with a discussion of future works.

II. RELATED WORKS
Despite decades of research on the engineering aspects, the problem of Sinhala character recognition remains as a challenging issue in the OCR field. When the past few years are considered, some studies have been conducted to identify widely used font types in Sri Lanka [3]. When considering OCR for the Sinhala language, initially the K-Nearest Neighbour (KNN) algorithm-based Sinhala OCR was developed by the Language Technology Research Laboratory, University of Colombo School of Computing [3]. For the following study, commercially used font types have been employed by varying font sizes to obtain 94% of average accuracy.
Considering literature, Neural Network based Sinhala OCR systems have been developed in recent years [4], [5], [6]. In 2013, the Sri Lanka Institute of Information Technology conducted a research based on applying neural networks for Sinhala optical character recognition [4]. In this study they have only focused on 36 characters in the alphabet. Another Sinhala OCR application integrating neural networks was developed by a local research group [5]. These studies mainly focused on the character level accuracies and not on word accuracies.
In addition, the Software Development Unit of University of Colombo School of Computing has trained a Sinhala OCR model using Tesseract 3 [7]. This system shows relatively good results only for the high-resolution images. Also, Language Technology and Research laboratory at University of Colombo is experimenting on the integration process of machine learning concepts to Sinhala OCR applications [8]. Further, Manisha et al. [9] has also tried to combine the Tesseract OCR engine with the Sinhala characters and mentions 97% of accuracy. However, the performance has not been well documented. It's well-known that Indic languages have many complexities and variations of characters which makes OCR systems hard to develop. But in the past few years, multiple studies have been conducted integrating Tesseract OCR engine for character recognition using different low resource languages such as Tamil [10], Hindi [11], Bengali [12] and Urdu [13].

III. SINHALA SCRIPT
The Sinhala script is an abugida or alphasyllabary script in which consonant-vowel sequences are written as units and thereby it is called a segmental writing system. The script has evolved from the Brahmi script. The letters in Sinhala are circular-shaped and are written from left to right [14]. The Sinhala script is used primarily to write the Sinhala language, which is one of the official languages of Sri Lanka spoken by about 16 million people in the country. In addition, it is also used in Sri Lanka for writing Pali, the canonical language of Theravada Buddhism, and sometimes Sanskrit, the Old Indo-Aryan language [15].
There are 20 vowels and 41 consonants in the Sinhala script. Since Sinhala is a segmental writing system, vowels take two representations as independent vowels: occur in the initial position of a word (infrequently occur in the middle of a word: E.g. නුවරඑළිය, ජාඇල) and dependent vowels also known as vowel modifiers: occur after a consonant. Figure 1 and 2 illustrate the vowels with their modifiers and consonants in Sinhala script respectively. From among the vowel modifiers in figure 1, ං (anusvara) and ං (visarga) are two specific modifiers. They occur not only with consonants but also with vowels. E.g. අ , ඉ , උ , අ , ඕ .

Consonants included for the training data
Consonants not included for the training data ඞ ඦ

Fig. 2 Consonant characters in Sinhala script
Two vowels: ඏ, ඐ and their corresponding vowel modifiers in figure 1 and ඦ in figure 2 were not included for the training data as they do not occur in old or contemporary Sinhala books. However, ඞ in figure 2 occurs in a limited number of words in old Sinhala books. It was not included because the shape of the particular character would cause misrecognition with similar characters in Sinhala script.
Sinhala consonants imply the inherent vowel /a/ (අ) when they are occur with no modifiers. Absence of the inherent vowel is marked by adding a symbol called hal lakuna or halkirima to the top of the particular consonant. E.g. ක්, ව්. Further, hal lakuna also occurs with two vowels and their modifiers. It has two shapes as illustrated in figure 3. As a segmental writing system, vowel modifiers appear above, below or to the right or left of the basic consonant. From all the consonant-vowel sequences in Sinhala script, ළු is a special character as it appears as a separate symbol to represent ළ+උ sequence. As an example, following figure 4 illustrates all the consonant-vowel sequences for consonant 'ක'. There are three consonant modifiers which occur in the Sinhala script, known as rakaranshaya ( % ), yanshaya ( H ) and rephaya ( _ ). Among them rakaranshaya represents 'ර' (ra) and yanshaya represents 'ය' (ya) when they appear after a consonant (from which the inherent vowel has been removed). However, as symbols, rakaranshaya appears below (e.g. ක්රම, ආශ්රය, වක්ර) and yanshaya to the right (වයසන, සත්ය, සංඛ්යාව) of the basic consonant. Further, rephaya is also used to denote 'ර්' when it occurs before a consonant and the symbol appears on top of the basic consonant (e.g. ධර්ම, සර්ව, ත්ර්ක). Using rephaya is an alternative rule in the Sinhala writing system while rakarakshaya and yanshaya are essential. All the vowel modifiers surround the consonantrakaranshaya (e.g. ක්ක්රෝ ), consonant-yanshaya (e.g. ක්කයෝ ) or rephaya-consonant (e.g. ක්ර්කෝ ) units. Figure 5 illustrates how vowel modifiers occur with rakaranshaya. One other significant characteristic in Sinhala writing system is using compound consonants. This frequently occurred in old Sinhala books. However, in contemporary Sinhala this writing system is infrequent and therefore only the first set of compound consonants in figure 6 (which are rarely occurred in contemporary Sinhala books) have been concerned for the training data in this research.

Compound consonants rarely occurred in contemporary
Sinhala books In our study, different Sinhala text genres were given different accuracy results. From a variety of genres, explanation and descriptive writings, narrative writings, and news reportage were selected for our purpose. When selecting documents, we considered a variety of documents and Unicode font types from different printing eras. When image resolutions were considered, low image resolutions may affect not only quality but also speed degradation of overall OCR performance, since uncertainty in character pictures produce more recognition variants. In the Tesseract engine also, high resolution images were able to give high accuracy by identifying all the punctuations, modifiers and complex letters. In the Tesseract engine, image processing is a combination of several steps such as rescaling, Binarization, Dilation / Erosion, and etc.

Compound consonants occurred in old books
For the training process, we adapted and experimented on both Tesseract 3.0 (Legacy version) and Tesseract 4.0 (Deep learning) OCR engines as a tool. Tesseract has a standard level of accuracy in its engine. It's necessary to have a library file in the OCR engine called 'traineddata' which works on Sinhala inputs. This file is a concatenation of multiple files. According to the accuracy and richness in the library file, the OCR engine can work to its full potential. Sinhala language is complicated and has various types of letters including vowels, consonants, compound characters and other special types. Therefore, for Tesseract 3.0, we developed a large character set for Sinhala. It is important to mention that, for Tesseract 3.0 we need to uniquely identify each and every character. Sometimes due to the complexity of the character set, the OCR may not always detect a character correctly even if the character is included in the training files.
The preparation of data and the training process adopted for developing the Sinhala OCR model for both tesseract 3.0 and 4.0 versions are described in the following subsections.

V. TRAINING PROCESS
The preparation of data and the training process adopted for developing the Sinhala OCR model for both Tesseract 3.0 and 4.0 versions are described in the following subsections.

A. Setting up the OCR Engine
We installed the Tesseract version 3.0 in the Windows Operating System. Since there is no user interface of Tesseract 3.0, we used several commands in the command line to launch the application.

B. Preparing training data
The process followed by preparing training data is described below. Based on the above criteria we prepared two sets of training images. The first set consisted of computergenerated images (screenshots). As an iterative process of training, the second set of training images were prepared with scanned images for the same text data. Figure 8 shows a sample of such training images.

C. Training the model
The training was performed as an iterative process until better results were obtained. Firstly, the training models were done for individual data sets of computer-generated images for given font types and sizes. Secondly, we combined the training data sets for multiple fonts and multiple sizes and trained the models. Thirdly, the training was performed using the scanned images and trained multiple models for the given font types and sizes. Finally, all the data sets of computer-generated images and scanned images were combined in several ways and trained multiple models.

• Preparation Tesseract 4.0 version A. Setting up the OCR Engine
For setting up the Tesseract 4.0 version we selected Ubuntu environment. Since Tesseract 4.0 deals with deep learning techniques such as Long Short-Term Memory (LSTM), the Ubuntu operating system provides full compatibility for OCR engines. And all the tasks were carried out in the terminal and instructions were given as commands.

B. Preparation of training data sets
Training data plays an important role in Tesseract version 4.0. With the integration of deep learning techniques, more training data will result in good outcomes. For our experiment, we have employed 3 datasets which are available for the Sinhala language. Further details of the 1) UCSC 10 million Sinhala dataset, 2) common crawler Sinhala dataset and 3) Google dataset will be discussed in the next few lines.

1) UCSC 10 Million Word Sinhala Corpus: UCSC 10M
Word Sinhala Corpus has been compiled by the Language Technology Research Laboratory -University of Colombo School of Computing (UCSC) in Sri Lanka. This text corpus contains a huge variety of Sinhala books including novels, short stories, translations, critiques written by renowned Sinhala writers, and Sinhala newspapers: Silumina, Dinamina, Lankadeepa and Lakbima. The UCSC 10 million dataset includes texts which belong to different eras in Sri Lanka. It also contains texts from various sources; the text is rich with different writings. Noise data and other textual data with different languages have been removed from this dataset in order to minimize the errors.
2) 5million+ sentences in Sinhala common crawler: In 2019, Guzman [16] presented two monolingual corpora for Sinhala. Those were a combination of 155k+ sentences of filtered Sinhala Wikipedia and 5178k+ sentences of Sinhala common crawl. Since this study considered only textual data available online, the diversity of textual representation is considerably low. Furthermore, a high noise rate exists in this dataset with other common issues like the zero width joiner problem and the combination of multiple language textual data with Sinhala textual data. And these affect the overall accuracy of the system.

3) Google dataset for Sinhala is especially built with the
Tesseract. This dataset includes variety of textual representations gathered in recent years.

4) UCSC 400K distinct wordlist:
This list of monolingual vocabulary was developed from the UCSC 10 million words Sinhala corpus by the Language Technology Research Lab of UCSC. The list includes 440,021 distinct entries and is available on the web. After comparing these 3 datasets, the UCSC 10 million Sinhala dataset [17] was selected by the authors due to the enrichment of textual combinations in different eras and less noise data. UCSC 400K Distinct Word List [18] was also combined with the existing Tesseract word list.
As a special feature, the Tesseract version 4.0 generates the tiff file and box file automatically. Additionally, image and corresponding UTF-8 text transcription are generated on lstmf file at the process of font training. Also in Tesseract 4.0 the clustering steps (mftraining, cntraining, shape clustering) are replaced with a single slow lstm training step.

C. Selection of font types and sizes
Since typefaces are significant in training an OCR system, we investigated the commonly used Sinhala fonts to train the OCR model in Tesseract 4.0. Though there are hundreds of non-Unicode fonts available for the Sinhala script, they have no unique character code point for identification. Owing to its 16-bit encoding, UNICODE is theoretically able to support over 65,000 unique character code points [19] and we selected 9 Unicode fonts from the limited number of Sinhala Unicode fonts available. They include Unicode fonts which are most commonly used in printed and digital media [20]. The font types involved with the research is given below.

D. Training the model
As pre-processing steps for noise removal, adaptive thresholding, page layout analysis and connected component analysis were performed by the Tesseract OCR engine. The following steps were followed to train the model. Initially generated training data is provided as the input to the engine and extract the generated model. Then the model was fine tuned to decrease the error rate and finally the fine-tuned model was combined with the initial trained model. We combine multiple fonts for model creation. Single font models, Double font models and Triple font models were used for analysis.

VI. EVALUATION AND RESULTS
For the evaluation process, we considered both Tesseract 3.0 and Tesseract 4.0. As the first phase, the Tesseract version 3.0 was evaluated by character level. Meanwhile, the Tesseract 4.0 was evaluated at both character and word level. The developed OCR models have been tested with 30 images selected for three different categories (10 for each category). When selecting images for testing we chose non identical images with different typefaces and different image qualities.

2) Old Sinhala books:
Testing images for this category were selected from old Sinhala books which are printed on Letterpress printing. The old books selected include: බුත්සරණ (buthsarana), පූජාවලිය (pujawaliya) and සද්ධෙණරත්නාවලිය (saddharmarathnawaliya). The images in this category are in 72 DPI.

3) Contemporary Sinhala books:
The books printed with computerized fonts were selected for this category. 10 images of randomly selected pages from 10 books were taken and they were scanned for 300 DPI.
To calculate the accuracy of the systems we compare the common and different characters between original and OCR document.

A. Evaluation of the models from Tesseract 3.0
The evaluation of Tesseract version 3.0 was conducted only for the third category of testing images for two reasons. Firstly, the results for the other two categories were not at satisfying level and secondly, we gave our main priority for the evaluation of Tesseract version 4.0 Therefore, we selected the most accurate model (Scanned-iskolapotha model) out of 18 multiple models created by varying different font types and sizes. Original data of the testing samples consist of 2592 words and 16380 characters. Testing results are illustrated in table i. The models generated from Tesseract 4.0 OCR engine were evaluated for the three categories of testing samples explained above. From the generated models, all the models of individual fonts and three selected Combined Models (CM) were evaluated. The same set of testing images were used in the evaluation process.
For the first category of evaluation, we selected 10 images from old newspapers and they consist of 1557 words and 9821 characters. Some of the texts in these images are even hard to read by a human. The results for the first category of images are shown in table ii.  contemporary writing does not follow this style, the training data is not rich with these sequences. This has resulted in not recognizing touching letters. o Inability to recognize compound consonants. The compound consonants given in figure 6 hardly occur in contemporary Sinhala and therefore they are not well recognized.
Some English characters were also in the testing images of all three categories. As we focused on developing a better recognition model for Sinhala characters, we did not include enough English text data in the training process, this resulted in some errors in recognition and affected the overall accuracy of the system. However, the above limitations will be considered in the next stage as a future enhancement.

VII.
CONCLUSION AND FUTURE WORKS In this paper we presented a process of developing an Optical Character Recognition system for Sinhala. In this research we identified the characteristics of Sinhala script along with properties of writing style in Sinhala scripts. The training process of the OCR model was initiated with Tesseract 3.0 and later moved to Tesseract 4.0, as it was the state of the art of deep learning.
The evaluation was carried out by comparing the results with the different types of Sinhala fonts and adapting systems to recognize varieties of test data gathered from different sources. Although we tested some samples with the model built from a Sinhala common crawl dataset, overall accuracy is less compared to others and unable to identify characters.
According to the results, our system model trained with font iskolapotha gave accuracy of 87.63% in contemporary Sinhala books. In the Sinhala old book category, models developed using fonts Malithi, LKLug and combined font models using Noto Sans, LKLug and Malithi gave accuracies of 87.07%, 87.15% and 87.52% respectively. Meanwhile in the old Sinhala newspaper category 67.02% of accuracy was obtained from the model developed with font iskolapotha.
Developing OCR systems for low resource languages needed a considerable amount of effort from both linguistics and computer science domain areas. Analysing linguistics rules and mapping them with computer science is quite challenging for low resources languages like Sinhala and Tamil. In this stage of the research, we focused highly, only on the recognition of the Sinhala script. As mentioned in the above sections, the Sinhala script is also used to write "Pali" and "Sanskrit" languages in Sri Lanka. As a future enhancement we will work on identifying touching and conjoining letters which occur frequently in Pali. We also plan to apply some n-gram or word embedding based postprocessing techniques to enhance the accuracy. Also in real world OCR can be categorized as one of sequence learning tasks. Therefore, it is necessary to predict the sequence of labels from noisy, unsegmented input data. As future work, we plan to combine connectionist temporal classification (CTC) with deep learning algorithms to train the Recurrent Neural Network (RNN) to label unsegmented sequences directly. Moreover, neural net compressions and conventional neural machine translations for Sinhala OCR will be studied in the future.
International Journal on Advances in ICT for Emerging Regions July 2021 APPENDIX A Following include the images of three categories used for testing each OCR model.

APPENDIX B
Interface of developed OCR system is shown in Figure B.