Abstract
An optical character recognition engine is the technological solution for preserving books and manuscripts that may soon be lost due to deterioration. In digital form, documents and/or text files are editable, searchable, and shareable. To save them from getting destroyed, documents and/or text files need to be scanned/converted into digital form and passed onto the optical character recognition engine to generate the digital text file. For a large amount of data, manual typing and conversion is nearly impossible. In this paper, the authors have tried to analyze the working of the Tesseract OCR engine for the images that contain Gujarati text.
Keywords
1 Introduction
Gujarati is an Indo-Aryan language that evolved from Sanskrit more than 1000Â years ago. According to historical evidence, the evolved Gujarati language is classified as old Gujarati, middle Gujarati, and modern Gujarati. Old Gujarati and its literature are thought to have existed between 1000 and 1450 AD, also known as the Pre-Narsinh yuga. Middle Gujarati, which lasted from 1450 to 1850 AD, saw many phonetic and grammatical changes in language and literature. This period is known as Narsinh yuga or Bhakti yuga. The modern Gujarati language dates from 1850 AD to the present. Aside from this, there are many regional dialects such as Standard Gujarati, Amdavadi, Kachhi, Surati, and Kathiawari [1].
As shown in Figs. 1, 2, 3, and 4, modern Gujarati has the following digits: 10, vowels: 13, consonants: 37, and special symbols. However, diacritics and conjuncts are used with these vowels and consonants to create new words in the language, increasing their complexity.
The Tesseract engine was initially evolved as exclusive programming at Hewlett Packard labs in Bristol, Britain, and Greeley, Colorado during the years 1985 and 1994, with additional progressions made in 1996 to port to Windows, and a relocation from C to C++ in 1998. A great deal of the code was written in C, and afterward, some more was written in C++. From that point forward, all the code has been switched over completely to a C++ compiler. It was delivered as open source in 2005 by Hewlett Packard and the College of Nevada, Las Vegas (UNLV). Tesseract advancement has been supported by Google since 2006.
Tesseract OCR has 5 versions till date. Versions 1 to 3 are almost identical in working. Version 4 adds LSTM-based OCR motor and models for most extra dialects and consonants. The current version 5 was launched in 2021 after two years of testing and development [2].
2 Architecture of Tesseract
Starting around 2005, there have been countless changes and improvements done in Tesseract OCR; however, in 2016 there was a colossal precision update brought into it utilizing the long short-term memory (LSTM) model, at the cost of required computational power. Tesseract gives three types of LSTM-based models for most of the dialects, because of accuracy, speed, and combination of legacy neural networks (NNs) and new LSTM mode [3, 4], we can set them by setting Command Line argument ‘oem’. Figure 5 is a high-level diagram of the working of the LSTM model that is followed by the Tesseract OCR [4].
Here, when we pass an image into the Tesseract engine as an input, the Tesseract OCR engine does page segmentation, layout analysis, line detection, and thresholding on an image and passes that image data through the LSTM model to extract text from the image to generate text as output with help of supporting file compressed in ‘language’.trained data [4]. The list of files that are required for extracting text is listed here along with their purpose:
-
1.
‘Language’.lstm:
LSTM recognition model file generated by the training process, and this is the required file for the recognition process.
-
2.
‘Language’.lstm-unicharset:
The Unicode character set that Tesseract recognizes, with properties. The same uni-charset must be used to train the LSTM
-
3.
‘Language’.lstm-recoder:
It is a uni-char-compressed (recorder), which maps the uni-charset further to the codes used by the neural network recognizer. This is created as a part of the starter trained data by combine_lang_model.
-
4.
‘Language’.lstm-punc-dawg/‘Language’.lstm-word-dawg/‘Language’.lstm-number-dawg:
-
5.
Data Warehouse Advisory Group (DWAG) acts as supporting (optional) files and ‘Language’.lstm-unicharset used to build the lstm-(punc|word|number)-dawgs files which are respectively punctuation, dictionary words, and numbers or tokens, in place of ‘word’ and digit space is added.
The Gujarati language model in Tesseract OCR has 98,456 words, 403 punctuations, and 148 numbers in relevant files [5].
3 Working of Tesseract OCR
As per [6, 7], LSTM is suitable for problems like classification, processing, and predictions. But it showed good results in natural language processing like speech recognition and OCRs [8, 9]. To run a test for Tesseract OCR for Gujarati language we have taken 2 font styles in our data set of 15 images, out of which 2 are shown below. The reason behind choosing such a dataset is to see how Tesseract OCR works with different font styles in the same language. The sample image in Fig. 6 is from the old text font face, while the sample image in Fig. 7 is from the modern text font face. To generate the output text file, we have passed this dataset of images through Tesseract OCR (version 4.1.1). For testing purposes, we have used a machine with Intel i5-7th generation processor, 8 GB of memory, and 1 TB of hard disk running on Ubuntu 20.04 LTS.
To run the Tesseract OCR for the Gujarati language, we passed images using the command as below:
tesseract-l guj [imagePath/imageName] [FileName].
Figure 8 shows the actual command used for extracting contents shown in Fig. 6, and the output is shown in Fig. 9. Similarly, the output extracted from the image in Fig. 7 is given in Fig. 10.
4 Observation and Analysis
We can see in Fig. 9 that Tesseract faces difficulty in detecting text in old font style text, whereas in modern style it has good accuracy compared to the previous one. We compared some common words in the dataset and tried to find out how many times Tesseract failed at recognition and generated an error table. Table 1 shows the comparison of actual words, words identified, and completely identified to show how the OCR engine is sensitive to type font faces for two different images. The images with the old-type font face have been taken from the novel ‘Sorath na Baharvatiya’ by Zaverchand Meghani, and the images with the modern-type font face have been taken from the book ‘Sadionu Shaanpan’ which is Gujarati translation of ‘Calendar of Wisdom’ by Leo Tolstoy. For better understanding of the OCR engine, we have checked the efficiency for one, two, three, and more than three letter words individually (Fig. 11).
As it can be observed in Fig. 12, the recognition of single letter words does not seem to be an issue, as all the 18 single letters words have been correctly identified as per their actual frequency.
As it can be observed in Fig. 13, the recognition of two letter words does not seem to have too much a problem as 60 out of 62 two letter words have been correctly identified.
As it can be observed in Figs. 14 and 15, the recognition problem starts cropping up in the engine when we have a greater number of letters. In the above case we can see that 4 words could not be identified at all. This introduces a loss in the dataset which might become problematic when used in real-time applications. The percentage error introduced in the non-identified characters is calculated in Table 2 (Fig. 16).
From Table 2, we can conclude that the Tesseract OCR is unable to recognize some words which have a fine ending curve in their design such as , , , , , and . The percentage error has been calculated using mean absolute percentage error (MAPE) method equation as mentioned:
When working with modern-type font, as it can be observed in Figs. 17, 18, 19, and 20, the recognition of letters was not an issue, and there was 100% accuracy in identifying the single or multiple letter words. Hence at this juncture it can be said that the Tesseract engine works accurately with the modern type of font face.
5 Conclusion
From the analysis shown above, we can conclude that the Tesseract OCR is able to generate text for Gujarati language. It can be observed from the results that for the modern font style there does not seem to be a margin of error. While in case of old fonts given as input to Tesseract, it has been observed that there has been an error rate of 93.93%. With the increase in dataset, chances are that the error rate is going to increase. Hence to work with the old Gujarati fonts, we might need to create an entirely new language model compatible with both new and old fonts.
References
Wikipedia (2017) Gujarati language: Wikipedia. https://en.wikipedia.org/wiki/Gujarati_language
Smith R (2007) An overview of the Tesseract OCR engine. In: Ninth international conference on document analysis and recognition, 2007. ICDAR 2007, vol 2, pp 629–633
Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780
GitHub (2017) Tesseract OCR GitHub. https://github.com/tesseract-ocr
Verstraeten C (2017) How to train Tesseract 3.01: Cédric Verstraeten. https://blog.cedric.ws/how-to-train-tesseract-301
Patel C, Patel A, Patel D (2012) Optical character recognition by open-source OCR tool tesseract: a case study. Int J Comput Appl 55:10
Audichya MK, Saini JKR (2022) A study to recognize printed Gujarati characters using Tesseract OCR. 1, 2 Computer Science, Gujarat Technological University
Patel C, Desai A (2013) Gujarati handwritten character recognition using hybrid method based on binary tree-classifier and K-nearest neighbour. Int J Eng Res Technol (IJERT) 2(6):2337–2345
Chaudhari SA, Gulati RM (2013) An OCR for separation and identification of mixed English: Gujarati digits using kNN classifier. Int Confer Intell Syst Sig Process (ISSP) 13:190–193
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Joshi, K., Arolkar, H. (2024). Working of the Tesseract OCR on Different Fonts of Gujarati Language. In: Joshi, A., Mahmud, M., Ragel, R.G., Kartik, S. (eds) ICT: Cyber Security and Applications. ICTCS 2022. Lecture Notes in Networks and Systems, vol 916. Springer, Singapore. https://doi.org/10.1007/978-981-97-0744-7_15
Download citation
DOI: https://doi.org/10.1007/978-981-97-0744-7_15
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-97-0743-0
Online ISBN: 978-981-97-0744-7
eBook Packages: EngineeringEngineering (R0)