Working of the Tesseract OCR on Different Fonts of Gujarati Language

Joshi, Kartik; Arolkar, Harshal

doi:10.1007/978-981-97-0744-7_15

Part of the book series: Lecture Notes in Networks and Systems ((LNNS,volume 916))

Included in the following conference series:

International Conference on Information and Communication Technology for Competitive Strategies

14 Accesses

Abstract

An optical character recognition engine is the technological solution for preserving books and manuscripts that may soon be lost due to deterioration. In digital form, documents and/or text files are editable, searchable, and shareable. To save them from getting destroyed, documents and/or text files need to be scanned/converted into digital form and passed onto the optical character recognition engine to generate the digital text file. For a large amount of data, manual typing and conversion is nearly impossible. In this paper, the authors have tried to analyze the working of the Tesseract OCR engine for the images that contain Gujarati text.

Download conference paper PDF

Keywords

1 Introduction

Gujarati is an Indo-Aryan language that evolved from Sanskrit more than 1000 years ago. According to historical evidence, the evolved Gujarati language is classified as old Gujarati, middle Gujarati, and modern Gujarati. Old Gujarati and its literature are thought to have existed between 1000 and 1450 AD, also known as the Pre-Narsinh yuga. Middle Gujarati, which lasted from 1450 to 1850 AD, saw many phonetic and grammatical changes in language and literature. This period is known as Narsinh yuga or Bhakti yuga. The modern Gujarati language dates from 1850 AD to the present. Aside from this, there are many regional dialects such as Standard Gujarati, Amdavadi, Kachhi, Surati, and Kathiawari [1].

As shown in Figs. 1, 2, 3, and 4, modern Gujarati has the following digits: 10, vowels: 13, consonants: 37, and special symbols. However, diacritics and conjuncts are used with these vowels and consonants to create new words in the language, increasing their complexity.

A list includes the Gujarati numerical names for numbers 0 to 9. — **Fig. 1**

A list of 26 vowels of Gujarati fonts with their sounds. — **Fig. 2**

A list of 36 Gujarati text fonts with their pronunciations. — **Fig. 3**

A list of 39 Gujarati text fonts with their pronunciations. — **Fig. 4**

The Tesseract engine was initially evolved as exclusive programming at Hewlett Packard labs in Bristol, Britain, and Greeley, Colorado during the years 1985 and 1994, with additional progressions made in 1996 to port to Windows, and a relocation from C to C++ in 1998. A great deal of the code was written in C, and afterward, some more was written in C++. From that point forward, all the code has been switched over completely to a C++ compiler. It was delivered as open source in 2005 by Hewlett Packard and the College of Nevada, Las Vegas (UNLV). Tesseract advancement has been supported by Google since 2006.

Tesseract OCR has 5 versions till date. Versions 1 to 3 are almost identical in working. Version 4 adds LSTM-based OCR motor and models for most extra dialects and consonants. The current version 5 was launched in 2021 after two years of testing and development [2].

2 Architecture of Tesseract

Starting around 2005, there have been countless changes and improvements done in Tesseract OCR; however, in 2016 there was a colossal precision update brought into it utilizing the long short-term memory (LSTM) model, at the cost of required computational power. Tesseract gives three types of LSTM-based models for most of the dialects, because of accuracy, speed, and combination of legacy neural networks (NNs) and new LSTM mode [3, 4], we can set them by setting Command Line argument ‘oem’. Figure 5 is a high-level diagram of the working of the LSTM model that is followed by the Tesseract OCR [4].

A model diagram lists the following steps. The input image is pre-processed, followed by page layout analysis, a Gujarati language model undergoes thresholding, and the lines detection is done to receive an output text file. — **Fig. 5**

Here, when we pass an image into the Tesseract engine as an input, the Tesseract OCR engine does page segmentation, layout analysis, line detection, and thresholding on an image and passes that image data through the LSTM model to extract text from the image to generate text as output with help of supporting file compressed in ‘language’.trained data [4]. The list of files that are required for extracting text is listed here along with their purpose:

1.
‘Language’.lstm:

LSTM recognition model file generated by the training process, and this is the required file for the recognition process.

2.
‘Language’.lstm-unicharset:

The Unicode character set that Tesseract recognizes, with properties. The same uni-charset must be used to train the LSTM

3.
‘Language’.lstm-recoder:

It is a uni-char-compressed (recorder), which maps the uni-charset further to the codes used by the neural network recognizer. This is created as a part of the starter trained data by combine_lang_model.

4.
‘Language’.lstm-punc-dawg/‘Language’.lstm-word-dawg/‘Language’.lstm-number-dawg:
5.
Data Warehouse Advisory Group (DWAG) acts as supporting (optional) files and ‘Language’.lstm-unicharset used to build the lstm-(punc|word|number)-dawgs files which are respectively punctuation, dictionary words, and numbers or tokens, in place of ‘word’ and digit space is added.

The Gujarati language model in Tesseract OCR has 98,456 words, 403 punctuations, and 148 numbers in relevant files [5].

3 Working of Tesseract OCR

As per [6, 7], LSTM is suitable for problems like classification, processing, and predictions. But it showed good results in natural language processing like speech recognition and OCRs [8, 9]. To run a test for Tesseract OCR for Gujarati language we have taken 2 font styles in our data set of 15 images, out of which 2 are shown below. The reason behind choosing such a dataset is to see how Tesseract OCR works with different font styles in the same language. The sample image in Fig. 6 is from the old text font face, while the sample image in Fig. 7 is from the modern text font face. To generate the output text file, we have passed this dataset of images through Tesseract OCR (version 4.1.1). For testing purposes, we have used a machine with Intel i5-7th generation processor, 8 GB of memory, and 1 TB of hard disk running on Ubuntu 20.04 LTS.

A page with multiple paragraphs having text fonts in a foreign language. — **Fig. 6**

To run the Tesseract OCR for the Gujarati language, we passed images using the command as below:

tesseract-l guj [imagePath/imageName] [FileName].

Figure 8 shows the actual command used for extracting contents shown in Fig. 6, and the output is shown in Fig. 9. Similarly, the output extracted from the image in Fig. 7 is given in Fig. 10.

A program code on neel at neel hyphen Lenovo to tesseract images with Gujarati font. A warning prompt says invalid resolution and the estimated resolution is 553. — **Fig. 8**

A screenshot with 39 samples of Gujarati text fonts with blank spaces at 2, 5, 8, 11, 14, 17, 28, and 31. — **Fig. 9**

A screenshot with 29 line samples of Gujarati text fonts with blank spaces at 1, 2, 4, 12, and 22. — **Fig. 10**

4 Observation and Analysis

We can see in Fig. 9 that Tesseract faces difficulty in detecting text in old font style text, whereas in modern style it has good accuracy compared to the previous one. We compared some common words in the dataset and tried to find out how many times Tesseract failed at recognition and generated an error table. Table 1 shows the comparison of actual words, words identified, and completely identified to show how the OCR engine is sensitive to type font faces for two different images. The images with the old-type font face have been taken from the novel ‘Sorath na Baharvatiya’ by Zaverchand Meghani, and the images with the modern-type font face have been taken from the book ‘Sadionu Shaanpan’ which is Gujarati translation of ‘Calendar of Wisdom’ by Leo Tolstoy. For better understanding of the OCR engine, we have checked the efficiency for one, two, three, and more than three letter words individually (Fig. 11).

Table 1 Statistics of words identification for both images

Full size table

4 tables with data on Gujarati words, their actual occurrences, and identified occurrences. — **Fig. 11**

As it can be observed in Fig. 12, the recognition of single letter words does not seem to be an issue, as all the 18 single letters words have been correctly identified as per their actual frequency.

A grouped bar graph demonstrates occurrences over the entire dataset compared to terms with a single letter in an old type font face. The information relates to the actual and the identifiable occurrences. — **Fig. 12**

As it can be observed in Fig. 13, the recognition of two letter words does not seem to have too much a problem as 60 out of 62 two letter words have been correctly identified.

A grouped bar graph demonstrates the occurrences over the entire dataset compared to terms with 2 letters in an old type font face. The information relates to the actual and the identifiable occurrences. — **Fig. 13**

As it can be observed in Figs. 14 and 15, the recognition problem starts cropping up in the engine when we have a greater number of letters. In the above case we can see that 4 words could not be identified at all. This introduces a loss in the dataset which might become problematic when used in real-time applications. The percentage error introduced in the non-identified characters is calculated in Table 2 (Fig. 16).

A grouped bar graph demonstrates instances over the entire dataset compared to terms with 3 letters in an old type font face. The information relates to the real and the identifiable occurrences. — **Fig. 14**

A grouped bar graph demonstrates instances over the entire dataset compared to terms with more than three letters in an old type font face. The information relates to the real and the identifiable occurrences. — **Fig. 15**

Table 2 Analysis of errors for non-identified characters in old font face

Full size table

From Table 2, we can conclude that the Tesseract OCR is unable to recognize some words which have a fine ending curve in their design such as , , , , , and . The percentage error has been calculated using mean absolute percentage error (MAPE) method equation as mentioned:

$$ \begin{aligned} {\varvec{APE}} & = (({\text{ground}}\;{\text{truth}}\;{\text{count}}{-}{\text{OCR}}\;{\text{text}}\;{\text{output}}\;{\text{count}} \\ & \;\;\;\;\;\;\;/{\text{ground}}\;{\text{truth}}\;{\text{count}}) * 100) \\ \end{aligned} $$

(1)

When working with modern-type font, as it can be observed in Figs. 17, 18, 19, and 20, the recognition of letters was not an issue, and there was 100% accuracy in identifying the single or multiple letter words. Hence at this juncture it can be said that the Tesseract engine works accurately with the modern type of font face.

A double bar graph demonstrates instances over the entire dataset compared to terms with a single letter in a modern type font face. The information relates to the real and the identifiable occurrences. — **Fig. 17**

A double bar graph demonstrates instances over the entire dataset compared to terms with more than 2 letters in a modern type font face. The information relates to the real and the identifiable occurrences. — **Fig. 18**

A double bar graph demonstrates instances over the entire dataset compared to terms with more than 3 letters in a modern type font face. The information relates to the real and the identifiable occurrences. — **Fig. 19**

A bar graph demonstrates instances over the entire dataset compared to terms with more than three letters in a modern type font face. The information relates to the real and the identifiable occurrences. — **Fig. 20**

5 Conclusion

From the analysis shown above, we can conclude that the Tesseract OCR is able to generate text for Gujarati language. It can be observed from the results that for the modern font style there does not seem to be a margin of error. While in case of old fonts given as input to Tesseract, it has been observed that there has been an error rate of 93.93%. With the increase in dataset, chances are that the error rate is going to increase. Hence to work with the old Gujarati fonts, we might need to create an entirely new language model compatible with both new and old fonts.

References

Wikipedia (2017) Gujarati language: Wikipedia. https://en.wikipedia.org/wiki/Gujarati_language
Smith R (2007) An overview of the Tesseract OCR engine. In: Ninth international conference on document analysis and recognition, 2007. ICDAR 2007, vol 2, pp 629–633
Google Scholar
Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780
Article Google Scholar
GitHub (2017) Tesseract OCR GitHub. https://github.com/tesseract-ocr
Verstraeten C (2017) How to train Tesseract 3.01: Cédric Verstraeten. https://blog.cedric.ws/how-to-train-tesseract-301
Patel C, Patel A, Patel D (2012) Optical character recognition by open-source OCR tool tesseract: a case study. Int J Comput Appl 55:10
Google Scholar
Audichya MK, Saini JKR (2022) A study to recognize printed Gujarati characters using Tesseract OCR. 1, 2 Computer Science, Gujarat Technological University
Google Scholar
Patel C, Desai A (2013) Gujarati handwritten character recognition using hybrid method based on binary tree-classifier and K-nearest neighbour. Int J Eng Res Technol (IJERT) 2(6):2337–2345
Google Scholar
Chaudhari SA, Gulati RM (2013) An OCR for separation and identification of mixed English: Gujarati digits using kNN classifier. Int Confer Intell Syst Sig Process (ISSP) 13:190–193
Google Scholar

Download references

Author information

Authors and Affiliations

GLS University, Ahmedabad, Gujarat, 380006, India
Kartik Joshi & Harshal Arolkar

Authors

Kartik Joshi
View author publications
You can also search for this author in PubMed Google Scholar
Harshal Arolkar
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Kartik Joshi .

Editor information

Editors and Affiliations

Global Knowledge Research Foundation, Ahmedabad, Gujarat, India
Amit Joshi
Nottingham Trent University, Nottingham, UK
Mufti Mahmud
University of Peradeniya, Delthota, Sri Lanka
Roshan G. Ragel
Department of CSE, SNS College of Technology, Coimbatore, Tamil Nadu, India
S. Kartik

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Joshi, K., Arolkar, H. (2024). Working of the Tesseract OCR on Different Fonts of Gujarati Language. In: Joshi, A., Mahmud, M., Ragel, R.G., Kartik, S. (eds) ICT: Cyber Security and Applications. ICTCS 2022. Lecture Notes in Networks and Systems, vol 916. Springer, Singapore. https://doi.org/10.1007/978-981-97-0744-7_15

Download citation

DOI: https://doi.org/10.1007/978-981-97-0744-7_15
Published: 14 May 2024
Publisher Name: Springer, Singapore
Print ISBN: 978-981-97-0743-0
Online ISBN: 978-981-97-0744-7
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics