Keywords

1 Introduction

Gujarati is an Indo-Aryan language that evolved from Sanskrit more than 1000 years ago. According to historical evidence, the evolved Gujarati language is classified as old Gujarati, middle Gujarati, and modern Gujarati. Old Gujarati and its literature are thought to have existed between 1000 and 1450 AD, also known as the Pre-Narsinh yuga. Middle Gujarati, which lasted from 1450 to 1850 AD, saw many phonetic and grammatical changes in language and literature. This period is known as Narsinh yuga or Bhakti yuga. The modern Gujarati language dates from 1850 AD to the present. Aside from this, there are many regional dialects such as Standard Gujarati, Amdavadi, Kachhi, Surati, and Kathiawari [1].

As shown in Figs. 1, 2, 3, and 4, modern Gujarati has the following digits: 10, vowels: 13, consonants: 37, and special symbols. However, diacritics and conjuncts are used with these vowels and consonants to create new words in the language, increasing their complexity.

Fig. 1
A list includes the Gujarati numerical names for numbers 0 to 9.

Numbers in the Gujarati language

Fig. 2
A list of 26 vowels of Gujarati fonts with their sounds.

Vowels in Gujarati language

Fig. 3
A list of 36 Gujarati text fonts with their pronunciations.

Consonants in the Gujarati language

Fig. 4
A list of 39 Gujarati text fonts with their pronunciations.

Special symbols in the Gujarati language

The Tesseract engine was initially evolved as exclusive programming at Hewlett Packard labs in Bristol, Britain, and Greeley, Colorado during the years 1985 and 1994, with additional progressions made in 1996 to port to Windows, and a relocation from C to C++ in 1998. A great deal of the code was written in C, and afterward, some more was written in C++. From that point forward, all the code has been switched over completely to a C++ compiler. It was delivered as open source in 2005 by Hewlett Packard and the College of Nevada, Las Vegas (UNLV). Tesseract advancement has been supported by Google since 2006.

Tesseract OCR has 5 versions till date. Versions 1 to 3 are almost identical in working. Version 4 adds LSTM-based OCR motor and models for most extra dialects and consonants. The current version 5 was launched in 2021 after two years of testing and development [2].

2 Architecture of Tesseract

Starting around 2005, there have been countless changes and improvements done in Tesseract OCR; however, in 2016 there was a colossal precision update brought into it utilizing the long short-term memory (LSTM) model, at the cost of required computational power. Tesseract gives three types of LSTM-based models for most of the dialects, because of accuracy, speed, and combination of legacy neural networks (NNs) and new LSTM mode [3, 4], we can set them by setting Command Line argument ‘oem’. Figure 5 is a high-level diagram of the working of the LSTM model that is followed by the Tesseract OCR [4].

Fig. 5
A model diagram lists the following steps. The input image is pre-processed, followed by page layout analysis, a Gujarati language model undergoes thresholding, and the lines detection is done to receive an output text file.

High-level diagram of working of the LSTM model

Here, when we pass an image into the Tesseract engine as an input, the Tesseract OCR engine does page segmentation, layout analysis, line detection, and thresholding on an image and passes that image data through the LSTM model to extract text from the image to generate text as output with help of supporting file compressed in ‘language’.trained data [4]. The list of files that are required for extracting text is listed here along with their purpose:

  1. 1.

    ‘Language’.lstm:

LSTM recognition model file generated by the training process, and this is the required file for the recognition process.

  1. 2.

    ‘Language’.lstm-unicharset:

The Unicode character set that Tesseract recognizes, with properties. The same uni-charset must be used to train the LSTM

  1. 3.

    ‘Language’.lstm-recoder:

It is a uni-char-compressed (recorder), which maps the uni-charset further to the codes used by the neural network recognizer. This is created as a part of the starter trained data by combine_lang_model.

  1. 4.

    ‘Language’.lstm-punc-dawg/‘Language’.lstm-word-dawg/‘Language’.lstm-number-dawg:

  2. 5.

    Data Warehouse Advisory Group (DWAG) acts as supporting (optional) files and ‘Language’.lstm-unicharset used to build the lstm-(punc|word|number)-dawgs files which are respectively punctuation, dictionary words, and numbers or tokens, in place of ‘word’ and digit space is added.

The Gujarati language model in Tesseract OCR has 98,456 words, 403 punctuations, and 148 numbers in relevant files [5].

3 Working of Tesseract OCR

As per [6, 7], LSTM is suitable for problems like classification, processing, and predictions. But it showed good results in natural language processing like speech recognition and OCRs [8, 9]. To run a test for Tesseract OCR for Gujarati language we have taken 2 font styles in our data set of 15 images, out of which 2 are shown below. The reason behind choosing such a dataset is to see how Tesseract OCR works with different font styles in the same language. The sample image in Fig. 6 is from the old text font face, while the sample image in Fig. 7 is from the modern text font face. To generate the output text file, we have passed this dataset of images through Tesseract OCR (version 4.1.1). For testing purposes, we have used a machine with Intel i5-7th generation processor, 8 GB of memory, and 1 TB of hard disk running on Ubuntu 20.04 LTS.

Fig. 6
A page with multiple paragraphs having text fonts in a foreign language.

Sample input image in old text font face

Fig. 7
A page with multiple paragraphs having text fonts in a foreign language.

Sample input image in modern text font face

To run the Tesseract OCR for the Gujarati language, we passed images using the command as below:

tesseract-l guj [imagePath/imageName] [FileName].

Figure 8 shows the actual command used for extracting contents shown in Fig. 6, and the output is shown in Fig. 9. Similarly, the output extracted from the image in Fig. 7 is given in Fig. 10.

Fig. 8
A program code on neel at neel hyphen Lenovo to tesseract images with Gujarati font. A warning prompt says invalid resolution and the estimated resolution is 553.

Execution of Tesseract command

Fig. 9
A screenshot with 39 samples of Gujarati text fonts with blank spaces at 2, 5, 8, 11, 14, 17, 28, and 31.

Text output of sample image in old text font face

Fig. 10
A screenshot with 29 line samples of Gujarati text fonts with blank spaces at 1, 2, 4, 12, and 22.

Text output of sample image in modern-type font face

4 Observation and Analysis

We can see in Fig. 9 that Tesseract faces difficulty in detecting text in old font style text, whereas in modern style it has good accuracy compared to the previous one. We compared some common words in the dataset and tried to find out how many times Tesseract failed at recognition and generated an error table. Table 1 shows the comparison of actual words, words identified, and completely identified to show how the OCR engine is sensitive to type font faces for two different images. The images with the old-type font face have been taken from the novel ‘Sorath na Baharvatiya’ by Zaverchand Meghani, and the images with the modern-type font face have been taken from the book ‘Sadionu Shaanpan’ which is Gujarati translation of ‘Calendar of Wisdom’ by Leo Tolstoy. For better understanding of the OCR engine, we have checked the efficiency for one, two, three, and more than three letter words individually (Fig. 11).

Table 1 Statistics of words identification for both images
Fig. 11
4 tables with data on Gujarati words, their actual occurrences, and identified occurrences.

a–d Single, two letters, three letters, and more than three lettered words identified in old font style

As it can be observed in Fig. 12, the recognition of single letter words does not seem to be an issue, as all the 18 single letters words have been correctly identified as per their actual frequency.

Fig. 12
A grouped bar graph demonstrates occurrences over the entire dataset compared to terms with a single letter in an old type font face. The information relates to the actual and the identifiable occurrences.

Actual versus identified occurrence of single letters in old font style

As it can be observed in Fig. 13, the recognition of two letter words does not seem to have too much a problem as 60 out of 62 two letter words have been correctly identified.

Fig. 13
A grouped bar graph demonstrates the occurrences over the entire dataset compared to terms with 2 letters in an old type font face. The information relates to the actual and the identifiable occurrences.

Actual versus identified occurrence of two letters in old font style

As it can be observed in Figs. 14 and 15, the recognition problem starts cropping up in the engine when we have a greater number of letters. In the above case we can see that 4 words could not be identified at all. This introduces a loss in the dataset which might become problematic when used in real-time applications. The percentage error introduced in the non-identified characters is calculated in Table 2 (Fig. 16).

Fig. 14
A grouped bar graph demonstrates instances over the entire dataset compared to terms with 3 letters in an old type font face. The information relates to the real and the identifiable occurrences.

Actual versus identified occurrence of three letters in old font style

Fig. 15
A grouped bar graph demonstrates instances over the entire dataset compared to terms with more than three letters in an old type font face. The information relates to the real and the identifiable occurrences.

Actual versus identified occurrence of more than three letters in old font style

Table 2 Analysis of errors for non-identified characters in old font face
Fig. 16
4 tables with data on Gujarati words, their actual occurrences, and identified occurrences.

a–d Single, two letters, three letters, and more than three lettered words identified in modern font style

From Table 2, we can conclude that the Tesseract OCR is unable to recognize some words which have a fine ending curve in their design such as , , , , , and . The percentage error has been calculated using mean absolute percentage error (MAPE) method equation as mentioned:

$$ \begin{aligned} {\varvec{APE}} & = (({\text{ground}}\;{\text{truth}}\;{\text{count}}{-}{\text{OCR}}\;{\text{text}}\;{\text{output}}\;{\text{count}} \\ & \;\;\;\;\;\;\;/{\text{ground}}\;{\text{truth}}\;{\text{count}}) * 100) \\ \end{aligned} $$
(1)

When working with modern-type font, as it can be observed in Figs. 17, 18, 19, and 20, the recognition of letters was not an issue, and there was 100% accuracy in identifying the single or multiple letter words. Hence at this juncture it can be said that the Tesseract engine works accurately with the modern type of font face.

Fig. 17
A double bar graph demonstrates instances over the entire dataset compared to terms with a single letter in a modern type font face. The information relates to the real and the identifiable occurrences.

Actual versus identified occurrence of single letter in modern font style

Fig. 18
A double bar graph demonstrates instances over the entire dataset compared to terms with more than 2 letters in a modern type font face. The information relates to the real and the identifiable occurrences.

Actual versus identified occurrence of two letters in modern font style

Fig. 19
A double bar graph demonstrates instances over the entire dataset compared to terms with more than 3 letters in a modern type font face. The information relates to the real and the identifiable occurrences.

Actual versus identified occurrence of three letters in modern font style

Fig. 20
A bar graph demonstrates instances over the entire dataset compared to terms with more than three letters in a modern type font face. The information relates to the real and the identifiable occurrences.

Actual versus identified occurrence of more than three letters in modern font style

5 Conclusion

From the analysis shown above, we can conclude that the Tesseract OCR is able to generate text for Gujarati language. It can be observed from the results that for the modern font style there does not seem to be a margin of error. While in case of old fonts given as input to Tesseract, it has been observed that there has been an error rate of 93.93%. With the increase in dataset, chances are that the error rate is going to increase. Hence to work with the old Gujarati fonts, we might need to create an entirely new language model compatible with both new and old fonts.