Keywords

1 Introduction

Sign language is a mode of communication that uses the visual-spatial modality to convey messages to people with hearing and verbal disabilities. American Sign Language (ASL) is commonly used by hundreds of thousands of individuals for communication who find themselves at a disadvantage in delicate or sensitive circumstances without an interpreter present. Previously, the Indian Sign Language (ISL) dataset was used, consisting of 320 hand gesture images for alphabets a–z, digits 0–9, and a few common words, which were flipped along the vertical axis, ensuring high accuracy for both left- and right-handed users. Using a self-coded 7-layered concurrent neural network, the obtained images were trained [1]. Sign language communication is multifaceted, and it uses only hand gestures, body movements, and facial expressions. However, this paper focuses mainly on accurately reproducing, recognizing, and representing hand gestures in sign language. In the usual methods, a convolutional neural network (CNN) model was developed to convert the sign language to text with an architecture having multiple dense and convolutional layers, ReLU activation, and 0.2 dropout to avoid overfitting [2]. In our project, the landmark detection technique performed better in terms of memory consumption when classifying signs with the help of Google’s open source library called MediaPipe. A landmark model, akin to this concept, is used to interpret signed American Standard Language (ASL) into regular text. This approach uses machine learning (ML) to examine 21 native 3D hand markings from a single frame. High-resolution 3D landmarks are returned by the detector. Leveraging crucial Python libraries like OpenCV, Sklearn’s KNN formula, etc. is observed. They build a website that takes at least 25 to 30 images per mark, which later predicts the sign with the assistance of an acceptable model [3].

It is a daunting task for a deaf person to attempt to communicate with a hearing person who is foreign to the language of sign. The kiosk serves as an essential tool to bridge communication barriers, particularly in delicate situations without an interpreter. Here is where the NLP model implementations come into play. From vision to NLP, proceeding with symbolic computation, Explainable AI (XAI) is the key to adoption of enterprise AI technologies’ current wave. Static and dynamic visuals were created to teach and clarify BERT and the Transformers. If multiple models are required to be tested, Interpret is used, as it is considered to be the best out of the many BERT visualizations that are specific to the model. Future visualizations will ensure to focus more on Transformer components that are specific to the problem. Larger architectures from common components reduce the current visualization reliability of the underlying models [4]. The text, once examined and rectified, has an automated response generated to it that contains the required response to the query posed to the kiosk.

To convert from sign language to speech, Outfit-7 and Video Relay services (VRS) technologies were used to enable audible language translation on smartphones, enabling gesture technologies. This developed technology makes use of Java Script Object Notation (JSON), which interprets the translation of voice into text and ASL video. The deaf person can read it and also sees the ASL video as the sent SMS stored in the inbox [5]. Our approach uses phoneme, prosody, and Mel-spectrogram to convert the response that was received in the form of text into speech, drawing cue from this current method.

Predetermined sign vocabulary was used to create the final avatar movements. Sign language translation (SLT) would greatly lower the hurdle for many deaf and mute individuals expected to be able to better communicate by accompanying others in day-to-day interactions.

2 Methodology

2.1 Assumptions and Problems

While collecting the dataset, it must be done with people of different hand shapes and also should cover different kinds of rotations, angles, and scaling. Earlier literature dealt with methods only to convert from sign to text and vice versa. But our solution adds the automated response with respect to certain context and provides the response in text, speech, or sign so that the user may not wait for the other person to clarify any queries, instead the machine will reply, basically acting as a bot for signs. So now, the proposed solution is mainly divided into 3 phases:

Phase 1: Conversion of sign to text

Phase 2: Generation of response to the text

Phase 3: Conversion of text-to-sign or speech (Fig. 1).

Fig. 1
A graphic defines how the flow diagram of the core tech. The query goes to the data collection and cleaning, training and testing, sign-to-text conversion, spelling correction, and producing an automatic response, to obtain the answer.

Block diagram of sign kiosk

2.2 Conversion of Sign to Text

The conversion of sign language to text was basically done by methods which take up a high amount of memory as the functions such as region extraction, shape detection, skin color detection, and so on are performed on images in order to classify the sign. Instead of these mentioned methods, we have used a landmark detection method to classify signs, i.e., we only use coordinates of the hands. We used MediaPipe’s holistic model in order to extract the coordinates of the joints where each coordinate contains x, y, z visibility values spread across three-dimensional space to train our custom machine learning model.

The goal is to train the model to classify from class ‘A’ to ‘Z’, a total of 27 classes with 84 features that includes x, y, z and visibility values of all 21 landmarks of the hand. Users can perform signs using their right hand, and the dataset for training is collected in different angles and different hand shapes in order to reduce error. The whole dataset contained 55,642 rows and 84 features. The dataset was further split into 70 percent training and 30 percent testing set for the evaluation of the trained model. The model was trained with algorithms such as KNN, SVM, and random forest classifier [7], obtaining an accuracy of 97.6%, 95.82%, and 98.68% respectively with allowed errors. There are certain advantages of using this type of method for classification of sign language. Bulky images are not used like in previous methods to train the model, allowing them to consume very less memory in order to execute. It also solves data privacy issues. The main disadvantage of using this method is that the model may not provide accurate results when it is subjected to poor illumination (Figs. 2, 3, and 4).

Fig. 2
A hand model with wrist, thumb C M C, thumb M C P, thumb I P, thumb tip, index finger M C P, index finger P I P, index finger D I P, index fingertip, middle finger M C P, middle finger P I P, middle finger D I P, middle finger tip, ring finger P I P, ring finger P I P, ring finger D I P, and pinky.

Hand model of MediaPipe

Fig. 3
A scatter graph of X 2 versus X 1. It plots two sets of data points for Category A and Category B. A new data point is in between them.

Plot of KNN

Fig. 4
A scatter graph of X 2 versus X 1. It plots two sets of data points for w asterisk x minus b = 1 and w asterisk x minus b = negative 1. A hyperplane is in between them with a width of 2 over a modulus of w and is labeled w asterisk x minus b = 0.

Plot of SVM

2.3 Generation of Response to Text

For phase 2, we used natural language processing (NLP) in order to ensure the accuracy of the generated text and avoid any grammatical errors, which can be done using an NLP module. Our NLP module consists of two models: Spello model and Transformers model. Spello model is a spelling correction library that uses phoneme, Symspell, and context models in the backend to get the best possible corrected words from the misspelled words. The Phoneme model is based on the algorithm called Soundex which suggests corrected words by surfing for similar-sounding words. The Symspell model uses the concept called edit distance, which corrects misspelled words due to typing errors while typing on the keyboard. Finally, the context model decides which of the words that is given from both models has to be considered and uses the n-gram probabilistic model [7] to find the best possible word for the particular sentence. For a better understanding of the workings of Spello, let us take an example of a sentence with a misspelled word, ‘I dnt knw Cachine Learning’.

The mathematical principle underlying this concept is that the system pre-calculates word probabilities along with an n-word context. It estimates the probability of a fragment by multiplying all n-grams of size n within it.

$$ P\left( {w_{1} ,...,w_{m} } \right) = \mathop \prod \limits_{i = 1}^{m} P(w_{i} | w_{1} , \ldots ,w_{{i - 1_{i} }} ) = \mathop \prod \limits_{i = 1}^{m} P(w_{i } | w_{{i - \left( {n - 1} \right)}} ,...,w_{i - 1} ) $$
(1)

To simplify, the system calculates the probability of an n-gram by multiplying the probabilities of all lower-order grams. Smoothing techniques like Kneser–Ney (2) are used to enhance the model's accuracy.

$$ P_{s} (w_{i } | w_{i - 2 } ,w_{i - 1} ) = P(w_{i} | w_{i - 2} ,w_{i - 1} ).P(w_{i} | w_{i - 1} ).P\left( w \right) $$
(2)

To get a probability of n-gram from appearance frequencies we need to normalize frequencies (e.g., divide number of n-grams by number of 2-g, etc.):

$$ P(w_{i } | w_{{i - \left( {n - 1} \right) }} ,..,w_{i - 1} ) = \frac{{{\text{count}}\left( {w_{{i - \left( {n - 1} \right)}} ,...,w_{i - 1} ,w_{1} } \right)}}{{{\text{count}}\left( {w_{{i - \left( {n - 1} \right)}} ,...,w_{i - 1} } \right)}} $$
(3)

Now we can use our extended language model to estimate candidates with context. In this example (as shown in Fig. 6), there are spelling mistakes, ‘dnt’, ‘Cachine’, and ‘knw’. With the help of the Symspell model, we saw that there was a spelling mistake in ‘dnt’ and ‘knw’. With the help of the Phoneme model, we saw that the word ‘Cachine’ sounded closely similar to the word ‘Machine’. Now, these models suggest correct words, and the context model uses these words to predict the most probable correct sentence using specific context. Finally, the most probable sentence, ‘I don't know Machine learning’ was selected and given as output. The latency of the model was 10 ms (Fig. 5).

Fig. 5
A diagram defines how the misspelled sentence is identified in two ways. Symspell model and phoneme model.

Block diagram of finding misspelled words in the sentence

Fig. 6
A flow diagram defines how the suggestions from both the models undergo context model and N-gram model to give the correct sentence as the output.

Block diagram showing the working of the context model

The next step now is to generate a response for the corrected sentence. This can be done using a question-answering module embedded in the RoBERTa model (a modified version of BERT) which makes the model extract answers from a given context. For a given query, tokens are generated, then they are passed through a series of neural networks. Those tokens are used to extract a sentence which is nothing but the response to the query. It is made up of segment embedding, which are the embedding used to distinguish between inquiries and reference material. The separator in this instance is SEP, and the class embedder is CLS. The CLS token starts from the beginning of the question and ends until the end of it. This is because of the SEP token that keeps the question and reference text separate from each other. The segment embedding present also plays a role in separating the question and the text (Fig. 7).

Fig. 7
A graphic defines how the start token classifier works. The C L S token input goes from transformer layer 1 to transformer layer 12 and is split into start sequences, embedded into segments, and graphs are plotted.

Start token classifier

Pipelines combine a pretrained model with the preprocessing applied during the training of that model. We employ question-answering, modeling, and tokenizer as parameters in our project. About 160 GB of data, including Wikipedia, were used to train the RoBERTa model. Using another tokenizer from the HuggingFace packages, we can perform operations directly on the dataset and tokenize the text. RoBERTa uses the GPT2-derived byte-level byte-pair encoding technique. There are 50,000 words in the vocabulary. Using the Fast Tokenizer From Pretrained class from the Transformers library, we can apply the tokenizer to the train and test subsets. The part of the reference text containing the answer is divided into tokens. The solution is accomplished by simply identifying which tokens start and end the required answer. The start token classifier receives the final embedding of every token in the text. The start token classifier has constant weight, applying it to every word. Between this constant weight and the output embedding of each token, we perform a dot product operation, followed by the application of softmax activation.

$$ \sigma \left( {\vec{z}} \right) = \frac{{{\text{e}}^{zi} }}{{\mathop \sum \nolimits_{j = 1}^{K} {\text{e}}^{zj} }} $$
(4)

where σ is the softmax activation function (4) and z ranging from z1 up to zK. The components of the input vector for the softmax function are the zi values, and K is the total number of classes.

The softmax function is used in multi-classification models since it outputs the probabilities from every class and the target class will likely have a high probability. The start token is chosen based on which token has the highest probability. Constant weight is associated with the end token classifier. The same process done for the start token is repeated, and the end token is identified. Therefore, the output solution to the given question is found out from the text of the start token until the text of the end token (Fig. 8).

Fig. 8
A graphic defines how the end token classifier works. The C L S token input goes from transformer layer 1 to transformer layer 12 and is split into end sequences, embedded into segments, and graphs are plotted for the correct sentences.

End token classifier

2.4 Conversion of Text-to-Sign and Speech

In this stage of implementation, the sign kiosk system aims to cater to users who may have intellectual disabilities, auditory impairments, or both. The system transforms text into real-world signed videos [9]. Additionally, it provides voice output for people with normal hearing. To achieve this, the system utilizes the Moviepy library to merge individual letters into sentences, creating VideoFileClip() instances for each .mp4 file. This instance was used to concatenate the video clips using the concatenate_videoclips() method and saved as a final output by writing to the file using write_videofile(). Moviepy's specialty lies in its ability to handle videos of different sizes, making it suitable for video editing and concatenation. The system's database uses a hashmap to map letters to their corresponding mp4 files, facilitating easy retrieval and concatenation of the sign videos (Fig. 9).

Fig. 9
A flow diagram defines how the text is derived in the vocoder. The text undergoes pre-processor, encoder, and decoder.

Block diagram of conversion of speech to text (gTTS)

The system uses the Moviepy library to concatenate pre-recorded video clips for each individual sign. The database is likely pre-populated using hashmap with a collection of pre-recorded video clips, where each clip corresponds to a specific sign or letter in sign language. When a user inputs text, the system retrieves the corresponding video clips for each letter or sign from the database and then merges them together using the Moviepy library to form a complete sentence or response in sign language. This concatenated video is then presented to the user through the sign kiosk system.

The primary purpose for implementing the text-to-speech [10] model is to enable both intellectually disabled individuals and those with normal hearing to hear the model's responses. This process is assisted by phoneme (one that helps a word's pronunciation and synonym differ from other words), prosody (pattern of the rhythm and audio used in poetry), and Mel-spectrogram (used to remove noise from the required audio). The block diagram consists of preprocessor, encoder, decoder, and vocoder. Preprocessor converts the sentence to words. Phoneme model converts words into phonemes (which is a pronunciation). Phoneme is associated with phoneme duration, pitch, and energy. Encoder converts the linguistic feature to a latent feature. Decoder converts the phonemes to acoustic features. Vocoder converts acoustic features to the waveform. This waveform is the speech (Fig. 10).

Fig. 10
A screenshot of the P E S interface. It displays the welcome screen with a photograph of a woman showing her hands to make a sign.

UI of our project

3 Results

Our proposal is centered on a kiosk that is claimed real time. Real time in this context means that the kiosk receives the response within a short period of time. We use MediaPipe to determine the alphabets needed for the input during the text-to-sign conversion phase, or the input phase. Additionally, the model shows the likelihood of the input alphabets in real time. Since our model has a 96.87% accuracy rate, it is possible that the input alphabets could occasionally be misinterpreted for a variety of reasons, including poor illumination, varying sign representations in terms of size and form, etc. However, in order to capture the hand's coordinates, our model determines the most likely alphabet it will need. It then concatenates these most probable alphabets into a sentence, which the kiosk interprets as a query. Following the creation of a response to the query using NLP, this text-based response is transformed back into a sign and shown alphabetically in the project's user interface, ultimately conveying the complete sentence response (Figs. 11 and 12).

Fig. 11
A screenshot of a set of program code fragments. It runs the probability of the alphabet with the input conveyed as the highest probability.

Probability of the alphabet. The highest probable alphabet is taken as input

Fig. 12
A snapshot of a woman showing her hands to make the letter E. Her hands are marked with the hand rule.

Person showing the alphabet ‘E’ to the interface using ASL

4 Conclusion

Communication is an important feature of human beings to lead a social life where some people are denied due to their speaking or hearing disability or both. The world is taking a step ahead in discovering new technologies to solve the same problem. This project concludes that the problem of sign language communication can be solved using computer vision and natural language processing without any human intervention. The model can be converted to a suitable kiosk and can be installed at some places where a person with hearing or speaking disability or both can gather the information needed with the help of our technology. American Standard Sign Language was used for communication throughout the entire project as this is the standard way of communication accepted for English.

In the future we can increase the size of the dataset which will enhance accuracy, especially when considering variables like background, lighting, and diverse hand shapes, allowing us to effectively address issues related to these factors. Proper extension for the model would be implementing the same for regional sign language, where the hearing and speaking disabled people can communicate in their regional language.