Development of a PC-based sign language translator

ABSTRACT

Hearing-impaired persons communicate mainly through sign language (SL), which includes the use of facial gestures, certain body part movements, and lip-reading. SL is country-specific which makes its translation a difficult task [3]. The most important thing is getting the attention of the hearing-impaired, which can be done via several ways such as touching the arm or shoulder, waving, stamping on the floor, and switching the light on and off [4]. The choice and effectiveness of a method in catching the attention of the hearing-impaired person depend on how well the two parties in the communication process are familiar.
Going by the number of hearing-impaired people around us, their inability to convey their thought constitutes a major problem, especially in areas where they exist in minority. Effective communication is required for peaceful co-existence and the general well-being of the society at large. In view of this, SLs were developed to facilitate effective communication between hearing-impaired persons. To a large extent, this measure has addressed the problem of communication among educated hearing-impaired individuals. However, what about communication between hearing-impaired and non-hearing impaired persons? Hence, there is a need for an arbiter between these two sets of people in order to forestall avoidable crises that may emanate from communication gap. Essentially, two key issues are involved in arriving at such a desired arbiter: sign language recognition (SLR) and sign language translation (SLT).
Research into recognizing and interpreting SL gestures started a few years back. SLR schemes can be grouped into three, depending on the form it takes. They are: SLR schemes that rely on the recognition of finger-spellings, those that are based on the recognition of isolated words, and lastly, continuous sentence construction recognition based SLR [5]. Many of the earlier research efforts into the SLR scheme adopted traditional recognition approaches which include the use of hidden Markov model [6] for words recognition, support vector machine [7], [8] for classification of both isolated words and continuous SL alphabets, and trajectory matching for isolated words grouping. Recently, different varieties of deep-learning methods convolutional neural network (CNN) [9], [10], long short-term memory (LSTM) [11]- [14] are being utilized singly and in hybrid configurations [15], [16] to address the problem of SLR, especially in applications involving the recognition of continuous sentence structure.
Unlike the situation with SLR, reports on SLT are scanty in the literature. However, knowledge gained from the research into SLR provides ample leverage in SLT development to facilitate effective communication between the hearing-impaired and non-hearing-impaired persons in our society. To that end, few proposals on SLT that can be found in the literature include deep learning model-based SLT [11], [15], [17]- [20] and sensor-based SLT [21]- [35]. Majority of the neural SLT models adopt a multimodal structure in their construction such that sequential connections exist between CNN and neural machine translation (NMT). While the NMT module is essentially the kernel for the translation of SL gestures into target sentences, CNN is used for the extraction of image-level features that serve as NMT input. The critical problem, about all deep learning-based SLT models, is the requirement for a large dataset, which is not readily available. This requirement hinders the performance of resulting SLT models. Sensor-based SLTs are usually built around a glove incorporated with microtouch switch or other electronic devices for gesture recognition and translation into text and speech equivalents.
Other researchers developed different SLTs such as software-based platform [36], MobilenetV2based gestures recognition system [37], and tablet-based hearing aid [38]. An open-source software framework developed in [36] presents a development environment for building of augmentative and alternative communication models that include communication aids for the disabled community. A MobilenetV2-based gesture recognition system was developed in [37]. Although, the system is specifically meant for smart home applications, it can as well be deployed for gestures translation. Cameras of mobile devices are used for detection and capturing of data from objects with the image presented in the frame by a bounding box. The focus of the developed device in [38] was a tablet-based hearing aid where sign language gestures are digitally processed by the tablet before being wirelessly relayed to standard earphones for better output.
This work proposes a PC-based communication aid to facilitate effective communication between hearing-impaired and non-hearing impaired people. American sign language (ASL) gestures are employed in the development whereby a database of hand gestures in ASL is created using Python scripts while the pipeline configuration model for machine learning of annotated images of gestures in the database with the real-time gestures is realized via the use of TensorFlow (TF). The developed SLT running on a PC equipped with a web camera that captures real-time gestures for comparison and interpretations is implemented in Python software environment. Outputs of the developed SLT are translation of ASL/gestures into written texts and corresponding audio renderings at an average duration of about one second. The novelty of this paper is premised on the non-introduction of new device, which leads to reduced cost in the longrun. Personal computer system is a veritable and common tool nowadays, that is available to different individuals (hearing-impaired and non-hearing impaired inclusive) mingling and interacting together. While sensorbased SLTs SLT [21], [23]- [33], [35] tend to be costlier, the proposed PC-based SLT in this paper is relatively cheaper as no new separate gadgets are required. In addition, no need for additional training as the device can be operated by whoever can handle a computer system. Specifically, a database of ASL gesture is created using Python script, which enables faster further signal processing, and the development of TF pipeline configuration model to interface with the created database for machine learning. The rest of the paper is structured as follows. Presented in section 2 is the method while section 3 has results and discussion. The paper is concluded in section 4.

METHOD
The developed communication aid between hearing and non-hearing impaired individuals is achieved by converting ASL gestured by the hearing impaired to a corresponding text and audio signals for the non-hearing impaired to interpret. The stages involved in the development are basically three: i) creation of database or datasets of images for each of the selected ASL gestures using a Python script; ii) creation of a pipeline configuration model using TF software to interface with the created database for machine learning; and iii) deployment of the TF pipeline configuration model in a Python software environment for matching, comparison, and decision making with real-time images of ASL gestures and production of corresponding texts and audio words. In order to realize the above steps, a few materials (hardware and software) are required. They are briefly highlighted in what follows, beginning with hardware materials.

Hardware materials
Hardware materials used in this study include: i) a 4.00 gigabyte (GB) random access memory (RAM), 2.60 gigahertz (GHz) processor, 64-bit, laptop running Windows 10 operating system-the laptop is used as the workbench for the developed communication aid and it housed the software component of the developed communication aid for deployment and utilization; ii) a 1,080 P video recording 12.0 M pixel high definition (HD) webcam-this is used to capture real-time images of the sign language and gestures before it. The webcam is connected to the laptop through its universal serial bus (USB) port and images captured by the webcam are saved to the laptop; iii) a SanDisk 8 GB Micro secure digital (SD) memory card: the memory card is used to compile the selected images of the gestures or sign languages. It will be slotted into the SD port of the laptop to extract the images; iv) other hardware like speakers is provided as peripheral to the computer to enhance audio output rendering of the gesture images; and v) sign language gesture-specifically, ASL gesture images are employed in the development of the communication aid reported in this paper.

Software components
Software components used in this study include: i) Python software v3.8, a popular open-source software and programming language. It is used as a model building and deployment environment for the developed communication aid. The choice is informed by its versatility and ease of use for different scripting applications in a wide variety of domains; ii) Labelimg, a graphical image annotation tool for labeling object bounding boxes in images. It is written in Python for the creation of bounding box annotations of gesture images. The created annotations are saved in CreateML formats; iii) Pyttsx3, adopted a text-to-speech conversion library for this work. It is a library in Python and is chosen because it works offline unlike available alternative libraries that work mainly online; iv) TF, a software library or framework, designed by the Google team, for easier and faster implementation of machine learning and deep learning concepts. Core functionalities of TF that favour its choice for this work are: augmented tensor operations with seamless interfaces with existing programs, automatic differentiation, which occupies the very core of optimizationbased algorithms, and parallel and distributed (multi-machine) computing. TF is used in the creation of a pipeline configuration model (an application interface) to facilitate easier detection of objects. The pipeline configuration file is split into five parts: model configuration, train configuration, evaluation configuration, train input configuration, and evaluation input configuration; v) Mobilenet_v2_SSD, an object detector that can be used on real-time images for location finding. Detected points/locations are described by bounding boxes with each of the bounding boxes assigned a class; vi) Jupyter server, an extension in Python software environment that extends the console-based approach to interactive computing in a qualitatively new direction, providing a web-based application suitable for capturing the whole computation process: developing, documenting, and executing code, as well as communicating the results. The Jupyter server combines two components: a web application and notebook documents. For this work, the web application component is used; and vii) deepstack server, an artificial intelligence (AI) server that enables development of faster AI systems both on premise and in the cloud. DeepStack runs on the docker platform but can be used from any programming language.

Procedure
Having highlighted the needed materials (both hardware and software), the next undertaking is the description of the procedure involved in the development of the developed communication aid. A Python script is written to compile and create the database for ASL gestures. The Python written script uses OpenCV, a library in Python for computer vision and imaging, to facilitate interaction between the laptop and webcam. This interaction allows storage of snapshots of ASL gestures made by the webcam for further processing. The images of some of the ASL gestures used and their corresponding meanings are shown in Figure 1 while Figure 2 depicts the snapshot of the ASL database created. Figure 1 describe the typical ASL gestures used in the creation of the database with corresponding meanings (a) hello, (b) i love you, (c) nice to meet you, (d) no, (e) please, and (f) sorry.
As describe earlier, Python Labellmg is then used for annotation of objects in the created ASL database. Figure 3 illustrates a typical form of Labellmg annotation, specifically for the gesture image "hello". Building of pipeline configuration model for machine learning was realized via the use of TF. The pipeline is used to convert annotations and the created ASL database into TF record format for machine learning. For object detection, Mobilenet_v2_SSD is used in conjunction with the created TF pipeline. Figure 4 shows extract from the TF pipeline created.  The TF pipeline configuration model is trained on the created database and its annotations in Python software environment. The model is first deployed on Jupyter server, then on deepstack server. The deployment on Jupyter and deepstack run on port 80 and simply points to the saved model's directory. The model compares, match and make final decisions on the real time images of ASL displayed by the gesturer. It also displays the boundary box coordinates which indicates the precision range from 0 to 1. An evaluation script that uses OpenCV facilitates interaction with the webcam while Pyttsx3 sees to conversion of the translated sign gestures text equivalents to the corresponding audio in real-time.

RESULTS AND DISCUSSION
Findings from deployment of the developed SLT are presented here as well as its response time when real time gestures are positioned before a PC running the developed communication aid. To that end, the discussion is grouped into two, beginning with the process of engaging the developed SLT.

Engaging the developed SLT
A gesturer is positioned facing the webcam of the PC where the developed virtual communication aid is running. This line of instruction is entered at command prompt on the PC: deepstack--MODELSTORE-DETECTION"C:\Users\USERPC\Desktop\ and Myproject\fine_tuned_model (docker)\ObjectDetection\ models"--PORT 80. To enable the webcam to start the video streaming, the above line of instruction is followed by: cd "C:\Users\USER PC\Desktop\My project\fine_tuned_model (docker)\ObjectDetection" and python livefeed_detection.py.
Successful operation and entering of correct instructions lead to the display of the properties of the real time image in front of the webcam and matching of the image with the equivalent in the database to give appropriate outputs (text displayed and audio rendering). In addition, a text is displayed on the PC screen to indicate the accuracy of the matching process. Pressing of the letter "Q" on the keyboard halt the live feed. Figure 5 shows a snapshot of the first and second line of instructions when entered at command prompt. Once the two instructions have been entered at the command prompt, real time image of gesturer in front of the camera is processed accordingly with the equivalent interpretation of the sign language translated into appropriate outputs (text displayed and audio). Figure 6 portrays obtained results when nine gestures are presented in front of the PC running the developed SLT. As can be observed from Figure 6. In addition to displayed text equivalent rendering of ASL gestures, certain numerical values, which range between 0 and 1 are shown. Those numerical figures indicate precision figure associated with each translated ASL gesture.   Table 1 presents the time taken by the developed PC-based based SLT to render selected ASL gestures presented in front of PC where it is deployed, into corresponding text and audio interpretations. In addition, translation precision values as captured for each of the selected ASL gestures are also presented. It is obvious from the results presented in Table 1 that the developed PC-based sign language translator achieves its aim. Going by the outputs illustrated, each of the nine ASL gestures used for the evaluation is successfully translated into word equivalent within reasonable time. This indicates that the developed sign language translator is suitable as a communication aid between hearing-impaired and non-hearing-impaired individuals. The duration for processing of each of the ASL gestures used in the evaluation is generally about one second, which shows that the translation is done in real-time without delay which may introduce frustration in the communication process if a longer time is involved. The precision value attached to each gesture indicates how accurate the real-time images are matched with those in the annotated database. The least in the set of ASL gestures used for the evaluation of the developed sign language translator is about 44% (corresponding to "i love you") while the highest is approximately 80% (corresponding to "yes"). These results show how robust the developed sign language translator is, for it detect and correctly translate gestures into required outputs (text and audio rendering) when the matching between the real image and real image is about 44%.

Time elapse for inference making
It is worth pointing out here that the proposed PC-based SLT in this paper takes lesser time to respond when compare with the response time of most sensor-based SLTs where few seconds are needed for recognition. For instance, it is reported in [28] that the detection of hand motions took a few seconds as the user had to hold a formed sign for two seconds to ensure recognition. This is in addition to the lower cost of implementation, acquisition, and operation of the proposed SLT since it is simply an 'add-in' to a typical PC. Table 2 further summarizes the comparison of PC-based SLT proposed in this paper with two others that can be found in the literature.

CONCLUSION
A PC-based sign language translator has been developed in this paper. The developed SLT is shown to successfully translate nine different ASL sign gestures into text and audio equivalents. These written and audio interpretations rendering of sign language gestures will go a long way in facilitating effective communication between hearing-impaired and non-hearing-impaired individuals in our society when deployed. Hence, it provides a method of addressing the problem of communication between these individuals and aid their mutual interactions in daily activities.