Sign language recognition using the fusion of image and hand landmarks through multi-headed convolutional neural network (2024)

Journal List
Scientific Reports
PMC10562485

As a library, NLM provides access to scientific literature. Inclusion in an NLM database does not imply endorsem*nt of, or agreement with, the contents by NLM or the National Institutes of Health.
Learn more: PMC Disclaimer | PMC Copyright Notice

Sci Rep. 2023; 13: 16975.

Published online 2023 Oct 9. doi:10.1038/s41598-023-43852-x

PMCID: PMC10562485

PMID: 37813932

Refat Khan Pathan,¹ Munmun Biswas,² Suraiya Yasmin,³ Mayeen Uddin Khandaker,^4,⁵ Mohammad Salman,⁶ and Ahmed A. F. Youssef⁶

Author information Article notes Copyright and License information PMC Disclaimer

Associated Data

Data Availability Statement

Abstract

Sign Language Recognition is a breakthrough for communication among deaf-mute society and has been a critical research topic for years. Although some of the previous studies have successfully recognized sign language, it requires many costly instruments including sensors, devices, and high-end processing power. However, such drawbacks can be easily overcome by employing artificial intelligence-based techniques. Since, in this modern era of advanced mobile technology, using a camera to take video or images is much easier, this study demonstrates a cost-effective technique to detect American Sign Language (ASL) using an image dataset. Here, “Finger Spelling, A” dataset has been used, with 24 letters (except j and z as they contain motion). The main reason for using this dataset is that these images have a complex background with different environments and scene colors. Two layers of image processing have been used: in the first layer, images are processed as a whole for training, and in the second layer, the hand landmarks are extracted. A multi-headed convolutional neural network (CNN) model has been proposed and tested with 30% of the dataset to train these two layers. To avoid the overfitting problem, data augmentation and dynamic learning rate reduction have been used. With the proposed model, 98.981% test accuracy has been achieved. It is expected that this study may help to develop an efficient human–machine communication system for a deaf-mute society.

Subject terms: Computational science, Image processing

Introduction

Spoken language is the medium of communication between a majority of the population. With spoken language, it would be workable for a massive extent of the population to impart. Nonetheless, despite spoken language, a section of the population cannot speak with most of the other population. Mute people cannot convey a proper meaning using spoken language. Hard of hearing is a handicap that weakens their hearing and makes them unfit to hear, while quiet is an incapacity that impedes their talking and makes them incapable of talking. Both are just handicapped in their hearing or potentially, therefore, cannot still do many other things. Communication is the only thing that isolates them from ordinary people^¹. As there are so many languages in the world, a unique language is needed to express their thoughts and opinions, which will be understandable to ordinary people, and such a language is named sign language. Understanding sign language is an arduous task, an ability that must be educated with training.

Many methods are available that use different things/tools like images (2D, 3D), sensor data (hand globe^², Kinect sensor^³, neuromorphic sensor^⁴), videos, etc. All things are considered due to the fact that the captured images are excessively noisy. Therefore an elevated level of pre-processing is required. The available online datasets are already processed or taken in a lab environment where it becomes easy for recent advanced AI models to train and evaluate, causing prone to errors in real-life applications with different kinds of noises. Accordingly, it is a basic need to make a model that can deal with noisy images and also be able to deliver positive results. Different sorts of methods can be utilized to execute the classification and recognition of images using machine learning. Apart from recognizing static images, work has been done in depth-camera detecting and video processing^⁵–⁷. Various cycles inserted in the system were created utilizing other programming languages to execute the procedural strategies for the final system's maximum adequacy. The issue can be addressed and deliberately coordinated into three comparable methodologies: initially using static image recognition techniques and pre-processing procedures, secondly by using deep learning models, and thirdly by using Hidden Markov Models.

Sign language guides this part of the community and empowers smooth communication in the community of people with trouble talking and hearing (deaf and dumb). They use hand signals along with facial expressions and body activities to cooperate. Yet, as a global language, not many people become familiar with communication via sign language gestures^⁸. Hand motions comprise a significant part of communication through signing vocabulary. At the same time, facial expressions and body activities assume the jobs of underlining the words and phrases communicated by hand motions. Hand motions can be static or dynamic^⁹,¹⁰. There are methodologies for motion discovery utilizing the dynamic vision sensor (DVS), a similar technique used in the framework introduced in this composition. For example, Arnon et al.^¹¹ have presented an event-based gesture recognition system, which measures the event stream utilizing a natively event-based processor from International Business Machines called TrueNorth. They use a temporal filter cascade to create Spatio-temporal frames that CNN executes in the event-based processor, and they reported an accuracy of 96.46%. But in a real-life scenario, corresponding background situations are not static. Therefore the stated power saving process might not work properly. Jun Haeng Lee et al.^¹² proposed a motion classification method with two DVSs to get a stereo-vision system. They used spike neurons to handle the approaching occasions with the same real-life issue. Static hand signals are also called hand acts and are framed in different shapes and directions of hands without speaking to any movement data. Dynamic hand motions comprise a sequence of hand stances with related movement information^¹³. Using facial expressions, static hand images, and hand signals, communication through signing gives instruments to convey similarly as if communicated in dialects; there are different kinds of communication via gestures as well^¹⁴.

In this work, we have applied a fusion of traditional image processing with extracted hand landmarks and trained on a multi-headed CNN so that it could complement each other’s weights on the concatenation layer. The main objective is to achieve a better detection rate without relying on a traditional single-channel CNN. This method has been proven to work well with less computational power and fewer epochs on medical image datasets^¹⁵. The rest of the paper is divided into multiple sections as literature review in "Literature review" section, materials and methods in "Materials and methods" section with three subsections: dataset description in Dataset description, image pre-processing in "Pre-processing of image dataset" and working procedure in "Working procedure", result analysis in "Result analysis" section, and conclusion in "Conclusion" section.

Literature review

State-of-the-art techniques centered after utilizing deep learning models to improve good accuracy and less execution time. CNNs have indicated huge improvements in visual object recognition^¹⁶, natural language processing^¹⁷, scene labeling^¹⁸, medical image processing^¹⁵, and so on. Despite these accomplishments, there is little work on applying CNNs to video classification. This is halfway because of the trouble in adjusting the CNNs to join both spatial and fleeting data. Model using exceptional hardware components such as a depth camera has been used to get the data on the depth variation in the image to locate an extra component for correlation, and then built up a CNN for getting the results^¹⁹, still has low accuracy. An innovative technique that does not need a pre-trained model for executing the system was created using a capsule network and versatile pooling^¹¹.

Furthermore, it was revealed that lowering the layers of CNN, which employs a greedy way to do so, and developing a deep belief network produced superior outcomes compared to other fundamental methodologies^²⁰. Feature extraction using scale-invariant feature transform (SIFT) and classification using Neural Networks were developed to obtain the ideal results^²¹. In one of the methods, the images were changed into an RGB conspire, the data was developed utilizing the movement depth channel lastly using 3D recurrent convolutional neural networks (3DRCNN) to build up a working system^⁵,²² where Canny edge detection oriented FAST and Rotated BRIEF (ORB) has been used. ORB feature detection technique and K-means clustering algorithm used to create the bag of feature model for all descriptors is described, but the plain background, easy to detect edges are totally dependent on edges; if the edges give wrong info, the model may fall accuracy and become the main problem to solve.

In recent years, utilizing deep learning approaches has become standard for improving the recognition accuracy of sign language models. Using Faster Region-based Convolutional Neural Network (Faster-RCNN)^²³, a CNN model is applied for hand recognition in the data image. Rastgoo et al.^²⁴ proposed a method where they cropped an image properly, used fusion between RGB and depth image (RBM), added two noise types (Gaussian noise + salt n paper noise), and prepared the data for training. As a naturally propelled deep learning model, CNNs achieve every one of the three phases with a single framework that is prepared from crude pixel esteems to classifier yields, but extreme computation power was needed. Authors in ref.^²⁵ proposed 3D CNNs where the third dimension joins both spatial and fleeting stamps. It accepts a few neighboring edges as input and performs 3D convolution in the convolutional layers. Along with them, the study reported in^²⁶ followed similar thoughts and proposed regularizing the yields with high-level features, joining the expectations of a wide range of models. They applied the developed models to perceive human activities and accomplished better execution in examination than benchmark methods. But it is not sure it works with hand gestures as they detected face first and thenody movement^²⁷.

On the other hand, the Microsoft and Leap Motion companies have developed unmistakable approaches to identify and track a user’s hand and body movement by presenting Kinect and the leap motion controller (LMC) separately. Kinect recognizes the body skeleton and tracks the hands, whereas the LMC distinguishes and tracks hands with its underlying cameras and infrared sensors^³,²⁸. Using the provided framework, Sykora et al.^⁷ utilized the Kinect system to catch the depth data of 10 hand motions to classify them using a speeded-up robust features (SURF) technique that came up to an 82.8% accuracy, but it cannot test on more extensive database and modified feature extraction methods (SIFT, SURF) so it can be caused non-invariant to the orientation of gestures. Likewise, Huang et al.^²⁹ proposed a 10-word-based ASL recognition system utilizing Kinect by tenfold cross-validation with an SVM that accomplished a precision pace of 97% using a set of frame-independent features, but the most significant problem in this method is segmentation.

The literature summarizes that most of the models used in this application either depend on a single variable or require high computational power. Also, their dataset choice for training and validating the model is in plain background, which is easier to detect. Our main aim is to show how to reduce the computational power for training and the dependency of model training on one layer.

Materials and methods

Dataset description

Using a generalized single-color background to classify sign language is very common. We intended to avoid that single color background and use a complex background with many users’ hand images to increase the detection complexity. That’s why we have used the “ASL Finger Spelling” dataset^³⁰, which has images of different sizes, orientations, and complex backgrounds of over 500 images per sign (24 sign total) of 4 users (non-native to sign language). This dataset contains separate RGB and depth images; we have worked with the RGB images in this research. The photos were taken in 5 sessions with the same background and lighting. The dataset details are shown in Table Table1,1, and some sample images are shown in Fig.1.

Table 1

Details of the dataset used.

Session	Total images per session	Depth	Intensity	Total images
A	12,547	0.49 pixel	0.35 pixel	65,748
B	13,872
C	13,393
D	13,154
E	12,782

Open in a separate window

Figure 1

Sample images from a dataset containing 24 signs from the same user.

Pre-processing of image dataset

Images were pre-processed for two operations: preparing the original image training set and extracting the hand landmarks. Traditional CNN has one input data channel and one output channel. We are using two input data channels and one output channel, so data needs to be prepared for both inputs individually.

Raw image processing

In raw image processing, we have converted the images from RGB to grayscale to reduce color complexity. Then we used a 2D kernel matrix for sharpening the images, as shown in Fig.2. After that, we resized the images into 50 × 50 pixels for evaluation through CNN. Finally, we have normalized the grayscale values (0–255) by dividing the pixel values by 255, so now the new pixel array contains value ranges (0–1). The primary advantage of this normalization is that CNN works faster in the (0–1) range rather than other limits.

Open in a separate window

Figure 2

Raw image pre-processing with (a) sharpening kernel.

Hand landmark detection

Google’s hand landmark model has an input channel of RGB and an image size of (224 × 224 × 3). So, we have taken the RGB images, converted pixel values into float32, and resized all the images into (256 × 256 × 3). After applying the model, it gives 21 coordinated 3-dimensional points. The landmark detection process is shown in Fig.3.

Open in a separate window

Figure 3

Hand landmarks detection and extraction of 21 coordinates.

Working procedure

The whole work is divided into two main parts, one is the raw image processing, and another one is the hand landmarks extraction. After both individual processing had been completed, a custom lightweight simple multi-headed CNN model was built to train both data. Before processing through a fully connected layer for classification, we merged both channel’s features so that the model could choose between the best weights. This working procedure is illustrated in Fig.4.

Open in a separate window

Figure 4

Flow diagram of working procedure.

Model building

In this research, we have used multi-headed CNN, meaning our model has two input data channels. Before this, we trained processed images and hand landmarks with two separate models to compare. Google’s model is not best for “in the wild” situations, so we needed original images to complement the low faults in Google’s model. In the first head of the model, we have used the processed images as input and hand landmarks data as the second head’s input. Two-dimensional Convolutional layers with filter size 50, 25, kernel (3, 3) with Relu, strides 1; MaxPooling 2D with pool size (2, 2), batch normalization, and Dropout layer has been used in the hand landmarks training side. Besides, the 2D Convolutional layer with filter size 32, 64, 128, 512, kernel (3, 3) with Relu; MaxPooling 2D with pool size (2, 2); batch normalization and dropout layer has been used in the image training side. After both flatten layers, two heads are concatenated and go through a dense, dropout layer. Finally, the output dense layer has 24 units with Softmax activation. This model has been compiled with Adam optimizer and MSE loss for 50 epochs. Figure5 illustrates the proposed CNN architecture, and Table Table22 shows the model details.

Open in a separate window

Figure 5

Proposed multi-headed CNN architecture. Bottom values are the number of filters and top values are output shapes.

Table 2

Details of model architecture.

Layer (type)	Output Shape	Param #	Connected to
“input_2 (InputLayer)”	[(None, 50, 50, 1)]	0	[]
“conv2d_2 (Conv2D)”	(None, 50, 50, 64)	640	“['input_2[0][0]']”
“batch_normalization_2 (BatchNormalization)”	(None, 50, 50, 64)	256	“['conv2d_2[0][0]']”
“max_pooling2d_2 (MaxPooling2D)”	(None, 50, 50, 64)	0	“['batch_normalization_2[0][0]']”
“conv2d_3 (Conv2D)”	(None, 50, 50, 128)	73,856	“['max_pooling2d_2[0][0]']”
“dropout_1 (Dropout)”	(None, 50, 50, 128)	0	“['conv2d_3[0][0]']”
“batch_normalization_3 (BatchNormalization)”	(None, 50, 50, 128)	512	“['dropout_1[0][0]']”
“max_pooling2d_3 (MaxPooling2D)”	(None, 50, 50, 128)	0	“['batch_normalization_3[0][0]']”
“conv2d_4 (Conv2D)”	(None, 50, 50, 512)	590,336	“['max_pooling2d_3[0][0]']”
“dropout_2 (Dropout)”	(None, 50, 50, 512)	0	“['conv2d_4[0][0]']”
“batch_normalization_4 (BatchNormalization)”	(None, 50, 50, 512)	2048	“['dropout_2[0][0]']”
“max_pooling2d_4 (MaxPooling2D)”	(None, 25, 25, 512)	0	“['batch_normalization_4[0][0]']”
“conv2d_5 (Conv2D)”	(None, 13, 13, 64)	294,976	“['max_pooling2d_4[0][0]']”
“input_1 (InputLayer)”	[(None, 21, 3, 1)]	0	[]
“dropout_3 (Dropout)”	(None, 13, 13, 64)	0	“['conv2d_5[0][0]']”
“conv2d (Conv2D)”	(None, 21, 3, 50)	500	“['input_1[0][0]']”
“batch_normalization_4 (BatchNormalization)”	(None, 13, 13, 64)	256	“['dropout_3[0][0]']”
“batch_normalization (BatchNormalization)”	(None, 21, 3, 50)	200	“['conv2d[0][0]']”
“max_pooling2d_5 (MaxPooling2D)”	(None, 7, 7, 64)	0	“['batch_normalization_5[0][0]']”
“max_pooling2d (MaxPooling2D)”	(None, 21, 3, 50)	0	“['batch_normalization[0][0]']”
“conv2d_6 (Conv2D)”	(None, 4, 4, 32)	18,464	“['max_pooling2d_5[0][0]']”
“conv2d_1 (Conv2D)”	(None, 21, 3, 25)	11,275	“['max_pooling2d[0][0]']”
“dropout_4 (Dropout)”	(None, 4, 4, 32)	0	“['conv2d_6[0][0]']”
“batch_normalization_1 (BatchNormalization)”	(None, 21, 3, 25)	100	“['conv2d_1[0][0]']”
“batch_normalization_6 (BatchNormalization)”	(None, 4, 4, 32)	128	“['dropout_4[0][0]']”
“max_pooling2d_1 (MaxPooling2D)”	(None, 21, 3, 25)	0	“['batch_normalization_1[0][0]']”
“max_pooling2d_6 (MaxPooling2D)”	(None, 2, 2, 32)	0	“['batch_normalization_6[0][0]']”
“dropout (Dropout)”	(None, 21, 3, 25)	0	“['max_pooling2d_1[0][0]']”
“flatten_1 (Flatten)”	(None, 128)	0	“['max_pooling2d_6[0][0]']”
“flatten (Flatten)”	(None, 1575)	0	“['dropout[0][0]']”
“concatenate (Concatenate)”	(None, 1703)	0	“['flatten_1[0][0]', 'flatten[0][0]']”
“dense (Dense)”	(None, 512)	872,448	“['concatenate[0][0]']”
“dropout_5 (Dropout)”	(None, 512)	0	“['dense[0][0]']”
“dense_1 (Dense)”	(None, 24)	12,312	“['dropout_5[0][0]']”
Total params: 1,878,307 Trainable params: 1,876,557 Non-trainable params: 1,750

Open in a separate window

Training and testing

The input images were augmented to generate more difficulty in training so that the model could not overfit. Image Data Generator did image augmentation with 10° rotation, 0.1 zoom range, 0.1 widths and height shift range, and horizontal flip. Being more conscious about the overfitting issues, we have used dynamic learning rates, monitoring the validation accuracy with patience 5, factor 0.5, and a minimum learning rate of 0.00001. For training, we have used 46,023 images, and for testing, 19,725 images. For 50 epochs, the training vs testing accuracy and loss has been shown in Fig.6.

Open in a separate window

Figure 6

Training versus testing accuracy and loss for 50 epochs.

For further evaluation, we have calculated the precision, recall, and F1 score of the proposed multi-headed CNN model, which shows excellent performance. To compute these values, we first calculated the confusion matrix (shown in Fig.7). When a class is positive and also classified as so, it is called true positive (TP). Again, when a class is negative and classified as so, it is called true negative (TN). If a class is negative and classified as positive, it is called false positive (FP). Also, when a class is positive and classified as not negative, it is called false negative (FN). From these, we can conclude precision, recall, and F1 score like the below:

Open in a separate window

Figure 7

Confusion matrix of the testing dataset. Numerical values in X and Y axis means the sequential letters from A = 0 to Y = 24, number 9 and 25 is missing because dataset does not have letter J and Z.

Precision: Precision is the ratio of TP and total predicted positive observation.

$P r e c i s i o n = T P / (T P + F P)$

Recall: It is the ratio of TP and total positive observations in the actual class.

$R e c a l l = \frac{TP}{T P + F N}$

F1 score: F1 score is the weighted average of precision and recall.

$F 1 s c o r e = 2 * [(P r e c i s i o n * R e c a l l) / (P r e c i s i o n + R e c a l l)]$

The Precision, Recall, and F1 score for 24 classes are shown in Table Table33.

Table 3

Precision, recall, and F1 score for the testing set.

Class	Precision	Recall	F1 Score	Support
A	0.99	0.99	0.99	819
B	1.00	0.99	1.00	785
C	1.00	0.99	0.99	862
D	1.00	0.99	0.99	788
E	1.00	1.00	1.00	812
F	0.99	0.99	0.99	762
G	0.98	0.99	0.99	783
H	0.99	0.99	0.99	799
I	1.00	1.00	1.00	804
K	1.00	0.99	0.99	873
L	1.00	1.00	1.00	825
M	1.00	0.99	0.99	797
N	0.98	0.98	0.98	801
O	0.99	1.00	0.99	811
P	0.95	0.98	0.96	870
Q	1.00	0.95	0.97	807
R	0.97	1.00	0.99	898
S	1.00	1.00	1.00	886
T	0.98	0.99	0.98	759
U	0.99	0.98	0.99	813
V	0.99	0.97	0.98	825
W	0.99	0.99	0.99	970
X	0.99	1.00	0.99	820
Y	1.00	1.00	1.00	756
Accuracy			0.99	19,725
Macro average	0.99	0.99	0.99	19,725
Weighted average	0.99	0.99	0.99	19,725

Open in a separate window

Result analysis

In human action recognition tasks, sign language has an extra advantage as it can be used to communicate efficiently. Many techniques have been developed using image processing, sensor data processing, and motion detection by applying different dynamic algorithms and methods like machine learning and deep learning. Depending on methodologies, researchers have proposed their way of classifying sign languages. As technologies develop, we can explore the limitations of previous works and improve accuracy. In ref.^¹³, this paper proposes a technique for acknowledging hand motions, which is an excellent part of gesture-based communication jargon, because of a proficient profound deep convolutional neural network (CNN) architecture. The proposed CNN design disposes of the requirement for recognition and division of hands from the captured images, decreasing the computational weight looked at during hand pose recognition with classical approaches. In our method, we used two input channels for the images and hand landmarks to get more robust data, making the process more efficient with a dynamic learning rate adjustment. Besides in ref^¹⁴, the presented results were acquired by retraining and testing the sign language gestures dataset on a convolutional neural organization model utilizing Inception v3. The model comprises various convolution channel inputs that are prepared on a piece of similar information. A capsule-based deep neural network sign posture translator for an American Sign Language (ASL) fingerspelling (posture)^²⁰ has been introduced where the idea concept of capsules and pooling are used simultaneously in the network. This exploration affirms that utilizing pooling and capsule routing on a similar network can improve the network's accuracy and convergence speed. In our method, we have used the pre-trained model of Google to extract the hand landmarks, almost like transfer learning. We have shown that utilizing two input channels could also improve accuracy.

Moreover, ref^⁵ proposed a 3DRCNN model integrating a 3D convolutional neural network (3DCNN) and upgraded completely associated recurrent neural network (FC-RNN), where 3DCNN learns multi-methodology features from RGB, motion, and depth channels, and FCRNN catch the fleeting data among short video clips divided from the original video. Consecutive clips with a similar semantic significance are singled out by applying the sliding window way to deal with a section of the clips on the whole video sequence. Combining a CNN and traditional feature extractors, capable of accurate and real-time hand posture recognition^²⁶ where the architecture is assessed on three particular benchmark datasets and contrasted and the cutting edge convolutional neural networks. Extensive experimentation is directed utilizing binary, grayscale, and depth data and two different validation techniques. The proposed feature fusion-based CNN^³¹ is displayed to perform better across blends of approval procedures and image representation. Similarly, fusion-based CNN is demonstrated to improve the recognition rate in our study.

After worldwide motion analysis, the hand gesture image sequence was dissected for keyframe choice. The video sequences of a given gesture were divided in the RGB shading space before feature extraction. This progression enjoyed the benefit of shaded gloves worn by the endorsers. Samples of pixel vectors representative of the glove’s color were used to estimate the mean and covariance matrix of the shading, which was sectioned. So, the division interaction was computerized with no user intervention. The video frames were converted into color HSV (Hue-SaturationValue) space in the color object tracking method. Then the pixels with the following shading were distinguished and marked, and the resultant images were converted to a binary (Gray Scale image). The system identifies image districts compared to human skin by binarizing the input image with a proper threshold value. Then, at that point, small regions from the binarized image were eliminated by applying a morphological operator and selecting the districts to get an image as an applicant of hand.

In the proposed method we have used two-headed CNN to train the processed input images. Though the single image input stream is widely used, two input streams have an advantage among them. In the classification layer of CNN, if one layer is giving a false result, it could be complemented by the other layer’s weight, and it is possible that combining both results could provide a positive outcome. We used this theory and successfully improved the final validation and test results. Before combining image and hand landmark inputs, we tested both individually and acquired a test accuracy of 96.29% for the image and 98.42% for hand landmarks. We did not use binarization as it would affect the background of an image with skin color matched with hand color. This method is also suitable for wild situations as it is not entirely dependent on hand position in an image frame. A comparison of the literature and our work has been shown in Table Table4,4, which shows that our method overcomes most of the current position in accuracy gain.

Table 4

Results of reviewed works for static image approaches.

Year	Features	Database	Accuracy in (%)
2011^²⁹	American sign language with Kinect	American sign language	97
2014^⁷	SURF and SIFT		82.8
2016^⁶	CNN	American sign languages	80.34
2018^¹⁴	Modified inception model	American sign languages	Average validation:90; Greatest:98
2018^²⁴	Fusion between RGB and depth image (RBM)	Massey, Fingerspelling A, NYU, ASL fingerspelling of the surrey university	ASL finger spelling A – 98.13
2018^²	IMU-based glove	Inertial Measurement Units (IMUs), French Sign Language (LSF)	92.95
2019^³¹	YCbCr + SkinMask fusion	custom—1800 images, 20 gesture	Softmax:96.29; SVM:97.28
2020^²²	Random forest, naïve bayes, svm, logistic regression, knn, mlp	ASL, Kaggle^³²	KNN: 95.81; ORB & MLP:96.96
Proposed method	Multi-headed CNN	American sign language	98.98

Open in a separate window

Table Table55 illustrates that the Combined Model, while having a larger number of parameters and consuming more memory, achieves the highest accuracy of 98.98%. This suggests that the combined approach, which incorporates both image and hand landmark information, is effective for the task when accuracy is priority. On the other hand, the Hand Landmarks Model, despite having fewer parameters and lower memory consumption, also performs impressively with an accuracy of 98.42%. But it has its own error and memory consumption rate in model training by Google. The Image Model, while consuming less memory, has a slightly lower accuracy of 96.29%. The choice between these models would depend on the specific application requirements, trade-offs between accuracy and resource utilization, and the importance of execution time.

Table 5

Complexity analysis of proposed model.

Model	Total Parameters	Execution time for 50 epochs (second)	Memory used for 50 epochs (MB)	Accuracy (%)
Combined Model	1,878,307 (7.17MB)	8230.36	3030.80	98.98
Image Model	984,568 (3.76MB)	8123.96	2759.36	96.29
Hand landmarks model	49,899 (194.92KB)	191.47	3404.91	98.42

Open in a separate window

Conclusion

This work proposes a methodology for perceiving the classification of sign language recognition. Sign language is the core medium of communication between deaf-mute and everyday people. It is highly implacable in real-world scenarios like communication, human–computer interaction, security, advanced AI, and much more. For a long time, researchers have been working in this field to make a reliable, low cost and publicly available SRL system using different sensors, images, videos, and many more techniques. Many datasets have been used, including numeric sensory, motion, and image datasets. Most datasets are prepared in a good lab condition to do experiments, but in the real world, it may not be a practical case. That’s why, looking into the real-world situation, the Fingerspelling dataset has been used, which contains real-world scenarios like complex backgrounds, uneven image shapes, and conditions. First, the raw images are processed and resized into a 50 × 50 size. Then, the hand landmark points are detected and extracted from these hand images. Making images goes through two processing techniques; now, there are two data channels. A multi-headed CNN architecture has been proposed for these two data channels. Total data has been augmented to avoid overfitting, and dynamic learning rate adjustment has been done. From the prepared data, 70–30% of the train test spilled has been done. With the 30% dataset, a validation accuracy of 98.98% has been achieved. In this kind of large dataset, this accuracy is much more reliable.

There are some limitations found in the proposed method compared with the literature. Some methods might work with low image dataset numbers, but as we use the simple CNN model, this method requires a good number of images for training. Also, the proposed method depends on the hand landmark extraction model. Other hand landmark model can cause different results. In raw image processing, it is possible to detect hand portions to reduce the image size, which may increase the recognition chance and reduce the model training time. Hence, we may try this method in future work. Currently, raw image processing takes a good amount of training time as we considered the whole image for training.

Author contributions

R.K.P andM.B,Conceptualization;R.K.P. methodology; R.K.P. softwareand coding; M.B. and R.K.P. validation; R.K.P. andM.B. formal analysis; R.K.P., S.Y., and M.B. investigation; S.Y. and R.K.P. resources; R.K.P. andM.B. data curation; S.Y., R.K.P., and M.B. writing—original draft preparation; S.Y., R.K.P., M.B., M.U.K., M.S., A.A.F.Y.and M.S. writing—review and editing; R.K.P.and M.U.K. visualization; M.U.K.and M.B. supervision; M.B., M.S. and A.A.F.Y. project administration; M.S. and A.A.F.Y, funding acquisition.

Funding

Funding was provided by theAmerican University of the Middle East, Egaila, Kuwait.

Data availability

The dataset used in this paper (ASL Fingerspelling Images (RGB & Depth)) is publicly available at Kaggle on this URL: https://www.kaggle.com/datasets/mrgeislinger/asl-rgb-depth-fingerspelling-spelling-it-out.

Competing interests

The authors declare no competing interests.

Footnotes

Publisher's note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

1. Anderson R, Wiryana F, Ariesta MC, Kusuma GP. Sign language recognition application systems for deaf-mute people: A review based on input-process-output. Proced. Comput. Sci. 2017;116:441–448. doi:10.1016/j.procs.2017.10.028. [CrossRef] [Google Scholar]

2. Mummadi C, et al. Real-time and embedded detection of hand gestures with an IMU-based glove. Informatics. 2018;5(2):28. doi:10.3390/informatics5020028. [CrossRef] [Google Scholar]

3. Hickeys Kinect for Windows - Windows apps. (2022). Accessed 01 January 2023. https://learn.microsoft.com/en-us/windows/apps/design/devices/kinect-for-windows

4. Rivera-Acosta M, Ortega-Cisneros S, Rivera J, Sandoval-Ibarra F. American sign language alphabet recognition using a neuromorphic sensor and an artificial neural network. Sensors. 2017;17(10):2176. doi:10.3390/s17102176. [PMC free article] [PubMed] [CrossRef] [Google Scholar]

5. Ye, Y., Tian, Y., Huenerfauth, M., & Liu, J. Recognizing American Sign Language Gestures from Within Continuous Videos. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2145–214509 (IEEE, 2018). 10.1109/CVPRW.2018.00280.

6. Ameen S, Vadera S. A convolutional neural network to classify American Sign Language fingerspelling from depth and colour images. Expert Syst. 2017;34(3):e12197. doi:10.1111/exsy.12197. [CrossRef] [Google Scholar]

7. Sykora P, Kamencay P, Hudec R. Comparison of SIFT and SURF methods for use on hand gesture recognition based on depth map. AASRI Proc. 2014;9:19–24. doi:10.1016/j.aasri.2014.09.005. [CrossRef] [Google Scholar]

8. Sahoo AK, Mishra GS, Ravulakollu KK. Sign language recognition: State of the art. ARPN J. Eng. Appl. Sci. 2014;9(2):116–134. [Google Scholar]

9. Mitra S, Acharya T. “Gesture recognition: A survey. IEEE Trans. Syst. Man Cybern. Part C. 2007;37(3):311–324. doi:10.1109/TSMCC.2007.893280. [CrossRef] [Google Scholar]

10. Rautaray SS, Agrawal A. Vision based hand gesture recognition for human computer interaction: A survey. Artif. Intell. Rev. 2015;43(1):1–54. doi:10.1007/s10462-012-9356-9. [CrossRef] [Google Scholar]

11. Amir A. et al A low power, fully event-based gesture recognition system. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 7388–7397 (IEEE, 2017). 10.1109/CVPR.2017.781.

12. Lee JH, et al. Real-time gesture interface based on event-driven processing from stereo silicon retinas. IEEE Trans. Neural Netw. Learn Syst. 2014;25(12):2250–2263. doi:10.1109/TNNLS.2014.2308551. [PubMed] [CrossRef] [Google Scholar]

13. Adithya V, Rajesh R. A deep convolutional neural network approach for static hand gesture recognition. Proc. Comput. Sci. 2020;171:2353–2361. doi:10.1016/j.procs.2020.04.255. [CrossRef] [Google Scholar]

14. Das, A., Gawde, S., Suratwala, K., & Kalbande, D. Sign language recognition using deep learning on custom processed static gesture images. In 2018 International Conference on Smart City and Emerging Technology (ICSCET), 1–6 (IEEE, 2018). 10.1109/ICSCET.2018.8537248.

15. Pathan RK, et al. Breast cancer classification by using multi-headed convolutional neural network modeling. Healthcare. 2022;10(12):2367. doi:10.3390/healthcare10122367. [PMC free article] [PubMed] [CrossRef] [Google Scholar]

16. Lecun Y, Bottou L, Bengio Y, Haffner P. Gradient-based learning applied to document recognition. Proc. IEEE. 1998;86(11):2278–2324. doi:10.1109/5.726791. [CrossRef] [Google Scholar]

17. Collobert, R., & Weston, J. A unified architecture for natural language processing. In Proceedings of the 25th international conference on Machine learning—ICML ’08, 160–167 (ACM Press, 2008). 10.1145/1390156.1390177.

18. Farabet C, Couprie C, Najman L, LeCun Y. Learning hierarchical features for scene labeling. IEEE Trans. Pattern Anal. Mach. Intell. 2013;35(8):1915–1929. doi:10.1109/TPAMI.2012.231. [PubMed] [CrossRef] [Google Scholar]

19. Xie B, He X, Li Y. RGB-D static gesture recognition based on convolutional neural network. J. Eng. 2018;2018(16):1515–1520. doi:10.1049/joe.2018.8327. [CrossRef] [Google Scholar]

20. Jalal, M. A., Chen, R., Moore, R. K., & Mihaylova, L. American sign language posture understanding with deep neural networks. In 2018 21st International Conference on Information Fusion (FUSION), 573–579 (IEEE, 2018).

21. Shanta, S. S., Anwar, S. T., & Kabir, M. R. Bangla Sign Language Detection Using SIFT and CNN. In 2018 9th International Conference on Computing, Communication and Networking Technologies (ICCCNT), 1–6 (IEEE, 2018). 10.1109/ICCCNT.2018.8493915.

22. Sharma A, Mittal A, Singh S, Awatramani V. Hand gesture recognition using image processing and feature extraction techniques. Proc. Comput. Sci. 2020;173:181–190. doi:10.1016/j.procs.2020.06.022. [CrossRef] [Google Scholar]

23. Ren, S., He, K., Girshick, R., & Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. Adv. Neural Inf. Process Syst., 28 (2015). [PubMed]

24. Rastgoo R, Kiani K, Escalera S. Multi-modal deep hand sign language recognition in still images using restricted Boltzmann machine. Entropy. 2018;20(11):809. doi:10.3390/e20110809. [PMC free article] [PubMed] [CrossRef] [Google Scholar]

25. Jhuang, H., Serre, T., Wolf, L., & Poggio, T. A biologically inspired system for action recognition. In 2007 IEEE 11th International Conference on Computer Vision, 1–8. (IEEE, 2007) 10.1109/ICCV.2007.4408988.

26. Ji S, Xu W, Yang M, Yu K. 3D convolutional neural networks for human action recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2013;35(1):221–231. doi:10.1109/TPAMI.2012.59. [PubMed] [CrossRef] [Google Scholar]

27. Huang, J., Zhou, W., Li, H., & Li, W. sign language recognition using 3D convolutional neural networks. In 2015 IEEE International Conference on Multimedia and Expo (ICME), 1–6 (IEEE, 2015). 10.1109/ICME.2015.7177428.

28. Digital worlds that feel human Ultraleap. Accessed 01 January 2023. Available: https://www.leapmotion.com/

29. Huang, F., & Huang, S. Interpreting american sign language with Kinect. Journal of Deaf Studies and Deaf Education, [Oxford University Press], (2011).

30. Pugeault, N., & Bowden, R. Spelling it out: Real-time ASL fingerspelling recognition. In 2011 IEEE International Conference on Computer Vision Workshops (ICCV Workshops), 1114–1119 (IEEE, 2011). 10.1109/ICCVW.2011.6130290.

31. Rahim MA, Islam MR, Shin J. Non-touch sign word recognition based on dynamic hand gesture using hybrid segmentation and CNN feature fusion. Appl. Sci. 2019;9(18):3790. doi:10.3390/app9183790. [CrossRef] [Google Scholar]

32. “ASL Alphabet.” Accessed 01 Jan, 2023. https://www.kaggle.com/grassknoted/asl-alphabet

Articles from Scientific Reports are provided here courtesy of Nature Publishing Group