Abstract: The invention proposes a new method for recognizing sign language gestures using video data. It focuses on generating meaningful patterns from video frames, specifically keyframes, which capture the most important moments of the gesture. The patterns are generated from these key-frames through human body segmentation and contour extraction, focusing on the upper body to emphasize the critical components of sign language communication. The resulting pattern sequences are then utilized to train a three-dimensional convolutional neural network (3D CNN), enabling the model to recognize sign language gestures based on their temporal and spatial characteristics. This innovative approach not only facilitates more accurate gesture recognition but also provides a framework for converting pre-existing patterns into class-specific representations, thereby advancing the field of sign language recognition technology. Figure 5
Description:FIELD OF THE INVENTION
This invention presents a method for generating a pattern sequence to represent sign language gesture classes. More particularly a method for directly analyzing video data to extract meaningful patterns associated with sign language gestures. Additionally, it describes a method for converting pre-existing patterns or sequences into class-specific sign language representations.
BACKGROUND OF THE INVENTION
Sign language stands out as a unique and expressive mode of communication, predominantly used by the Deaf and hard-of-hearing communities through the utilization of their facial expressions, hand gestures, and body postures for expressing their emotions, and feelings and conveying their messages to others. However, the gap in understanding and interpreting sign language poses significant challenges, hindering effective communication between sign language users and the broader society. This communication barrier limits educational and employment opportunities, emphasizing the critical need for an advanced Sign Language Recognition System (SLRS). Communication is one of the vital activities of human beings for expressing their feelings, maintaining social bonds, sharing ideas, and working together in society.
Traditional machine learning algorithms are used for the classification of a sequence of images and frames which provides a reflection of specific sign words or gestures by performing extraction of temporal and spatial information from the given dataset. For this purpose, many researchers have utilized various types of traditional techniques like image pre-processing, hand detection, image segmentation, hand shape detection, contour detection, hand tracking, etc. along with feature classification and extraction.
Dutta et al. used a Principal Component Analysis (PCA) to do extraction of features from the images and for classification, the authors used K-Nearest Neighbor (KNN). The results of the paper showed that the authors were successful in obtaining an accuracy of 95.84% for the recognition of alphabets used in Indian Sign Language (ISL). Indian Sign Language (ISL) is comprised of various types of double and single hand gestures.
Joshi et al. used a single and double-hand sign language recognition, a fusion-based technique has been used in which the authors of the paper used a Histogram Oriented Gradient (HOG) and Scale Invariant Feature Transformation (SIFT) along with the use of K-Nearest Neighbor for the recognition of ISL alphabets. The proposed method was successful in achieving an accuracy of 90%. There are multiple types of vision-based solutions based on deep learning algorithms that have been used in the past.
Das et al. used vision-based SLRS which was named Hybrid CNN-BilSTM SLR (HCBSLR) to overcome the problems related to excessive pre-processing. The authors of the paper followed the approach of HD (Histogram Difference) for extracting features from the selected keyframes so that they can improve efficiency, reliability, and accuracy. The authors also utilized Bidirectional Long Short TERM Memory (BILSTM) for the extraction of temporal features. The proposed model was successful in obtaining an average accuracy of 87.67%.
Vyavahare et al. proposed a solution in which the authors used a specific deep learning based Sign Language Detection system for ISL. In the proposed solution, for detecting and recognizing the actions of signers through dynamic gestures captured in the selected dataset videos’ keyframes, the authors utilized Long Short Term Memory (LSTM) networks. The proposed system achieved an accuracy of 96% in training and 87% in the testing phases for ISL recognition.
Several pieces of literature have been published including patents and non-patent documents in said domain.
Reference is made to patent application no. US201715584361A, titled as “Automated sign language recognition”. This prior art is configured to detect interest points in an extracted sign language feature, wherein the interest points are localized in space and time in each image acquired from a plurality of frames of a sign language video; apply a filter to determine one or more extrema of a central region of the interest points; associate features with each interest point using a neighboring pixel function; cluster a group of extracted sign language features from the images based on a similarity between the extracted sign language features; represent each image by a histogram of visual words corresponding to the respective image to generate a code book; train a classifier to classify each extracted sign language feature using the code book; detect a posture in each frame of the sign language video using the trained classifier; and construct a sign gesture based on the detected postures.
Another reference is made to patent application no. CN202211007786A, titled as “Sign language letter spelling recognition method based on convolutional neural network”. This prior art describes a method based on a convolutional neural network extract features of a hand depth map by using the convolutional neural network and conducts sign language letter spelling recognition. After acquiring the sign language picture and the depth picture, the depth camera sends the sign language picture and the depth picture to a target detection network to extract an accurate hand target picture and an accurate depth picture; after the hand target is extracted, the depth picture is divided into accurate sign language gesture targets through gray value-based pseudo color linear transformation and a color gamut division algorithm, and lost sign language gesture information is supplemented through a color fusion algorithm. After the segmentation is finished, the picture is subjected to pixel processing through graying and local area binarization to form a single-channel binary image so as to reduce the number of network input parameters, and the picture preprocessing is finished. And finally, sending the preprocessed sign language gesture pictures into a convolutional neural network for feature extraction, connecting the extracted features with a full connection layer, and classifying through a SoftMax classifier. After training, the network model is saved and used for sign language letter spelling recognition.
Another reference is made to patent application no. IN202111028446, titled as “Sign language and gesture capture and detection”. This prior art disclose a systems and methods used for sign language and gesture identification and capture. A method may include capturing a series of images of a user and determining whether there are regions of interest in an image of the series of images. In response to determining that there is a region of interest in a particular image of the series of images, the method may include determining whether the region of interest includes a gesture movement by the user and determining whether the region of interest includes a sign language sign movement by the user. An image representation of a gesture or a sign may be generated.
Another reference is made to non-patented document by Geethu G Nath, Arun C S titled as “Real Time Sign Language Interpreter”. This prior art describes a sign language is a medium for communication for many disabled people. Sign language recognition has many applications including gesture controlled activities like human computer interaction, gesture controlled home appliances and other electronic devices and many applications that use gestures as the trigger input. The most important application is that it provides communication aid for deaf and dumb people. The system for sign language recognition for deaf and dumb people is implemented in ARM CORTEX A8 processor board using convex hull algorithm and template matching algorithm. Image is obtained using webcam. This hand sign image is converted to text so as to develop communication between normal and deaf and dumb people. Open CV is a software tool that provides the support with image processing techniques. The system converts sign language to text for deaf and dumb people to communicate with normal people. Moreover, the system is used to control devices like Robot, Car Audio Systems, home appliances etc. "
Another reference is made to non-patented document by "Malladi Sai Phani Kumar ATDC, IIT, Kharagpur, West Bengal; Veerapalli Lathasree; S. N. Karishma titled as “Novel contour based detection and GrabCut segmentation for sign language recognition”. This prior art discloses an automatic computer aided hand gesture to voice conversion system for people suffering from Aphonia, a medical term for speech impairment. This art assumed the input gesture images given to the system as simple and complex depending on the background of the image. A contour based image segmentation algorithm is proposed in this paper to detect the boundary of the foreground from images with simple dark background. The traditional GrabCut algorithm is employed for segmentation of foreground from images with complex background. This algorithm iteratively segments the image to extract the foreground accurately. The American Sign Language (ASL) 26 finger-spelled alphabet images are taken as the dataset for the two above mentioned algorithms. For the dataset that this prior art generated, it is observed from the results that contour based segmentation algorithm provides absolutely perfect results. The number of iterations required for GrabCut algorithm to segment the foreground may vary from image to image depending on the background. Out of all the 26 gesture of alphabets, Q, R and S need 6 number of iterations at maximum. A minimum of 1 iteration is required for alphabets E, J and O. On average, 3 iterations of GrabCut algorithm is required to completely segment the foreground from images with complex background.
The existing prior art relies on traditional image processing techniques to analyze video data and extract meaningful patterns associated with sign language gestures. The existing art relies on contour-based detection and Grab Cut segmentation to analyze video data and extract intricate patterns associated with sign language gestures.
In view of the drawback associated with above existing state of art, the present invention utilizes advanced machine learning algorithms for analysis which enhances accuracy, efficiency, scalability, and adaptability, making it a significant improvement over the prior art.
OBJECT OF THE INVENTION
In order to overcome the shortcomings in the existing state of the art, the present invention provides novel methods for generating a sequence of patterns to represent sign language classes.
Yet another objective of the invention is to provide novel methods for classification, facilitating a more comprehensive and accurate analysis of the video content.
Yet another object of the invention is to provide a method for identifying key frames from a video sequence that represent the most important moments of a sign language gesture.
Yet another object of the invention is to provide a method for generating patterns from the key frames by segmenting the human body gestures and extracting contours.
Yet another object of the present invention is to provide a method for training a 3D CNN model to recognize sign language gestures based on the generated pattern sequences.
Yet another objective of the present invention is to provide a method for enhancing the accuracy and robustness of the recognition process compared to methods solely based on static image segmentation.
Yet another objective of the invention is to provide different methods that enhance accuracy, efficiency, scalability, and adaptability for generating a pattern sequence to represent a sign language class.
SUMMARY OF THE INVENTION:
The present invention provides a method for generating a pattern sequence to represent sign language gesture classes. More particularly a method for directly analyzing video data to extract meaningful patterns associated with sign language gestures. Additionally, it describes a method for converting pre-existing patterns or sequences into class-specific sign language representations.
This inventive method involves preparing video frames, identifying key-frames, and generating pattern sequences. Key-frames are extracted from the video and used to generate patterns through human body segmentation and contour extraction. These patterns are then used to train a 3D CNN model for sign language recognition. In this method the intelligent pattern generation for sign language (IPGSL) focuses on generating patterns to represent sign language gestures, providing a more intuitive and potentially more accurate approach compared to traditional methods. This method identifies key-frames from the video, which capture the most important moments of the sign language gesture. Patterns are generated from the key-frames by segmenting the human body and extracting contours. The generated pattern sequences are used to train a 3D CNN model, which is capable of recognizing sign language gestures based on their temporal and spatial characteristics.
BRIEF DESCRIPTION OF DRAWINGS
Figure 1 depicts illustrating sign language recognition using IPGSLa.
Figure 2 depicts a sign language representing the gesture class ’Curved’
Figure 3 depicts a block diagram of IPGSL framework.
Figure 4 depicts a sign language representing the gesture class ‘Cheap’.
Figure 5 depicts proposed model architecture.
Figure 6 depicts face extracted from a frame for class ’Curved.’
Figure 7 depicts accuracy vs number of Key frames plot on three dis- tinct datasets.
Figure 8 depicts comparison Result of WLASL-100 for IPGSL model vs SOTA Models.
Figure 9 depicts comparison Result of Accuracy of 5-Fold Cross Validation of WLASL-100 for IPGSL model vs SOTA Models.
Figure 10 depicts comparison Result of DSGS for IPGSL model vs SOTA Models.
Figure 11 depicts comparison Result of Accuracy of 5-Fold Cross Validation of DSGS for IPGSL model vs SOTA Models.
Figure 12 depicts Comparison Result of Include-50 for IPGSL model vs SOTA Models.
Figure 13 depicts Comparison Result of Accuracy of 5-Fold Cross Vali- dation of Include-50 for IPGSL model vs SOTA Models.
DETAILED DESCRIPTION OF THE INVENTION WITH ILLUSTRATIONS AND EXAMPLES
While the invention has been disclosed with reference to certain embodiments, it will be understood by those skilled in the art that various changes may be made, and equivalents may be substituted without departing from the scope of the invention. In addition, many modifications may be made to adapt to a particular situation or material to the teachings of the invention without departing from its scope.
Throughout the specification and claims, the following terms take the meanings explicitly associated herein unless the context clearly dictates otherwise. The meaning of “a”, “an”, and “the” include plural references. Additionally, a reference to the singular includes a reference to the plural unless otherwise stated or inconsistent with the disclosure herein.
The present invention discloses a method of pattern recognition-based approach named Intelligent Pattern Generation for Sign Language (IPGSL) for Sign Language recognition. The method generates a sequence of patterns to represent each class in Sign Language. This invention generates a sequence of 5 patterns to represent a single action/class as shown in Fig. 2 where Fig. 2a represents pattern sequence to represent class ’Fan’.
This generated sequence of patterns is factored in input for training a simple 3D CNN model. Fig. 3 illustrates the working of IPGSL, and the method involves the following stages:
• Preparation of video frames
• Identification of key-frames (IKF)
• Generation of Pattern sequence (GPS)
Preparation of video frames
The initial step is to extract individual frames {f0, f1, . . . , fm} from the input sign language video. In order to prevent the loss of essential frames, the frames are extracted at regular 50ms intervals, regardless of the video’s duration. The background from the extracted frames is removed and converted for further processing.
Identification of key frames
The frames extracted from the previous stage are then subjected to identify key-frames (KF), which represent a sign language gesture class. Generally, after completion of a gesture, a person retracts the hand back to the normal position.
Thus, when choosing key frames, it’s essential to exclude frames from the segment where hand retraction initiates. The objective of this stage is to identify the most distinguishable frames {KF0, KF1, . . . , KFk} that can better represent a sign language gesture class from the available set of frames {f0, f1, . . . , fm}, where k < m.
As an initial step, the starting frame {f0} is considered as a key-frame and it is also set as the key-frame under consideration
(KFC). {KF} = {KF} ∪ {f0} (2)
KF C = f0 (3)
where {KF} is the set of key-frames. Thereafter every consecutive frames {f1, f2, . . . , fm} are compared with the KFC to compute the inter-frame difference (D).
D = KF C − fi (4)
If the inter-frame difference (D) for a specific frame fi exceeds the predefined threshold value θ, that frame will be compared to all the key-frames in {KF}. The frame fi will be classified as a key-frame if the individual difference between that frame and all the key-frames in {KF} exceeds the specified threshold value θ.
Di = KFi − fi , ∀ KFi ∈ {KF} (5)
if ∀ Di > θ =⇒ KF = {KF} ∪ {fi} (6)
The number of key-frame k in KF depends on the value of θ. k can be defined as a linear function on the number of frames m and threshold θ, k = f (θ, k). Ihis invention derive the function using a system of linear equations and validate its correctness through line fitting. The expression for k is defined as follows:
k = aθ + bm + c (7)
The sequence of steps for the identification key-frames (IKF) from the set of available frames {f0, f1, . . . , fm} is presented in Algorithm 1.
The algorithm produces a set of key-frames {KF0, KF1, . . . , KFk} as its output.
Algorithm 1 Identification of Key-frame (IKF)
Input: Set of video frames F = {f0, f1, . . . , fm} and Threshold θ
Output: Set of key frames KF = {KF0, KF1, . . . , KFk}
KF = {f0}
KF C = f0
for i = 0 to m − 1 do
D = KF C − fi
if D > θ then
for ∀KFj ∈{KF} do
Compute Dj = fi – KFj
if ∀Dj > θ then
KF= {KF} ∪{fi}
KFC= fi
end if
end for
end if
end for.
Generation of Pattern sequence (GPS)
In this stage, a sequence of patterns P = {p0, p1, . . . , pk} will be generated from the key frames {KF0, KF1, . . . , KFk} extracted in the previous stage. It includes the following steps:
Segmentation of Human Body
Generation of Overlay (Patterns)
Retention of Facial Landmark
Retention of Hand Landmark
Segmentation of Human Body:
As an initial step towards pattern generation, the human body is extracted from the key-frames image using a segmentation algorithm. Generally, sign language gestures are formed through a combination of facial expressions and hand movements, which predominantly focus on the superior portion of the human body. Hence, the lower part of the human body is cropped from the key-frames after human body segmentation. Thereafter, a binary mask that highlights all the contours of the segmented image is generated through an edge detection algorithm.
Fig 4 shows the sequence of steps for the segmentation of a human body, wherein Fig 4a shows the sequence of key-frames representing Class ’cheap’. The resulting image after cropping the lower part of the image is represented in Fig 4b and the output after segmenting the human body is represented in Fig 4c. Finally, Fig 4d represents binary images generated from the segmented human body through the canny edge detection algorithm.
Generation of Overlay (Patterns):
Typically, the largest contour in the binary mask will be the outline of the human body. As an initial step, invention detect and contours all, C = {C1, C2, . . . , Ce} within the binary mask corresponds to a key-frame KFi . Thereafter, the contours that are considered irrelevant or too small to the main body are filtered out by setting a minimum contour area threshold Amin. Contours with an enclosed area below the specified threshold Amin are discarded, while larger contours saved as efficient contours EC.
Ci ∈ EC iff A(Ci) > Amin (8)
Hereafter, the area of the region enclosed by each preserved contour is calculated to detect the largest contour in the binary mask which will be the pattern pi. Finally, a sequence of patterns {KF0, KF1, . . . , KFk}. The sequence of steps for the generation of pattern sequence (GPS) {KF0, KF1, . . . , KFk} from the set of key frames {KF0, KF1, . . . , KFk} is presented in algorithm 2
The generated k patterns are then supplied as an input to CNN that conations two conventional layers, two maxpooling layers,a flatten layers, a fully connected layer and output layers. The CNN finally predicts the correct class of each language videos
Fig. 4: A sign language representing the gesture class Cheap a) ’Key-frames’ b) ’Upper part of Human body’ c) ’Segmentation of Human Body’ d) ’Binary Image of Segmented Human Body’.
Algorithm 2 Generation of patterns (GPS)
Input: Set of key-frames KF = {KF0, KF1, . . . , KFk}
Output: Set of Patterns P = {p0, p1, . . . , pk}
P={} EC={}
for ∀KFi ∈{KF} do
Crop the image in KFi to upper body
Segment the human body from KFi
Generate binary image of KFi
Identify set of contours C = {C1, C2, . . . ,e } in KFi
for ∀Cj ∈{C} do
Compute enclosed area by contour Ci as A(Cj )
if A(Cj ) > Amin then
EC= {EC} ∪{Ci}
end if
end for
Find the Ck ∈ EC with largest Area
P= {P} ∪{Ck}
end for
----------------------------------------------------------------------------------------------------------
Retention of Face and Hand Landmarks
Facial features are pivotal in sign language recognition (SLR) as they provide essential cues that convey emotional context and enhance the meaning of hand signs. Non-manual signals, such as eyebrow positioning and facial expressions, are particularly valuable for disambiguating signs that could have multiple meanings. Invention’s method identifies key points corresponding to major facial landmarks, such as the eyes, eyebrows, nose, and mouth, and represents these landmarks through piecewise linear functions or curves.
Retention of Eyes
In our approach, the eyes are modeled as closed shapes, defined by several key points around the contour of each eye. This invention approximate the boundary of eye by connecting key points sequentially to form a polygon, represented as a closed curve using piecewise linear functions to capture the eye's natural shape. This closed curve for the eye's boundary is defined by the equation:
feye(t) = ∑_(i=1)^(n-1)▒〖[(1 - t)pi + tpi+1〗] , 0 ≤ t ≤ 1, (9)
where pi and pi+1 are adjacent key points on the eye contour, and t is a parameter ranging from 0 to 1. When t is closer to 0, the result is nearer to pi, and as t approaches 1, the result shifts closer to pi+1. This piecewise interpolation allows for smooth transitions between points, forming a continuous closed curve that accurately represents the eye’s shape.
Retention of Eyebrows
The eyebrows are modeled as open curves, connecting key points along the eyebrow’s natural shape without looping back to the starting point. The eyebrow shape is modeled using a piecewise linear function to accurately capture its natural contour and expressive movements. By interpolating between consecutive key points along the eyebrow, invention represent its upward or downward motions, which are essential for conveying emotional and grammatical cues. The eyebrow curve is defined as:
feyebrow(t) = ∑_(i=1)^(n-1)▒〖[(1 - t)pi + tpi+1〗] , 0 ≤ t ≤ 1, (10)
where pi and pi+1 are adjacent points on the eyebrow, and t is a parameter ranging from 0 to 1. As t varies, the function interpolates smoothly between points, forming an open curve that represents the eyebrow’s shape and expressive movements
Retention of Nose
The nose is represented as a simple linear structure, connecting the top and bottom key points of the nose bridge. The line representing the nose effectively captures its orientation and alignment and is defined by the parametric equation:
fnose(t) = (1 - t)ptop + tpbottom+1 0 ≤ t ≤ 1, (11)
where ptop and pbottom are the coordinates of the top and bottom points of the nose, respectively. The parameter t ranges from 0 to 1, allowing the function to interpolate between these points.
Retention of Lips
The lips are modeled using two sets of key points: one for the outer boundary and another for the inner boundary of the mouth. Both contours are represented
as closed shapes, with points connected sequentially to form polygons approximating the lip boundaries. Each contour is defined as a piecewise linear curve:
flips(t) = ∑_(i=1)^(n-1)▒〖[(1 - t)pi + tpi+1〗] , 0 ≤ t ≤ 1,
where pi and pi+1 are adjacent points on the lip boundary, and t is a parameter that interpolates between points, forming a smooth closed curve that captures the natural shape of the lips.
Retention of Hand Landmarks
Hand landmark detection captures essential points on the hand as 2D coordinates (xi, yi), which are crucial for accurately representing hand
gestures. Generally, two primary landmarks are defined for each finger: the fingertip and the base (metacarpophalangeal joint, MCP). To effectively capturing the finger’s alignment, lines are drawn by connecting these base and tip points using a parametric line equation using equation:
ffinger(t) = (1 − t)hbasei + thtipi , t ∈ [0, 1] (12)
where hbasei and htip denote the coordinates of the finger base and tip, respectively. The parameter t varies from 0.
Thus, the invention retained facial and hand landmarks from the generated K patterns. The generated k patterns are then supplied as input to the Convolutional Neural Network (CNN) that contains two convolutional layers, two max-pooling layers, a flatten layer, a fully connected layer, and an output layer. The CNN finally predicts the correct class of each sign language videos.
EXPERIMENTAL SETUP
Dataset Description
This study specifically focuses on three distinct sign language datasets: (i) WLASL-100 - American Sign Language (ASL) [6], (ii) SMILE DSGS- Swiss German Sign Language [28], and (iii) INCLUDE-50 - Indian Sign Language (ISL) [26] dataset.
The WLASL-100 dataset represents a subset of glosses with a vocabulary size of 100. This dataset contains videos with lengths ranging from 0.36 to 8.12 seconds, with a frame size 656 X 370. The average length of videos in WLASL100 is of 2seconds and the average rate of frame per second is 30. The WLASL-100 contains videos of a person demonstrating sign language in both standing and sitting positions.
The DSGS dataset was derived from a vocabulary production test consisting of 100 distinct stimuli. This dataset contains videos with lengths ranging from 0.6 to 4.2 seconds, with a frame size 1920 X 1072. The average length of videos in DSGS is of 3seconds and the average rate of frame per second is 26. The DSGS dataset contains videos of a person demonstrating sign language only in the sitting position.
The INCLUDE-50 dataset is a subset of the Indian Lexicon Sign Language Dataset consisting of 50 word signs across 15 different word categories. The dataset contains a total of 4,790 videos with a frame size 1920 X 1080. The average length of videos in INCLUDE-50 is of 2seconds and the average rate of frame per second is 25. The Include -50 dataset contains videos of a person demonstrating sign language only in the standing position.
Selection of the Threshold θ
To decide the best value for threshold θ this invention conducted experiments to find the recognition accuracy of IPGSL for varying the number of key-frames (k) across datasets: Include-50, SMILE-DSGS, and WLASL-100. Figure 7 represents the plot describing the relationship between the number of keyframes k and the sign language recognition accuracy. The recognition accuracy improves significantly when the number of keyframes changes from 3 to 7 for all these datasets. When the number of key-frames is 5 (k = 5) keyframes, accuracy stabilizes around 96% for Include-50, 92% for SMILE-DSGS, and 93% for WLASL-100. There is no significant improvement in accuracy when the number of key-frames is greater than 5. For Include- 50, accuracy increases by only 0.3% when moving from 5 to 13 keyframes. The same trend is observed for SMILE- DSGS and WLASL-100, with minimal accuracy improvements beyond 5 keyframes. The IPGSL algorithm produces lesser accuracy, when the number of key-frames is less than 5, which indicates the loosing of significant information. Thus, invention set the number of key-frames as 5, k = 5, for our experiments. The value of θ is calculated using equa tion 7, where the invention set the value of k as 5 and m is the total number of frames in the sign language video.
CNN Classifier.
The IPGSL model is implemented in Google Collab using python version 3.10.12 and with the help of Keras and TensorFlow libraries which are used for Deep learning. Table I, provides details about different libraries which are used to implement the proposed model. In this experiment, the Mediapipe library is used for hand detection and hand landmarks extraction. The grabCut algorithm is used for background extraction and Canny edge detection algorithm is used for binary mask generation from the key-frames.
In this invention, the CNN deep learning model for the classification of sign languages. In invention the count of patterns generated at 5 , by automatically determining the value using equation 7. The input layer of the model comprises 5 neurons that receive 5 patterns of size 256X256 generated from the proposed method. The first convolutional layer contains 16 filters and the second convolutional layer contains 32 filters of size 2X2 with a stride of 2 to produce a feature map of size 128X128 and 64X64 respectively. Max-pooling layers use a filter size of 2X2 and rectified linear unit (ReLU) is used as the activation function. Adam optimizer is used to reduce the loss during model training. The proposed model is trained for 20 epochs with a learning rate of 10−4 . Table I and II represent the Python libraries and the parameter settings for training the IPGSL model respectively. Table III illustrates the parameters employed in state-of-the-art (SOTA) models, which are utilized for comparing the performance of the proposed model.
Evaluation Metrics
In this invention, accuracy, categorical cross entropy (CCE) Loss, macro average precision (MAP), macro average recall (MAR), and macro average F1-score (MF1) are used as evaluation metrics for the performance analysis of the model.
The metrics are computed using the following equations:
Accuracy = (Correctly Predicted Images/Total Number of Images) × 100 (13)
MAP = 1/N ∑N i=1 TPi TPi + FPi (14)
MAR = 1 N X N i=1 TPi TPi + FNi (15)
MF1 = 2 × MAP × MAR MAP + MAR (16)
Where TP = the number of true positive values, TN = the number of true negatives, FN = the number of false negatives and FP = number of false positives
CCE Loss = − X i ytrue,i · log(ypred,i) (17)
Where ytrue,i is the true probability of class i and ypred,i is the predicted probability of class i
Table 1. Python libraries used for implementation of IPGSL model [10]
Python Library Purpose
OpenCV To extract frames from input videos and perform image processing
Keras For importing the CNN model and the sequential model
Mediapipe For hand tracking and extracting hand features
Tensorflow For importing convolution layer, pooling layer, dropout layer, callback function, etc.
Os For directory related operation
Sklearn For dataset split, confusion matrix and accuracy
Numpy For operating multidimensional array, matrices
Glob For importing input videos from hierarchical directory
Table 2. Parameters for training the IPGSL model.
Parameters Measures
Batch size 5
Learning rate 10-4
Optimizer Adam
Epoch 20
Loss function Categorial Cross Entropy
Activation function ReLU & Softmax
RESULTS
As part of the experimental investigation, here evaluation of the algorithms on three distinct multilingual datasets: American Sign Language (WLASL-100), German Sign Language (DSGS-SMILE), and Indian Sign Language (INCLUDE-50). To compare the efficiency of the proposed model, two state-of-the-art (SOTA) architectures are implemented:
MobileNetV2 + BiLSTM
Pose-Temporal Graph Convolutional Network (Pose-TGCN).
The accuracy and loss charts serve to illustrate the comparison of experimental results between these models with respect to our IPGSL model
Table 3. Parameters used in SOTA models
Model Architecture Hyperparameter Values
MobileNet-V2(38) Inverted Residual Blocks, Depthwise Separable Lerning rate 0.001
Epoch 50
Batch Size 64
Loss Function Categorical cross entropy
Activation Function ReLU
Optimizer Adam
Pose-TGCN (25) Pose-TGCN, VGG-16 Lerning rate 0.001
Epoch 30
Batch Size 32
Loss Function Categorical cross entropy
Activation Function ReLU
Optimizer SGD
American Sign Language Dataset (WLASL-100)
The evaluation of all the classification results of our IPGSL model and the SOTA model on the WLASL-100 dataset. The Mthe attained an accuracy of 84.63%, MAP of 83.48%, MAR of 83.09%, and MF1 of 82.19%. The Pose-TGCN model achieved an accuracy of 88.92%, MAP of 87.65%, MAR of 87.02%, and MF1 of 86.84%. The IPGSL Model exhibited superior performance with an accuracy of 92.87%, MAP of 92.76%, MAR of 92.04%, and MF1 of 91.86%. The proposed model generates superior results when compared to the other models and the results are listed in Table IV.
Fig. 8 shows the comparison of experimental results through different models on the WLASL-100 dataset, where Fig. 8a shows an iterative accuracy plot over each epoch and Fig. 8b shows iterative loss plot over each epoch. In Fig. 6 the orange plot indicates the Pose-TGCN model, the blue plot indicates the MobileNet-V2 + BiLSTM model, and the green plot indicates the IPGSL model. The plot shows that our IPGSL model produces higher accuracy and lower CCE Loss result when comparing to the other models in each epoch.
Figure 9 illustrates the accuracy trends of different models in a 5-fold cross-validation. Notably, the IPGSL Model, denoted by the green line, consistently outshines other SOTA models, including the TGCN Model, as indicated by the orange line, and surpasses the performance of the MobilenetV2 Model, as indicated by the blue line. This visual analysis shows the superior performance of the IPGSL Model across various folds in the experiments.
German Sign Language Dataset (DSGS SMILE)
An invention assessed classification outcomes of the IPGSL and SOTA models on the DSGS-SMILE datasets. The MobileNet-V2 + BiLSTM model obtains an accuracy of 82.42%, MAP of 81.68%, MAR of 80.19%, and MF1 of 80.02%. The Pose-TGCN model attains an accuracy of 84.46%, MAP of 84.1%, MAR of 83.26%, and MF1 of 82.89%. The IPGSL Model has improved accuracy of 91.84%, MAP of 91.44%, MAR of 91.10%, and MF1 of 90.74%. The proposed model yields improved outcomes compared to the SOTA models, which are shown in Table V. Fig. 10 illustrates the comparison of experimental outcomes across various models on the DSGS dataset. Specifically, Fig. 10a displays the accuracy plot for each epoch, while Fig. 10b presents the loss plot for each epoch. In Fig. 10, the orange plot represents the Pose-TGCN model, the blue plot represents the MobileNet-V2 + BiLSTM model, and the green plot represents the IPGSL model. The figure demonstrates that our IPGSL model achieves superior accuracy and reduced CCE Loss compared to the other models at each epoch.
Table 5. Performance evaluation for DSGS SMILE dataset
Model Accuracy (%) MAP(%) MAR(%) MF1 (%)
MobileNet-V2 82.42 81.68 80.19 80.02
Pose-TGCN 84.46 84.1 83.26 82.89
IPGSL model 91.84 91.44 91.10 90.74
Table 6. Performance evaluation for Include-50 dataset
Model Accuracy (%) MAP(%) MAR(%) MF1(%)
MobileNet-V2 94.35 92.16 92.09 91.82
Pose-TGCN 92.36 91.84 91.06 90.82
IPGSL model 96.24 97.46 97.18 96.87
The 5 fold cross validation depicted in figure 11 revels distinctive accuracy trends among various models. The present invention provides consistent superiority of the IPGSL model represented by green line surpassing other SOTA models like the TGCH model, as represented by the orange line, and outperforming the MobilenetV2 Model, as illustrated by the blue line. This visual examination accentuates the exceptional performance of the IPGSL Model across diverse folds in the experimental context.
Indian Sign Language Dataset (INCLUDE 50)
An assessed classification outcomes of both our IPGSL model and the state-of-the-art (SOTA) model using the INCLUDE-50 dataset. The MobileNet-V2 got an accuracy of 94.35%, MAP of 92.16%, MAR of 92.09%, and MF1 91.82%. Pose-TGCN has an impressive accuracy of 92.36%, MAP of 91.84%, MAR of 91.06%, and MF1 of 90.82%. The IPGSL Model demonstrates exceptional accuracy, achieving a remarkable 96.24%. MAP of 97.46%, MAR of 97.18%, and MF1 of 96.87%. The IPGSL model demonstrates superior performance compared to SOTA models, as seen by the results presented in Table VI. Fig. 12 depicts the comparison of empirical results across different models using the Include-50 dataset. More precisely, Fig. 12a exhibits the accuracy plot for each epoch, while Fig. 12b showcases the loss plot for each epoch. In Fig. 12, the orange plot corresponds to the Pose-TGCN model, the blue plot corresponds to the MobileNet-V2 + BiLSTM model, and the green plot corresponds to the IPGSL model. The information shown in the figure unequivocally demonstrates that our IPGSL model surpasses the other models in terms of accuracy and CCE Loss at each epoch.
Figure 13 presents the results of a 5-fold cross-validation, unveiling discernible accuracy trends across different models. Notably, the green line, signifying the IPGSL Model, consistently excels, surpassing other cutting-edge models such as the TGCN Model and exhibiting superior performance compared to the MobilenetV2 Model, rep-resented by the blue line. This visual scrutiny emphasizes the remarkable proficiency of the IPGSL Model across a range of folds in the experiments.
Comparison with other SOTA models with Friedman Test
Table 7 summarizes the performance of our IPGSL model with other state of art (SOTA) models on three different sign language datset:WLASL-100, SMILE-DSGS and Include-50. This invention compare IPGSL with SOTA models such as Inflated 3D convNets (13D), Spatial-Temporal Graph Convolution Networks (ST-GCN), Transferring Cross Domain Knowledge (TCK), Hand-Model-Aware(HMA), BERT Pre-Training for SLR(BEST), Pose Based Temporal Graph Convolution Neural Network (Pose-TGCN), Pre-Training of Hand-Model Aware Representation for SLR (SignBERT) and Natural language assisted SLR(NLA-SLR), end to end hand shape and continuous SLR (SubUNets), BiLSTM and
CNN+BILSTM. The IPGSL model achieves high Macro Average Accuracy (MAA) percentages across all three datasets, indicating its effectiveness in accurately recognizing sign language gestures
Table 7. Performance Comparison with other SOTA models with Friedman Test
CONCLUSION
This inventive method is named Intelligent Pattern Generation for Sign Language (IPGSL) for sign language recognition based on pattern recognition principles. The key contribution of this work is in the formation of a sequence of patterns to represent a sign language class. Initially, key-frames are detected from the input video frames of a sign language class using the Identification of Key-frame (IKF) method, and subsequently, patterns are generated from identified key-frames. The generated patterns are then supplied as input to a Convolutional Neural Network (CNN) model designed for classification and tested in three distinct multilingual sign language datasets:
(i)American Sign Language (WLASL-100),
(ii)German Sign Language (DSGS-SMILE),
(iii)Indian Sign Language (INCLUDE-50).
Through the various experiments, it is observed that the IPGSL method outperforms other existing methods in terms of performance. The IPGSL method will facilitate enhanced accessibility and communication for sign-language users.
Accordingly, the present invention provides following novel aspects of the IPGSL method.
Key-frame Identification:
A novel method is proposed to identify key-frames from input sign language videos. This method aims to select frames that are most representative of the sign's meaning and motion, ensuring that the essential information is captured.
Pattern Generation:
A novel method is proposed to generate patterns from the identified key-frames. These patterns represent the visual and temporal characteristics of the sign, such as hand shapes, movements, and spatial configurations.
Pattern Sequence Identification:
The invention identifies the optimal sequence of patterns that best represent a specific sign language class. This sequence selection is crucial for accurate classification, as it captures the dynamic nature of sign language.
Classifier Enhancement:
The generated patterns of present invention are used to enhance the performance of sign language classifiers. By incorporating these informative patterns, the classifiers can better distinguish between different signs, leading to improved accuracy and robustness.
Therefore, the present invention contributes to the advancement of sign language recognition by providing a novel approach for key-frame identification, pattern generation, and sequence selection. The proposed methods are expected to improve the overall performance of sign language recognition systems, making them more accessible and efficient for individuals with communication impairments.
, Claims:1. A method for generating a pattern sequence for sign language gesture recognition, comprising:
- preparing a set of video frames from an input sign language video, wherein the frames are extracted at regular intervals to maintain essential gesture information;
- identifying key-frames from the prepared video frames, wherein said key-frames are selected based on their representational significance to the sign language gesture class, and excluding frames that depict hand retraction;
- generating a sequence of patterns from the identified key-frames through the following steps:
a. segmenting the human body from each key-frame using a segmentation algorithm;
b. generating binary images of the segmented human body;
c. identifying contours within the binary images and filtering out irrelevant contours based on a minimum area threshold;
d. selecting the largest contour from each key-frame as a representative pattern for that frame;
- compiling the generated patterns into a sequence that represents the sign language gesture class;
- training a three-dimensional convolutional neural network (3D CNN) using the generated pattern sequences to recognize sign language gestures based on their temporal and spatial characteristics.
2. The method as claimed in claim 1, wherein the video frames are extracted at intervals of 50 milliseconds to ensure comprehensive gesture representation.
3. The method as claimed in claim 1, wherein the key-frames are determined by calculating inter-frame differences and comparing them against a predefined threshold value to ensure significant representation of gesture motion.
4. The method as claimed in claim 1, wherein the segmentation algorithm includes edge detection techniques to enhance contour identification for pattern generation.
5. The method as claimed in claim 1, further comprising evaluating the performance of the trained 3D CNN using metrics including accuracy, precision, recall, and F1-score against multiple sign language datasets.
6. The method as claimed in claim 1, wherein said sign language gesture recognition includes face gestures, hand gestures, eye gestures, eyebrows gestures, nose gestures, lips gestures.
| # | Name | Date |
|---|---|---|
| 1 | 202441096451-STATEMENT OF UNDERTAKING (FORM 3) [06-12-2024(online)].pdf | 2024-12-06 |
| 2 | 202441096451-REQUEST FOR EXAMINATION (FORM-18) [06-12-2024(online)].pdf | 2024-12-06 |
| 3 | 202441096451-REQUEST FOR EARLY PUBLICATION(FORM-9) [06-12-2024(online)].pdf | 2024-12-06 |
| 4 | 202441096451-FORM-9 [06-12-2024(online)].pdf | 2024-12-06 |
| 5 | 202441096451-FORM FOR SMALL ENTITY(FORM-28) [06-12-2024(online)].pdf | 2024-12-06 |
| 6 | 202441096451-FORM FOR SMALL ENTITY [06-12-2024(online)].pdf | 2024-12-06 |
| 7 | 202441096451-FORM 18 [06-12-2024(online)].pdf | 2024-12-06 |
| 8 | 202441096451-FORM 1 [06-12-2024(online)].pdf | 2024-12-06 |
| 9 | 202441096451-FIGURE OF ABSTRACT [06-12-2024(online)].pdf | 2024-12-06 |
| 10 | 202441096451-EVIDENCE FOR REGISTRATION UNDER SSI(FORM-28) [06-12-2024(online)].pdf | 2024-12-06 |
| 11 | 202441096451-EVIDENCE FOR REGISTRATION UNDER SSI [06-12-2024(online)].pdf | 2024-12-06 |
| 12 | 202441096451-DRAWINGS [06-12-2024(online)].pdf | 2024-12-06 |
| 13 | 202441096451-DECLARATION OF INVENTORSHIP (FORM 5) [06-12-2024(online)].pdf | 2024-12-06 |
| 14 | 202441096451-COMPLETE SPECIFICATION [06-12-2024(online)].pdf | 2024-12-06 |
| 15 | 202441096451-FORM-5 [06-01-2025(online)].pdf | 2025-01-06 |
| 16 | 202441096451-ENDORSEMENT BY INVENTORS [06-01-2025(online)].pdf | 2025-01-06 |
| 17 | 202441096451-Proof of Right [07-01-2025(online)].pdf | 2025-01-07 |
| 18 | 202441096451-FORM-26 [10-02-2025(online)].pdf | 2025-02-10 |