Abstract: SYSTEM AND METHOD FOR SIGN LANGUAGE MULTIMODAL VIRTUAL ASSISTANT ABSTRACT A system for accessing various applications on a service aggregation platform with a chatbot is disclosed. The system (100) includes an adaptive user interface (102) or chatbot for receiving user query/requests that are real-time, a wearable sensor module (104) with inertial measurement units (IMUs) for enhanced hand movement tracking a chat engine (106) for managing user interactions by retaining received user query/requests in a queue for sequential processing, providing real-time feedback and interactive prompts to guide the user during interaction, performing validation and calibration of the received user query/requests and select recognition modes dynamically based on context and quality of the received user query/requests. The system (100) also includes an inference engine (108) for recognizing Indian Sign Language (ISL) in the user query/requests and a response mapping module (116) for aligning predicted user query/requests with most appropriate and contextually relevant pre-recorded response videos or AI-generated synthesized videos. FIG. 1
DESC:F O R M 2
THE PATENTS ACT, 1970
(39 of 1970)
COMPLETE SPECIFICATION
(See section 10 and rule 13)
TITLE
SYSTEM AND METHOD FOR SIGN LANGUAGE MULTIMODAL VIRTUAL ASSISTANT
INVENTORS:
NEDUNGADI, Prema - US Citizen
MADATHILKULANGARA, Geetha - Indian Citizen
RAMAN, Raghu – US Citizen
MA Math
Amritapuri PO
Kollam, Kerala 690546
APPLICANT
AMRITA VISHWA VIDYAPEETHAM
Clappana P.O. Amritapuri, Vallikavu, Kerala 690525, India
THE FOLLOWING SPECIFICATION PARTICULARLY DESCRIBES THE INVENTION AND THE MANNER IN WHICH IT IS TO BE PERFORMED:
SYSTEM AND METHOD FOR SIGN LANGUAGE MULTIMODAL VIRTUAL ASSISTANT
CROSS-REFERENCES TO RELATED APPLICATION
[1] This application take priority to Provisional Patent Application No. 202341066041;titled “SYSTEM AND METHOD FOR SIGN LANGUAGE VIRTUAL ASSISTANT” filed on October 2, 2023
FIELD OF INVENTION
[2] The present disclosure relates to virtual assistance for interacting with sign language, more particularly it relates to Indian Sign Language (ISL) used for interaction with users.
DESCRIPTION OF THE RELATED ART
[3] A virtual assistance system or a chatbot is a system developed to engage in conversations with users, answer questions, provide information, and perform various tasks autonomously or with minimal human intervention. Chatbots use NLP technology to understand and interpret human language. They are equipped with knowledgebase or database of information that they can draw upon to answer questions accurately. Chatbots interact with users through chat like interface, which can be text based or voice based. A Sign Language Chatbot is a specialized type of Chatbot designed to enhance digital accessibility to the Deaf and Hard of Hearing (DHH) community and the literary backward people. It allows for various technological components, including wearable sensors for capturing hand and body movements, haptic feedback devices for tactile interaction, voice recognition modules for users who can speak (speech may not be coherent), but not hear, text entry via keyboard or touch and facial recognition cameras to capture the nuances of facial expressions integral to sign language
[4] Emerging trends showcase the increasing integration of chatbots in the public domain, especially in e-governance. Sign language-centric chatbots offer a transformative communication avenue for the DHH population, promoting more autonomous service access. Sign languages, with their distinct vocabulary and grammar, vary worldwide, influenced by unique linguistic structures and often reflecting iconic forms where the sign mirrors its meaning. The deaf and hard of hearing (DHH)community sees itself anchored in a distinct culture and language, separate from viewing it as a disability. However, challenges persist in accessing services and digital content for this community. Specifically, Indian Sign Language (ISL) is characterized by gestures, facial cues, body movements, and, in certain vocabularies, mouth movements. Developing effective chatbots require careful consideration of specific sign language, target user demographics, and available resources. Furthermore, individual signing styles and video quality can impact chatbot accuracy. Additionally, the security and privacy of chatbot interactions, especially when processing sensitive data, are of concern.
[5] Various publications have tried to address the aforementioned problems encountered when developing a chatbot for sign language identification. KR102104294B1 discloses a chatbot application that receives keyboard or touchscreen inputs and generates sign language video to guide a user. Wulandari et al. (2019) introduced a chatbot to connect DHH individuals in Indonesia with governmental services. Feedback indicated a preference for this system over conventional communication methods like emails or text messages. Concurrently, advancements by Yagishita et al. (2018) employed artificial intelligence (AI) and machine learning to translate Japanese sign language into text. However, user preferences are paramount. Research by Appuzo et al. (2022) highlights a preference among the DHH community for video responses featuring human signing over those generated by avatars. Despite these insights, there is a gap in systems that proficiently handle sign language inquiries. An exception is Sign Guide, a mobile service developed by Kosmopoulos et al. (2022), which caters to museum-related queries from DHH users using avatars.
[6] Presently, there is a requirement for a system for providing secure, accurate and seamless interaction to access the various applications available on the service aggregation platformespecially for the Deaf and Hard of Hearing (DHH) community.
SUMMARY OF THE INVENTION
[7] The present subject matter relates to a system for accessing various applications on a service aggregation platform with a chatbot.
[8] In one embodiment of the present subject matter,the system comprises an adaptive user interfaceconfigured for receiving user query/requests that are real-time, wherein the interface may handle partial text and less coherent speech inputs, a wearable sensor module 104 with inertial measurement units (IMUs) for enhanced hand movement tracking and a chatbot engine for managing user interactions. The chatbotengine is configured to retain received user query/requests in a queue for sequential processing, provide real-time feedback and interactive prompts to guide the user during interaction, wherein the feedback may include haptic feedback, perform validation and calibration of the received user query/requests andselect recognition modes dynamically based on context and quality of the received user query/requests. The system also includes an inference engine for recognizing Indian Sign Language (ISL) in the user query/requests, wherein the engine comprises a multimodal feature extractor with plurality of modality-specific encoders to process and combine the received user query/requests, an adaptive weighting module based on factors including clarity, completeness, accuracy, and consistency of inputs from each modality, a sequential learning module that includes a multimodal transformer, the module comprising modality-specific encoders for sign language, facial expressions, lip movements, and speech, incorporating regional adaptation layers and attention mechanisms to handle regional variations, cross-modal transformer layers with cross-modal attention mechanisms for aligning and integrating modality-specific information andtemporal dependency modeling module, and recurrent transformer blocks with enhanced memory capabilities to effectively capture long-term dependencies in sequential data. The system further includes a response mapping module for aligning predicted user query/requests with most appropriate and contextually relevant pre-recorded response videos or AI-generated synthesized videos.
[9] In various embodiments, the adaptive weighting module utilizes quality and completeness of inputs from each modality, adjusting their weights to improve sign language recognition.
[10] In various embodiments, the chat engine is configured to perform validation and calibration of the user query/requests received by ensuring that the user’s face and hands are fully visible within the video frame with detection algorithms, providing real-time feedback to prompt the user to adjust their position when necessary.
[11] In various embodiments,the validation maybe integrated with adaptive weighting by adjusting the influence of visual data based on the completeness of the inputs.
[12] In various embodiments, the system includes a pre-processing module for performing alignment correction and downsampling the validated and calibrated user query/requests.
[13] In various embodiments, the chat engine and the inference engine are configured to operate as multiple instances to perform parallel processing of user query/requests.
[14] In various embodiments, the adaptive weighting module enhances the accuracy and reliability of the system by prioritizing higher-quality inputs and reducing the influence of lower-quality or incomplete inputs.
[15] In various embodiments, the recognition modes are static and dynamic modes, allowing the system to dynamically select the most appropriate mode based on the complexity and type of input received.
[16] In various embodiments,the query/requests are real-time sign language videos/images, speech inputs, text inputs, facial expressions, body language cues, and optionally, sensor inputs.
[17] In various embodiments, the wearable sensor module performs signal processing to improve tracking accuracy and user interaction.
[18] In various embodiments, the context and quality of received user query/requests includes coherence level of speech inputs, completeness of text, and complexity of sign language videos.
[19] In one embodiment, a method for accessing various applications on a service aggregation platform with a multimodal chatbot is disclosed. The method includes receiving user queries/requests as real-time sign language videos/images, along with other modalities such as partial text, speech inputs, facial expressions, body language cues, and sensor inputs, determining the recognition mode based in the received user queries/requests, dynamically adjusting the weight of each modality based on the quality and completeness and determining the weighted inputs with an inference engine. The method further includes recognizing Indian Sign Language (ISL) in the user query/requests with the inference engine, extracting spatial embedding from the user query/requests with a multimodal feature extractor, dynamically adjusting the weight of evaluating the quality of each modality and performing sequential learning with a transformer encoder provided with the extracted spatial embedding and temporal patterns in sign language. The method also includes performing model training and gesture recognition, computing expectation over all possible frame-to-gloss alignments andaligning predicted glosses with most appropriate and contextually relevant pre-recorded response videos or AI-generated synthesized responses.
[20] This and other aspects are described herein.
BRIEF DESCRIPTION OF THE DRAWINGS
[21] The invention has other advantages and features, which will be more readily apparent from the following detailed description of the invention and the appended claims, when taken in conjunction with the accompanying drawings, in which:
[22] FIG. 1illustrates the system for accessing various applications on a service aggregation platform with a chatbot.
[23] FIG. 2A and 2Billustrate the inference engine and sequential learning module components.
[24] FIG. 3 illustrates the method for accessing various applications on a government service aggregation platform with a chatbot.
[25] FIG. 4A and 4B illustrate implementation of the system for accessing various applications on a service aggregation platform with a chatbot.
[26] FIG. 5illustratestheaccuracy of the system across various scenarios.
[27] FIG. 6 illustratesthe word error rate of the system.
[28] FIG. 7A and 7B illustrate the contextual response mapping of the system.
[29] FIG. 7C and 7D illustrate comparison of contextual response mapping accuracy of the system.
[30] FIG. 8 illustrates Spearman’s Rank Correlation indicating usability of the system.
[31] FIG. 9A and 9B illustrate concurrency and load balancing of the system.
[32] FIG. 10 illustrate response time performance of the system.
[33] FIG. 11 illustrates implementation of the system for eRaktkosh
[34] Referring to the figures, like numbers indicate like parts throughout the various views.
DETAILED DESCRIPTION OF THE EMBODIMENTS
[35] While the invention has been disclosed with reference to certain embodiments, it will be understood by those skilled in the art that various changes may be made and equivalents may be substituted without departing from the scope of the invention. In addition, many modifications may be made to adapt to a particular situation or material to the teachings of the invention without departing from its scope.
[36] Throughout the specification and claims, the following terms take the meanings explicitly associated herein unless the context clearly dictates otherwise. The meaning of “a”, “an”, and “the” include plural references. The meaning of “in” includes “in” and “on.” Referring to the drawings, like numbers indicate like parts throughout the views. Additionally, a reference to the singular includes a reference to the plural unless otherwise stated or inconsistent with the disclosure herein.
[37] The present subject matter describesa system for facilitating access of service aggregation platform particularly for the DHH community using a chatbot. The system includes a versatile Sign Language Recognizer Model that operates both static and continuous sign recognition modes for accurate interpretation. The system also includes a chat engine for managing user interactions and dynamically selects recognition modes as needed. The user is provided with provided with pre-recorded or synthetic generated sign language videos in response to the user queries.
[38] A system for accessing various applications on a service aggregation platform with a chatbot is illustrated in FIG. 1, in various embodiments of the subject matter. The system 100 includes an adaptive user interface 102 or chatbot for receiving user query/requests that are real-time, a wearable sensor module 104 for enhanced hand movement tracking and a chat engine 106for managing user interactions by retaining received user query/requests in a queue for sequential processing, providing real-time feedback and interactive prompts to guide the user during interaction, wherein the feedback may include haptic feedback, performing validation and calibration of the received user query/requests and select recognition modes dynamically based on context and quality of the received user query/requests.The context and quality of received user query/requests includes coherence level of speech inputs, completeness of text, and complexity of sign language videos. The system 100 also includes an inference engine 108 for recognizing Indian Sign Language (ISL) in the user query/requests and a response mapping module 116for aligning predicted user query/requests with most appropriate and contextually relevant pre-recorded response videos or AI-generated synthesized videos. The wearable sensor module 104 also performs signal processing to improve tracking accuracy and user interaction.
[39] In various embodiments,the adaptive user interface 102 maybe achatbot that supports a two-way communication with the users. The users may belong to the DHH community or may be general users who prefer video interaction. The user queries/requests or inputs are real-time that may be in the form of sign language videos/images, speech inputs, text inputs, facial expressions, body language cues, and as well as sensor inputs. The adaptive user interface 102 is capable of handling queries/requests that may be in partial textual format or less coherent speech inputs. The wearable sensor module 104 for enhanced hand movement tracking is also provided as inputs from the user. The adaptive user interface 102 is connected to a chat engine 106 that receives the user queries/requests for further processing and manages the user interactions.
[40] In various embodiments, the chat engine 106manages the chatflow of the adaptive user interface 102 using a hierarchical structure that guides the conversation, ensuring that it progresses logically based on user inputs by providing real-time feedback and interactive prompts to guide the user during interaction, wherein the feedback may include haptic feedback. The chat engine 106 receives the user queries/requests and holds them in a queue, first-come, first-served or sequential queuing is used for user interactions. The queue ensures that user interactions are handled progressively and prevents the system from being overloaded with concurrent requests. The received user queries/requests are validated and calibrated by ensuring that the user’s face and hands are fully visible within the video frame with sufficient lightingusing detection algorithms by providing real-time feedback to prompt the user to adjust their position when necessary. The system may dynamically switch between input modalities by prioritizing the most effective input in real-time,such that if the user's spoken input is incoherent or less coherent, the system may prompt the user to switch to sign language or text input to ensure continued, accurate communication.
[41] In various embodiments, the validated and calibrated user queries/requests are provided to the inference engine 108 for sign language recognition. Based on the quality, complexity and context of the received user query/requests, recognition modes are selected dynamically. The recognition modes maybe static and dynamic. The static mode enhances user interaction by enabling the recognition of static sign language gestures, including alphabets and numbers. This is useful for tasks like form filling, where the users mayprovide input data using sign language gestures instead of typing. The mode is context-aware, automatically switching between alphabets and numbers based on the input field, such as phone numbers or names. Usage of auto-correction and predictive text further enhance usability, minimizing errors and streamlining the input process, contributing to a more intuitive and user-friendly experience. A pre-processing module 118performs alignment correction and downsampling of the validated and calibrated user query/requests.
[42] In various embodiments, the inference engine 108 as illustrated in FIG. 2A receives the validated and calibrated data from the chat engine 106 and performs recognition of Indian Sign Language (ISL) in the user query/requests. The inference engine 108 includes a multimodal feature extractor 110 with plurality of modality-specific encoders to process and combine the received user query/requests, wherein the modalities may include hand gestures, facial expressions etc. An adaptive weighting module 112 is used for assigning weights to the received user query/requests based on factors including clarity, completeness, accuracy, and consistency of inputs from each modality. The adaptive weighting module 112 uses quality and completeness of inputs from each modality for weight assignment andadjuststheir weights to improve sign language recognition. It enhances accuracy and reliability by prioritizing higher-quality inputs and reducing the influence of lower-quality or incomplete inputs.Further, the validation of received inputs maybe integrated with adaptive weighting by adjusting the influence of visual data based on the completeness of the inputs.
[43] In various embodiments, the inference engine 108 includes a sequential learning module 114 which receives processed inputs from the multimodal feature extractor 110. The sequential learning module 114 as illustrated in FIG. 2B has a multimodal transformer for learning the relative information between frames in a video. The sequential learning module 114 includes modality-specific encoders 302 for interpreting sign language, facial expressions, lip movements, and speech, it incorporates regional adaptation layers and attention mechanisms to handle regional variations. Cross-modal transformer layers 304with cross-modal attention are used for aligning and integrating modality-specific information to determine the complex relations across different modalities. This captures both local and global dependencies in sign language gestures, thereby significantly improving recognition accuracy. The sequential learning module 114 further includes temporal dependency modeling module306 with recurrent transformer blocks having enhanced memory capabilities for effectively capturing long-term dependencies in the sequential data. It models the probability of the gloss, which isa written representation of a signlanguage word or phrase in English orother spoken language to indicate it’s meaning –given as the input by computing the expectation over all possible frame-to-gloss alignments.
[44] In various embodiments, the processed data from the sequential learning module 114is provided to a response mapping module 116 for aligning predicted user query/requestswith most appropriate and contextually relevant pre-recorded response videos or AI-generated synthesized videos. After the inference engine 108processes the user queries/requests and generates a sequence of glosses, the response mapping module 116 correlates these glosses with the most relevant response from the database. The response is determined by analyzing keywords and contextual information within the gloss sequence. This facilitates in ensuring users receive precise and contextually relevant responses to their queries/requests.
[45] In some embodiments, the system may integrate AI-generated educational feedback for users learning sign language. The avatars may highlight errors in hand placement, gesture speed, or facial expressions and offer real-time suggestions for improvement.
[46] In some embodiments, the chat engine 106 and the inference engine 108 are configured to operate as multiple instances to perform parallel processing of user query/requests in a cloud server environment. The chat engine 106 manages Answer Video Indexing that identifies and retrieves pre-recorded videosor may generate AI videos that correspond to the recognized sign language glosses based on user queries/inputs. The most relevant videos are displayed to the user, providing immediate feedback and enhancing the user experience with real-time responses.The chat engine 106 also facilitates in real-time data retrieval from external APIs, such as government service aggregator platforms. By integrating these APIs, the users receive up-to-date information, particularly useful for queries requiring current data like service status or personalized information from government platforms. The system enhances security through biometric authentication using facial recognition, voiceprints, and hand gesture patterns. This ensures that users are securely authenticated in the virtual environment, particularly when accessing sensitive information, such as personal medical records or confidential documents.
[47] In some embodiments, the system maybe capable of recognizing multiple sign languages and performing real-time translation between them. This facilitates in cross-linguistic communication between users from different regions. As an example, the system may translate queries presented in American Sign Language (ASL) into Indian Sign Language (ISL), allowing seamless interaction between users with different linguistic backgrounds.
[48] In some embodiments, the system maybe deployed in a Virtual Reality (VR) or metaverse environment. The users may interact with AI-generated avatars that respond in real-time to their sign language gestures, speech, and text within a fully immersive 3D space. This ensures that the users to flexibly communicate through any combination of sign language, speech, and text, allowing for multimodal interactions based on their preferences and capabilities. The system processes these combined inputs simultaneously, ensuring that the AI-generated avatars respond appropriately to the chosen combination of modalities. The system further enhances accessibility by allowing users to explore virtual environments while receiving sign language guidance, ensuring seamless communication and interaction in VR-based platforms, such as virtual classrooms, healthcare settings, or service portals. Additionally, the system may also provide real-time feedback through visual, auditory, or haptic prompts thus helping the users in refining their inputs for clear and accurate communication.
[49] In various embodiments, a method 200 as illustrated in FIG. 3 for accessing various applications on a service aggregation platform with a multimodal chatbot is disclosed. In step 202 user queries/requests are received as real-time sign language videos/images, along with other modalities such as partial text, speech inputs, facial expressions, body language cues, and sensor inputs. The recognition mode is determinedbased in the received user queries/requests in step 204. For further processing, dynamically the weight of each modality are adjusted based on the quality and completeness in step 206 and in step 208 the determined weights are provided to an inference engine 108. Further, Indian Sign Language (ISL) is recognized in the user query/requests with the inference engine 108 in step 210. Within the inference engine 108, spatial embedding is extracted from the user query/requests with a multimodal feature extractor in step 212 and the weight of evaluating the quality of each modality is adjusted in step 214. In step 216, sequential learning is performed with a transformer encoder provided with the extracted spatial embedding and temporal patterns in sign language and model training with gesture recognition is done in 218. Expectation over all possible frame-to-gloss alignments is computed in step 220 and the predicted glosses with most appropriate and contextually relevant pre-recorded response videos or AI-generated synthesized responses are aligned in step 222.
[50] The system for accessing various applications on a service aggregation platform with a chatbothas several advantages over the present prior art. This system is deployed hybrid cloud-server hosting environment to optimize performance resource utilization and cost management.To manage the demands of real-time sign language interpretation, the system employs dynamic instance allocation based on available memory, supplemented by a queue management. Multi-threading in the inference engine further enhances resource usage, maintaining consistent response times and a smooth user experience during peak usage periods. Real-time video calibration, user feedback collection, and initial pre-processing are performed directly in the user's browser using which reduces server load and leverages user device resources, improving the system's scalability and responsiveness.The system is highly scalable as it can enhancedby adding servers or adjusting cloud resources as user demand increases. Further, stress tests indicate the system’s capability to handle increased loads without degradation in performance.
[51] Although the detailed description contains many specifics, these should not be construed as limiting the scope of the invention but merely as illustrating different examples and aspects of the invention. It should be appreciated that the scope of the invention includes other embodiments not discussed herein. Various other modifications, changes and variations which will be apparent to those skilled in the art may be made in the arrangement, operation and details of the system and method of the present invention disclosed herein without departing from the spirit and scope of the invention as described here, and as delineated in the claims appended hereto.
EXAMPLES
[52] EXAMPLE 1: Dataset creation for Indian Sign Language (ISL) virtual Assistant
[53] In the implementation shown in FIG. 4, a comprehensive ISL (Indian Sign Language) video dataset featuring a diverse range of signs and expressions was compiled. The dataset includes recordings from multiple signers, enabling the model to learn from the variability in signing styles. The ISL Virtual Assistant was designed for Indian Railways and e-RaktKosh domains, necessitating a dataset that included a diverse vocabulary, simplified sentence structures, and relevant question-and-answer content. Indian Railways data covered train schedules, ticket bookings, and real-time station status, while e-RaktKosh focused on blood bank management, including donor and inventory statuses.The dataset creation followed a strict protocol involving certified ISL interpreters and DHH individuals. The background, device, format, lighting, signer’s attire, posture, and facial expressions were considered during the recording to ensure clarity of the signed language, avoiding obstructions like matching backgrounds or skin tones. Vocabulary not present in the ISLRTC dictionary was developed through consultations with experts. Each video was reviewed by at least two expert signers, and discrepancies were resolved through consensus. TABLE 1 shows the dataset for the system.
TABLE 1: Data Set Created for ISL-Virtual Assistant
Alphabets and Numbers
Alphabet Numbers
Capturing environment Studio Home Studio Home
Video Dataset 5200 5200 2000 2000
Image Dataset (Cropped from Video) 9100 9100 3500 3500
Words and Sentences
Eraktkosh Indian Railways
Passenger Services Parcel Services UTS Rail Madad
Training Set
Dictionary count: 163 102 43 95 77
Gloss count: 226 118 101 222 166
No. of signers: 13 15 8 8 8
Isolated word videos: 5,408 6120 43 95 77
Videos in studio environment 2938 1770 1296 1776 1096
Videos in home environment 8814 5310 1515 5328 3365
Total videos in each service (Train) 17160 8610 2854 7199 4538
Test Set
Trained 590 505 222 830
In-vocabulary 165 80 42 135
Out of vocabulary (1 word) 165 64 42 135
Out of vocabulary (2 words) 150 64 42 135
Out of vocabulary (No answer) 100 40 10 50
Total videos in each service(Test) 1170 753 358 1285
Total Test Set Videos 3566 Videos
[54] Video annotation is a crucial component in the effective training of multimedia systems leveraging video recognition. To preserve the integrity and accuracy of the gestures and signs, at least two expert signers reviewed each video in the dataset. Any discrepancies encountered in the annotations were resolved through consensus-driven discussions.
[55] EXAMPLE 2: Testing Scenarios
[56] To evaluate the robustness and accuracy of the ISL-virtual assistant system, 174 test samples were selected across various testing scenarios. These scenarios were designed to simulate real-world conditions by introducing different combinations of signers and glosses. The following scenarios were considered:
[57] Best-case scenario: In this scenario, signers from the training set (known signers) used glosses that were part of the system’s trained vocabulary but included variations in the order or structure of the glosses. This scenario aimed to assess the system’s performance when faced with familiar signers and trained vocabulary under slightly modified conditions.
[58] Average-case scenario: This scenario involved known signers using test glosses from the trained vocabulary (in-vocabulary words). The objective was to evaluate how well the system could generalize to new gloss combinations while maintaining a familiar vocabulary. This represents a practical, everyday communication environment with moderate challenges for the system.
[59] Worst-case scenario: In the worst-case scenario, both known and new signers were tasked with using glosses that were out-of-vocabulary (OOV) for the system. This scenario was designed to simulate highly challenging conditions, where signers may introduce novel, incorrect, or less common signs not previously encountered by the system during training. The objective was to test system’s ability to generalize and adapt to unfamiliar vocabulary and signing styles.
[60] Word Error Rate (WER) was calculated for each of the above scenarios as a complementary evaluation metric to accuracy. WER is defined as the number of substitutions, deletions, and insertions divided by the total number of words in the reference. WER helps quantify the system's ability to recognize sign language accurately in terms of both word-level and gloss-level errors, providing a more nuanced measure of system’s performance.
[61] EXAMPLE 3: Accuracy of ISL Virtual Assistant
[62] TABLE 2 summarizes four experiments conducted to evaluate the robustness and effectiveness of the system in different scenarios, testing the system’s ability to accurately recognize signs. Each row represents a specific test condition, varying by the signers involved and whether the ISL glosses were in-vocabulary or out-of-vocabulary.
[63] The system was evaluated using various scenarios involving 174 test data samples. The testing comprised six signers: three signers were part of the training set (Known Signers), and three signers were unseen by the system model (New Signers). Notably, the test glosses encompassed words intentionally chosen to be within and outside the system’s vocabulary, simulating real-world scenarios where signers may use unfamiliar or non-standard signs.
TABLE 2: ISL-Virtual Assistant Accuracy
ISL Signer ISL Sentences Accuracy
Known Signers In-vocabulary 90%
New Signers In-vocabulary 83%
Known Signers Out-of-vocabulary glosses 76.6%
New Signers Out-of-vocabulary glosses 72.7%
[64] The system achieved the highest accuracy (90%) with known signers using in-vocabulary glosses. When tested with new signers using in-vocabulary glosses, the accuracy dropped to 83%, indicating the system’s generalization capability. Known signers using out-of-vocabulary glosses led to a further reduction in accuracy (76.6%), and the lowest accuracy (72.7%) was observed with new signers using out-of-vocabulary glosses. The accuracy of the system is shown in FIG. 5.
[65] EXAMPLE 4: Word Error Rate (WER)
[66] The Word Error Rate (WER) is a critical metric for evaluating the accuracy of the ISL Conversational Agent. WER is calculated as the number of substitutions, deletions, and insertions divided by the total number of words in the reference. The WER across different scenarios is shown in TABLE 3, where the system performs best with known signers using in-vocabulary glosses and worst with new signers using out-of-vocabulary glosses. FIG. 6 shows the accuracy of the system according to WER metric.
TABLE 3: WER Across Different Scenarios
Scenario WER (%)
Known Signers, In-Vocabulary 10.0
New Signers, In-Vocabulary 17.0
Known Signers, Out-of-Vocabulary 23.4
New Signers, Out-of-Vocabulary 27.3
[67] EXAMPLE 5: Contextual response mapping
[68] Three approaches to contextual response mapping were evaluated to optimize the system’s accuracy and user satisfaction:
[69] Context-based response mapping: Keywords within user queries were used to determine conversational context, with confidence levels adjusting as additional relevant or conflicting keywords were detected.
[70] Generalized response mapping for categories: Responses were grouped into broader categories, reducing context granularity but improving overall accuracy.
[71] Generalized response mapping with suggestive chatflow: This method provided multiple suggestive responses when exact keyword matches were not found, enhancing relevance and user satisfaction.
[72] The keyword-context-based response mapping provided precise, context-specific responses but was highly dependent on accurate keyword identification, which could lead to context misinterpretation in some cases. The generalized response mapping for categories improved accuracy by simplifying the context but often lacked the specificity that users preferred. The generalized response mapping with suggestive chatflow achieved the highest accuracy (minimum 95%) and user satisfaction by offering multiple suggestive responses, even when keyword identification was incomplete. TABLE 4 shows the accuracy of contextual response mapping.
TABLE 4: Accuracy of contextual response mapping
Method Accuracy
Context-based response mapping 85%
Generalized response mapping 89%
Generalized response mapping with suggestive chatflow 95%
[73] The Venn diagram and radar chart illustrate as shown in FIG. 7A and 7B show extensive overlaps among keywords used in the primary service categories respectively. These overlaps demonstrate the complexities involved in distinguishing queries that could apply to multiple services, which is crucial for accurate response mapping. The keywords for each service are as follows:
[74] Passenger Services: 'ticket', 'book', 'status', 'reservation', 'schedule', 'track'
[75] Parcel Services: 'parcel', 'track', 'freight', 'waybill', 'schedule', 'book'
[76] Rail Madad: 'complaints', 'actions', 'ticket', 'track', status
[77] The Venn diagram highlights the significant overlap among keywords used in Passenger Services, Parcel Services, and Rail Madad. The following observations can be made:
[78] Common Keywords:Keywords like 'ticket', 'track', 'reservation', and 'schedule' appear in multiple services. This overlap indicates that users might use similar terms across different contexts, making it challenging to distinguish the intended service based solely on these keywords.
[79] Overlapping Areas:The intersection areas of the Venn diagram show the overlapping keywords among the three services. These common keywords contribute to the complexity of accurately identifying the appropriate category or providing precise answers.
[80] The bar chart as shown in FIG. 7C quantifies the vocabulary used across different services, revealing significant variations in the number of terms utilized. Services with more extensive vocabularies pose greater challenges in keyword-based mapping due to the higher probability of term overlaps.
[81] EXAMPLE 6: Comparison of Answer Mapping Accuracies by Service
[82] The line chart as shown in FIG. 7D compares the performance of different mapping strategies across services. The results indicate that the generalized suggestive method outperforms others, particularly in services with high term overlaps and vocabulary complexity, suggesting its robustness in handling complex query contexts.
[83] EXAMPLE 7: Usability
[84] During workshops, participants (including DHH and others) provided feedback on the system’s usability, accuracy, and overall experience. The system was well received, with participants noting its effectiveness in accessing e-Governance services and expressing a preference for portrait video capture. Survey data indicated high overall satisfaction with the system's navigation and accuracy of sign language recognition.A System Usability Scale (SUS) study, with questions in ISL, was filled by participants skilled in ISL. The SUS, comprising ten Likert scale questions, assessed system’sability.The analysis commenced with descriptive statistics for a dataset containing 15 users’ responses to 10 questions. The System Usability Scale (SUS) scores for the system ranged from 57.5 to 97.5, averaging 71.83, suggesting above-average usability as shown in FIG. 8.
[85] EXAMPLE 8: Concurrency and load balancing
[86] Effective management of concurrency and load balancing is critical to ensure the system can handle a high volume of queries without compromising performance. Two approaches were examined: one without optimization and another with backend optimization. Two diagrams as shown in FIG. 9A and FIG. 9B are utilized to illustrate these approaches, emphasizing response time, the number of concurrent users, concurrent ISL queries, and GPU memory usage.
[87] In the first diagram, the following parameters were tracked:
[88] Response Time: The time taken to respond to a query.
[89] Number of Concurrent Users: The number of users interacting with the system simultaneously.
[90] Concurrent ISL Queries: The number of ISL queries being processed concurrently.
[91] GPU Memory Usage: The amount of GPU memory being utilized.
[92] Initially, the system did not have any optimization measures implemented. The reference GPU utilized was an NVIDIA A6000 with 48GB of total memory. Of this, 6GB was already consumed by other programs, leaving 42GB available. To handle complex queries, 10GB was reserved as a buffer, allowing the system to process up to 8 concurrent queries simultaneously.However, when the number of concurrent queries exceeds 8, the response time increases significantly due to queries being queued and processed one by one. The mean response time in this setup is approximately 10 seconds, but it spikes when the system is overloaded.
[93] The second diagram illustrates the incorporation of a second GPU to manage the increased load when the number of concurrent queries exceeds 8. As the concurrent queries surpass 8, the additional GPU is activated to distribute the processing, thereby maintaining a controlled response time. With the deployment of the second GPU, the mean response time is sustained at approximately 10 seconds.
[94] The response time remains regulated due to the load distribution across the two GPUs. This approach of utilizing a second GPU helps to handle the increased workload and maintain a consistent response time for the users. When the number of concurrent queries exceeds the capacity of a single GPU, the system automatically activates the second GPU to share the processing load. This allows the system to continue providing timely responses without experiencing significant delays.
[95] Through the implementation of backend optimizations, GPU memory consumption was reduced considerably while substantial improvement in the response time performance was noticed as shown in FIG. 10. Each GPU session utilized only 2.5GB of memory, a significant decrease from the previous 10GB per session, and the average response time has been decreased to a mere 7 seconds, down from the previous average of around 10 seconds.
[96] EXAMPLE 8: Performance Analysis & Time Complexity
[97] The system demonstrates good accuracy in sign language recognition, including with new signers, suggesting robustness. It also has a reasonable accuracy with ISL gloss containing OOV.
[98] Time Complexity of Data Upload (T_upload): Version 1’s data upload process was encumbered by a 1-second overhead for uploading a 7MB video due to less efficient cloud storage methods and lower compression rates. In Version 2, an 81% compression rate was achieved, and direct GPU node data uploads were introduced, removing the overhead entirely and maximizing cloud bandwidth usage. These advancements have led to a substantial decrease in T_upload, enhancing the data upload efficiency greatly compared to the previous version.
[99] Time Complexity of Prediction Time: The prediction time was initially hampered by suboptimal model loading and redundancy management, resulting in higher T_predict. The improvements in Version 2 include further model optimization and reduced module loading times, bolstered by enhanced redundancy. It led to minimizing T_predict, allowing for faster and more reliable predictions, thereby optimizing ISL-CA’s performance and responsiveness.
[100] Time Complexity of Response Video Transmission Time (T_transmit): The original version saw longer transmission times due to basic video compression techniques and reliance on cloud storage hosting. With the introduction of advanced video compression tailored to the content and strategic use of CDNs in Version 2, there has been a significant cut in T_transmit. These enhancements have reduced the video file size and decreased video delivery latency, and the content is retrieved from the nearest possible data centers.
[101] EXAMPLE 9: Stress Testing
[102] In addition to accuracy testing, the system was evaluated under high-load conditions to assess its performance during peak usage. Stress tests were conducted in two phases:
[103] Stress Test 1: Involved 50 parallel connections over a 30-minute period.
[104] Stress Test 2: Involved 78 parallel connections over a 30-minute period.
[105] These tests were designed to simulate high-traffic scenarios and evaluate the system’s effectiveness in maintaining real-time performance and response times under load. A well-balanced range of devices, operating systems, and browsers were included in the user setup, ensuring compatibility assessment across different platforms.
[106] EXAMPLE 10: ISL Virtual Assistant for eRaktkosh
[107] The system for eRaktkosh uses a tree-based system and relies on JSON to manage its conversation paths as shown in FIG. 11. This design promotes flexibility and adaptability, allowing transitions between diverse services. The chatbot finds key terms in a user's message and then, understanding the question, provides the service, such as finding the closest blood bank using the user's PIN code or geolocation.
[108] Next, the chatbot initiates a sequence of dialogues to obtain all necessary details to fulfill the user's request or provide the desired information. This dynamic chat interaction continues until the chatbot arrives at an endpoint, facilitating tasks such as checking statuses, submitting applications, updating bloodstock, initiating registrations, processing withdrawals, and lodging complaints, among other services.
[109] Frequently asked questions are incorporated into the chatbot's knowledge base, ensuring readily available information for common queries. Overall, the chatbot's tree-based chat flow, powered by JSON and tailored to eRaktkosh, enables streamlined user interactions and facilitates prompt and accurate delivery of services and information within the blood bank domain.
[110] EXAMPLE 11: Operation of Multimodal Virtual Assistant
[111] The system processes real-time multimodal inputs from users, which include sign language videos (such as ISL), partially less-coherent spoken queries, and partial or complete textual inputs. The inputs can either complement each other, by providing different parts of the intended message, or reinforce each other when similar content is expressed across modalities.
[112] a user asking about blood bank inventory may sign part of the query in ISL, such as "blood bank status," while simultaneously speaking less coherently, saying "available...blood type." The system captures these inputs through its multimodal feature extractor, where each modality—sign language, speech, and text—is analyzed and combined.
[113] When inputs from different modalities complement one another, such as the ISL query providing the subject and the spoken input adding detail, the system integrates the information to form a complete understanding of the user’s intent. On the other hand, when the same content is expressed across multiple modalities (e.g., both the spoken query and ISL say “blood bank status”), the inputs reinforce each other, increasing the system's confidence in the interpretation.
[114] The system employs adaptive weighting, adjusting the importance given to each modality based on factors like clarity, completeness, and accuracy. In cases where the spoken query is less coherent, the system might rely more heavily on the clear ISL gestures and text input. Cross-modal attention mechanisms ensure that even facial expressions and lip movements contribute to refining the system’s interpretation.By using this complementary and reinforcing approach, the system can interpret incomplete or less-coherent inputs while still delivering accurate, contextually relevant responses. The output can be in the form of pre-recorded or AI-generated sign language videos, as well as along with correct mouthing for synthesized speech and text captions responses, depending on the user's input preferences.
[115] EXAMPLE 12: Operationof Multimodal Virtual Assistant in a Virtual Reality (VR) Environment with AI-Generated Sign Language Avatars
[116] The multimodal virtual assistant can operate within a fully immersive Virtual Reality (VR) environment, where AI-generated sign language avatars enhance interaction. Users can engage with the system through a combination of real-time sign language videos (such as ISL), partially less-coherent spoken queries, and partial or complete textual inputs, all integrated into the VR setting.
[117] In this VR environment, users navigate a virtual space—such as a government service portal, healthcare center, or educational facility—using VR hand controllers and sensors that capture sign language gestures, facial expressions, and body movements. The assistant uses AI-generated avatars to respond in sign language, providing a dynamic, lifelike communication experience for users. These avatars deliver pre-recorded or AI-generated responses, using sign language that is syntactically accurate and visually natural within the 3D VR space.
[118] As users interact with the system, inputs from various modalities (ISL, speech, and text) complement or reinforce each other. For instance, a user might sign “blood bank status” while saying “available...blood type.” The system uses multimodal feature extraction to interpret these inputs, combining them to either complete the user's query or increase the confidence in its accuracy. When multiple inputs express similar content, the system reinforces the understanding through its cross-modal attention mechanism, which also processes facial expressions and lip movements in the VR space.
[119] The VR environment further enhances accessibility by providing real-time feedback. If the system detects incomplete or unclear inputs, visual prompts may appear within the VR space, guiding the user on how to improve gesture clarity or offering suggestions for spoken or typed queries. The user can adjust their position or movements, and the system responds with updated visual cues, ensuring smooth interaction.
[120] The AI-generated sign language avatars offer a fluid, responsive method of communication within the virtual environment. These avatars can convey complex messages using ISL and other sign languages, adapting to the user's input in real time. The avatars' signing is synchronized with the user's query, creating a seamless and natural flow of communication that mirrors interactions with human interpreters, but within the VR space. Additionally, synthesized speech or text-based responses can accompany the avatar’s signing, offering multimodal outputs tailored to each user’s preferences.
[121] This allows the creation of an inclusive and immersive experience for users in a VR environment. It facilitates the Deaf and Hard of Hearing (DHH) community to interact with AI-generated sign language avatars, while also supporting users who rely on speech or text.
,CLAIMS:WE CLAIM:
1. A system (100) for accessing various applications on a service aggregation platform with a chatbot, the system comprising:
an adaptive user interface (102) configured for receiving user query/requests that are real-time, wherein the interface is may handle partial text and less coherent speech inputs;
awearable sensor module (104) with inertial measurement units (IMUs) for enhanced hand movement tracking;
a chatbot engine (106) for managing user interactions, wherein the chatbot engine is configured to:
retain received user query/requests in a queue for sequential processing;
provide real-time feedback and interactive promptsto guide the user during interaction, wherein the feedback may include haptic feedback;
perform validation and calibration of the received user query/requests; and
select recognition modes dynamically based on context and quality of the received user query/requests;
an inference engine (108) for recognizing Indian Sign Language (ISL) in the user query/requests, wherein the engine comprises:
a multimodal feature extractor (110) with plurality of modality-specific encoders to process and combine the received user query/requests;
an adaptive weighting module(112) based on factors including clarity, completeness, accuracy, and consistency of inputs from each modality.
a sequential learning module (114) includesa multimodal transformer, the module comprising:
modality-specific encoders (302) for sign language, facial expressions, lip movements, and speech, incorporating regional adaptation layers and attention mechanisms to handle regional variations;
cross-modal transformer layers (304) with cross-modal attention mechanisms for aligning and integrating modality-specific information; and
temporal dependency modeling module (306) and recurrent transformer blocks with enhanced memory capabilities to effectively capture long-term dependencies in sequential data; and
aresponse mapping module (116) for aligning predicted user query/requestswith most appropriate and contextually relevant pre-recorded response videos or AI-generated synthesized videos.
2. The system (100) as claimed in claim 1, wherein the adaptive weighting module(112) utilizes quality and completeness of inputs from each modality, adjusting their weights to improve sign language recognition.
3. The system (100) as claimed in claim 1, wherein the chat engine (106) is configured to perform validation and calibration of the user query/requests received by ensuring that the user’s face and hands are fully visible within the video frame with detection algorithms, providing real-time feedback to prompt the user to adjust their position when necessary.
4. The system (100) as claimed in claim 3, wherein the validation maybe integrated with adaptive weighting by adjusting the influence of visual data based on the completeness of the inputs.
5. The system (100) as claimed in claim 1, comprising a pre-processing module (118) for performing alignment correction and downsampling the validated and calibrated user query/requests.
6. The system (100) as claimed in claim 1, wherein the chat engine (106) and the inference engine (108) are configured to operate as multiple instances to perform parallel processing of user query/requests.
7. The system (100) as claimed in claim 1, wherein the adaptive weighting module (112) enhances the accuracy and reliability of the system by prioritizing higher-quality inputs and reducing the influence of lower-quality or incomplete inputs.
8. The system (100) as claimed in claim 1, wherein the recognition modes are static and dynamic modes, allowing the system to dynamically select the most appropriate mode based on the complexity and type of input received.
9. The system (100) as claimed in claim 1, wherein the query/requests are real-time sign language videos/images, speech inputs, text inputs, facial expressions, body language cues, and optionally, sensor inputs.
10. The system (100) as claimed in claim 1, wherein the wearable sensor module (104) performs signal processing to improve tracking accuracy and user interaction.
11. The system (100) as claimed in claim 1, wherein the context and qualityof received user query/requests includes coherence level of speech inputs, completeness of text, and complexity of sign language videos.
12. A method (200) for accessing various applications on a service aggregation platform with a multimodal chatbot, wherein the method comprises:
receiving (202) user queries/requests as real-time sign language videos/images, along with other modalities such as partial text, speech inputs, facial expressions, body language cues, and sensor inputs.
determining(204) the recognition mode based in the received user queries/requests;
dynamically adjusting (206) the weight of each modality based on the quality and completeness;
determining (208) the weighted inputs with an inference engine (108);
recognizing (210) Indian Sign Language (ISL) in the user query/requests with the inference engine (108);
extracting(212) spatial embedding from the user query/requests with a multimodal feature extractor;
dynamically(214) adjusting the weight of evaluating the quality of each modality;
performing(216) sequential learning with a transformer encoder provided with the extracted spatial embedding and temporal patterns in sign language;
performing (218) model training and gesture recognition;
computing(220) expectation over all possible frame-to-gloss alignments; and
aligning(222) predicted glosses with most appropriate and contextually relevant pre-recorded response videos or AI-generated synthesized responses.
Dr V. SHANKAR
IN/PA-1733
For and on behalf of the Applicants
| # | Name | Date |
|---|---|---|
| 1 | 202341066041-STATEMENT OF UNDERTAKING (FORM 3) [02-10-2023(online)].pdf | 2023-10-02 |
| 2 | 202341066041-PROVISIONAL SPECIFICATION [02-10-2023(online)].pdf | 2023-10-02 |
| 3 | 202341066041-OTHERS [02-10-2023(online)].pdf | 2023-10-02 |
| 4 | 202341066041-FORM FOR SMALL ENTITY(FORM-28) [02-10-2023(online)].pdf | 2023-10-02 |
| 5 | 202341066041-FORM 1 [02-10-2023(online)].pdf | 2023-10-02 |
| 6 | 202341066041-EVIDENCE FOR REGISTRATION UNDER SSI(FORM-28) [02-10-2023(online)].pdf | 2023-10-02 |
| 7 | 202341066041-EDUCATIONAL INSTITUTION(S) [02-10-2023(online)].pdf | 2023-10-02 |
| 8 | 202341066041-Proof of Right [06-04-2024(online)].pdf | 2024-04-06 |
| 9 | 202341066041-FORM-26 [06-04-2024(online)].pdf | 2024-04-06 |
| 10 | 202341066041-FORM-9 [03-10-2024(online)].pdf | 2024-10-03 |
| 11 | 202341066041-FORM-8 [03-10-2024(online)].pdf | 2024-10-03 |
| 12 | 202341066041-FORM 18 [03-10-2024(online)].pdf | 2024-10-03 |
| 13 | 202341066041-FORM 13 [03-10-2024(online)].pdf | 2024-10-03 |
| 14 | 202341066041-DRAWING [03-10-2024(online)].pdf | 2024-10-03 |
| 15 | 202341066041-CORRESPONDENCE-OTHERS [03-10-2024(online)].pdf | 2024-10-03 |
| 16 | 202341066041-COMPLETE SPECIFICATION [03-10-2024(online)].pdf | 2024-10-03 |
| 17 | 202341066041-RELEVANT DOCUMENTS [24-03-2025(online)].pdf | 2025-03-24 |
| 18 | 202341066041-POA [24-03-2025(online)].pdf | 2025-03-24 |
| 19 | 202341066041-FORM 13 [24-03-2025(online)].pdf | 2025-03-24 |
| 20 | 202341066041-OTHERS [06-05-2025(online)].pdf | 2025-05-06 |
| 21 | 202341066041-EDUCATIONAL INSTITUTION(S) [06-05-2025(online)].pdf | 2025-05-06 |
| 22 | 202341066041-FER.pdf | 2025-10-10 |
| 1 | 202341066041_SearchStrategyNew_E_SearchHistoryE_29-09-2025.pdf |