Abstract: A sign to speech conversion system, the system (100) comprising an input module (102), wherein the input module (102) comprises camera and microphone configured to capture video of the user and voice commands of the user, respectively. The system (100) , further comprises at least one processor (104) operationally coupled with the input module (102), wherein the at least one processor (104) configured to receive dataset of Indian sign language (ISL) signs and gestures stored in a database, train dataset by using machine learning model to identify specific ISL signs from the captured video input, translate recognized ISL signs into intermediate text data, convert the intermediate text data into audible speech output to allow communication with individuals who do not understand ISL, convert spoken language into text, enabling bidirectional communication between ISL users and non-ISL users and a user interface (106) installed within the computing unit (108).
Description:FIELD OF THE DISCLOSURE
[0001] This invention generally relates to a field of assistive communication technologies and in particular relates to a sign to speech conversion system.
BACKGROUND
[0002] The subject matter discussed in the background section should not be assumed to be prior art merely as a result of its mention in the background section. Similarly, a problem mentioned in the background section or associated with the subject matter of the background section should not be assumed to have been previously recognized in the prior art. The subject matter in the background section merely represents different approaches, which in and of themselves may also correspond to implementations of the claimed technology.
[0003] In India, there are million individuals who are deaf or hard of hearing, many of whom use Indian Sign Language (ISL) as their primary mode of communication. However, a major challenge they face is the lack of widespread understanding of ISL among the general population. This communication barrier often results in social isolation and limits access to essential services, including education, healthcare, and employment. The ability of ISL users to interact effectively with non-sign language users is often constrained, which further alienates them from mainstream society and hinders their integration.
[0004] Previous efforts to address this communication gap include the use of human interpreters and manual translation services. While these solutions have been helpful, they are often expensive, impractical for real-time interactions, and unavailable in many regions. Additionally, traditional interpreters do not offer the flexibility needed for everyday personal or professional interactions, such as spontaneous conversations in public spaces, meetings, or educational settings. Moreover, existing technological solutions for sign language translation are typically not tailored for Indian Sign Language and lack support for multiple regional Indian languages.
[0005] According to a patent application, “AU2021105337A4 ” titled as “a device for conversion of visual gestures into handwritten of deaf /physically challenged people” which discloses as The present invention generally relates to a device for conversion of visual gestures into handwritten of deaf /physically challenged people comprises a camera coupled with a body for capturing visual gestures and signs; a conversion module for converting visual gestures and signs into text; a classification module for identifying task of a person; and a central processing unit configured for augmenting converted text to transliteration into identified person's own handwriting style.
[0006] Therefore, there is a need for a sign to speech conversion system that overcome the drawbacks in the prior art.
OBJECTIVES OF THE INVENTION
[0007] Further, the objective of present invention is to provide a sign to speech conversion system.
[0008] Furthermore, the objective of the present invention is to implement machine learning techniques such as Convolutional Neural Networks (CNN) and Recurrent Neural Networks (RNN) for accurate recognition of ISL gestures and contextual understanding of sequences.
[0009] Furthermore, the objective of the present invention is to offer multi-language support by translating recognized ISL gestures into multiple regional Indian languages, making the system more accessible to a diverse population.
[0010] Furthermore, the objective of the present invention is provide bidirectional communication by converting spoken language into text for ISL users, enabling them to interact with non-ISL users without the need for human interpreters.
[0001]
SUMMARY
[0011] According to an aspect, the present embodiments, discloses a sign to speech conversion system, the system comprising an input module, wherein the input module comprises camera and microphone configured to capture video of the user and voice commands of the user, respectively and a at least one processor operationally coupled with the input module, wherein the at least one processor configured to receive dataset of Indian sign language (ISL) signs and gestures stored in a database, train dataset by using machine learning model to identify specific ISL signs from the captured video input, translate recognized ISL signs into intermediate text data, convert the intermediate text data into audible speech output to allow communication with individuals who do not understand ISL, convert spoken language into text, enabling bidirectional communication between ISL users and non-ISL users and a user interface installed within the computing unit, wherein the at least one processor is configured to provide final processed data as spoken words, written text, and gesture feedback, enhancing accessibility and understanding.
BRIEF DESCRIPTION OF THE DRAWINGS
[0012] The accompanying drawings illustrate various embodiments of systems, methods, and embodiments of various other aspects of the disclosure. Any person with ordinary skills in the art will appreciate that the illustrated element boundaries (e.g. boxes, groups of boxes, or other shapes) in the figures represent one example of the boundaries. It may be that in some examples one element may be designed as multiple elements or that multiple elements may be designed as one element. In some examples, an element shown as an internal component of one element may be implemented as an external component in another, and vice versa. Furthermore, elements may not be drawn to scale. Non-limiting and non-exhaustive descriptions are described with reference to the following drawings. The components in the figures are not necessarily to scale, emphasis instead being placed upon illustrating principles.
[0013] FIG. 1 illustrates a block diagram of a sign to speech conversion system, according to an embodiment of the present invention.
[0014] FIG. 2 illustrates a flow chart of method of operating a sign to speech conversion system.
[0002]
DETAILED DESCRIPTION
[0015] Some embodiments of this disclosure, illustrating all its features, will now be discussed in detail. The words “comprising,” “having,” “containing,” and “including,” and other forms thereof, are intended to be equivalent in meaning and be open ended in that an item or items following any one of these words is not meant to be an exhaustive listing of such item or items, or meant to be limited to only the listed item or items. It must also be noted that as used herein and in the appended claims, the singular forms “a,” “an,” and “the” include plural references unless the context clearly dictates otherwise.
[0016] Although any systems and methods similar or equivalent to those described herein can be used in the practice or testing of embodiments of the present disclosure, the preferred, systems and methods are now described. Embodiments of the present disclosure will be described more fully hereinafter with reference to the accompanying drawings in which like numerals represent like elements throughout the several figures, and in which example embodiments are shown. Embodiments of the claims may, however, be embodied in many different forms and should not be construed as limited to the embodiments set forth herein. The examples set forth herein are non-limiting examples and are merely examples among other possible examples.
[0017] The present invention discloses a sign-to-speech conversion system (100) that leverages advanced machine learning and computer vision techniques to translate Indian Sign Language gestures into speech and text.
[0018] FIG. 1 illustrates a block diagram of a sign to speech conversion system, according to an embodiment of the present invention.
[0019] In some embodiments, the system (100) comprises an input module (102), at least one processor (104), a database (106) of ISL gestures, and a user interface (106) installed on a computing unit (108).
[0020] In some embodiment, the system (100) comprises the input module (102) consists of a camera and a microphone, responsible for capturing the video of the user’s ISL gestures and receiving voice commands from non-ISL users, respectively. The camera is configured to track hand movements, facial expressions, and body posture to accurately recognize ISL signs. The microphone collects spoken language data from the hearing users, which the system may convert into text for ISL users. By enabling both visual and auditory inputs, the system facilitates bidirectional communication between ISL and non-ISL users.
[0021] In one embodiment, the at least one processor (104) may be communicatively coupled to the memory (102). The at least one processor (110) may include suitable logic, input/ output circuitry, and communication circuitry that are operable to execute one or more instructions stored in the memory to perform predetermined operations. In one embodiment, the at least one processor (104) may be configured to decode and execute any instructions received from one or more other electronic devices or server(s). The at least one processor (104) may be configured to execute one or more computer-readable program instructions, such as program instructions to carry out any of the functions described in this description. Further, the at least one processor (104) may be implemented using one or more processor technologies known in the art. Examples of the at least one processor (104) include, but are not limited to, one or more general purpose processors and/or one or more special purpose processors.
[0022] In one embodiment, the memory may be configured to store a set of instructions and data executed by the at least one processor (104). Further, the memory (102) may include the one or more instructions that are executable by the at least one processor (104) to perform specific operations.
[0023] The at least one processor (104) is operationally coupled with the input module (102) and is configured to process the input data using advanced algorithms. The processor first retrieves the dataset of Indian Sign Language gestures stored in the database (106). The dataset includes a comprehensive collection of ISL signs, categorized based on hand shapes, movements, orientation, and facial expressions.
[0024] Once the dataset is retrieved, the processor employs machine learning models—specifically Convolutional Neural Networks (CNN) for gesture recognition and Recurrent Neural Networks (RNN) for understanding the sequence and context of gestures. CNN process the visual input (captured video), recognizing individual gestures with high accuracy by analyzing the spatial structure of hand shapes and movements. On the other hand, RNN helps in understanding the sequence of gestures, allowing the system to capture the context in which the gestures are made, which is critical for interpreting longer sentences or phrases in ISL.
[0025] After recognizing the ISL gestures from the video input, the processor translates the recognized signs into intermediate text data. This intermediate text is a direct translation of the gestures, providing the system with a textual representation of the user's input.
[0026] For example, if a user signs the phrase "How are you?" in ISL, the system will convert this into the corresponding text format.
[0027] Next, the at least one processor (104) converts the intermediate text data into audible speech output, allowing the ISL user to communicate with individuals who do not understand ISL. The system (100) supports multiple Indian languages, meaning that the speech output can be tailored to the language preference of the non-ISL user. This feature ensures that communication is not limited to one language but can accommodate the linguistic diversity present in India.
[0028] The system (100) may be designed to support bidirectional communication. In this mode, spoken language input from a non-ISL user may be converted into text, enabling ISL users to read and understand what is being said to them. For example, a teacher speaking in Hindi may have their speech converted into text, which is displayed on the system’s user interface (106) for the student who relies on ISL for communication. This feature ensures that communication between ISL users and non-ISL users is not one-sided but allows for full interaction between the two groups.
[0029] The system (100) also provides real-time feedback during name input and communication. As users input gestures or spoken commands, the system (100) analyzes the data in real-time and provides suggestions for corrections if any mistakes are detected. This ensures that communication is as accurate as possible, minimizing the risk of misunderstandings. For example, if the ISL user’s gesture is ambiguous or unclear, the system may prompt the user to clarify or repeat the gesture, ensuring the translation remains accurate.
[0030] The user interface (106) is installed within a computing unit (108), such as a smartphone, tablet, or desktop computer, allowing users to interact with the system easily. The interface provides visual feedback in the form of written text and audible speech, displaying both the user’s ISL gestures and the spoken language translations. It is designed to be intuitive and accessible, ensuring that both ISL users and non-ISL users can navigate the system without difficulty. The interface also offers customization options, allowing users to choose the output language and adjust settings to suit their specific needs.
[0031] The system has wide-ranging applications across various sectors, including education, healthcare, public services, and workplace.
[0032] For example, the system (100) can be integrated into classrooms to help deaf or hard-of-hearing students communicate effectively with teachers and peers. It also allows educators to interact with ISL-using students without requiring an interpreter.
[0033] For example, in healthcare settings, the system can be used to facilitate communication between patients who use ISL and medical staff who do not understand ISL, ensuring that patients receive accurate medical care and advice.
[0034] FIG. 2 illustrates a flow chart of method of operating a sign to speech conversion system.
[0035] At step 202, The process begins with the input module (102), where the system captures both visual and auditory data from the user. The input module consists of a camera and a microphone. The camera captures the video input of the user's hand gestures, body movements, and facial expressions, which are essential for interpreting ISL signs accurately. The microphone captures voice commands or spoken language from non-ISL users, which the system may process for bidirectional communication.
[0036] At step 204, the camera continuously tracks the user’s gestures, which are processed by the system to detect the exact ISL signs being performed. The input received is crucial for initiating the subsequent steps of recognition and translation.
[0037] At step 206, once the visual input is captured, the system proceeds to retrieve relevant data from a pre-existing dataset stored in the database (106). The database includes a vast collection of Indian Sign Language signs, categorized based on various features such as hand orientation, movement, facial expression, and body posture. The retrieved data serves as a reference point for comparing the captured gestures with known ISL signs.
[0038] Before gesture recognition, the input data is preprocessed. Preprocessing may involve cleaning the video data, removing any unnecessary noise or irrelevant background elements, and ensuring that the gesture's key features are extracted correctly. This stage ensures that the system operates efficiently and can accurately identify the user’s gestures.
[0039] At step 208, After data preprocessing, the system utilizes machine learning models to analyze and interpret the user’s gestures. The Convolutional Neural Network (CNN) is employed for gesture recognition. CNN processes the visual input, analyzing the spatial structure of the user’s hand movements and facial expressions. The system matches the captured gestures with the corresponding ISL signs stored in the database.
[0040] CNN plays a crucial role in identifying specific gestures by detecting visual patterns in the input video. It learns from the dataset by recognizing various hand shapes, motions, and positions. This learning process allows the system to handle variations in the way different users perform the same ISL signs. Once the gesture is identified, the system moves to the next stage.
[0041] For interpreting longer phrases or sequences of gestures, the system employs a Recurrent Neural Network (RNN). RNN is designed to understand the temporal and contextual relationships between consecutive gestures, which is essential for interpreting ISL sentences or complex expressions.
[0042] ISL, like other sign languages, relies heavily on context, and understanding the sequence of gestures is necessary for accurate translation. The RNN model helps the system capture the meaning of entire sentences rather than treating each gesture as an isolated event. For example, a sequence of gestures representing “How are you?” requires context-based interpretation, which the RNN facilitates.
[0043] At step 210, Once the gestures are recognized and the context is understood, the system converts the identified ISL signs into intermediate text data. This translation step converts the visual input into written language. The intermediate text represents the direct meaning of the signed gestures, providing a readable format that the system can further process.
[0044] The intermediate text data is critical because it allows the system to maintain a textual representation of the user’s communication, which is necessary for subsequent conversion to audible speech and for record-keeping purposes.
[0045] Conversion to Speech Output: After translating the gestures into text, the system then converts the intermediate text data into speech output. This conversion is accomplished using a text-to-speech (TTS) engine, which reads the text and produces an audible output. The TTS engine is capable of producing speech in multiple Indian languages, depending on the preferences of the non-ISL user.
[0046] This feature ensures that the output can cater to a diverse linguistic audience, making communication more inclusive. For example, the system may convert an ISL phrase into Hindi, Tamil, or English, depending on the user’s selection. The speech output is delivered in real-time, facilitating seamless communication between ISL users and hearing individuals.
[0047] At step 212, the microphone captures spoken language input from the non-ISL user, which the system then converts into text using speech-to-text technology. The resulting text is displayed on the screen, allowing the ISL user to read and respond.
[0048] This two-way interaction ensures that both parties can communicate effectively without needing an interpreter. For instance, a doctor and patient in a healthcare setting can communicate easily through this system without requiring a human translator.
[0049] Throughout the entire process, the system provides real-time feedback to the user. If a gesture is unclear or ambiguous, the system may prompt the user to clarify or correct the gesture. This real-time feedback ensures that the translation is as accurate as possible, reducing the risk of miscommunication.
[0050] For example, if the ISL user performs a gesture that is not fully recognized, the system may display a suggestion asking the user to repeat the gesture or provide a more precise sign. This feature is particularly useful in fast-paced conversations, ensuring that both parties remain on the same page.
[0051] At step 212, the final processed data, including spoken words, written text, and gesture feedback, is displayed via a user interface (106). This interface is installed on a computing unit (108) such as a smartphone, tablet, or desktop computer. The interface is designed to be user-friendly, ensuring that both ISL and non-ISL users can easily navigate it.
[0052] The output may be presented in multiple formats audible speech for non-ISL users, written text for ISL users, and visual feedback for gesture correction. The flexibility of the output format ensures that the system is accessible and can be adapted to the needs of various users in different communication contexts.
[0053] It should be noted that the sign to speech conversion system thereof in any case could undergo numerous modifications and variants, all of which are covered by the same innovative concept; moreover, all of the details can be replaced by technically equivalent elements. In practice, the components used, as well as the numbers, shapes, and sizes of the components can be of any kind according to the technical requirements. The scope of protection of the invention is therefore defined by the attached claims. , Claims:WE CLAIM:
1.A sign to speech conversion system, the system (100) comprising:
an input module (102), wherein the input module (102) comprises camera and microphone configured to capture video of the user and voice commands of the user, respectively;
at least one processor (104) operationally coupled with the input module (102), wherein the at least one processor (104) configured to:
receive dataset of Indian sign language (ISL) signs and gestures stored in a database;
train dataset by using machine learning model to identify specific ISL signs from the captured video input;
translate recognized ISL signs into intermediate text data;
convert the intermediate text data into audible speech output to allow communication with individuals who do not understand ISL;
convert spoken language into text, enabling bidirectional communication between ISL users and non-ISL users; and
a user interface (106) installed within the computing unit (108), wherein the at least one processor (104) is configured to provide final processed data as spoken words, written text, and gesture feedback, enhancing accessibility and understanding.
2.The system as claimed in claim 1, wherein the machine learning model comprises Convolutional Neural Network (CNN) for gesture recognition and Recurrent Neural Network (RNN) for sequence context analysis.
| # | Name | Date |
|---|---|---|
| 1 | 202511049333-STATEMENT OF UNDERTAKING (FORM 3) [22-05-2025(online)].pdf | 2025-05-22 |
| 2 | 202511049333-REQUEST FOR EXAMINATION (FORM-18) [22-05-2025(online)].pdf | 2025-05-22 |
| 3 | 202511049333-REQUEST FOR EARLY PUBLICATION(FORM-9) [22-05-2025(online)].pdf | 2025-05-22 |
| 4 | 202511049333-FORM-9 [22-05-2025(online)].pdf | 2025-05-22 |
| 5 | 202511049333-FORM 18 [22-05-2025(online)].pdf | 2025-05-22 |
| 6 | 202511049333-FORM 1 [22-05-2025(online)].pdf | 2025-05-22 |
| 7 | 202511049333-FIGURE OF ABSTRACT [22-05-2025(online)].pdf | 2025-05-22 |
| 8 | 202511049333-DRAWINGS [22-05-2025(online)].pdf | 2025-05-22 |
| 9 | 202511049333-DECLARATION OF INVENTORSHIP (FORM 5) [22-05-2025(online)].pdf | 2025-05-22 |
| 10 | 202511049333-DECLARATION OF INVENTORSHIP (FORM 5) [22-05-2025(online)]-1.pdf | 2025-05-22 |
| 11 | 202511049333-COMPLETE SPECIFICATION [22-05-2025(online)].pdf | 2025-05-22 |