Abstract: SYSTEM FOR GENERATING CONTEXT-AWARE SCENE DESCRIPTIONS FOR VISUALLY IMPAIRED USERS ABSTRACT A system (100) for generating context-aware scene descriptions for visually impaired users is disclosed. The system (100) comprises a computer-readable media (102) adapted to store instructions. A processing unit (104) is configured to receive a digital input from a predefined source (106); analyze the received digital input using a computer vision model; detect objects based on the analyzed digital input. The detected objects are selected from faces, texts, emotional expressions, articles, or a combination thereof; identify contextual relationships among the detected objects; generate a natural language description using a language generation engine (122) based on the identified contextual relationships; and generate an audio output by transmitting the generated natural language description to a text-to-speech engine (124). The system (100) conveys emotional cues and facial expressions, offering richer, more human-like narration than mere object labels. Claims: 10, Figures: 3 Figure 2 is selected.
Description:BACKGROUND
Field of Invention
[001] Embodiments of the present invention generally relate to a context-aware scene descriptor and particularly to a system for generating context-aware scene descriptions for visually impaired users.
Description of Related Art
[002] Technological advancements in artificial intelligence have contributed to the development of various tools aimed at assisting visually impaired individuals. These tools typically focus on object detection, text recognition, and conversion of visual elements into audio cues. Several commercial products now offer limited levels of support by identifying physical objects, reading text aloud, and executing predefined tasks through smart devices. While these systems serve as valuable aids, they primarily address basic recognition functions without offering deeper interaction or understanding of the surrounding environment.
[003] Existing solutions rely heavily on static visual inputs and predefined label databases, which restrict the scope of interpretation. The lack of adaptability to different scenes or user intent limits the effectiveness of these tools in dynamic or socially nuanced settings. Tools available in the market often neglect to consider contextual relationships between objects, emotional expressions of individuals, or moment-specific relevance, which results in a fragmented and superficial user experience. Furthermore, real-time user engagement through natural dialogue remains absent in most current systems.
[004] Prior efforts to combine image recognition with audio narration have demonstrated usefulness but fall short in offering enriched and personalized outputs. Limitations in natural language generation, scene understanding, and social awareness restrict the level of autonomy available to end users. The gap between basic object recognition and fully context-aware scene interpretation continues to pose challenges in achieving a comprehensive assistive system for visually impaired individuals.
[005] There is thus a need for an improved and advanced system for generating context-aware scene descriptions for visually impaired users that can address the aforementioned limitations in a more efficient manner.
SUMMARY
[006] Embodiments in accordance with the present invention provide a system for generating context-aware scene descriptions for visually impaired users. The system comprising a computer-readable media adapted to store instructions. The system further comprising a processing unit adapted to execute the instructions stored in the computer-readable media. The processing unit is configured to receive a digital input from a predefined source; analyse the received digital input using a computer vision model; and detect objects based on the analysed digital input. The detected objects are selected from faces, texts, emotional expressions, articles, or a combination thereof. The processing unit is further configured to identify contextual relationships among the detected objects; generate a natural language description using a language generation engine based on the identified contextual relationships; and generate an audio output by transmitting the generated natural language description to a text-to-speech engine.
[007] Embodiments in accordance with the present invention further provide a method for generating context-aware scene descriptions for visually impaired users. The method comprising steps of receiving a digital input from a predefined source; analyzing the received digital input using a computer vision model; and detecting objects based on the analyzed digital input. The detected objects are selected from faces, texts, emotional expressions, articles, or a combination thereof; identifying contextual relationships among the detected objects; generating a natural language description using a language generation engine based on the identified contextual relationships; and generating an audio output by transmitting the generated natural language description to a text-to-speech engine.
[008] Embodiments of the present invention may provide a number of advantages depending on their particular configuration. First, embodiments of the present application may provide a system for generating context-aware scene descriptions for visually impaired users.
[009] Next, embodiments of the present application may provide a descriptor for visually impaired users that conveys emotional cues and facial expressions, offering richer, more human-like narration than mere object labels.
[0010] Next, embodiments of the present application may provide a descriptor for visually impaired users that supports natural-language queries after initial description, so users can ask context-specific questions and receive tailored answers in real time.
[0011] Next, embodiments of the present application may provide a descriptor for visually impaired users that interprets relationships among objects and people within a scene, providing holistic, context-sensitive insights rather than isolated identifications.
[0012] Next, embodiments of the present application may provide a descriptor for visually impaired users that Delivers hands-free, natural-voice output via Text to Speech (TTS) or Speech to Text (STT), and screen-readers, enabling unobtrusive use in everyday tasks and mobile scenarios.
[0013] Next, embodiments of the present application may provide a descriptor for visually impaired users that leverages cloud-based infrastructure for fast, on-demand processing and plugs directly into social platforms and smart devices for ubiquitous accessibility.
[0014] These and other advantages will be apparent from the present application of the embodiments described herein.
[0015] The preceding is a simplified summary to provide an understanding of some embodiments of the present invention. This summary is neither an extensive nor exhaustive overview of the present invention and its various embodiments. The summary presents selected concepts of the embodiments of the present invention in a simplified form as an introduction to the more detailed description presented below. As will be appreciated, other embodiments of the present invention are possible utilizing, alone or in combination, one or more of the features set forth above or described in detail below.
BRIEF DESCRIPTION OF THE DRAWINGS
[0016] The above and still further features and advantages of embodiments of the present invention will become apparent upon consideration of the following detailed description of embodiments thereof, especially when taken in conjunction with the accompanying drawings, and wherein:
[0017] FIG. 1 illustrates a system for generating context-aware scene descriptions for visually impaired users, according to an embodiment of the present invention;
[0018] FIG. 2 illustrates a block diagram of a processing unit, according to an embodiment of the present invention; and
[0019] FIG. 3 depicts a flowchart of a method for generating context-aware scene descriptions for visually impaired users, according to an embodiment of the present invention.
[0020] The headings used herein are for organizational purposes only and are not meant to be used to limit the scope of the description or the claims. As used throughout this application, the word "may" is used in a permissive sense (i.e., meaning having the potential to), rather than the mandatory sense (i.e., meaning must). Similarly, the words “include”, “including”, and “includes” mean including but not limited to. To facilitate understanding, like reference numerals have been used, where possible, to designate like elements common to the figures. Optional portions of the figures may be illustrated using dashed or dotted lines, unless the context of usage indicates otherwise.
DETAILED DESCRIPTION
[0021] The following description includes the preferred best mode of one embodiment of the present invention. It will be clear from this description of the invention that the invention is not limited to these illustrated embodiments but that the invention also includes a variety of modifications and embodiments thereto. Therefore, the present description should be seen as illustrative and not limiting. While the invention is susceptible to various modifications and alternative constructions, it should be understood, that there is no intention to limit the invention to the specific form disclosed, but, on the contrary, the invention is to cover all modifications, alternative constructions, and equivalents falling within the scope of the invention as defined in the claims.
[0022] In any embodiment described herein, the open-ended terms "comprising", "comprises”, and the like (which are synonymous with "including", "having” and "characterized by") may be replaced by the respective partially closed phrases "consisting essentially of", “consists essentially of", and the like or the respective closed phrases "consisting of", "consists of”, the like.
[0023] As used herein, the singular forms “a”, “an”, and “the” designate both the singular and the plural, unless expressly stated to designate the singular only.
[0024] FIG. 1 illustrates a system 100 for generating context-aware scene descriptions for visually impaired users, according to an embodiment of the present invention. The system 100 may be adapted to generate a textual description and/or a vocal description of a scene and/or a visualization. The system 100 may conduct artificially computed storytelling, emotion-sensitive descriptions of scenes, and real-time conversations to offer visually impaired individuals an immersive experience of their surroundings and/or a specific environment. The system 100 may operate on a multimodal image recognition that may be superior to traditional object recognition systems.
[0025] According to the embodiments of the present invention, the system 100 may incorporate non-limiting hardware components to enhance the processing speed and efficiency such as the system 100 may comprise a computer-readable media 102, a processing unit 104, predefined source 106, a third-party platform 108, a social media application 110, an entertainment platform 112, an infotainment platform 114, an education platform 116, a user device 118, a storage medium 120, a language generation engine 122, and a text-to-speech engine 124. In an embodiment of the present invention, the hardware components of the system 100 may be integrated with computer-executable instructions for overcoming the challenges and the limitations of the existing systems.
[0026] In an embodiment of the present invention, the computer-readable media 102 may be adapted to store instructions. The computer-readable media 102 may be, but not limited to, a Random-Access Memory (RAM), a Static Random-Access Memory (SRAM), a Dynamic Random-Access Memory (DRAM), a Read-Only Memory (ROM), an Erasable Programmable Read-only Memory (EPROM), an Electrically Erasable Programmable Read-only Memory (EEPROM), a NAND Flash, a Secure Digital (SD) memory, a cache memory, a Hard Disk Drive (HDD), a Solid-State Drive (SSD), and so forth. In a preferred embodiment of the present invention, the computer-readable media 102 may be a cloud-based server such as, but not limited to, Amazon Web Services (AWS), Azure, and so forth. Embodiments of the present invention are intended to include or otherwise cover any type of the computer-readable media 102, including known, related art, and/or later developed technologies.
[0027] In an embodiment of the present invention, the processing unit 104 may be adapted to execute the instructions stored in the computer-readable media 102. The processing unit 104 may further be configured to execute computer-executable instructions to generate an output relating to the system 100. The processing unit 104 may be, but not limited to, a Programmable Logic Control (PLC) unit, a microprocessor, a development board, and so forth. Embodiments of the present invention are intended to include or otherwise cover any type of the processing unit 104 including known, related art, and/or later developed technologies. In an embodiment of the present invention, the processing unit 104 may further be explained in conjunction with FIG. 2.
[0028] In an embodiment of the present invention, the predefined source 106 may provide a digital input to the processing unit 104. The digital input may be, but not limited to, an image, a video frame, a real-time video stream, a Graphics Interchange Format (GIF), a compressed digital file, and so forth. Embodiments of the present invention are intended to include or otherwise cover any type of the digital input, including known, related art, and/or later developed technologies. The digital input may be the scene and/or the visualization for which the processing unit 104 may generate a natural language description and/or an audio output. The predefined source 106 may be, but not limited to, the third-party platform 108, the social media application 110, the entertainment platform 112, the infotainment platform 114, the education platform 116, the storage medium 120 of the user device 118, and so forth. Embodiments of the present invention are intended to include or otherwise cover any type of the predefined source 106, including known, related art, and/or later developed technologies.
[0029] In an embodiment of the present invention, the user device 118 may be an electronic device that may be used by the visually impaired users. The user device 118 may enable the visually impaired users to receive the natural language description and/or the audio output. Further, the user device 118 may enable the visually impaired users to generate the natural language description and/or the audio output from the digital input stored in the storage medium 120 of the user device 118. The user device 118 may be, but not limited to, a laptop, a smartphone, a camera, a webcam, and so forth. Embodiments of the present invention are intended to include or otherwise cover any type of the user device 118, including known, related art, and/or later developed technologies.
[0030] Further, the natural language description and/or the audio output may be routed to a smart home device and/or a voice assistant (not shown). The smart home device and/or a voice assistant may be, but not limited to, Apple HomePod, Siri, Amazon Alexa, Google Home, Bixby, and so forth. Embodiments of the present invention are intended to include or otherwise cover any type of the smart home device and/or a voice assistant, including known, related art, and/or later developed technologies.
[0031] The storage medium 120 may be, but not limited to, a Random-Access Memory (RAM), a Static Random-Access Memory (SRAM), a Dynamic Random-Access Memory (DRAM), a Read-Only Memory (ROM), an Erasable Programmable Read-only Memory (EPROM), an Electrically Erasable Programmable Read-only Memory (EEPROM), a NAND Flash, a Secure Digital (SD) memory, a cache memory, a Hard Disk Drive (HDD), a Solid-State Drive (SSD), and so forth. Embodiments of the present invention are intended to include or otherwise cover any type of the storage medium 120, including known, related art, and/or later developed technologies.
[0032] In an embodiment of the present invention, the language generation engine 122 may be adapted to generate the natural language description. The language generation engine 122 may comprise a multimodal large language model trained on paired image-text datasets. In an embodiment of the present invention, the text-to-speech engine 124 may be adapted to generate the audio output from the generated natural language description.
[0033] FIG. 2 illustrates a block diagram of the processing unit 104, according to an embodiment of the present invention. The processing unit 104 may comprise the computer-executable instructions in form of programming modules such as a data receiving module 200, a data analysis module 202, a data detection module 204, a data identification module 206, and a data generation module 208.
[0034] In an embodiment of the present invention, the data receiving module 200 may be configured to receive the digital input from the predefined source 106. The data receiving module 200 may be configured to transmit the received digital input to the data analysis module 202.
[0035] The data analysis module 202 may be activated upon receipt of the digital input from the data receiving module 200. In an embodiment of the present invention, the data analysis module 202 may be configured to convert the received digital input into Base64. The data analysis module 202 may be configured to analyse the received digital input and the Base64 version of the digital input using a computer vision model. The computer vision model may be configured to detect and classify the emotional expressions based on facial feature analysis. The computer vision model may be, but not limited to, a You Only Look Once (YOLO) model, a Convoluted Computer Vision (CCV) algorithm, and so forth. In a preferred embodiment of the present invention, the computer vision model may be an Application Programming Interface (API) of Generative Pre-trained Transformer (GPT) version 4. Embodiments of the present invention are intended to include or otherwise cover any type of the computer vision model, including known, related art, and/or later developed technologies.
[0036] The data analysis module 202 may be configured to transmit the analyzed digital input to the data detection module 204.
[0037] The data detection module 204 may be activated upon receipt of the analyzed digital input from the data analysis module 202. In an embodiment of the present invention, the data detection module 204 may be configured to detect objects based on the analysed digital input. The detected objects may be, but not limited to, faces, texts, emotional expressions, articles, and so forth. Further, the detected objects may include previously known individuals identified using facial recognition trained on a user-provided image dataset. Embodiments of the present invention are intended to include or otherwise cover any type of the detected objects, including known, related art, and/or later developed technologies. The data detection module 204 may be configured to transmit the detected objects to the data identification module 206.
[0038] The data identification module 206 may be activated upon receipt of the detected objects from the data detection module 204. In an embodiment of the present invention, the data identification module 206 may be configured to identify contextual relationships among the detected objects. The contextual relationships may comprise a spatial positioning, a proximity, interaction cues, and so forth between the detected objects. The data identification module 206 may be configured to transmit the identified contextual relationships to the data generation module 208.
[0039] The data generation module 208 may be activated upon receipt of the identified contextual relationships from the data identification module 206. In an embodiment of the present invention, the data generation module 208 may be configured to deploy the language generation engine 122. The language generation engine 122 may be configured to generate the natural language description a based on the identified contextual relationships.
[0040] The data generation module 208 may be configured to deploy the text-to-speech engine 124. The text-to-speech engine 124 may receive the natural language description generated from the language generation engine 122. Further, the text-to-speech engine 124 may be configured to generate the audio output of the natural language description. The audio output of the natural language description may be generated by integrating an emotive modulation into the generated natural language description to reflect a mood, a tone, a pitch, a speaking rate, a volume level, an intonation pattern, an emotional expression, and so forth. Furthermore, the generated audio output may be modulated to reflect an emotional tone detected from the digital input.
[0041] The data generation module 208 may be configured to enable the user to ask follow-up questions relating to the generated audio output. For example, if the generated audio output mentions a tree, then the user may ask a follow-up question, questioning ‘What kind of tree’, ‘What is colour of the tree’, ‘What fruit does the tree bears?’, and so forth. The data generation module 208 may further be configured to enable the user to enquire about explanations relating to the generated audio output. For example, if the generated audio output mentions a smartphone with AMOLED display, then the user may enquire, ‘What is an AMOLED display’.
[0042] In an exemplary scenario of the present invention, the system 100 may be configured to assist a blind person in interpreting a digital scene. The user, who is blind, may initiate a scan using a wearable smart camera connected to the system. The camera captures a live scene in a public park. The data receiving module 200 receives the digital input comprising the image stream from the smart camera and forwards it to the data analysis module 202. The data analysis module 202 converts the input into a Base64 format and processes the image data using a pre-trained computer vision model (e.g., YOLO or GPT-4 API). The model detects various objects and individuals in the scene, including children playing, a dog running, a man reading a newspaper on a bench, and a vendor selling ice cream. Emotional cues such as smiling children and a frustrated expression on the dog’s owner are detected.
[0043] These detected objects and expressions are passed to the data identification module 206, which identifies contextual relationships such as the proximity of the dog to the children and the man’s interaction with the newspaper. The data generation module 208 receives this contextual information and uses the language generation engine 122 to create a coherent natural language description: “You are in a lively park. Children are laughing and playing near a fountain. A dog is running toward them, looking excited. Nearby, a man sits on a bench reading a newspaper, and another person is selling ice cream from a cart.”
[0044] This description may be passed to the text-to-speech engine 124, which produces an emotively modulated audio output with expressive cues and cheerful tone for children’s laughter, a curious tone for the dog’s behavior. The audio is played through the user’s bone-conduction earphones. The user then asks, “What flavor of ice cream is the vendor selling?” The system interprets the scene again to detect label information on the ice cream cart and responds, “The vendor appears to be selling chocolate, vanilla, and strawberry ice creams.”
[0045] FIG. 3 depicts a flowchart of a method 300 for generating the real-time, context-aware scene descriptions for the visually impaired users using the system 100, according to an embodiment of the present invention.
[0046] At step 302, the system 100 may receive the digital input from the predefined source 106.
[0047] At step 304, the system 100 may analyse the received digital input using the computer vision model.
[0048] At step 306, the system 100 may detect the objects based on the analysed digital input.
[0049] At step 308, the system 100 may identify the contextual relationships among the detected objects.
[0050] At step 310, the system 100 may generate the natural language description using the language generation engine 122 based on the identified contextual relationships.
[0051] At step 312, the system 100 may generate the audio output by transmitting the generated natural language description to the text-to-speech engine 124.
[0052] While the invention has been described in connection with what is presently considered to be the most practical and various embodiments, it is to be understood that the invention is not to be limited to the disclosed embodiments, but on the contrary, is intended to cover various modifications and equivalent arrangements included within the scope of the appended claims.
[0053] This written description uses examples to disclose the invention, including the best mode, and also to enable any person skilled in the art to practice the invention, including making and using any devices or systems and performing any incorporated methods. The patentable scope of the invention is defined in the claims, and may include other examples that occur to those skilled in the art. Such other examples are intended to be within the scope of the claims if they have structural elements that do not differ from the literal language of the claims, or if they include equivalent structural elements within substantial differences from the literal languages of the claims. , Claims:CLAIMS
I/We Claim:
1. A system (100) for generating context-aware scene descriptions for visually impaired users, the system (100) comprising:
a computer-readable media (102) adapted to store instructions; and
a processing unit (104) adapted to execute the instructions stored in the computer-readable media (102), characterized in that the processing unit (104) is configured to:
receive a digital input from a predefined source (106);
analyse the received digital input using a computer vision model;
detect objects based on the analysed digital input, wherein the detected objects are selected from faces, texts, emotional expressions, articles, or a combination thereof;
identify contextual relationships among the detected objects;
generate a natural language description using a language generation engine (122) based on the identified contextual relationships; and
generate an audio output by transmitting the generated natural language description to a text-to-speech engine (124).
2. The system (100) as claimed in claim 1, wherein the computer vision model is configured to detect and classify the emotional expressions based on facial feature analysis.
3. The system (100) as claimed in claim 1, wherein the contextual relationships comprise a spatial positioning, a proximity, interaction cues between the detected objects, or a combination thereof.
4. The system (100) as claimed in claim 1, wherein the language generation engine (122) comprises a multimodal large language model trained on paired image-text datasets.
5. The system (100) as claimed in claim 1, wherein the generated audio output is modulated to reflect an emotional tone detected from the digital input.
6. The system (100) as claimed in claim 1, wherein the detected objects include previously known individuals identified using facial recognition trained on a user-provided image dataset.
7. The system (100) as claimed in claim 1, wherein the digital input comprises an image, a video frame, a real-time video stream, a Graphics Interchange Format (GIF), a compressed digital file, or a combination thereof.
8. The system (100) as claimed in claim 1, wherein the audio output is generated by integrating an emotive modulation into the generated natural language description to reflect a mood, a tone, a pitch, a speaking rate, a volume level, an intonation pattern, an emotional expression, or a combination thereof.
9. The system (100) as claimed in claim 1, wherein the predefined source (106) is selected from a third-party platform (108), a social media application (110), an entertainment platform (112), an infotainment platform (114), an education platform (116), a storage medium (120) of a user device (118), or a combination thereof.
10. A method (300) for generating context-aware scene descriptions for visually impaired users, the method (300) comprising:
receiving a digital input from a predefined source (106);
analyzing the received digital input using a computer vision model;
detect objects based on the analyzed digital input, wherein the detected objects are selected from faces, texts, emotional expressions, articles, or a combination thereof;
identifying contextual relationships among the detected objects;
generating a natural language description using a language generation engine (122) based on the identified contextual relationships; and
generating an audio output by transmitting the generated natural language description to a text-to-speech engine (124).
Date: May 26, 2025
Place: Noida
Nainsi Rastogi
Patent Agent (IN/PA-2372)
Agent for the Applicant
| # | Name | Date |
|---|---|---|
| 1 | 202541050619-STATEMENT OF UNDERTAKING (FORM 3) [27-05-2025(online)].pdf | 2025-05-27 |
| 2 | 202541050619-REQUEST FOR EARLY PUBLICATION(FORM-9) [27-05-2025(online)].pdf | 2025-05-27 |
| 3 | 202541050619-POWER OF AUTHORITY [27-05-2025(online)].pdf | 2025-05-27 |
| 4 | 202541050619-OTHERS [27-05-2025(online)].pdf | 2025-05-27 |
| 5 | 202541050619-FORM-9 [27-05-2025(online)].pdf | 2025-05-27 |
| 6 | 202541050619-FORM FOR SMALL ENTITY(FORM-28) [27-05-2025(online)].pdf | 2025-05-27 |
| 7 | 202541050619-FORM 1 [27-05-2025(online)].pdf | 2025-05-27 |
| 8 | 202541050619-EVIDENCE FOR REGISTRATION UNDER SSI(FORM-28) [27-05-2025(online)].pdf | 2025-05-27 |
| 9 | 202541050619-EDUCATIONAL INSTITUTION(S) [27-05-2025(online)].pdf | 2025-05-27 |
| 10 | 202541050619-DRAWINGS [27-05-2025(online)].pdf | 2025-05-27 |
| 11 | 202541050619-DECLARATION OF INVENTORSHIP (FORM 5) [27-05-2025(online)].pdf | 2025-05-27 |
| 12 | 202541050619-COMPLETE SPECIFICATION [27-05-2025(online)].pdf | 2025-05-27 |