Sign In to Follow Application
View All Documents & Correspondence

System And Method For Enhancing Accessibility Of Visual Content Via Text And Voice

Abstract: SYSTEM AND METHOD FOR ENHANCING ACCESSIBILITY OF VISUAL CONTENT VIA TEXT AND VOICE ABSTRACT A system (100) for enhancing accessibility of visual content is disclosed. The system (100) comprising: an input device (102); and a processor (114) configured to: receive the visual content from the input device (102); extract visual features from the received visual content using a pre-trained visual geometry group (VGG) 16 model; process the extracted visual features using a long short-term memory (LSTM) network to generate a textual description of the visual content; generate the textual description based on a database (110) of caption information associated with the visual content; present the generated textual description to a user through a text-based interface (106); and convert the generated textual description into an audible speech output using a text-to-speech (TTS) converter (108). The system (100) operates in real time and provides advanced and high-level image recognition. Claims: 10, Figures:3 Figure 1A is selected.

Get Free WhatsApp Updates!
Notices, Deadlines & Correspondence

Patent Information

Application #
Filing Date
24 May 2024
Publication Number
22/2024
Publication Type
INA
Invention Field
COMPUTER SCIENCE
Status
Email
Parent Application

Applicants

SR University
SR University, Ananthasagar, Warangal,Telangana-506371, India patent@sru.edu.in 08702818333

Inventors

1. Dr. P Praveen
SR University, Ananthasagar, Warangal, Telangana-506371, India (IN)
2. Dr. Mohammed Ali Shaik
SR University, Ananthasagar, Warangal, Telangana-506371, India (IN)
3. Dr. T Sampath Kumar
SR University, Ananthasagar, Warangal, Telangana-506371, India (IN)
4. Dr. R Ravi Kumar
SR University, Ananthasagar, Warangal, Telangana-506371, India (IN)

Specification

Description:BACKGROUND
Field of Invention
[001] Embodiments of the present invention generally relate to image processing and particularly to a system for enhancing accessibility of visual content.
Description of Related Art
[002] Image captioning, a task of generating a natural language description of an image, presents a significant challenge in artificial intelligence and computer vision. The process requires not only understanding a content of an image but also generating a coherent and informative description.
[003] Existing image captioning systems face several challenges that hinder their effectiveness in generating accurate and contextually relevant descriptions. One prominent issue is the difficulty in capturing the nuanced relationships between various elements within an image, leading to descriptions that may lack depth or fail to convey the true essence of the visual content. Additionally, these systems often struggle with handling complex scenes or ambiguous imagery, resulting in descriptions that may be inaccurate or misleading. Furthermore, variations in lighting conditions, image quality, and object occlusions pose significant challenges to the robustness and generalization capability of existing systems, limiting their applicability across diverse real-world scenarios. Addressing these challenges is crucial to enhancing the performance and usability of image captioning systems for a wide range of applications, including accessibility tools for visually impaired individuals, automated content tagging, and multimedia content generation.
[004] There is thus a need for an improved and advanced system for enhancing accessibility of visual content that can administer the aforementioned limitations in a more efficient manner.
SUMMARY
[005] Embodiments in accordance with the present invention provide a system for enhancing accessibility of visual content. The system comprising: an input device adapted to receive visual content as an input. The system further comprising: a processor in communication with the input device. The processor is configured to: receive the visual content from the input device; extract visual features from the received visual content using a pre-trained visual geometry group (VGG) 16 model; process the extracted visual features using a long short-term memory (LSTM) network to generate a textual description of the visual content; generate the textual description based on a database of caption information associated with the visual content; present the generated textual description to a user through a text-based interface; and convert the generated textual description into an audible speech output using a text-to-speech (TTS) converter.
[006] Embodiments in accordance with the present invention further provide a method for enhancing an accessibility of visual content. The method comprising steps of: receiving visual content as input through an input device; extracting visual features from the received visual content using a pre-trained visual geometry group (VGG) 16 model; processing the extracted visual features using a long short-term memory (LSTM) network to generate a textual description of the visual content; generating the textual description based on a database of caption information associated with the visual content; presenting the generated textual description to a user through a text-based interface; and converting the generated textual description into an audible speech output using a text-to-speech (TTS) converter.
[007] Embodiments of the present invention may provide a number of advantages depending on their particular configuration. First, embodiments of the present application may provide a system for enhancing accessibility of visual content.
[008] Next, embodiments of the present application may provide a system for enhancing accessibility of visual content that provides advanced and high-level image recognition.
[009] Next, embodiments of the present application may provide a system for enhancing accessibility of visual content that announces a description of the recognized image in a natural and human sound tone.
[0010] Next, embodiments of the present application may provide a system for enhancing accessibility of visual content that supports multiple languages for an announcement of the description of the recognized image.
[0011] Next, embodiments of the present application may provide a system for enhancing accessibility of visual content that operates on a wide variety of computing platforms and operating systems.
[0012] Next, embodiments of the present application may provide a system for enhancing accessibility of visual content such as generating captions for images.
[0013] Next, embodiments of the present application may provide a system for enhancing accessibility of visual content that integrates with other applications.
[0014] Next, embodiments of the present application may provide a system for enhancing accessibility of visual content that is accurate and highly precise.
[0015] Next, embodiments of the present application may provide a system for enhancing accessibility of visual content that is intuitive and user-friendly.
[0016] Next, embodiments of the present application may provide a system for enhancing accessibility of visual content that offers a free trial period.
[0017] Next, embodiments of the present application may provide a system for enhancing accessibility of visual content that is targeted toward visually impaired and visually disabled users.
[0018] Next, embodiments of the present application may provide a system for enhancing accessibility of visual content that emphasizes data security and data privacy.
[0019] These and other advantages will be apparent from the present application of the embodiments described herein.
[0020] The preceding is a simplified summary to provide an understanding of some embodiments of the present invention. This summary is neither an extensive nor exhaustive overview of the present invention and its various embodiments. The summary presents selected concepts of the embodiments of the present invention in a simplified form as an introduction to the more detailed description presented below. As will be appreciated, other embodiments of the present invention are possible utilizing, alone or in combination, one or more of the features set forth above or described in detail below.
BRIEF DESCRIPTION OF THE DRAWINGS
[0021] The above and still further features and advantages of embodiments of the present invention will become apparent upon consideration of the following detailed description of embodiments thereof, especially when taken in conjunction with the accompanying drawings, and wherein:
[0022] FIG. 1A illustrates a block diagram of a system for enhancing accessibility of visual content, according to an embodiment of the present invention;
[0023] FIG. 1B illustrates a block diagram of a processor of the system for enhancing accessibility of visual content, according to an embodiment of the present invention; and
[0024] FIG. 2 depicts a flowchart of a method for enhancing an accessibility of visual content using the system for enhancing accessibility of visual content, according to an embodiment of the present invention.
[0025] The headings used herein are for organizational purposes only and are not meant to be used to limit the scope of the description or the claims. As used throughout this application, the word "may" is used in a permissive sense (i.e., meaning having the potential to), rather than the mandatory sense (i.e., meaning must). Similarly, the words “include”, “including”, and “includes” mean including but not limited to. To facilitate understanding, like reference numerals have been used, where possible, to designate like elements common to the figures. Optional portions of the figures may be illustrated using dashed or dotted lines, unless the context of usage indicates otherwise.
DETAILED DESCRIPTION
[0026] The following description includes the preferred best mode of one embodiment of the present invention. It will be clear from this description of the invention that the invention is not limited to these illustrated embodiments but that the invention also includes a variety of modifications and embodiments thereto. Therefore, the present description should be seen as illustrative and not limiting. While the invention is susceptible to various modifications and alternative constructions, it should be understood, that there is no intention to limit the invention to the specific form disclosed, but, on the contrary, the invention is to cover all modifications, alternative constructions, and equivalents falling within the scope of the invention as defined in the claims.
[0027] In any embodiment described herein, the open-ended terms "comprising", "comprises”, and the like (which are synonymous with "including", "having” and "characterized by") may be replaced by the respective partially closed phrases "consisting essentially of", “consists essentially of", and the like or the respective closed phrases "consisting of", "consists of”, the like.
[0028] As used herein, the singular forms “a”, “an”, and “the” designate both the singular and the plural, unless expressly stated to designate the singular only.
[0029] FIG. 1A illustrates a block diagram of a system 100 for enhancing accessibility of visual content, according to an embodiment of the present invention. In an embodiment of the present invention, the system 100 may enable a user to upload a visual content to the system 100. The system 100 may further recognize visual features (contents) of the visual content and may further provide a textual description of the visual features recognized in the uploaded visual content, in an embodiment of the present invention. In another embodiment of the present invention, the system 100 may further announce the textual description of the visual features recognized in the uploaded visual content.
[0030] According to embodiments of the present invention, the visual content may be, but not limited to, a video content, a document content, an audio content, and so forth. In a preferred embodiment of the present invention, the visual content may be an image content. Embodiments of the present invention are intended to include or otherwise cover any type of the visual content that may be uploaded to the system 100, including known, related art, and/or later developed technologies.
[0031] According to an embodiment of the present invention, the system 100 may comprise an input device 102, a computer application 104, a database 110, an application server 112, and a processor 114.
[0032] In an embodiment of the present invention, the input device 102 may be a device used by a user to upload the visual content to the system 100. In an embodiment of the present invention, the user may upload the visual content from a storage medium (not shown) of the input device 102. The user may further capture a real-time visual content using an image-capturing unit, such as a camera (not shown), that may further be uploaded to the system 100, in an embodiment of the present invention.
[0033] The input device 102 may be, but not limited to, a personal computer, a consumer device, and alike. Embodiments of the present invention are intended to include or otherwise cover any type of the input device 102 including known, related art, and/or later developed technologies. In an embodiment of the present invention, the personal computer may be, but not limited to, a desktop, a server, a laptop, and alike. Embodiments of the present invention are intended to include or otherwise cover any type of the personal computer including known, related art, and/or later developed technologies.
[0034] Further, in an embodiment of the present invention, the consumer device may be, but not limited to, a tablet, a mobile phone, a notebook, a netbook, a smartphone, a wearable device, and so forth. Embodiments of the present invention are intended to include or otherwise cover any type of the consumer device including known, related art, and/or later developed technologies.
[0035] According to an embodiment of the present invention, the input device 102 may comprise software applications such as, but not limited to, an Internet browsing application, a carrier connectivity application, a text-to-speech (TTS) application, and the like. In a preferred embodiment of the present invention, the input device 102 may comprise the computer application 104 which may be a computer-readable program installed in the input device 102 for executing functions associated with the system 100. The computer application 104 may further comprise a text-based interface 106 and a text-to-speech (TTS) converter 108.
[0036] In an embodiment of the present invention, the database 110 may be adapted to store preset caption information that may be associated with the visual content uploaded to the system 100. The database 110 may be trained using a Flickr_8K dataset, in an embodiment of the present invention. According to embodiments of the present invention, the Flickr_8K dataset may comprise contents such as, but not limited to, image names, captions, photographs, and so forth. Embodiments of the present invention are intended to include or otherwise cover any contents that may be stored in the Flickr_8K dataset, including known, related art, and/or later developed technologies.
[0037] According to embodiments of the present invention, the database 110 may be for example, but not limited to, a distributed database, a personal database, an end-user database, a commercial database, a Structured Query Language (SQL) database, a non-SQL database, an operational database, a relational database, an object-oriented database, a graph database, a cloud server database, and so forth. Embodiments of the present invention are intended to include or otherwise cover any type of the database 110 including known, related art, and/or later developed technologies.
[0038] Further, the database 110 may be a cloud server database, in an embodiment of the present invention. In an embodiment of the present invention, the cloud server may be remotely located. In an exemplary embodiment of the present invention, the cloud server may be a public cloud server. In another exemplary embodiment of the present invention, the cloud server may be a private cloud server. In yet another embodiment of the present invention, the cloud server may be a dedicated cloud server. According to embodiments of the present invention, the cloud server may be, but not limited to, a Microsoft Azure cloud server, an Amazon AWS cloud server, a Google Compute Engine (GEC) cloud server, an Amazon Elastic Compute Cloud (EC2) cloud server, and so forth. Embodiments of the present invention are intended to include or otherwise cover any type of the cloud server including known, related art, and/or later developed technologies.
[0039] In an embodiment of the present invention, the application server 112 may be a hardware on which the processor 114 may be installed. According to embodiments of the present invention, the application server 112 may be, but not limited to, a motherboard, a wired board, a mainframe, and so forth. Embodiments of the present invention are intended to include or otherwise cover any type of the application server 112, including known, related art, and/or later developed technologies.
[0040] In an embodiment of the present invention, the processor 114 may be located on the application server 112. The processor 114 may be adapted to be in communication with the input device 102. The processor 114 may be configured to execute the computer-readable instructions to generate an output relating to the system 100. According to embodiments of the present invention, the processor 114 may be, but not limited to, a Programmable Logic Control (PLC) unit, a microprocessor, a development board, and so forth. Embodiments of the present invention are intended to include or otherwise cover any type of the processor 114 including known, related art, and/or later developed technologies. In an embodiment of the present invention, the processor 114 may further be explained in conjunction with FIG. 1B.
[0041] FIG. 1B illustrates a block diagram of the processor 114 of the system 100, according to an embodiment of the present invention. The processor 114 may comprise the computer-readable instructions in form of programming modules such as a data receiving module 116, a data extraction module 118, a data processing module 120, a description generation module 122, a data presentation module 124, and an announcement module 126.
[0042] In an embodiment of the present invention, the data receiving module 116 may be configured to receive the visual content from the input device 102. The received visual content may further be transmitted to the data extraction module 118, in an embodiment of the present invention.
[0043] In an embodiment of the present invention, the data extraction module 118 may be activated upon receipt of the visual content from the data receiving module 116. The data extraction module 118 may be configured to extract visual features from the received visual content using a pre-trained visual geometry group (VGG) 16 model, in an embodiment of the present invention. In an embodiment of the present invention, the visual features extracted using the pre-trained visual geometry group (VGG) 16 model may further be stored in a dictionary (which may be stored in the database 110) with image identifiers as keys and saved to a pickle file for later use.
[0044] In another embodiment of the present invention, a last classification layer of the pre-trained visual geometry group (VGG) 16 model may be removed to extract the features instead of class predictions. The extracted visual features from the visual content may further be transmitted to the data processing module 120, in an embodiment of the present invention.
[0045] In an embodiment of the present invention, the data processing module 120 may be activated upon receipt of the extracted visual features from the data extraction module 118. The data processing module 120 may be configured to process the extracted visual features using a long short-term memory (LSTM) network to generate the textual description of the visual content, in an embodiment of the present invention. In an embodiment of the present invention, the long short-term memory (LSTM) network may be integrated with an attention mechanism focused on relevant visual features from the extracted visual features and a linguistic context during the generation of the textual description.
[0046] Further, the long short-term memory (LSTM) network may be configured to import libraries from TensorFlow and Keras for handling and processing the extracted visual features from the visual content, in an embodiment of the present invention. In an embodiment of the present invention, the generated textual description may further be transmitted to the description generation module 122.
[0047] In an embodiment of the present invention, the description generation module 122 may be activated upon receipt of the generated textual description from the data processing module 120. The description generation module 122 may be configured to generate the textual description based on the database 110 of the caption information associated with the visual content, in an embodiment of the present invention. In an embodiment of the present invention, the generated textual description may further be preprocessed. According to embodiments of the present invention, the preprocessing may involve steps such as, but not limited to, converting into a lowercase, removing non-alphabetical characters, adding start and end tokens, and so forth. Embodiments of the present invention are intended to include or otherwise cover any steps that may be involved in the preprocessing of the generated textual description, including known, related art, and/or later developed technologies. In an embodiment of the present invention, the description generation module 122 may be configured to convert words into numerical identifiers for further processing by the system 100.
[0048] In another embodiment of the present invention, the description generation module 122 may further be configured to evaluate a quality of the generated textual description using BiLingual Evaluation Understudy (BLEU) scores. Further, if the BiLingual Evaluation Understudy (BLEU) score of the generated textual description may be less than a predefined threshold, then the description generation module 122 may refine and process the generated textual description. Otherwise, the generated processed textual description may further be transmitted to the data presentation module 124. However, if the generated processed textual description may be refined then the refined textual description may be transmitted to the data presentation module 124, in an embodiment of the present invention.
[0049] In an embodiment of the present invention, the data presentation module 124 may be configured to be activated upon receipt of the generated and/or refined textual description from the description generation module 122. The data presentation module 124 may be configured to present the generated textual description to the user through the text-based interface 106, in an embodiment of the present invention. The text-based interface 106 may be an interface on the computer application 104 installed in the input device 102 and the user may read the generated textual description using the computer application 104 installed in the input device 102, in an embodiment of the present invention. After the presentation of the generated textual description to the user through the text-based interface 106, the data presentation module 124 may transmit an activation signal to the announcement module 126.
[0050] In an embodiment of the present invention, the announcement module 126 may be activated upon receipt of the activation signal from the data presentation module 124. The announcement module 126 may be configured to convert the generated textual description into an audible speech output using the text-to-speech (TTS) converter 108, in an embodiment of the present invention. In an embodiment of the present invention, the audible speech output may be announced using a sound unit, such as speakers (not shown) installed in the input device 102.
[0051] FIG. 2 depicts a flowchart of a method 200 for enhancing the accessibility of the visual content using the system 100, according to an embodiment of the present invention.
[0052] At step 202, the system 100 may receive the visual content from the input device 102.
[0053] At step 204, the system 100 may extract the visual features from the received visual content using the pre-trained visual geometry group (VGG) 16 model.
[0054] At step 206, the system 100 may process the extracted visual features using the long short-term memory (LSTM) network to generate the textual description of the visual content.
[0055] At step 208, the system 100 may generate the textual description based on the database 110 of the caption information associated with the visual content.
[0056] At step 210, the system 100 may present the generated textual description to the user through the text-based interface 106.
[0057] At step 212, the system 100 may convert the generated textual description into the audible speech output using the text-to-speech (TTS) converter 108.
[0058] While the invention has been described in connection with what is presently considered to be the most practical and various embodiments, it is to be understood that the invention is not to be limited to the disclosed embodiments, but on the contrary, is intended to cover various modifications and equivalent arrangements included within the scope of the appended claims.
[0059] This written description uses examples to disclose the invention, including the best mode, and also to enable any person skilled in the art to practice the invention, including making and using any devices or systems and performing any incorporated methods. The patentable scope of the invention is defined in the claims, and may include other examples that occur to those skilled in the art. Such other examples are intended to be within the scope of the claims if they have structural elements that do not differ from the literal language of the claims, or if they include equivalent structural elements within substantial differences from the literal languages of the claims. , Claims:CLAIMS
I/We Claim:
1. A system (100) for enhancing accessibility of visual content, the system (100) comprising:
an input device (102) adapted to receive visual content as an input; and
a processor (114) in communication with the input device (102), characterized in that the processor (114) is configured to:
receive the visual content from the input device (102);
extract visual features from the received visual content using a pre-trained visual geometry group (VGG) 16 model;
process the extracted visual features using a long short-term memory (LSTM) network to generate a textual description of the visual content;
generate the textual description based on a database (110) of caption information associated with the visual content;
present the generated textual description to a user through a text-based interface (106); and
convert the generated textual description into an audible speech output using a text-to-speech (TTS) converter (108).
2. The system (100) as claimed in claim 1, wherein the database (110) is trained using a Flickr_8K dataset that comprises image names, captions, photographs, or a combination thereof.
3. The system (100) as claimed in claim 1, wherein the processor (114) is configured to import libraries from TensorFlow and Keras for handling and processing the visual content.
4. The system (100) as claimed in claim 1, wherein a last classification layer of the pre-trained visual geometry group (VGG) 16 model is removed to extract the features instead of class predictions.
5. The system (100) as claimed in claim 1, wherein the processor (114) is configured to evaluate a quality of the generated textual description using BiLingual Evaluation Understudy (BLEU) scores.
6. The system (100) as claimed in claim 1, wherein the visual features extracted using the pre-trained visual geometry group (VGG) 16 model are stored in a dictionary with image identifiers as keys and saved to a pickle file for later use.
7. The system (100) as claimed in claim 1, wherein the processor (114) preprocesses the generated textual descriptions by converting into a lowercase, removing non-alphabetical characters, and adding start and end tokens before presentation to the user through a text-based interface (106).
8. The system (100) as claimed in claim 1, wherein the processor (114) tokenizes the textual descriptions to convert words into numerical identifiers for further processing by the system (100).
9. The system (100) as claimed in claim 1, wherein the processor (114) is configured to integrate an attention mechanism within the long short-term memory (LSTM) network to focus on relevant visual features from the extracted visual features and a linguistic context during the generation of the textual description.
10. A method (200) for enhancing an accessibility of visual content, the method (200) characterised by steps of:
receiving visual content as input through an input device (102);
extracting visual features from the received visual content using a pre-trained visual geometry group (VGG) 16 model;
processing the extracted visual features using a long short-term memory (LSTM) network to generate a textual description of the visual content;
generating the textual description based on a database (110) of caption information associated with the visual content;
presenting the generated textual description to a user through a text-based interface (106); and
converting the generated textual description into an audible speech output using a text-to-speech (TTS) converter (108).
Date: May 22, 2024
Place: Noida

Dr. Keerti Gupta
Agent for the Applicant
(IN/PA-1529)

Documents

Application Documents

# Name Date
1 202441040418-STATEMENT OF UNDERTAKING (FORM 3) [24-05-2024(online)].pdf 2024-05-24
2 202441040418-REQUEST FOR EARLY PUBLICATION(FORM-9) [24-05-2024(online)].pdf 2024-05-24
3 202441040418-POWER OF AUTHORITY [24-05-2024(online)].pdf 2024-05-24
4 202441040418-OTHERS [24-05-2024(online)].pdf 2024-05-24
5 202441040418-FORM-9 [24-05-2024(online)].pdf 2024-05-24
6 202441040418-FORM FOR SMALL ENTITY(FORM-28) [24-05-2024(online)].pdf 2024-05-24
7 202441040418-FORM 1 [24-05-2024(online)].pdf 2024-05-24
8 202441040418-EVIDENCE FOR REGISTRATION UNDER SSI(FORM-28) [24-05-2024(online)].pdf 2024-05-24
9 202441040418-EDUCATIONAL INSTITUTION(S) [24-05-2024(online)].pdf 2024-05-24
10 202441040418-DRAWINGS [24-05-2024(online)].pdf 2024-05-24
11 202441040418-DECLARATION OF INVENTORSHIP (FORM 5) [24-05-2024(online)].pdf 2024-05-24
12 202441040418-COMPLETE SPECIFICATION [24-05-2024(online)].pdf 2024-05-24
13 202441040418-FORM-26 [11-07-2024(online)].pdf 2024-07-11