System For Text To Audio Conversion For Visually Impaired Using

< Back

System For Text To Audio Conversion For Visually Impaired Using Optical Character Recognition

Abstract: SYSTEM FOR TEXT-TO-AUDIO CONVERSION FOR VISUALLY IMPAIRED USING OPTICAL CHARACTER RECOGNITION ABSTRACT A system (100) for text-to-audio conversion for visually impaired using optical character recognition is disclosed. The system (100) comprising: an input unit (102) adapted to upload digital files. A processor (104) is configured to: receive the uploaded digital files from the input unit (102); enhance input quality of the received files by performing operations selected from a resizing, a noise reduction, a contrast adjustment, or a combination thereof; activate a text extraction engine to extract text in the received files, wherein the text extraction engine is adapted to utilize Tesseract (Optical Character Recognition) OCR to extract text from non-character readable digital files; sanitize the extracted text to remove non-XML-compliant characters; parse the sanitized text through a Text-to-Speech (TTS) engine to generate corresponding audio files; and play back the generated audio files using an output unit (106). Claims: 10, Figures: 3 Figure 1A is selected.

Get Free WhatsApp Updates!
Notices, Deadlines & Correspondence

Patent Information

Application #

Filing Date

11 March 2025

Publication Number

12/2025

Publication Type

INA

Invention Field

ELECTRONICS

Status

Parent Application

Applicants

SR University

SR University, Ananthasagar, Warangal Telangana India 506371 patent@sru.edu.in 08702818333

Inventors

1. Dr. V. Malathy

SR University, Ananthasagar, Hasanparthy (PO), Warangal, Telangana, India-506371.

Specification

Description:BACKGROUND
Field of Invention
[001] Embodiments of the present invention generally relate to a system for text-to-audio conversion and particularly to a system for text-to-audio conversion for visually impaired using optical character recognition.
Description of Related Art
[002] The advent of digital technology has revolutionized how information is processed, stored, and accessed. Optical Character Recognition (OCR) has played a crucial role in extracting text from various sources such as printed documents, images, and scanned materials. OCR technology enables digitization, making it possible to edit, search, and store textual information more efficiently. In parallel, Text-to-Speech (TTS) systems have evolved to convert written text into audible speech, improving accessibility for individuals with visual impairments and those who require hands-free interaction with digital content.
[003] Despite advancements in OCR and TTS technologies, challenges persist in seamlessly integrating these systems to create a user-friendly, automated solution. Many existing OCR tools primarily focus on text extraction but lack the capability to process and organize the extracted content for audio conversion. Similarly, TTS applications often rely on structured input, limiting their effectiveness when dealing with raw, unformatted text obtained from various document types. The inconsistency in text extraction accuracy, varying font styles, and document structures further complicate the automation process.
[004] Current commercial solutions for text-to-audio conversion require multiple manual steps, such as text extraction, formatting, and audio synthesis, reducing efficiency and usability. Moreover, traditional OCR systems struggle with complex layouts, embedded images, and handwritten text, leading to incomplete or inaccurate results. These limitations highlight the need for a comprehensive, automated approach that integrates OCR and TTS technologies to provide a seamless solution for converting diverse textual content into accessible audio formats.
[005] There is thus a need for an improved and advanced system for text-to-audio conversion for visually impaired using optical character recognition that can administer the aforementioned limitations in a more efficient manner.
SUMMARY
[006] Embodiments in accordance with the present invention provide a system for text-to-audio conversion for visually impaired using optical character recognition. The system comprising an input unit adapted to upload digital files. The system further comprising a processor communicatively connected to the input unit. The processor is configured to receive the uploaded digital files from the input unit; check a presence of text in the received files; enhance input quality of the received files by performing operations selected from a resizing, a noise reduction, a contrast adjustment, or a combination thereof; activate a text extraction engine to extract text in the received files. The text extraction engine is adapted to utilize Tesseract (Optical Character Recognition) OCR to extract text from non-character readable digital files; sanitize the extracted text to remove non-XML-compliant characters; parse the sanitized text through a Text-to-Speech (TTS) engine to generate corresponding audio files; and play back the generated audio files using an output unit.
[007] Embodiments in accordance with the present invention further provide a method for text-to-audio conversion for visually impaired using optical character recognition. The method comprising steps of receiving uploaded digital files from an input unit; checking a presence of text in the received files; enhancing input quality of the received files by performing operations selected from a resizing, a noise reduction, a contrast adjustment, or a combination thereof; activating a text extraction engine to extract text in the received files. The text extraction engine is adapted to utilize Tesseract (Optical Character Recognition) OCR to extract text from non-character readable digital files; sanitizing the extracted text to remove non-XML-compliant characters; parsing the sanitized text through a text-to-speech (TTS) to generate corresponding audio files; and playing back the generated audio files using an output unit.
[008] Embodiments of the present invention may provide a number of advantages depending on their particular configuration. First, embodiments of the present application may provide a system for text-to-audio conversion for visually impaired using optical character recognition.
[009] Next, embodiments of the present application may provide a system for text-to-audio conversion that integrates Optical Character Recognition (OCR) and Text-to-Speech (TTS) technology, eliminating the need for manual text extraction and formatting before audio conversion. This automation enhances efficiency and usability.
[0010] Next, embodiments of the present application may provide a system for text-to-audio conversion that process text from images, PDFs, DOCX, and PowerPoint presentations, ensuring versatility across different document sources.
[0011] Next, embodiments of the present application may provide a system for text-to-audio conversion that significantly benefits visually impaired users by providing an easy-to-use text-to-audio conversion system, allowing them to access printed or digital content audibly without dependency on others.
[0012] Next, embodiments of the present application may provide a system for text-to-audio conversion that incorporates pre-processing techniques such as noise reduction and contrast enhancement, improving OCR accuracy. The text-to-speech conversion is optimized for clear and coherent audio output.
[0013] These and other advantages will be apparent from the present application of the embodiments described herein.
[0014] The preceding is a simplified summary to provide an understanding of some embodiments of the present invention. This summary is neither an extensive nor exhaustive overview of the present invention and its various embodiments. The summary presents selected concepts of the embodiments of the present invention in a simplified form as an introduction to the more detailed description presented below. As will be appreciated, other embodiments of the present invention are possible utilizing, alone or in combination, one or more of the features set forth above or described in detail below.
BRIEF DESCRIPTION OF THE DRAWINGS
[0015] The above and still further features and advantages of embodiments of the present invention will become apparent upon consideration of the following detailed description of embodiments thereof, especially when taken in conjunction with the accompanying drawings, and wherein:
[0016] FIG. 1A illustrates a block diagram of a system for text-to-audio conversion for visually impaired using optical character recognition, according to an embodiment of the present invention;
[0017] FIG. 1B illustrates an exemplary implementation of the system 100, according to an embodiment of the present invention; and
[0018] FIG. 2 depicts a flowchart of a method for text-to-audio conversion for visually impaired using optical character recognition, according to an embodiment of the present invention.
[0019] The headings used herein are for organizational purposes only and are not meant to be used to limit the scope of the description or the claims. As used throughout this application, the word "may" is used in a permissive sense (i.e., meaning having the potential to), rather than the mandatory sense (i.e., meaning must). Similarly, the words “include”, “including”, and “includes” mean including but not limited to. To facilitate understanding, like reference numerals have been used, where possible, to designate like elements common to the figures. Optional portions of the figures may be illustrated using dashed or dotted lines, unless the context of usage indicates otherwise.
DETAILED DESCRIPTION
[0020] The following description includes the preferred best mode of one embodiment of the present invention. It will be clear from this description of the invention that the invention is not limited to these illustrated embodiments but that the invention also includes a variety of modifications and embodiments thereto. Therefore, the present description should be seen as illustrative and not limiting. While the invention is susceptible to various modifications and alternative constructions, it should be understood, that there is no intention to limit the invention to the specific form disclosed, but, on the contrary, the invention is to cover all modifications, alternative constructions, and equivalents falling within the scope of the invention as defined in the claims.
[0021] In any embodiment described herein, the open-ended terms "comprising", "comprises”, and the like (which are synonymous with "including", "having” and "characterized by") may be replaced by the respective partially closed phrases "consisting essentially of", “consists essentially of", and the like or the respective closed phrases "consisting of", "consists of”, the like.
[0022] As used herein, the singular forms “a”, “an”, and “the” designate both the singular and the plural, unless expressly stated to designate the singular only.
[0023] FIG. 1A illustrates a block diagram of a system 100 for text-to-audio conversion for visually impaired using optical character recognition, according to an embodiment of the present invention. The system 100 may be adapted to identify and recognize text in an uploaded digital file. The system 100 may further be adapted to narrate the recognized text. Further, the system 100 may be adapted to generate an audio file with the narration of the recognized text. The system 100 may further be adapted to store the generated audio file.
[0024] According to the embodiments of the present invention, the system 100 may incorporate non-limiting hardware components to enhance the processing speed and efficiency such as the system 100 may comprise an input unit 102, a processor 104, an output unit 106, a storage unit 108, and a user interface 110. In an embodiment of the present invention, the hardware components of the system 100 may be integrated with computer-executable instructions for overcoming the challenges and the limitations of the existing systems.
[0025] In an embodiment of the present invention, the input unit 102 may be adapted to upload the digital file(s) to the system 100. The input unit 102 may be, but not limited to, a mobile, a computer, and so forth. Embodiments of the present invention are intended to include or otherwise cover any type of the input unit 102, including known, related art, and/or later developed technologies.
[0026] In an embodiment of the present invention, the processor 104 communicatively connected to the input unit 102. The processor 104 may be configured to receive the uploaded digital files from the input unit 102. The processor 104 may be configured to check a presence of text in the received files.
[0027] The processor 104 may be configured to enhance input quality of the received files by performing operations such as, but not limited to, a resizing, a noise reduction, a contrast adjustment, and so forth. Embodiments of the present invention are intended to include or otherwise cover any operations for enhancing input quality of the received files, including known, related art, and/or later developed technologies.
[0028] The processor 104 may be configured to activate a text extraction engine to extract text in the received files. The text extraction engine may be adapted to utilize Tesseract (Optical Character Recognition) OCR to extract text from non-character readable digital files. The non-character readable digital files may be, but not limited to, an image file, a video file, and so forth. Embodiments of the present invention are intended to include or otherwise cover any type of the non-character readable digital files, including known, related art, and/or later developed technologies.
[0029] The text extraction engine may be adapted to utilize American Standard Code for Information Interchange (ASCII) libraries to extract text from character readable digital files. The character readable digital files may be, but not limited to, a document file, a presentation file, a spreadsheet file, a computer-coding language file, and so forth. Embodiments of the present invention are intended to include or otherwise cover any type of the character readable digital files, including known, related art, and/or later developed technologies.
[0030] The processor 104 may be configured to sanitize the extracted text to remove non-XML-compliant characters. Embodiments of the present invention are intended to include or otherwise cover any type of the non-XML-compliant characters, including known, related art, and/or later developed technologies. The processor 104 may be configured to parse the sanitized text through a Text-to-Speech (TTS) engine to generate corresponding audio files. In a preferred embodiment of the present invention, the Text-to-Speech (TTS) engine may be a Google Text-to-Speech (TTS) engine. Embodiments of the present invention are intended to include or otherwise cover any type of the Text-to-Speech (TTS) engine, including known, related art, and/or later developed technologies.
[0031] The processor 104 may be configured to play back the generated audio files using the output unit 106. The processor 104 may be, but not limited to, a Programmable Logic Control (PLC) unit, a microprocessor, a development board, and so forth. Embodiments of the present invention are intended to include or otherwise cover any type of the processor 104 including known, related art, and/or later developed technologies.
[0032] In an embodiment of the present invention, the output unit 106 may be adapted to play back the audio files generated by the processor 104. The output unit 106 may be, but not limited to, a speaker, a headphone, an earphone, a hearing aid device, and so forth. Embodiments of the present invention are intended to include or otherwise cover any type of the output unit 106, including known, related art, and/or later developed technologies
[0033] In an embodiment of the present invention, the storage unit 108 may be adapted to store the audio files generated by the processor 104. The storage unit 108 may be, but not limited to, a flash drive, a hard drive, and so forth. Embodiments of the present invention are intended to include or otherwise cover any type of the storage unit 108, including known, related art, and/or later developed technologies
[0034] In an embodiment of the present invention, the user interface 110 may be configured to allow a user to control operations such as, but not limited to, select input file types, manage text extraction settings, monitor processing status, and so forth. Embodiments of the present invention are intended to include or otherwise cover any operations that may be carried out by the user interface 110, including known, related art, and/or later developed technologies.
[0035] FIG. 1B illustrates an exemplary implementation of the system 100, according to an embodiment of the present invention. In an exemplary embodiment of the present invention, the system 100 may be designed to assist visually impaired individuals in accessing textual information through an intuitive and accessible interface. The system 100 may allow users to upload digital files containing text through the input unit 102, which may include a mobile device, a computer, or a dedicated assistive device equipped with accessibility features such as voice commands or Braille input. Once a file is uploaded, the processor 104 of the system 100 may process the file by enhancing its quality, extracting the text using Optical Character Recognition (OCR), and converting the extracted text into an audio format using a Text-to-Speech (TTS) engine. The processed audio is then played back through the output unit 106, which may be a speaker, headphones, or a hearing aid device.
[0036] In some embodiments of the present invention, the system 100 may be configured to analyze the extracted text to provide contextual enhancements, such as summarizing lengthy documents or explaining complex terms using a built-in AI assistant. Additionally, the system 100 may be adapted to recognize the output unit 106 dynamically and adjust the audio playback accordingly. In some embodiments of the present invention, the system 100 may recognize the output unit 106 by retrieving metadata from the connected device, such as Device Identification (Device ID), Frequency Response Range, Impedance, and supported audio codecs (Coder-Decoder). This metadata may be accessed through Bluetooth profiles, Universal Serial Bus (USB) descriptors, or System Application Programming Interfaces (APIs). Based on the extracted metadata, the Central Processing Unit (CPU) within the processor 104 may dynamically adjust Gain Control, Equalization (EQ), and Noise Suppression (NS) settings to optimize audio playback for the specific output unit 106.
[0037] For example, if the output unit 106 is a hearing aid device, the system 100 may enable frequency adjustments to enhance speech clarity and reduce background noise. If the output unit 106 is a speaker, the system 100 may enhance volume and adjust echo reduction settings for a clearer sound. If the output unit 106 is a headphone, the system 100 may apply binaural audio processing to create a more immersive experience.
[0038] FIG. 2 depicts a flowchart of a method 200 for text-to-audio conversion for visually impaired using optical character recognition, according to an embodiment of the present invention.
[0039] At step 202, the system 100 may receive the uploaded digital files from the input unit 102.
[0040] At step 204, the system 100 may check the presence of text in the received files.
[0041] At step 206, the system 100 may enhance the input quality of the received files by performing operations, as discussed above.
[0042] At step 208, the system 100 may activate the text extraction engine to extract text in the received files.
[0043] At step 210, the system 100 may sanitize the extracted text to remove the non-XML-compliant characters.
[0044] At step 212, the system 100 may parse the sanitized text through the Text-to-Speech (TTS) engine to generate the corresponding audio files.
[0045] At step 214, the system 100 may play back the generated audio files using the output unit 106.
[0046] While the invention has been described in connection with what is presently considered to be the most practical and various embodiments, it is to be understood that the invention is not to be limited to the disclosed embodiments, but on the contrary, is intended to cover various modifications and equivalent arrangements included within the scope of the appended claims.
[0047] This written description uses examples to disclose the invention, including the best mode, and also to enable any person skilled in the art to practice the invention, including making and using any devices or systems and performing any incorporated methods. The patentable scope of the invention is defined in the claims, and may include other examples that occur to those skilled in the art. Such other examples are intended to be within the scope of the claims if they have structural elements that do not differ from the literal language of the claims, or if they include equivalent structural elements within substantial differences from the literal languages of the claims. , Claims:CLAIMS
I/We Claim:
1. A system (100) for text-to-audio conversion for visually impaired using optical character recognition, the system (100) comprising:
an input unit (102) adapted to upload digital files; and
a processor (104) communicatively connected to the input unit (102), characterized in that the processor (104) is configured to:
receive the uploaded digital files from the input unit (102);
check a presence of text in the received files;
enhance input quality of the received files by performing operations selected from a resizing, a noise reduction, a contrast adjustment, or a combination thereof;
activate a text extraction engine to extract text in the received files, wherein the text extraction engine is adapted to utilize Tesseract (Optical Character Recognition) OCR to extract text from non-character readable digital files;
sanitize the extracted text to remove non-XML-compliant characters;
parse the sanitized text through a Text-to-Speech (TTS) engine to generate corresponding audio files; and
play back the generated audio files using an output unit (106).
2. The system (100) as claimed in claim 1, wherein the text extraction engine is adapted to utilize American Standard Code for Information Interchange (ASCII) libraries to extract text from character readable digital files.
3. The system (100) as claimed in claim 1, wherein a format of the digital files are selected from a document file, a presentation file, a spreadsheet file, an image file, a video file, a computer-coding language file, or a combination thereof.
4. The system (100) as claimed in claim 1, comprising a user interface (110) configured to allow a user to select input file types, manage text extraction settings, monitor processing status, or combination thereof.
5. The system (100) as claimed in claim 1, wherein the output unit (106) is selected from a speaker, a headphone, an earphone, a hearing aid device, or a combination thereof.
6. The system (100) as claimed in claim 1, wherein the generated audio files are stored in a storage unit (108).
7. A method (200) for text-to-audio conversion for visually impaired using optical character recognition, the method (200) is characterized by steps of:
receiving uploaded digital files from an input unit (102);
checking a presence of text in the received files;
enhancing input quality of the received files by performing operations selected from a resizing, a noise reduction, a contrast adjustment, or a combination thereof;
activating a text extraction engine to extract text in the received files; wherein the text extraction engine is adapted to utilize Tesseract (Optical Character Recognition) OCR to extract text from non-character readable digital files;
sanitizing the extracted text to remove non-XML-compliant characters;
parsing the sanitized text through a text-to-speech (TTS) to generate corresponding audio files; and
playing back the generated audio files using an output unit (106).
8. The method (200) as claimed in claim 7, wherein the text extraction engine is adapted to utilize American Standard Code for Information Interchange (ASCII) libraries to extract text from non-character readable digital files.
9. The method (200) as claimed in claim 7, wherein the output unit (106) is selected from a speaker, a headphone, an earphone, a hearing aid device, or a combination thereof.
10. The method (200) as claimed in claim 7, wherein the generated audio files are stored in a storage unit (108).
Date: March 10, 2025
Place: Noida

Nainsi Rastogi
Patent Agent (IN/PA-2372)
Agent for the Applicant

Documents

Application Documents

#	Name	Date
1	202541021588-STATEMENT OF UNDERTAKING (FORM 3) [11-03-2025(online)].pdf	2025-03-11
2	202541021588-REQUEST FOR EARLY PUBLICATION(FORM-9) [11-03-2025(online)].pdf	2025-03-11
3	202541021588-POWER OF AUTHORITY [11-03-2025(online)].pdf	2025-03-11
4	202541021588-OTHERS [11-03-2025(online)].pdf	2025-03-11
5	202541021588-FORM-9 [11-03-2025(online)].pdf	2025-03-11
6	202541021588-FORM FOR SMALL ENTITY(FORM-28) [11-03-2025(online)].pdf	2025-03-11
7	202541021588-FORM 1 [11-03-2025(online)].pdf	2025-03-11
8	202541021588-EVIDENCE FOR REGISTRATION UNDER SSI(FORM-28) [11-03-2025(online)].pdf	2025-03-11
9	202541021588-EDUCATIONAL INSTITUTION(S) [11-03-2025(online)].pdf	2025-03-11
10	202541021588-DRAWINGS [11-03-2025(online)].pdf	2025-03-11
11	202541021588-DECLARATION OF INVENTORSHIP (FORM 5) [11-03-2025(online)].pdf	2025-03-11
12	202541021588-COMPLETE SPECIFICATION [11-03-2025(online)].pdf	2025-03-11
13	202541021588-Proof of Right [21-05-2025(online)].pdf	2025-05-21