Transformer Based Audio To Text Transcription System

< Back

Transformer Based Audio To Text Transcription System

Abstract: TRANSFORMER-BASED AUDIO-TO-TEXT TRANSCRIPTION SYSTEM ABSTRACT A transformer-based audio-to-text transcription system (100) is disclosed. The system (100) comprising: an input unit (102) and a processor (104). The processor (104) is configured to receive the audio input from the input unit (102); execute a Convolutional Neural Network (CNN) model (106) to process the received audio; develop spectrograms of the processed audio for filtration of noise and extraction of speech features; model temporal dependencies and contextual relationships for speech recognition in the processed audio; employ a Bidirectional Long Short-Term Memory (BiLSTM) engine (108) to capture sequential dependencies in the processed audio; employ a Connectionist Temporal Classification (CTC) loss (110) and language models (112) for reducing Word Error Rate (WER) and Character Error Rate (CER); align the processed audio; and generate textual outputs for obtaining a transcription of the processed audio. The system (100) features improved transcription accuracy, especially in noisy environments. Claims: 10, Figures: 2 Figure 1 is selected.

Get Free WhatsApp Updates!
Notices, Deadlines & Correspondence

Patent Information

Application #

Filing Date

06 March 2025

Publication Number

12/2025

Publication Type

INA

Invention Field

ELECTRONICS

Status

Parent Application

Applicants

SR University

SR University, Ananthasagar, Warangal Telangana India 506371 patent@sru.edu.in 08702818333

Inventors

1. Ch. Aparna

SR University, Ananthasagar, Hasanparthy (PO), Warangal, Telangana, India-506371., India

2. Dr. Rajchandar K

SR University, Ananthasagar, Hasanparthy (PO), Warangal, Telangana, India-506371., India

Specification

Description:BACKGROUND
Field of Invention
[001] Embodiments of the present invention generally relate to an audio-to-text transcription tool and particularly to a transformer-based audio-to-text transcription system.
Description of Related Art
[002] Audio-to-text transcription systems have significantly evolved over the years, with various techniques employed to enhance accuracy and efficiency. Traditional methods relied heavily on statistical models such as Hidden Markov Models (HMMs) and Gaussian Mixture Models (GMMs) to process and recognize speech patterns. These approaches, while foundational in automatic speech recognition (ASR), struggled with handling variations in speaker accents, noise interference, and real-time processing demands. As a result, researchers turned to deep learning methodologies, leveraging neural networks to improve transcription quality.
[003] Neural network-based systems, including early implementations of Recurrent Neural Networks (RNNs) and Long Short-Term Memory (LSTM) models, provided significant improvements over conventional techniques. These models could capture temporal dependencies in audio signals, leading to better contextual understanding and reduced Word Error Rates. However, they often required extensive computational resources and struggled with scalability when applied to real-world noisy environments or multiple-speaker scenarios. The introduction of Convolutional Neural Networks (CNNs) helped enhance feature extraction, but challenges remained in achieving high accuracy across diverse linguistic conditions.
[004] In recent years, transformer-based models have emerged as a powerful solution for speech-to-text conversion. These models leverage self-attention mechanisms to capture long-range dependencies and contextual relationships within audio data. While transformers have demonstrated superior transcription accuracy, integrating them effectively with traditional audio processing techniques remains an area of active research. Current solutions still face limitations in balancing accuracy, speed, and computational efficiency, particularly for real-time applications such as live captioning and accessibility services. As technology advances, there is a continuous need for robust frameworks that address these challenges while ensuring seamless and precise transcription capabilities.
[005] There is thus a need for an improved and advanced transformer-based audio-to-text transcription system that can administer the aforementioned limitations in a more efficient manner.
SUMMARY
[006] Embodiments in accordance with the present invention provide a transformer-based audio-to-text transcription system. The system comprising an input unit adapted to receive an audio input. The system further comprising a processor communicatively connected to the input unit. The processor is configured to receive the audio input from the input unit; execute a Convolutional Neural Network (CNN) model to process the received audio; develop spectrograms of the processed audio for filtration of noise and extraction of fine-grained speech features; model temporal dependencies and contextual relationships for speech recognition in the processed audio; employ a Bidirectional Long Short-Term Memory (BiLSTM) engine to capture sequential dependencies in the processed audio in a forward direction and a backward direction; employ a Connectionist Temporal Classification (CTC) loss and language models for reducing Word Error Rate (WER) and Character Error Rate (CER); execute a location-sensitive attention mechanism to align the processed audio; and generate textual outputs, using a latency optimization framework, for obtaining a transcription of the processed audio.
[007] Embodiments in accordance with the present invention further provide a method for transformer-based audio-to-text transcription. The method comprising steps of receiving an audio input from an input unit; executing a Convolutional Neural Network (CNN) model to process the received audio; developing spectrograms of the processed audio for filtration of noise and extraction of fine-grained speech features; modeling temporal dependencies and contextual relationships for speech recognition in the processed audio; employing a Bidirectional Long Short-Term Memory (BiLSTM) engine; employing a Connectionist Temporal Classification (CTC) loss and language models; executing a location-sensitive attention mechanism to align the processed audio; and generating textual outputs, using a latency optimization framework, for obtaining a transcription of the processed audio.
[008] Embodiments of the present invention may provide a number of advantages depending on their particular configuration. First, embodiments of the present application may provide a transformer-based audio-to-text transcription system.
[009] Next, embodiments of the present application may provide an audio-to-text transcription system that integrates Convolutional Neural Networks (CNNs) for feature extraction and transformers for contextual modeling, resulting in significantly improved transcription accuracy, especially in noisy environments.
[0010] Next, embodiments of the present application may provide an audio-to-text transcription system that is designed to handle speaker variations, background noise, and multiple accents effectively. Data augmentation techniques, such as noise addition and pitch variation, ensure that the model performs reliably across diverse audio conditions.
[0011] Next, embodiments of the present application may provide an audio-to-text transcription system that is suitable for live captioning and accessibility services, making it ideal for real-time applications where speed is critical without compromising accuracy.
[0012] Next, embodiments of the present application may provide an audio-to-text transcription system that ensures that speech-to-text alignment remains precise, reducing word errors and improving the fluency of transcriptions compared to traditional models.
[0013] Next, embodiments of the present application may provide an audio-to-text transcription system that is designed to support multiple languages and dialects, making it a future-proof solution for global speech-to-text transcription needs.
[0014] These and other advantages will be apparent from the present application of the embodiments described herein.
[0015] The preceding is a simplified summary to provide an understanding of some embodiments of the present invention. This summary is neither an extensive nor exhaustive overview of the present invention and its various embodiments. The summary presents selected concepts of the embodiments of the present invention in a simplified form as an introduction to the more detailed description presented below. As will be appreciated, other embodiments of the present invention are possible utilizing, alone or in combination, one or more of the features set forth above or described in detail below.
BRIEF DESCRIPTION OF THE DRAWINGS
[0016] The above and still further features and advantages of embodiments of the present invention will become apparent upon consideration of the following detailed description of embodiments thereof, especially when taken in conjunction with the accompanying drawings, and wherein:
[0017] FIG. 1 illustrates a block diagram of a transformer-based audio-to-text transcription system, according to an embodiment of the present invention.
[0018] FIG. 2 depicts a flowchart of a method for transformer-based audio-to-text transcription, according to an embodiment of the present invention.
[0019] The headings used herein are for organizational purposes only and are not meant to be used to limit the scope of the description or the claims. As used throughout this application, the word "may" is used in a permissive sense (i.e., meaning having the potential to), rather than the mandatory sense (i.e., meaning must). Similarly, the words “include”, “including”, and “includes” mean including but not limited to. To facilitate understanding, like reference numerals have been used, where possible, to designate like elements common to the figures. Optional portions of the figures may be illustrated using dashed or dotted lines, unless the context of usage indicates otherwise.
DETAILED DESCRIPTION
[0020] The following description includes the preferred best mode of one embodiment of the present invention. It will be clear from this description of the invention that the invention is not limited to these illustrated embodiments but that the invention also includes a variety of modifications and embodiments thereto. Therefore, the present description should be seen as illustrative and not limiting. While the invention is susceptible to various modifications and alternative constructions, it should be understood, that there is no intention to limit the invention to the specific form disclosed, but, on the contrary, the invention is to cover all modifications, alternative constructions, and equivalents falling within the scope of the invention as defined in the claims.
[0021] In any embodiment described herein, the open-ended terms "comprising", "comprises”, and the like (which are synonymous with "including", "having” and "characterized by") may be replaced by the respective partially closed phrases "consisting essentially of", “consists essentially of", and the like or the respective closed phrases "consisting of", "consists of”, the like.
[0022] As used herein, the singular forms “a”, “an”, and “the” designate both the singular and the plural, unless expressly stated to designate the singular only.
[0023] FIG. 1 illustrates a block diagram of a transformer-based audio-to-text transcription system 100 (hereinafter referred to as the system 100), according to an embodiment of the present invention. The system 100 may be adapted to identify a vocal snippet and/or a speech snippet in an uploaded digital file. Further, the system 100 may be adapted to recognize the vocal snippet and/or the speech snippet and convert the same into a transcription. The digital file may be, but not limited to, an audio, file, a presentation file, a video file, a video conferencing call, an audio conferencing call, and so forth. Embodiments of the present invention are intended to include or otherwise cover any type of the digital file that may be uploaded to the system 100, including known, related art, and/or later developed technologies.
[0024] The system 100 may comprise an input unit 102, a processor 104, a Convolutional Neural Network (CNN) model 106, a Bidirectional Long Short-Term Memory (BiLSTM) engine 108, a Connectionist Temporal Classification (CTC) loss 110, language models 112, a location-sensitive attention mechanism 114, and a latency optimization framework 116.
[0025] In an embodiment of the present invention, the input unit 102 may be adapted to upload the digital file(s) to the system 100. The input unit 102 may be, but not limited to, a mobile, a computer, and so forth. Embodiments of the present invention are intended to include or otherwise cover any type of the input unit 102, including known, related art, and/or later developed technologies.
[0026] In an embodiment of the present invention, the processor 104 communicatively connected to the input unit 102. The processor 104 may be configured to receive the audio input from the input unit 102. The processor 104 may be configured to execute the Convolutional Neural Network (CNN) model 106 to process the received audio. The Convolutional Neural Network (CNN) model 106 may utilize a spectrogram-based preprocessing to distinguish speech components from noise. The processor 104 may be configured to develop spectrograms of the processed audio for filtration of noise and extraction of fine-grained speech features. The processor 104 may be configured to model temporal dependencies and contextual relationships for speech recognition in the processed audio. The processor 104 may be configured to employ the Bidirectional Long Short-Term Memory (BiLSTM) engine 108 to capture sequential dependencies in the processed audio in a forward direction and a backward direction.
[0027] The Bidirectional Long Short-Term Memory (BiLSTM) engine 108 may enhance sequential speech recognition by considering a past audio contexts and a future audio contexts. The processor 104 may be configured to employ the Connectionist Temporal Classification (CTC) loss 110 and the language models 112 for reducing Word Error Rate (WER) and Character Error Rate (CER). The processor 104 may be configured to execute the location-sensitive attention mechanism 114 to align the processed audio. The location-sensitive attention mechanism 114 may map input acoustic signals to text tokens, thereby ensuring synchronized transcription in multi-speaker and overlapping speech scenarios. The processor 104 may be configured to generate textual outputs, using the latency optimization framework 116, for obtaining the transcription of the processed audio. The processor 104 may be, but not limited to, a Programmable Logic Control (PLC) unit, a microprocessor, a development board, and so forth. Embodiments of the present invention are intended to include or otherwise cover any type of the processor 104, including known, related art, and/or later developed technologies.
[0028] In an exemplary embodiment of the present invention, the system 100 may be adapted to identify and transcribe text from an audio input recorded in a multi-sound environment, such as a wedding ceremony. Weddings typically involve multiple overlapping sound sources, including conversations, music, clapping, and ambient noise. The system 100 may effectively isolate and recognize a specific vocal snippet, such as the father’s speech during a wedding toast. In this scenario, the input unit 102 may receive an audio file recorded at the wedding. The processor 104 processes the received audio using the Convolutional Neural Network (CNN) model 106. The CNN model 106 may generate spectrograms that help distinguish different sound components by filtering background noise and isolating dominant speech features. The processor 104 subsequently employs the Bidirectional Long Short-Term Memory (BiLSTM) engine 108 to analyze sequential dependencies in both forward and backward directions. This may allow the system 100 to differentiate the father’s voice from other overlapping sounds, such as wedding music or audience chatter.
[0029] Further, to improve transcription accuracy, the processor 104 may utilize the Connectionist Temporal Classification (CTC) loss 110 and the language models 112. These components work together to minimize Word Error Rate (WER) and Character Error Rate (CER), so that the father’s speech is correctly transcribed even in the presence of acoustic disturbances. Furthermore, the location-sensitive attention mechanism 114 is employed to enhance alignment between the audio input and the corresponding text. This mechanism ensures that even if multiple speakers are talking at the same time, the system 100 can track and transcribe the correct speaker’s voice, such as the father’s speech, with high precision. Finally, the latency optimization framework 116 is utilized to generate real-time or near real-time textual output, delivering an accurate transcription of the father’s speech within the multi-sound wedding environment. The output can be stored as text, displayed in captions, or integrated with other digital platforms for accessibility and documentation.
[0030] In another exemplary embodiment, the system 100 may be used in a newsroom setting, where multiple journalists and reporters are speaking simultaneously. The system can isolate and transcribe a specific reporter’s voice accurately, ensuring real-time documentation of live news coverage despite background chatter and overlapping discussions. Furthermore, an exemplary application of the location-sensitive attention mechanism 114 can be seen in a courtroom setting, where multiple speakers, including judges, attorneys, and witnesses, are speaking in succession or simultaneously. The system 100 may utilize the location-sensitive attention mechanism 114 to track the speaker's position and voice characteristics, such as regional accents for accurate attribution of statements to the correct individuals. This enhances legal documentation and prevents misinterpretation of court proceedings.
[0031] Embodiments of the present invention may extend to various real-world applications, such as legal proceedings, conferences, and other scenarios where multiple overlapping voices need to be transcribed with high accuracy. The system 100 may demonstrate significant potential in enhancing speech recognition technology, particularly in complex auditory environments.
[0032] FIG. 2 depicts a flowchart of a method 200 for the transformer-based audio-to-text transcription, according to an embodiment of the present invention.
[0033] At step 202, the system 100 may receive the audio input from the input unit 102.
[0034] At step 204, the system 100 may execute the Convolutional Neural Network (CNN) model 106 to process the received audio.
[0035] At step 206, the system 100 may develop the spectrograms of the processed audio for filtration of noise and extraction of fine-grained speech features.
[0036] At step 208, the system 100 may model the temporal dependencies and the contextual relationships for the speech recognition in the processed audio.
[0037] At step 210, the system 100 may employ the Bidirectional Long Short-Term Memory (BiLSTM) engine 108 to capture the sequential dependencies in the processed audio in the forward direction and the backward direction.
[0038] At step 212, the system 100 may employ the Connectionist Temporal Classification (CTC) loss 110 and the language models 112 for reducing the Word Error Rate (WER) and the Character Error Rate (CER).
[0039] At step 214, the system 100 may execute the location-sensitive attention mechanism 114 to align the processed audio.
[0040] At step 216, the system 100 may generate the textual outputs, using the latency optimization framework 116, for obtaining the transcription of the processed audio.
[0041] While the invention has been described in connection with what is presently considered to be the most practical and various embodiments, it is to be understood that the invention is not to be limited to the disclosed embodiments, but on the contrary, is intended to cover various modifications and equivalent arrangements included within the scope of the appended claims.
[0042] This written description uses examples to disclose the invention, including the best mode, and also to enable any person skilled in the art to practice the invention, including making and using any devices or systems and performing any incorporated methods. The patentable scope of the invention is defined in the claims, and may include other examples that occur to those skilled in the art. Such other examples are intended to be within the scope of the claims if they have structural elements that do not differ from the literal language of the claims, or if they include equivalent structural elements within substantial differences from the literal languages of the claims. , Claims:CLAIMS
I/We Claim:
1. A transformer-based audio-to-text transcription system (100), the system (100) comprising:
an input unit (102) adapted to receive an audio input;
a processor (104) communicatively connected to the input unit (102), characterized in that the processor (104) is configured to:
receive the audio input from the input unit (102);
execute a Convolutional Neural Network (CNN) model (106) to process the received audio;
develop spectrograms of the processed audio for filtration of noise and extraction of fine-grained speech features;
model temporal dependencies and contextual relationships for speech recognition in the processed audio;
employ a Bidirectional Long Short-Term Memory (BiLSTM) engine (108) to capture sequential dependencies in the processed audio in a forward direction and a backward direction;
employ a Connectionist Temporal Classification (CTC) loss (110) and language models (112) for reducing Word Error Rate (WER) and Character Error Rate (CER);
execute a location-sensitive attention mechanism (114) to align the processed audio; and
generate textual outputs, using a latency optimization framework (116), for obtaining a transcription of the processed audio.
2. The system (100) as claimed in claim 1, wherein the Convolutional Neural Network (CNN) model (106) utilizes a spectrogram-based preprocessing to distinguish speech components from noise.
3. The system (100) as claimed in claim 1, wherein the location-sensitive attention mechanism (114) maps input acoustic signals to text tokens, ensuring synchronized transcription in multi-speaker and overlapping speech scenarios.
4. The system (100) as claimed in claim 1, wherein the Bidirectional Long Short-Term Memory (BiLSTM) engine (108) enhances sequential speech recognition by considering a past audio contexts and a future audio contexts.
5. A method (200) for transformer-based audio-to-text transcription, the method (200) is characterized by steps of:
receiving an audio input from an input unit (102);
executing a Convolutional Neural Network (CNN) model (106) to process the received audio;
developing spectrograms of the processed audio for filtration of noise and extraction of fine-grained speech features;
modeling temporal dependencies and contextual relationships for speech recognition in the processed audio;
employing a Bidirectional Long Short-Term Memory (BiLSTM) engine (108);
employing a Connectionist Temporal Classification (CTC) loss (110) and language models (112);
executing a location-sensitive attention mechanism (114) to align the processed audio; and
generating textual outputs, using a latency optimization framework (116), for obtaining a transcription of the processed audio.
6. The method (200) as claimed in claim 5, wherein the Convolutional Neural Network (CNN) model (106) utilizes a spectrogram-based preprocessing to distinguish speech components from noise.
7. The method (200) as claimed in claim 5, wherein the location-sensitive attention mechanism (114) maps input acoustic signals to text tokens, ensuring synchronized transcription in multi-speaker and overlapping speech scenarios.
8. The method (200) as claimed in claim 5, wherein the Bidirectional Long Short-Term Memory (BiLSTM) engine (108) enhances sequential speech recognition by considering a past audio contexts and a future audio contexts.
9. The method (200) as claimed in claim 5, wherein the Connectionist Temporal Classification (CTC) loss (110) and the language models (112) are adapted to reduce Word Error Rate (WER) and Character Error Rate (CER).
10. The method (200) as claimed in claim 5, wherein the Bidirectional Long Short-Term Memory (BiLSTM) engine (108) is adapted to capture sequential dependencies in the processed audio in a forward direction and a backward direction.
Date: March 05, 2025
Place: Noida

Nainsi Rastogi
Patent Agent (IN/PA-2372)
Agent for the Applicant

Documents

Application Documents

#	Name	Date
1	202541019973-STATEMENT OF UNDERTAKING (FORM 3) [06-03-2025(online)].pdf	2025-03-06
2	202541019973-REQUEST FOR EARLY PUBLICATION(FORM-9) [06-03-2025(online)].pdf	2025-03-06
3	202541019973-POWER OF AUTHORITY [06-03-2025(online)].pdf	2025-03-06
4	202541019973-OTHERS [06-03-2025(online)].pdf	2025-03-06
5	202541019973-FORM-9 [06-03-2025(online)].pdf	2025-03-06
6	202541019973-FORM FOR SMALL ENTITY(FORM-28) [06-03-2025(online)].pdf	2025-03-06
7	202541019973-FORM 1 [06-03-2025(online)].pdf	2025-03-06
8	202541019973-EVIDENCE FOR REGISTRATION UNDER SSI(FORM-28) [06-03-2025(online)].pdf	2025-03-06
9	202541019973-EDUCATIONAL INSTITUTION(S) [06-03-2025(online)].pdf	2025-03-06
10	202541019973-DRAWINGS [06-03-2025(online)].pdf	2025-03-06
11	202541019973-DECLARATION OF INVENTORSHIP (FORM 5) [06-03-2025(online)].pdf	2025-03-06
12	202541019973-COMPLETE SPECIFICATION [06-03-2025(online)].pdf	2025-03-06
13	202541019973-Proof of Right [21-05-2025(online)].pdf	2025-05-21