Real Time Speech To Text Transcription For Inclusive Communication

< Back

Real Time Speech To Text Transcription For Inclusive Communication

Abstract: Disclosed herein is a real-time Automatic Speech Recognition (ASR) system capable of transcribing English and Hindi speech with the help of advanced speech-to-text models along with MAX7219 LED dot matrix display for live text visualization. The system accomplishes transcription by using models like Whisper, Wav2Vec2 and Retrieval-Augmented Generation (RAG). The core functionality of the project consists of capturing spoken Hindi speech through a microphone, processing the speech using the ASR model, and dynamically showing the text displayed from the ASR model on the scrolling LED matrix. The system increases accessibility specifically in social and public communication dimensions by providing a meaningful text representation for people with hearing impairment about in real-time. Fig. 1

Get Free WhatsApp Updates!
Notices, Deadlines & Correspondence

Patent Information

Application #

Filing Date

13 August 2025

Publication Number

36/2025

Publication Type

INA

Invention Field

ELECTRONICS

Status

Parent Application

Applicants

AMRITA VISHWA VIDYAPEETHAM

Bengaluru Campus Kasavanahalli, Carmelaram P.O. Bangalore – 560035, India

Inventors

1. SHARMA, Vishwash

#404 Mahaveer rich Apt, AGB layout, chikkasandra, chikkabanvara, Jalahalli cross, 560090, Bangalore, Karnataka, India

2. SATHYAVARAPU, Sri Jaswanth

224, Anandapuram, Varada, K.Kotapadu, Visakhapatnam, Andhra Pradesh, 531022

3. RAMESH, Sreyas

Swathi Puliyakotte House, Kunissery, Alathur, Palakkad, Kerala-678681, India

4. SUKANTH, B N

#125, Skanda Sukriti, Kadubeesanahalli, Bengaluru, Karnataka-560103, India

5. PALANISWAMY, Suja

#102, MB4, Suryacity Phase 1, Chandapura, Anekal Road, Bengaluru, Karnataka-560099, India

Specification

Description:This application is an improvement of the invention claimed in the specification of the main patent application no. 202441087514 titled “REAL-TIME SPEECH-TO-TEXT HOLOGRAPHIC COMMUNICATION” filed on 13th November 2024.

STATEMENT OF INVENTION AS DISCLOSED IN THE PARENT APPLICATION NO. 202441087514:
The invention as disclosed in the application titled “REAL-TIME SPEECH-TO-TEXT HOLOGRAPHIC COMMUNICATION” (parent application) relates to a real-time speech-to-text conversion system using holographic display technology to aid communication for the deaf and hard-of-hearing (DHH). More particularly, the parent invention discloses a real-time speech-to-text conversion system which integrates advanced speech recognition technologies with innovative holographic displays. The system integrates holographic displays, dual speech recognition techniques, and the real-time processing of the speech to text to provide enhanced real-time communication and accessibility. The system works on two comprehensive methodologies that combine cutting-edge speech recognition technology with advanced holographic projection. The first methodology includes a conventional speech recognition, and another is an enhanced version of the Wav2vec2 model, which has been retrained to improve its efficacy with diverse speech patterns. The system is deployed on a Raspberry Pi 10 and is interfaced with a USB microphone which captures the spoken language effectively. Upon recognition of the speech, the text is instantly displayed on a holographic screen made up of LED matrices and set against an acrylic glass surface, creating a visually engaging representation of floating text. The integration of these two recognition methods enhances the system's accuracy and reliability in processing real-time speech. Moreover, by providing a visual output of spoken words, this system not only bridges communication gaps but also enriches the interaction experience within various social and professional settings. A high-fidelity microphone (MP) and an advanced Automatic Speech Recognition (ASR) processing unit ensures rapid and accurate speech-to-text conversion for real-time communication. The system of the invention goes beyond conventional displays by incorporating holographic projection technology and utilizing LED Matrix 7219 and acrylic glass by which it projects the transcribed text in a floating, mid-air format that is visually engaging and accessible. Further, the system incorporates a speech recognition module (SRM) which optimizes both speed and accuracy, providing immediate visual feedback which enhances communication for users with hearing impairments. The system uses real-time Speech Recognition Library Approach and Wav2Vec2 Approach to process the audio data with a heightened sensitivity and linguistic nuances to significantly improve the accuracy of the text output.

FIELD OF THE INVENTION:
The present invention relates to a real-time ASR system for inclusive communication. More particularly, the present invention discloses a real-time ASR system integrated with advanced speech to text models for dynamically displaying the text recognized through the ASR model on the scrolling LED matrix resulting in holographic preview of floating text, enhancing comprehension for deaf users.

BACKGROUND OF THE INVENTION:
Speech is one of the primary means of communication through which people can exert their thoughts, ideas and emotions. Yet, people with hearing disabilities face many difficulties in speech-based interaction, thus making them to feel uncomfortable while having conversations with peers. People with hearing or auditory disabilities are usually defined and characterized by a decreased ability or total inability to hear (deafness). Often, individuals who are hard of hearing, or don’t hear well, and individuals who are unable to hear, or are deaf, would typically be considered to have a hearing disability. As a result, these individuals often rely on visual and tactile mediums for receiving and communicating information. These mediums are in the form of a variety of intelligent assistive technologies and devices which enable improved accessibility higher sound, tactile feedback, visual cues and improved technology access resulting in gaining information in numerous environments. Thus, the assistive technologies can fill the gap when it comes to communication in an inclusive setting in educational institutions, workplaces, and in public spaces.

Technological advancement in the field of intelligent assistive communication has led to the integration of ASR technology with holographic displays, novel approaches to assist those who are DHH. Broadly speaking, Speech recognition, or speech-to-text, is the ability of a machine or program to identify words spoken aloud and convert them into readable text. Speech recognition uses a broad array of research in computer science, linguistics and computer engineering. It is an interdisciplinary subfield of computer science and computational linguistics that develops methodologies and technologies enabling the recognition and translation of spoken language into text by computers. Speech recognition systems use computer algorithms to process and interpret spoken words and convert them into text.
With the advent of deep learning models and progress in Automatic Speech Recognition (ASR), it is now possible to perform accurate and efficient speech to text conversion. When dealing with real-time ASR systems, immediate transcription is very valuable, enabling seamless communication. Automatic Speech Recognition (ASR) is the technology that allows human beings to use their voices to speak with a computer interface in a way that resembles normal human conversation. ASR technology converts spoken language into text by analyzing sound waves and mapping them to linguistic units like phonemes or words.

Available research data reflects the efforts to improve communication with the deaf and hard of hearing by utilizing advanced technologies for converting speech to text with emphasis of a single language or a combination of languages.

In one such study, Malik et al., 2024 present a comparison of various data mining techniques for extraction of features of Hindi isolated speech recognition wherein methods in which ASR offers a way to convert spoken words into a text and enables the technology to be accessible to a large portion of Indian population that does not speak English are discussed. The purpose of the study is to design an accurate system for the identification of isolated Hindi words based on [‘Shunya’ to ‘Nau’] MFCC, PLP and LPC feature extraction methods.

Another research by Feng et al., 2024 addresses the critical issue of bias in state-of-the-art ASR systems wherein the performance of ASR with respect to different speaker groups is assessed in light of the factors like: under representation of the training data, within group variation, biased transcriptions, across group variation, recording equipment quality as potential reasons for bias.

A lot of work has been conducted on various architectures for converting speech to text, out of which decoder-only architecture has been studied well for speech processing tasks by Wu et.al., 2023, which highlights a novel approach that includes acoustic information into text based large language models including a Connectionist Temporal Classification and an audio encoder to compress the features to the continuous space of the LLM.

In another study, Orken et al., 2022 proposed a system for automatic recognition of Kazakh speech which uses connectionist temporal classification in a transformer-based model having the advantage of fast-learning ability.

Abujar and Hasan et.al., 2016 tried a Bengali text-to-speech conversion system, by using ‘DeepSpeech’ model to create a neural network that recognizes audio files containing speech and converts them into text format.

The work of Nikam et.al., 2022 presents a novel encoder-decoder based ASR model which is not only robust but also end-end for Hindi language. The work incorporates QuartsNet model consisting of 2 different architectures QuartzNet-5x5 and QuartzNet-15x5 keeping all the parameters same wherein the performance of the novel model shows a good improvement in the accuracy.

Another research by Abdulaziz et al., 2023 explores essential inclusive educational needs of deaf and hard-of-hearing students in Oman since their schools and colleges lack American Sign Language (ASL) support, presenting a plan to combine speech-text recognition technology with animation education utilizing artificial intelligence, machine learning and cloud infrastructure to establish accessible learning platforms to transform speech and text into ASL

OBJECT OF THE INVENTION:
In order to obviate the drawbacks of the existing state of the art, the present invention discloses a real-time Automatic Speech Recognition (ASR) system for inclusive communication for the deaf and hard-of-hearing and a method of automatic speech recognition using the system.

The main object of the invention is to provide a real-time ASR system integrated with advanced speech-to-text models for dynamically displaying the text recognized through the ASR model on the scrolling LED matrix resulting in holographic preview of floating text, enhancing comprehension for deaf users.

Another object of the invention is to provide a real-time Automatic Speech Recognition (ASR) system capable of transcribing English and Hindi, with a primary focus on Hindi, but also designed to cater to different regional languages.

Another object of the invention is to provide a real-time Automatic Speech Recognition (ASR) system wherein rare words of a regional language can be translated accurately with a low computational footprint.

Another object of the invention is to provide a real-time Automatic Speech Recognition (ASR) system by integrating the cutting-edge Automatic Speech Recognition (ASR) technologies with compact hardware components to achieve a real-time speech-to-text solution with a holographic display.

Another object of the invention is to provide a real-time Automatic Speech Recognition (ASR) system with a scrolling display that solves the limitations of conventional screen-based system, providing enhanced readability and accessibility, particularly in public/heavy-noise environments.

SUMMARY OF THE INVENTION:
The present invention discloses an innovative real-time Automatic Speech Recognition (ASR) system, namely “HoloSpeak”, for inclusive communication with people having hearing impairment. The real-time Automatic Speech Recognition (ASR) system is integrated with advanced speech to text models for dynamically displaying the text recognized through the ASR model on the scrolling LED matrix resulting in holographic preview of floating text, enhancing comprehension for the deaf and providing a meaningful text representation for people with some form of hearing impairment. The system is capable of recognizing both English and Hindi languages, with a primary focus on Hindi, for dynamically showing the text displayed from the ASR model to provide a meaningful text representation for people with hearing impairment.

The invention is aimed at building a real-time ASR system supporting both English and Hindi, with primary optimization for Hindi speech-to-text system based on the advanced ASR models, including Whisper, Wave2Vec2, a Retrieval Augmented Generation (RAG) model wherein the text is transcribed via MAX7219 LED dot matrix to visually output live transcribed text. Such an implementation has many applications and can aid people of those with hearing impairments or who wish to see visual speech presentation in a noisy environment when audio communication is infeasible. The system works at a 16kHz sampling rate to capture pitch patterns typical of Hindi speech, ensuring accurate real-time transcription and visualization, especially useful in public environments such as classrooms, airports, and railway stations.

The invention is focused on utilizing advanced ASR techniques along with a hardware-based display system to achieve better real-time transcription accuracy coupled with a low latency. This work extends the use of ASR by integrating it with an LED based scrolling display based on the parent invention on assistive speech recognition. The ASR model is fine-tuned using Indian-accented Hindi data to precisely recognize retroflex consonants such as “tha” and “kha”, which are commonly misinterpreted by standard ASR systems. This results in exceptionally low WER (0.08) and CER (0.03) for Hindi transcription. The use of Indian phonetic datasets makes the model ideal for regional deployment and supports phoneme-level speech fidelity for native communication needs. In addition to transcribing speech, the system also provides such content visually, although in an efficient and user-friendly manner.

The model has been designed such that it can be refined to improve accuracy, to optimize the real time processing and to include the Retrieval Augmented Generation (RAG) model for increased contextual understanding. The RAG model enables real-time contextualization in environments such as railway stations, airports, or any infrastructure integrated with Large Language Models (LLMs). By dynamically retrieving relevant domain-specific data, the RAG architecture enhances the accuracy and relevance of responses to user queries. Additionally, it facilitates high-fidelity translation by leveraging retrieved context to produce semantically and syntactically accurate language outputs. A dedicated denoising pipeline integrates RNNoise and DeepFilterNet for adaptive noise reduction in real time, enabling high transcription accuracy even in the acoustically challenging environments such as transportation hubs. This ensures intelligibility of critical announcements, including flight/train numbers, scheduled arrivals, delays, and platform details via integrated display logic. Therefore, this work embodies a move towards making speech recognition technologies more accessible and impactful in settings.

BRIEF DESCRIPTION OF THE DRAWINGS:
Fig. 1: depicts the Flow Diagram of the methodology of “HoloSpeak”
Fig. 2: depicts the DeepFilterNet Model Flow Diagram
Fig. 3: depicts the Whisper model Architecture
Fig. 4: depicts the Wav2Vec 2.0 Model Architecture

Fig. 5: depicts Real-Time Speech Translation with Gemini RAG and FAISS
Fig. 6: depicts the Real-Time Train Information Retrieval Using RAG and Multilingual Display
Fig. 7: depicts the Real-Time Aircraft Information Retrieval Using RAG and Multilingual Display
Fig. 8: depicts the Whisper Model WER and CER over Steps
Fig. 9: depicts the Whisper model training loss Graph
Fig. 10: depicts the Whisper Model Evaluation loss
Fig. 11: depicts the Speech-to-text outcome of the Whisper Model
Fig. 12: depicts the Gemini translation to Hindi
Fig. 13: depicts the RAG + Gemini translation to Hindi
Fig. 14: depicts the Gemini Announcement for train No 12723 Telangana Express
Fig. 15: depicts the RAG + Gemini Announcement for train No 12723
Fig. 16: depicts the Train Bot response for queries using Gemini
Fig. 17: depicts the Train Bot response using RAG + Gemini
Fig. 18: depicts the GPT 4o’s response
Fig. 19: depicts the Flight RAG model response
Fig. 20: depicts the RAG retrieval information about flight No. 6E 5334
Fig. 21: depicts the text Scrolling Display Result
Fig. 22: depicts the Text and Number Scrolling Display Result

DETAILED DESCRIPTION OF THE INVENTION:
The present invention discloses a system and method of real-time Automatic Speech Recognition (ASR) system, the “HoloSpeak” which is integrated with advanced speech-to-text models for dynamically displaying the recognized text for enhancing comprehension for the deaf and hearing impaired.

A flow diagram of the system architecture has been depicted in Fig. 1 which shows that a speech driven system processes spoken input and displays relevant information. The process begins with speech input, which is first routed through a Noise Reduction Module (NRM). This module along with the experimented and compared techniques including RNNoise and DeepFilterNet, aims to clean the audio signal. The de-noised speech then proceeds to the Speech Recognition Module (SRM), where it is converted into text. The Speech Recognition Module (SRM), compares with Whisper and Wav2Vec2 models for accurate transcription, resulting in Speech- to- text output. The extracted text is then fed into a Retrieval Augmented Generation Model (RAG) which processes the text to generate an appropriate response or retrieve relevant information. Finally, the output from the RAG Model is sent to a Controller deployed in Raspberry Pi having a Display Module (DM) which then drives a Display to visualize the final result.

RNNoise:
RNNoise uses the audio to understand the sounds in speech and decides which sounds are speech from background noise. Rather than using fixed instructions, it relies on a neural network to make the sound cleaner in many different but important ways. For this, the signal is put through a Fast Fourier Transform (FFT) after it is split into short frames of 20 ms each. From every frame, the system uses Bark Frequency Cepstral Coefficients (BFCCs), pitch correlation, energy in different bands and temporal features. The features are passed into a GRU-based neural network. The RNN architecture is able to track the order of frames, making it easier for the model to tell apart speech and noise. The network provides a mask for every frequency band which means how much should be retained or cut back. The FFT magnitude receives spectral gains, and the result is converted back to a time domain signal using iFFT. To make the high-quality signal, overlapping audio frames are linked together via overlap-add. RNNoise uses a pair of paired clean and noisy speech along with mean squared error to optimally learn gain masks. Its small size of 90 KB allows it to work fast on CPUs and embedded systems which is perfect for tools like Raspberry Pi.

DeepFilterNet:
The DeepFilterNet system uses a complex masking method to suppress background noise while preserving key speech components, as shown in Fig. 2 The process involves taking a noisy speech signal, processing it frame by frame, generating a complex-valued mask, and applying it to the noisy signal in the frequency domain. Without explicit phase analysis, this method enhances speech and attenuates noise, though it may cause artifacts or robotic effects.

Audio is given as input and it is converted into the time-frequency domain through Short-Time Fourier transform (STFT). Each picture frame records log power, delta parameters and sometimes extra data like noise references or an estimate of signal-to-noise (SNR). The features are given as input to a small neural network. Its architecture is based on Temporal Convolutional Networks (TCNs) and Gated Recurrent Units (GRUs) to model both local and temporal links. The design supports real-time actions because the network is fast and works the way cause and effect do. By processing each bin of frequencies, it makes filter masks (real and imaginary parts) that are applied to the noisy speech’s STFT to get an enhanced form of the STFT. The result is reversed using inverse STFT to convert it back to an audio signal. In training, the time-domain loss, spectral distance and STOI and PESQ metrics are combined to create a perceptually motivated loss function.

Table 1 below depicts the comparison between the 2 noise reduction models namely RNNoise and DeepFilterNet based on the Signal to Noise Ratio (SNR) score.

Table 1 Comparison of the noise reduction model performance:
Model SNR (Noisy) SNR (Denoised)
RNNoise 5 dB 7.98 dB
DeepFilterNet 4.99 dB 11.93 dB

It is clear from the table that there is a moderate improvement of 2.98 dB in case of RNNoise indicating that RNNoise was successful in reducing some noise but not in improving the signal clarity drastically, since RNN is based on compact GRU and RNN model which limits the depth of the spectral detail it can handle. On the other hand, DeepFilterNet shows a significant improvement of 6.94 dB, due to its complex spectral masking and deeper architecture.

WHISPER MODEL
The Whisper model applies a detailed series of processes to turn Hindi speech into text. The input audio is transformed into 80-dimensional Log-Mel Spectrograms which keep important speech details and get rid of noises. An encoder captures the main speech patterns, also recognizing Hindi specific sounds such as retroflex and aspirated consonants. Rare words can be translated accurately because the decoder employs Byte-Pair Encoding (BPE) and beam search (num_beams=5). By using Indian data in Hindi, the model becomes better at identifying the sounds “tha” and “kha”. Working at the 16kHz sampling rate makes it possible for speech to match key pitches in real time.

Dataset Description:
The dataset, sourced from Kaggle, includes 3,843 Hindi audio files (.wav) and a transcription.txt file, totalling about 5.56 hours of speech from dialogues and news broadcasts. It was normalized to 16kHz mono audio for Whisper compatibility. Resampling, normalizing the data’s amplitude, possibly removing unwanted silence and checking its quality (using machines and by hand) were all part of preprocessing the data. Data was split into training and validation sets using 90% for training and 10% for validation and Hugging Face Datasets made this process seamless. It also handled the task of removing background noise and cutting out speakers which talked over each other. The system can adjust the time scale and include acoustic simulation to include regional accents such as Bhojpuri and Marwari.

Whisper Model Architecture:
The Whisper Small model of the present invention has 244 million parameters, 12 encoder and 12 decoder layers with 8 attention heads and a hidden state size of 768. The sound is processed into log-Mel spectrograms by dividing the audio into 25ms windows with 10ms steps to keep details about syllables. The Whisper model architecture of the process has been depicted in Fig. 3. The encoder processes spectral frames via multi-head attention and GELU activated feedforward networks. The decoder generates Hindi tokens (starting with “hi” using beam search and BPE (50k tokens) for robust Hindi language generation. Hindi adaptation involved Modified Mel filters for retroflex sounds, Gradient checkpointing (60% memory reduction), FP16 training (2× faster), Dynamic batching (30% more efficient). Attention maps confirm focus on essential Hindi phonemes. Key techniques like beam search and FP16 training reduces WER by up to 12% and boost speed. The model can incorporate knowledge distillation for on-device deployment.

Training Whisper Model:
Fine-tuning was done on Whisper Small using a batch size of 8 and learning rate of 1e-5, optimized for GPU usage (NVIDIA T4). Training lasted 3,600 steps, with fp16=True and gradient check pointing=True for efficient memory usage. Validation was performed every 200 steps, with early stopping (patience=3) to prevent overfitting. The max generation length was 225 tokens, and label smoothing (0.1) improved generalization to noisy inputs. Model checkpoints were saved with a limit of 2 to conserve space. The model was trained for 6 hours, balancing speed, memory, and performance, and successfully adapted Whisper to handle the phonetic and morphological complexities of Hindi.

WAV2VEC2 MODEL
Wav2Vec model, a speech recognition model developed by Facebook has a tremendous impact on the domain of ASRas it does not rely on intensive manual transcription of speech. Fig. 4 depicts the architecture of the Wav2Vec 2.0 model architecture showing three primary components namely:
• Feature Encoder: converts raw audio into latent speech representations using convolutional layers. It transforms 30ms of audio into a 512-dimensional vector every 10ms.
• Context Network: applies a Transformer-based model over the encoded features to capture long-range dependencies and contextual patterns in speech, enhancing semantic understanding.
• Quantization Module: discretizes continuous features into a finite set of speech units, enabling self-supervised learning by predicting masked units and improving representation quality.

Dataset Description:
The Mozilla Common Voice dataset is a free, publicly available, multilingual collection of voice recordings and corresponding text, designed to improve speech recognition technologies and ensure that everyone has access to making one’s own speech recognition model. Each entry in the dataset consists of a unique MP3 file and its respective text file. Since the latest release of the dataset has grown to 31,841 hours of recorded speech, with 20,789 hours validated by the community, consisting of 129 languages. In this work, Common voice 20.0 Hindi has been used which consists of 458.53 MB of data recorded for 23 hours and validated for 16 hours with 447 voices in the MP4 format.

Pre-Processing:
The pre-processing step is undertaken after gathering the dataset and includes cleaning of the gathered data using the following pre- processing techniques:
- Normalization:
Normalizing the text is a crucial step in preparing textual data for speech recognition as it transforms the text into a standardized and consistent format. The normalization function used in the present invention includes Regular Expression, Lowercase Conversion, Unicode Normalization and Whitespace Normalization.
- Dropping Unnecessary columns:
In this technique, columns irrelevant to training like client ID, age, gender, etc., are removed. After normalizing and dropping unnecessary columns vocabulary dictionary is extracted and constructed from the text data in the datasets. The process is crucial for training models that convert audio into text.
- Vocabulary Extraction:
In this technique, unique characters are extracted from combined sentences and characters from both datasets are merged and sorted. Special tokens like [PAD] and [UNK] are added for handling padding and unknown characters.
- Feature Extraction:
Raw audio is normalized and resampled (typically to 16 kHz) after which Wav2Vec2FeatureExtractor converts audio into feature vectors suitable for the model.
- Data Collection:
A data collator dynamically pads inputs and labels within each batch. This enables efficient batching, conserving resources while maintaining performance. The dataset is split into 80% training and 20% testing.
- Connectionist Temporal Classification (CTC):
CTC is a decoding method used in sequence tasks where input-output alignment is unknown, particularly when input sequences are longer than the outputs. It handles variable-length outputs by introducing a special blank token and merging repeated characters, enabling efficient decoding. The CTC loss computes the sum of probabilities over all valid alignments for a target sequence, eliminating the need for explicit alignment during training. While effective for speech recognition, CTC struggles with modelling long-term dependencies and complex output structures. Modern ASR systems, like Wav2Vec 2.0, often combine CTC with attention-based encoder-decoder architectures to improve context understanding and alignment accuracy.

REAL-TIME SPEECH-TO-TEXT TRANSLATION SYSTEM WITH EXAMPLES:
The Real-Time Speech Translation flow diagram has been depicted in Fig. 5. The system begins by loading an English-Hindi translation dataset (130,475 entries) using Pandas from a CSV file. After removing missing values, each English sentence is converted into numerical form using the paraphrase-multilingual-MiniLM-L12-v2 model from Sentence Transformers, chosen for its multilingual capabilities and efficient performance. MiniLM is a compact transformer model trained via contrastive learning and knowledge distillation, offering fast and effective sentence embeddings. For retrieval, FAISS (Facebook AI Similarity Search) is used with an IndexFlatL2 setup that performs brute-force nearest neighbour searches based on L2 distance. Given an English query, it is embedded and matched against stored vectors to retrieve the most relevant Hindi translations.

The retrieved results are passed to Gemini via a structured prompt using a POST API call. Gemini refines the translation for contextual accuracy and fluency. If the API fails, an error message is displayed. The system demonstrates real-time translation, as shown in the example with “Cheers”! yielding contextually similar Hindi outputs. The system, as observed in Fig. 6, enables multilingual, real-time train information retrieval using Retrieval-Augmented Generation (RAG). It combines ChromaDB for vector-based similarity search with Gemini for generating accurate responses.

As shown in Fig. 6, the translation process begins when a user inputs a query like “Train 12712, running late.” Using the MiniLM-L12-V2 model, the system extracts key elements (e.g., train number) and embeds the input for semantic understanding. The embedded query is then compared with entries in ChromaDB a vector database containing train schedules, delays, and platform updates. Unlike strict keyword matching, ChromaDB uses contextual similarity to identify relevant information, even from varied phrasings (e.g., “Train 12712 is behind schedule”). After this, Gemini generates a context-aware, multilingual response based on the retrieved data, making the system useful for both passengers and railway staff. After retrieving relevant train data, Gemini generates multilingual responses containing the train status details. This enhances accessibility by communicating information in users’ preferred languages. The output is forwarded to a display module (DM) that transmits the message via LED matrices, LCD screens, or other digital displays in railway stations. This real-time multilingual system improves passenger experience by providing up-to-date information in an inclusive manner.

The Real-Time Aircraft Information Retrieval System observed in Fig. 7 enables real-time flight information retrieval and multilingual communication using RAG. It combines semantic vector search (via ChromaDB or FAISS) with generative AI (Gemini) for context-aware responses. When a user enters a query like Flight TK 4720, delayed by an “hour”, the system uses the MiniLM-L12-V2 model to extract key entities (e.g., flight number). The input is embedded into a vector space and matched against a flight database containing structured information such as flight number, airlines name, departure airport, arrival airport, scheduled times (departure and arrival), estimated times, flight status, aircraft model, terminal gate info. Once the relevant data is retrieved, Gemini synthesizes a natural-language response that includes real-time flight updates, delays, gate info, and more. The output is translated into the user’s preferred language, ensuring better communication for both passengers and airport staff. This hybrid approach of semantic search and generative AI ensures accurate, contextual, and multilingual transportation updates.

Table 2 contrasts the performance assessment of the 2 ASR models namely Whisper and Wav2Vec2 with respect to the following three essential performance metrics including Loss function along with WER and CER:
• Loss: It is a model’s error during training, showing how well the model predicts the correct training data. Generally, by taking lower loss values a better fit model is trained.
• Word Error Rate (WER): It is a standard metric that measures the percentage of words incorrectly transcribed by the ASR system, providing a high-level understanding of the system’s accuracy at the world level.
• Character Error Rate (CER): It measures the percentage of incorrectly transcribed characters by interpreting the transcription in a more granular manner, particularly for languages that do not have well defined words or applications that are sensitive to fine-level errors.

Both WER and CER are very important quantities for assessing usability of an ASR system and the lower values correspond to better accuracy and better performance in real world applications.

Table 2 Comparison of the ASR model performance
Model Loss Word Error Rate Character Error Rate
Whisper 0.1569 0.2220 0.0904
Wav2Vec2 0.3691 0.3285 0.0875

Evaluation results of WHISPER, as depicted in Fig. 8, indicate strong performance with a Word Error Rate (WER) of 0.22 and a Character Error Rate (CER) of 0.09, approaching human-level transcription accuracy. The attention maps effectively localized critical features in Hindi, such as retroflex consonants and vowel duration, enabling strong performance on these phonetic structures.

The WHISPER model achieved 2.8× faster inference than Wav2Vec2-Hindi and improved WER by 14% over IndicWhisper on the Common Voice Hindi dataset. Mixed-language training and data augmentation were employed to address code-switching and dialectal variations. An n-gram language model rescoring mechanism is integrated to potentially improve WER by an additional 2-3% in future iterations. It outperformed IndicWhisper by 14% WER and was 2.8× faster than Wav2Vec2-Hindi. Attention maps effectively captured Hindi- specific features like retroflex consonants and vowel duration.

Training from step 200 (WER: 45.45%, CER: 20.14%) to step 3000 (WER: 8.16%, CER: 3.53%) showed rapid adaptation. The decreasing training-validation loss gap and steady WER/CER decline confirm generalization, not memorization. The model grasped character-level patterns first, then improved at word-level recognition. Training loss dropped sharply between steps 100 500, then tapered, showing convergence by step 1500 as observed in Fig. 9. This indicates stable and effective learning of Hindi phonetics and grammar. The smooth curve validates the use of learning rate 1e-5, batch size 8, and gradient clipping. Further training may bring diminishing returns beyond step 3000.

Validation loss reduced from 0.55 to 0.0782 over 3000 steps, with most gains seen early (steps 200-350), as observed in Fig. 10. The steady decline confirms strong generalization. Minimal fluctuations suggest effective regularization and no overfitting, validating the training configuration. The final transcription shows fluent, natural Hindi text with accurate sentence structure, spelling, and dialogue flow as observed in Fig. 11. It demonstrates successful handling of long-form speech via chunking and recombination, confirming the model’s full end-to-end speech-to-text capability.

While Wav2Vec 2.0 gave a good competition to WHISPER, as observed in Table 2, it still may underperform compared to WHISPER in noisy or real-world audio settings, one of the possible reasons being the differences in architecture training data decoding strategies, thus indicating that the model architecture plays a key role in speech-to-text transformation. WHISPER employs a Transformer-based encoder-decoder architecture with a built-in language model, enabling it to correct words even when parts of the audio are unclear. In contrast, Wav2Vec 2.0 uses an encoder only model trained to learn audio representations, which are later fine-tuned on labelled data. This leads to the inability to recover from acoustic ambiguity, especially if decoding is done without a strong external language model.

Secondly, the amount and diversity of training data contribute heavily to robustness. Whisper is trained on over 680,000 hours of multilingual and noisy internet sourced audio. This includes various accents, environments, and background noises, making it highly robust to real-world scenarios. On the other hand, Wav2Vec 2.0, especially in its common pretrained form, is usually trained on clean datasets like LibriSpeech, which consists mostly of audiobooks, as a result of which, it performs best in clean conditions and may struggle with noise or domain mismatch.

It is important to note that decoding strategy also matters in speech-to-text transformation. Further, Whisper includes an integrated beam search decoder and a language model, which helps improve word-level accuracy and contextual correctness whereas on the other hand, Wav2Vec models often rely on greedy decoding unless paired with external decoders like KenLM. While using a simple decoder, this significantly limits Wav2Vec’s performance on real-world data.

Thus, it is clear from the results depicted in Table 2 that Whisper outperforms Wav2Vec2 on all metrics (Loss, WER, CER) hence, being a better system.

RETRIEVAL AUGMENTED GENERATION (RAG) MODEL
The RAG model employed in the system of the present invention processes the text extracted and fed to it from the Speech Recognition Module (SRM) to generate an appropriate response or retrieve relevant real-time information and display the same on screen for other’s visibility and accessibility.

Translating Language Model for English to Hindi to translation:
The performance of the RAG model is assessed for outputting text when integrated with the Gemini model. Fig. 12 depicts that the text outputs in Gemini translation aren’t concise and contains a lot of vague answers, whereas the RAG+Gemini fine-tuned model depicted in Fig. 13, clearly shows the complete and concise output for the translating bot.

Training the RAG Model:
Fig. 14 and 15 depict the results of text analysis of Gemini vs RAG+Gemini announcements for a certain train. The main comparison between the results in Fig. 14 and Fig. 15 are that the model without RAG needs a manual input of train details whereas Gemini+RAG model can retrieve information on its own. The model in Fig. 14 is unable to generate the announcement as it doesn’t have complete information and asks the user to provide it, whereas the RAG integrated model is able to generate the prompt in all the main languages.

In Fig. 16, the bot accurately says that it doesn’t have the proper information to generate the query, whereas in Fig. 17 the train bot response using RAG+Gemini is able to provide accurate information about the query.

AIRPORT RAG MODEL
Figures 18-20 depict the announcements systems using the RAG model. There are many differences in the announcements generated by GPT 4o of Fig.18 and the RAG model of Fig. 19 which have been trained using FAISS and Gemini, a notable difference being that the announcement is consistent across all the languages. Correct flight information, as seen in Fig. 20, has been used in generating the RAG response, and correct time and terminal information has also been provided. Thus, RAG models are good for post arrival/departure information and can be made better using real time RAG model training to give real time reports.

HOLOGRAM
The hardware equipment used in the invention is meant for displaying rolling text in the form of a hologram with the help of an LED display. The display exhibits the results of the real-time speech-to-text conversion and its analysis performed by the deployed software in collaboration with the hardware for visible outcome. The hardware implementation system, “HoloSpeak”, comprises compact and low-power embedded components optimized for real-time performance and ease of deployment in public spaces. The key hardware components are: Raspberry Pi 5; USB Microphone; MAX7219 LED Dot Matrix Display (8x32 configuration) and
Power Supply (5V/32A)

Fig. 21 depicts the result of the Hindi scrolling text in the LED display which reflects the readiness of the system for showcasing a regional language, a language other than English making it useful for a new set of regional crowds which can be benefitted from the system. The system opens the gates to multiple regional languages, especially with the help of the developed RAG and ASR model. The display, as indicated in the Figure 21 directly interacts with the interface of the RAG model to show the appropriate output as per the requirement of the user catering to help reach out to the crowd with disability.

Fig. 22 depicts the result of Text and Number Scrolling Display showing the train-number, from which it can be inferred that the model and the display are prepared and tested to show alphanumeric scrolling text. This shows the utility of the system in public places where announcements are made or display boards are needed for assistance in providing information to the people, especially benefitting those with hearing impairment.

The “HoloSpeak” represents a major breakthrough in the evolution of effective communication systems for people with hearing disabilities. With the integration of the cutting-edge Automatic Speech Recognition (ASR) technologies including OpenAI’s Whisper, Facebook’s Wav2Vec2 and the retrieval augmented generation RAG module with compact hardware components such as the MAX7219 LED dot matrix and Raspberry Pi 5, the system achieves a real time Hindi speech to text solution with a holographic display. This system guarantees high accuracy of transcription and incorporates a new visual interface based on the Ghost illusion, enabling text to be observed as a floating hologram. This type of display method solves the limitations of conventional screen-based system, providing enhanced readability and accessibility, particularly in public/heavy-noise environments.

The system incorporates reputed noise suppression algorithms like RNNoise and DeepFilterNet to make speech processing reliable in a very noisy environment. Comparative studies reveal that although the Whisper model outperforms other ASR technologies in both WER and CER, it is especially suited for on-chip embedded applications. Further, modular architecture and low computational footprint make the system suitable for deployment in constrained resource-gallery environments such as classrooms, transportation hub, and service desks. In general, the invention showcases the application of modern AI and embedded technologies in designing powerful real-world solutions that close the communication gap for assistive communities.

The HoloSpeak has been designed to be trained for understanding more Indian languages as well as for processing sentences that contain Hindi as well as English language. Its outreach could be improved by addition of gestures or emotion recognition features to enable it to function better for non-verbal communication. The system can be modulated to become more intelligent by making use of recent conversations to refine accuracy. The device can be made smaller and wearable by deploying glasses or clips, thus making it much more convenient to be carried and used anywhere. Providing users with options to adjust the text size, color, and speed could render it more convenient to different needs. The system is very helpful in multilingual areas, where real-time translation can be an added feature. Finally, the device preserves the user privacy and can work even in the absence of internet connectivity.
, Claims:We Claim:
1. A real-time Automatic Speech Recognition (ASR) system, “HoloSpeak”, for inclusive communication, the System (S) comprising:
- USB microphone for capturing audio signals,
- Noise Reduction Module (NRM), for cleaning the audio signal,
- Speech Recognition Module (SRM) for converting audio signals into text,
- Retrieval Augmented Generation Module (RAG) for processing the text and retrieving relevant information,
- Raspberry Pi 5 as the core computational unit,
- Display Module (DM) for visualizing the output text,
characterized in that, the system is integrated with advanced speech-to-text models for dynamically transcribing and displaying the text recognized through the ASR model on the scrolling LED matrix resulting in holographic preview of floating text in a regional language, providing a meaningful text representation.
2. The System as claimed in claim 1, wherein Noise Reduction Module (NRM) processes the speech inputs through the deployed RNNoise and DeepFilterNet techniques to suppress background noise while preserving key speech components.
3. The System (S) as claimed in claim 1, wherein transcription of speech-to-text is achieved by integrating Whisper, Wave2Vec2, and a Retrieval Augmented Generation (RAG) module.
4. The System as claimed in claim 1, wherein the text is transcribed via MAX7219 LED dot matrix to display the text on screen, in real-time for visibility and accessibility.
5. The System (S) as claimed in claim 1, wherein the Whisper employs a Transformer-based encoder-decoder architecture with a built-in language model for capturing speech patterns and translating rare words.
6. The System (S) as claimed in claim 1, wherein the audio input is transformed into 80-dimensional Log-Mel Spectrogram for preserving speech details and reducing noise.
7. The System (S) as claimed in claim 1, wherein the Wave2Vec2 model employs a Transformer-based context network over convolutional feature encoders to learn contextual speech representations, enabling robust modeling of speech patterns and improved recognition of underrepresented phonetic units.
8. The System (S) as claimed in claim 1, wherein the RAG model uses the MiniLM model for sentence transformation from English to Hindi.
9. The system as claimed in claim 1, wherein the transcribed speech is displayed as scrolling text on an LED display comprising of LED Dot Matrix Display 8x32.
10. The System (S) as claimed in claim 1, wherein the said System (S) is designed to transcribe English and Hindi speech into text by the RAG and ASR modules and displaying the transcribed speech as a scrolling text.
11. The System (S) as claimed in claim 1, wherein rare words can be translated accurately by the decoder employs Byte-Pair Encoding (BPE) and beam search (num beams=5).
12. A method of real-time automatic speech recognition by the System (S) as claimed in claim 1, comprising the steps of:
- inputting speech through the Noise Reduction Module (NRM)for removing noise and cleaning audio signals,
- applying the de-noised speech to the Speech Recognition Module (SRM) for transcription and Speech-to-text output,
- feeding the extracted text into the Retrieval Augmented Generation Model (RAG) for processing the text by retrieving relevant information to generate an appropriate response,
- sending the output from the RAG Model to a Controller deployed in Raspberry Pi of the Display Module (DM) for driving a display to visualize the final result.

Documents

Application Documents

#	Name	Date
1	202543077030-STATEMENT OF UNDERTAKING (FORM 3) [13-08-2025(online)].pdf	2025-08-13
2	202543077030-FORM-9 [13-08-2025(online)].pdf	2025-08-13
3	202543077030-FORM FOR SMALL ENTITY(FORM-28) [13-08-2025(online)].pdf	2025-08-13
4	202543077030-FORM 18 [13-08-2025(online)].pdf	2025-08-13
5	202543077030-FORM 1 [13-08-2025(online)].pdf	2025-08-13
6	202543077030-FIGURE OF ABSTRACT [13-08-2025(online)].pdf	2025-08-13
7	202543077030-EVIDENCE FOR REGISTRATION UNDER SSI(FORM-28) [13-08-2025(online)].pdf	2025-08-13
8	202543077030-EVIDENCE FOR REGISTRATION UNDER SSI [13-08-2025(online)].pdf	2025-08-13
9	202543077030-EDUCATIONAL INSTITUTION(S) [13-08-2025(online)].pdf	2025-08-13
10	202543077030-DRAWINGS [13-08-2025(online)].pdf	2025-08-13
11	202543077030-DECLARATION OF INVENTORSHIP (FORM 5) [13-08-2025(online)].pdf	2025-08-13
12	202543077030-COMPLETE SPECIFICATION [13-08-2025(online)].pdf	2025-08-13
13	202543077030-FORM-26 [25-09-2025(online)].pdf	2025-09-25