System And Method For Speech Recognition Using Machine Transliteration

System And Method For Speech Recognition Using Machine Transliteration And Transfer Learning

Abstract: A method for converting speech in one of a plurality of input languages into text using machine transliteration and transfer learning is disclosed. The method in-cludes a training stage. The training stage includes receiving a training set of a plurali-ty of audio files and an input text corresponding to the audio input in any input lan-guage using the speech recognition engine; transliterating the training set to transform the input text into transliterated text that includes characters of a base language and training acoustic model with the plurality of audio files and corresponding translit-erated text using transfer learning. The method further includes an inference stage. The inference stage includes performing decoding on output of the trained acoustic model to generate text includes characters of the base language at inference and trans-literating the generated text to output text includes characters in input language using reverse transliteration.

Patent Information

Application #

Filing Date

01 December 2021

Publication Number

51/2022

Publication Type

INA

Invention Field

ELECTRONICS

Status

Email

patents@formulateip.com

Parent Application

Patent Number

Legal Status

Grant Date

2024-05-26

Renewal Date

Applicants

TALENT UNLIMITED ONLINE SERVICES PRIVATE LIMITED

202, S/F 94 Meghdoot Nehru Place, South Delhi Delhi India 110019

Inventors

1. RAHUL PRASAD

122/02 Silver Oaks Apartments, DLF Phase 1 Gurugram Haryana, India, 122002

2. ANKIT PRASAD

L 263, DLF Park Place, DLF Phase 5, Sector 54, Gurugram, Haryana, India, 122002

3. ABHISHEK SHARMA

Flat No. S - 502, Homes 121, Sector 121, Gautam Buddha Nagar, Noida, Uttar Pradesh, India, 201301

Specification

FIELD OF THE INVENTION
[0002] The present invention, in general, relates to speech recognition. More particularly, the present invention relates to a system and a method of machine transliteration and transfer learning. In specific, the present invention describes a system and a method for speech recognition using machine transliteration and transfer learning.

B) BACKGROUND OF THE INVENTION
[0003] Conventional speech-to-text (STT) or voice-to-text conversions for a plurality of input language require a plurality of acoustic models and language models. These conversions require training of multiple, different artificial intelli-gence (AI) models from scratch, which is time consuming. That is, conventional speech recognition methods require training of a plurality of acoustic models and use of a plurality of language models for STT conversions. In some implementa-tions, all words that are not in the first script are transliterated into the first script. The existing methods include the process of accessing a set of data indicating lan-guage examples for a first script, wherein at least some of the language examples include words in the first script and words in one or more other scripts, the meth-od transliterates at least portions of some of the language examples to the first script to generate a training data set having words transliterated into the first script.
[0004] Hence, there is a need for reducing the time required for training these models and using transfer learning to create acoustic model for any language using a single acoustic model for example, the acoustic model that is pre-trained on English data to transcribe a plurality of languages. Conventional sequence-to-sequence speech transcription and recognition models are based on recurrent neu-ral network (RNN) models with complex deep neural network layers. Training these RNN models is time intensive and expensive to implement.
[0005] Hence, there is a long-felt need for a system and a method for con-verting speech in any input language into text using machine transliteration and transfer learning, while addressing the above-recited problems associated with the related art.

C) SUMMARY OF THE INVENTION
[0006] This summary is provided to introduce a selection of concepts in a simplified form that are further disclosed in the detailed description. This sum-mary is not intended to determine the scope of the claimed subject matter.
[0007] The present invention addresses the above-recited need for a system and a method for converting speech in one of a plurality of input languages, for example, Hindi, into text comprising, for example, Devanagari characters, using machine transliteration and transfer learning. The method disclosed herein em-ploys an artificial intelligence (AI)-based speech recognition engine executable by at least one processor for converting speech in any input language into text using machine transliteration and transfer learning.
[0008] In an aspect a processor implemented method for converting speech in one of a plurality of input languages into text using machine transliteration and transfer learning is provided. The method includes a training stage. The training stage includes receiving a training set of a plurality of audio files and an input text corresponding to the audio input in any input language using the speech recogni-tion engine. The training stage further includes transliterating the training set to transform the input text into transliterated text includes characters of a base lan-guage and training an acoustic model with the plurality of audio files and corre-sponding transliterated text using transfer learning. The method further includes an inference stage. The inference stage includes performing decoding on an output of the trained acoustic model to generate text includes characters of the base lan-guage and transliterating the generated text to output text includes characters in the input language using reverse transliteration.
[0009] According to an embodiment, transliterating includes transliterating languages using a speech recognition engine.
[0010] According to an embodiment, the method further includes sanitizing the data during the preparation of the data set using the speech recognition engine.
[0011] According to an embodiment, the sanitization includes removal of at least one of duplication of a plurality of audio files and corresponding transcript text and/or blanks in the text.
[0012] According to an embodiment, sanitization includes removing the au-dio transcript pairs, where audios or transcripts are repeated and keeping only one such pair selected randomly, and wherein repeated audios are identified using checksum. If plurality of audio transcript pairs is present with same transcript, only one such audio transcript pair is kept in the data. Sanitization also includes remov-ing audio transcript pairs where audio is noisy. The noisy audio is detected by identifying audios which have too much noise either programmatically or manual-ly.
[0013] According to an embodiment, the transliterating is the process of converting text of any language written in one script to another script.
[0014] According to an embodiment, the transliterating is achieved using a plurality of methodologies includes at least one of rule-based machine translitera-tion, Finite-state Transducers (FST) based transliteration, AI based transliteration, neural model-based transliteration, and mapping-based transliteration.
[0015] According to an embodiment, the rule-based transliteration is per-formed using a collection of rules stored on the disk in a keypair format and pro-cessed using python programming language to transform text in one script to an-other using defined rules.
[0016] According to an embodiment, the rules are one of sub-word based, even phrase based, and character based.
[0017] According to an embodiment, the rule-based machine transliteration includes executing the phrase-based replacements includes at least one of word based, sub-word based and character-based replacements and discarding the char-acters left from the initial script of the transcripts.
[0018] According to an embodiment, the trained acoustic models are used to perform speech-to-text conversion for a plurality of languages.
[0019] According to an embodiment, the speech recognition engine trains an acoustic model using transfer learning over pre-trained acoustic model of base lan-guage with the plurality of audio files and corresponding transliterated text in the base language characters using transfer learning.
[0020] According to an embodiment, the pre-trained acoustic model is trained on plurality of datasets of the base language.
[0021] According to an embodiment, the pre-trained acoustic model is trained on a plurality of datasets of the base language and wherein the model trained using transfer learning over pre-trained acoustic model learns optimally when the characters in the transcript are from the base language itself.
[0022] According to an embodiment, the transfer learning is a machine learning method which reuses a pre-trained acoustic model developed for convert-ing speech in a base language as a starting point for training an acoustic model for converting speech in an input language.
[0023] According to an embodiment, the transfer learning is a technique of training machine learning models in which a pre-trained model or checkpoint is used to assign starting weights for model training.
[0024] According to an embodiment, a checkpoint is a collection of model weights which is read programmatically, and the weights are used as the starting model weights for the model to be trained.
[0025] According to an embodiment, the speech recognition engine executes either of the beam search decoding algorithm or other functionally equivalent de-coding algorithm on the output of the trained acoustic model to increase the accu-racy of the generated text which includes characters of the base language.
[0026] According to an embodiment, the beam search decoding is a method of extracting output sequence from ML models where instead of picking the indi-vidual output at each time step with either maximum score or probability given by ML model a plurality of alternatives is selected for output sequence at each timestep based on conditional probability.
[0027] According to an embodiment, the number of alternatives at each timestep is called beam width.
[0028] According to an embodiment, implementation is performed in python programming language and the output of acoustic model is used as the input to it in the case.
[0029] According to an embodiment, the probabilities which are returned by the acoustic model are used as the input to beam search decoding along with a language model to return a certain number of alternatives and the one with max conditional probability is picked.
[0030] In another aspect, a system for converting speech in one of a plurality of input languages into text using machine transliteration and transfer learning is provided. The system includes a memory unit including one or more executable modules and a processor configured to execute the one or more executable mod-ules for converting speech in one of a plurality of input languages into text using machine transliteration and transfer learning. The one or more executable modules includes a data reception module that receives training set including a plurality of audio files and corresponding transcript texts in any input language. A data trans-formation module transforms the input text into transliterated text includes char-acters of a base language. A training module trains an acoustic model with the plu-rality of audio files and corresponding transliterated text. An inference module performs decoding on an output of the trained acoustic model to generate text in-cludes characters of the base language. A database stores the plurality of audio files received as speech input for speech-to-text conversion and a corpus contain-ing large datasets of curated and augmented texts.
[0031] According to an embodiment, the inference module is further con-figured to improve the accuracy of the generated text includes characters of the base language.
[0032] According to an embodiment, the inference module is further con-figured for receiving an audio file as input from the user, processing the input au-dio data through an acoustic model and generating output text in a base language character that is reverse transliterated to obtain text in the original input language characters through a pre-trained customized language model.
[0033] In one or more embodiments, related systems comprise circuitry and/or programming for effecting the present invention. In an embodiment, the circuitry and/or programming are any combination of hardware, software, and/or firmware configured to implement the present invention depending upon the de-sign choices of a system designer. Also, in an embodiment, various structural ele-ments are employed depending on the design choices of the system designer.

D) BRIEF DESCRIPTION OF THE DRAWINGS
[0034] The foregoing summary, as well as the following detailed descrip-tion, is better understood when read in conjunction with the appended drawings. For illustrating the present invention, exemplary constructions of the present in-vention are shown in the drawings. However, the present invention is not limited to the specific methods and components disclosed herein. The description of a method step or a component referenced by a numeral in a drawing is applicable to the description of that method step or component shown by that same numeral in any subsequent drawing herein.
[0035] FIG. 1 illustrates an architectural block diagram of a system for con-verting speech in one of a plurality of input languages into text using machine transliteration and transfer learning, according to an embodiment of the present invention.
[0036] FIG. 2 illustrates a high-level flowchart showing a process flow for converting an audio input in any input language into text at an inference stage, according to an embodiment of the present invention.
[0037] FIG. 3 illustrates a flowchart of a method for converting speech in one of a plurality of input languages into text using machine transliteration and transfer learning, according to an embodiment of the present invention.

E) DETAILED DESCRIPTION OF THE EMBODIMENTS
[0038] The embodiments herein and the various features and advanta-geous details thereof are explained more fully with reference to the non-limiting embodiments that are illustrated in the accompanying drawings and detailed in the following description. Descriptions of well-known components and processing techniques are omitted so as to not unnecessarily obscure the embodiments herein. The examples used herein are intended merely to facilitate an understanding of ways in which the embodiments herein may be practiced and to further enable those of skill in the art to practice the embodiments herein. Accordingly, the ex-amples should not be construed as limiting the scope of the embodiments herein.
[0039] Various embodiments disclosed herein provide a method and a sys-tem for converting speech in one of a plurality of input languages into text using machine transliteration and transfer learning. The system and method disclosed herein uses the pre-trained English acoustic model for training and implementing speech recognition for any language in the world. The use of pre-trained acoustic models reduces the time required for training and developing the acoustic model. To use an acoustic model that is pre-trained on a Latin or English dataset, the pre-sent invention transliterates the input received in any input language into Latin or English characters, thereby reducing the training time substantially over training from scratch for a plurality of languages. The present invention precludes the need for training a plurality of acoustic models from scratch, as a single pre-trained English acoustic model is used as the starting point. Moreover, the use of a pre-trained acoustic model is computationally less expensive.
[0040] FIG. 1 illustrates an architectural block diagram of an exemplary im-plementation of a system 101 for converting speech in one of a plurality of input languages into text using machine transliteration and transfer learning, according to an embodiment of the present invention. In an embodiment, the system 101 in-cludes a system 101 as exemplarily illustrated in FIG. 1. The system 101 may in-clude a computer system programmable using high-level computer programming languages. The system 101 is an electronic device, for example, one or more of a personal computer, a tablet computing device, a mobile computer, a mobile phone, a smart phone, a portable computing device, a laptop, a personal digital assistant, a wearable computing device such as smart glasses, a smart watch, etc., a touch cen-tric device, a workstation, a client device, a server, a portable electronic device, a network enabled computing device, an interactive network enabled communica-tion device, an image capture device, any other suitable computing equipment, combinations of multiple pieces of computing equipment, etc. In an embodiment, the speech recognition engine 107 is implemented in the system 101 using pro-grammed and purposeful hardware. In an embodiment, the speech recognition en-gine 107 is a computer-embeddable system that converts speech in any input lan-guage into text using machine transliteration and transfer learning.
[0041] In an embodiment, the speech recognition engine 107 is accessible to users at inference, for example, through a broad spectrum of technologies and user devices such as smart phones, tablet computing devices, endpoint devices, etc., with access to a network, for example, a short-range network or a long-range net-work. The network is, for example, one of the internets, an intranet, a wired net-work, a wireless network, a network that implements Wi-Fi® of Wi-Fi Alliance Corporation, a mobile telecommunication network, etc., or a network formed from any combination of these networks.
[0042] As illustrated in FIG. 1, the system 101 comprises at least one pro-cessor 102 and a non-transitory, computer-readable storage medium, for example, a memory unit 106, for storing computer program instructions defined by mod-ules, for example, 108, 109, 110, 111, etc., of the speech recognition engine 107. In an embodiment, the modules, for example, 108, 109, 110, 111, 112, and the like, of the speech recognition engine 107 are stored in the memory unit 106 as illustrated in FIG. 1. The processor 102 is operably and communicatively coupled to the memory unit 106 for executing the computer program instructions defined by the modules, for example, 108, 109, 110, 111, etc., of the speech recognition engine 107. The processor 102 refers to any one or more microprocessors, central processing unit (CPU) devices, finite state machines, computers, microcontrollers, digital signal processors, logic, a logic device, an application specific integrated circuit (ASIC), a field-programmable gate array (FPGA), a chip, etc., or any com-bination thereof, capable of executing computer programs or a series of com-mands, instructions, or state transitions. The speech recognition engine 107 is not limited to employing the processor 102. In an embodiment, the speech recognition engine 107 employs one or more controllers or microcontrollers.
[0043] As illustrated in FIG. 1, the system 101 comprises a data bus 113, a display unit 103, a network interface 104, and common modules 105. The data bus 113 permits communications between the modules, for example, 102, 103, 104, 105, and 106. The display unit 103, via a graphical user interface (GUI) 103a, displays information, display interfaces, user interface elements such as checkbox-es, input text fields, etc., for example, for allowing a user to invoke and execute the speech recognition engine 107 at the inference stage, where the user device 101 receives the audio input from the user for speech-to-text conversion, and per-form input actions for triggering various functions to transcribe the audio input into text of the base language.
[0044] The network interface 104 enables connection of the speech recogni-tion engine 107 to the network. The network interface 104 is, for example, one or more of infrared interfaces, interfaces implementing Wi-Fi® of Wi-Fi Alliance Corporation, universal serial bus interfaces, FireWire® interfaces of Apple Inc., interfaces based on transmission control protocol/internet protocol, interfaces based on wireless communications technology such as satellite technology, radio frequency technology, near field communication, etc. The common modules 105 of the system 101 comprise, for example, input/output (I/O) controllers, input de-vices, output devices, fixed media drives such as hard drives, removable media drives for receiving removable media, etc. Computer applications and programs are used for operating the speech recognition engine 107. The programs are loaded onto fixed media drives and into the memory unit 106 via the removable media drives. In an embodiment, the computer applications and programs are loaded into the memory unit 106 directly via the network.
[0045] In an embodiment, the speech recognition engine 107 comprises modules defining computer program instructions, which when executed by the processor 102, cause the processor 102 to convert speech in any input language into text using machine transliteration and transfer learning training. In an embod-iment, the modules of the speech recognition engine 107 comprise a data reception module 108, a data transformation module 109, a training module 110, an infer-ence module 111, and a database 112. The database 112 stores, for example, audio files received as speech input for speech-to-text conversion and a corpus contain-ing large datasets of curated and augmented texts. The data reception module 108 receives training set of plurality of audio files and input text corresponding to the audio input in any language (for example Hindi) using the speech recognition en-gine. The data transformation module 109 transliterates the training set to trans-form the input text into transliterated text comprising characters of a base lan-guage. The data transformation module 109 transforms the input text into translit-erated text comprising characters of a base language, for example, Latin or Eng-lish, using any transliteration methodology, for example, rule-based machine trans-literation, WFST based transliteration, AI-based machine transliteration, mapping-based transliteration, and the like.
[0046] In an embodiment, the training module 110 trains an acoustic model with the plurality of audio files and corresponding transliterated text based on, for example, transfer learning over pre-trained acoustic model in base language. The pre-trained acoustic model is trained on multiple datasets of the base language. The inference module 111 performs decoding, for example on an output of the trained acoustic model to generate text comprising characters of the base lan-guage, for example, Latin or English. In an embodiment, the inference module 111 improves the accuracy of the generated text comprising characters of the base lan-guage, for example, Latin or English, by using a pre-trained customized language model such as a pre-trained Hindi language model with Latin characters. The gen-erated text is transliterated by the inference module 111 to output text comprising characters in the input language, for example, Devanagari characters, using reverse transliteration using any transliteration methodology, for example, rule-based ma-chine transliteration, WFST based transliteration, AI-based machine translitera-tion, mapping-based transliteration, etc. Further, the inference module 111 exe-cutes the inference stage disclosed in the detailed description of FIG. 2, where the inference module 111 receives an audio file as input from the user, processes the input audio data through the acoustic model, and in an embodiment, through the pre-trained customized language model, and generates output text in characters of a base language, which is reverse transliterated to obtain text in the original input language.
[0047] The data reception module 108, the data transformation module 109, the training module 110, and the inference module 111 are disclosed above as software executed by the processor 102. In an embodiment, the modules, for ex-ample, 108, 109, 110, and 111 of the speech recognition engine 107 are imple-mented completely in hardware. In another embodiment, the modules, 108, 109, 110, and 111 of the speech recognition engine 107 are implemented by logic cir-cuits to carry out their respective functions disclosed above. In another embodi-ment, the speech recognition engine 107 is also implemented as a combination of hardware and software including one or more processors, for example, 102 that are used to implement the modules, for example, 108, 109, 110, and 111 of the speech recognition engine 107. The processor 102 retrieves instructions defined by the speech reception module 108, the data transformation module 109, the training module 110, and the inference module 111 from the memory unit 106 for per-forming respective functions disclosed above. The non-transitory, computer-readable storage medium disclosed herein stores computer program instructions executable by the processor 102 for converting speech in any input language into text using machine transliteration and transfer learning.
[0048] FIG. 2 illustrates a high-level flowchart showing a process flow for converting an audio input in any input language into text at an inference stage, according to an embodiment of the present invention. As illustrated in FIG. 2, the speech recognition engine receives the audio input 201 and applies the audio input 201 to the acoustic model 202, which transcribes text in base language characters. In one embodiment, the speech recognition engine improves the accuracy of the transcribed text by using the pre-trained customized language model 203 while decoding. In an additional embodiment, the speech recognition engine is config-ured to perform beam search decoding to increase the accuracy of the generated transcribed text. At the inference stage 204, the speech recognition engine per-forms reverse transliteration to generate output text data 205 in the original input language.
[0049] FIG. 3 illustrates a flowchart of a method for converting speech in one of a plurality of input languages into text using machine transliteration and transfer learning, according to an embodiment of the present invention. The meth-od disclosed herein employs an artificial intelligence (AI)-based speech recogni-tion engine executable by at least one processor for converting speech in an input language into text. For purposes of illustration, the detailed description refers to a speech or audio input in an input language, for example, Hindi, being converted into text comprising Devanagari characters; however, the scope of the method and the system disclosed herein is not limited to the input language being Hindi but may be extended to include any other language. The speech recognition engine is configured to transliterate text in any input language to text comprising characters of a base language. In one embodiment, the speech recognition engine is config-ured to transliterate text in any input language to text comprising characters of Latin or English language.
[0050] In the method disclosed herein, at step 302 training stage begins. At step 304, a speech recognition engine receives a training set including pairs of au-dio files and transcript text (or input text) corresponding to the audio files in any input language, for example, Hindi. The input text comprises characters of the in-put language, for example, Devanagari characters. At step 306, the speech recogni-tion engine transliterates the transcript text into transliterated text comprising characters of a base language, for example, Latin or English, using any translitera-tion methodology, for example, rule-based machine transliteration, WFST based transliteration, AI-based machine transliteration, mapping-based transliteration, and the like.
[0051] In an embodiment, instead of a phonetic language, the speech recog-nition engine is configured to transliterate graphical languages, for example, Man-darin, into similar sounding words or characters. In one embodiment, instead of transliteration into Latin or English characters, the speech recognition engine is configured to transliterate into characters of any other base language and train an AI model over the base language’s pre-trained acoustic model, thereby providing ample possibilities to do transfer learning over pre-trained acoustic models of vari-ous languages and then using the trained acoustic models to perform speech-to-text conversion for a plurality of languages. In one embodiment, said any other language may be a language other than the graphical languages to train the pre-trained acoustic models in various languages and then using those trained acoustic models to perform speech-to-text conversion for a plurality of languages.
[0052] At step 308, the speech recognition engine trains an acoustic model using transfer learning over pre-trained acoustic model of base language with au-dio files and corresponding transliterated text in the base language characters us-ing transfer learning. The pre-trained acoustic model is trained on multiple datasets of the base language. For example, the acoustic model is pre-trained on Latin characters in English language. Since the pre-trained acoustic model is trained over Latin characters or English alphabets, in an example, the speech recognition engine performs the above-mentioned machine transliteration for converting Devanagari characters into Latin characters or English alphabets to allow training of the acous-tic model using transfer learning. For example, when the speech recognition engine receives an audio clip and a transcription of the audio clip recorded in Devanagari characters such as “???? ?? ??? ??”, the speech recognition engine converts the Devanagari transcription into Latin characters or English alphabets “Kya kar rahe ho” using any method of transliteration, thereby allowing transfer learning over the pre-trained Latin or English-based acoustic model with the transliterated Latin or English character transcripts and corresponding Hindi audios. In one embodi-ment, the speech recognition engine is configured to sanitize the data while during the preparation of the data set. The data sanitization includes removal of at least one of duplication of audio files and corresponding transcript text or blanks in the text. In an embodiment, the data sanitization involves removing at least one of: duplicate audios, duplicate transcript, and noisy audio. Any audios which are re-peated are removed by using checksum to identify same audios. If there are multi-ple audios with same transcript only one of them is kept in the data. Audios which have too much noise are identified either programmatically or manually and ex-clude them from the data. In some embodiments, the data sanitization may in-clude any kind of manipulation to train set to improve results.
[0053] In the method disclosed herein, the acoustic model training uses a pre-trained acoustic model of base language for multiple other input languages, thereby reducing training time. The pre-trained acoustic model learns optimally when exposed to its own base language characters. In one embodiment, the acous-tic model is a standardized model that is pre-trained on Latin or English characters and during transfer learning it learns optimally when exposed to text comprising characters of the Latin or English language, thereby increasing the accuracy of training. Transfer learning is a machine learning method that reuses a pre-trained acoustic model developed for converting speech in a base language, for example, Latin or English, into text, as a starting point for training an acoustic model for converting speech in an input language, for example, Hindi, into text. Transfer learning is a technique of training machine learning models in which a pre-trained model or checkpoint is used to assign starting weights for model training. A checkpoint is a collection of model weights which is read programmatically, and these weights are then used as the starting model weights for the model we are looking to train.
[0054] At step 310, inference is performed. At step 312, the speech recogni-tion engine receives an audio input and passes the audio input through the trained acoustic model and then performs decoding on the output of the trained acoustic model to generate text comprising characters of the base language, for example, Latin or English. In an embodiment, the speech recognition engine additionally executes the beam search decoding algorithm or any other functionally equivalent decoding algorithm on the output of the trained acoustic model to increase the accuracy of the generated text that comprises characters of the base language, for example, Latin or English. The beam search decoding is a method of extracting output sequence from ML models where instead of picking the individual output at each time step with maximum score or probability given by ML model the sys-tem selects multiple alternatives for output sequence at each timestep based on conditional probability. The number of alternatives at each timestep is called beam width. The implementation is performed in python programming language and the output of acoustic model is used as the input to it in this case. The probabilities which are returned by the acoustic model are used as the input to beam search de-coding along with a language model to return a certain number of alternatives and the one with max conditional probability is picked. In an embodiment, the speech recognition engine improves the inference accuracy of the generated text compris-ing characters of the base language, for example, Latin or English, by using a pre-trained customized language model, for example, pre-trained hindi customized English character language model.
[0055] At step 314, the speech recognition engine then transliterates the generated text of input language in the base language characters, for example Eng-lish or Latin, to output text comprising characters in the input language, for exam-ple, Devanagari characters, using reverse transliteration using any transliteration methodology. For example, rule-based machine transliteration, WFST based trans-literation, AI-based machine transliteration, mapping-based transliteration, etc. A detailed description describing a flow of converting an audio input in any input language into text at an inference stage from a user perspective through a user de-vice is described in FIG. 2. The transliteration is the process of converting text of any language written in one script to another script. There can be multiple meth-odologies to achieve the same, for example rule-based machine transliteration, FST based transliteration, mapping-based transliteration, and the like. Rule-based trans-literation system is a collection of rules stored on the disk in a key-pair format and processed using python programming language to transform text in one script to another using defined rules. These rules can be sub-word based or even phrase based, or character based. First, the phrase-based replacements are executed and subsequently the word-based replacements is executed and then sub-word based replacements is executed and after that the character-based replacements are exe-cuted. Any transcripts which still have some characters left from the initial script are discarded.
[0056] The present invention is based on an AI-based speech recognition engine for conversion of speech-to-text or voice-to-text. The present invention transliterates any language received as multiple language character input into char-acters of a base language, for example, Latin characters, and uses the Latin charac-ters transcripts along with the audios for training AI models for speech recogni-tion. The present invention transcribes speech-to-text for any language in the world, thereby providing a convenient conversational platform that makes com-munication effortless. The present invention performs pre-processing in the form of machine transliteration of characters of any input language, for example, Deva-nagari characters, into Latin characters, on the training set before training the acoustic models. The present invention then performs training of the acoustic model in characters of a base language, for example, Latin or English characters, using transfer learning on a pre-trained English acoustic model.
[0057] The transcription from the trained Latin or English-based acoustic model is in Latin or English characters, which are then converted into Devanagari characters using reverse transliteration. The present invention uses the pre-trained English acoustic model for training and implementing speech recognition for any language in the world. The use of pre-trained acoustic models reduces the time required for training and developing the acoustic model. To use an acoustic model that is pre-trained on a Latin or English dataset, the present invention transliterates the training set transcripts in any input language into Latin or English characters, thereby reducing the training time over training from scratch for a plurality of lan-guages. The present invention precludes the need for training a plurality of acous-tic models from scratch, as a single pre-trained English acoustic model is used as the starting point. Moreover, the use of a pre-trained acoustic model is computa-tionally less expensive.
[0058] The present invention has multiple applications involving speech-to-text or voice-to-text conversations in Hindi, other Indic languages, or any other language spoken in the world. The present invention can be used by third par-ties, research industries, firms or academic institutions working on speech recogni-tion, businesses requiring data-driven strategies, research-based industries, soft-ware sectors, cloud-based companies, AI-based conversation media entities, etc. The present invention precludes the need for investing substantial amounts of money, time, and human resources on building AI models for speech recognition for multiple languages.
[0059] The foregoing examples and illustrative implementations of various embodiments have been provided merely for explanation and are in no way to be construed as limiting the present invention. While the present invention has been described with reference to various embodiments, illustrative implementations, drawings, and techniques, it is understood that the words, which have been used herein, are words of description and illustration, rather than words of limitation. Furthermore, although the present invention has been described herein with refer-ence to particular means, materials, embodiments, techniques, and implementa-tions, the present invention is not intended to be limited to the particulars dis-closed herein; rather, the present invention extends to all functionally equivalent structures, methods and uses, such as are within the scope of the appended claims. It will be understood by those skilled in the art, having the benefit of the teach-ings of this specification that the present invention is capable of modifications and other embodiments may be affected and changes may be made thereto, without departing from the scope and spirit of the present invention.

I/We Claim:

1. A processor implemented method for converting speech in one of a plurality of input languages into text using machine transliteration and transfer learn-ing, the method comprising steps of:
providing a training stage comprises:
receiving a training set of a plurality of audio files and an input text corresponding to the audio input in any input language using a speech recog-nition engine;
transliterating the training set to transform the input text into trans-literated text comprising characters of a base language; and
training an acoustic model with the plurality of audio files and cor-responding transliterated text using transfer learning; and
providing an inference stage;
performing decoding on an output of the trained acoustic model to generate text comprising characters of the base language at the inference stage; and
transliterating the generated text to output text comprising charac-ters in the input language using reverse transliteration, and wherein translit-erating comprises transliterating using a speech recognition engine.

2. The processor implemented method of claim 1, further comprising sanitizing the data during the preparation of the data set using the speech recognition engine, and wherein the sanitization comprises removal of at least one of du-plication of the plurality of audio files and corresponding transcript text and/or blanks in the text.

3. The processor implemented method of claim 1, wherein sanitization compris-es:
removing the audio transcript pairs, where audios or transcripts are re-peated and keeping only one such pair selected randomly, and repeated au-dios are identified using checksum;
removing the audio script pairs if the plurality of audio transcript pairs is present with same transcript by keeping only one such audio tran-script pair is kept in the data; and
removing audio transcript pair where, audio is noisy, and wherein noisy audio is detected by identifying audios which have too much noise either programmatically or manually.

4. The processor implemented method of claim 1, wherein the transliterating is the process of converting text of any language written in one script to anoth-er script, and wherein the transliterating is achieved using a plurality of methodologies comprising at least one of rule-based machine transliteration, Finite-state Transducers (FST) based transliteration, AI based transliteration, Neural model-based transliteration, and mapping-based transliteration.

5. The processor implemented method of claim 4, wherein the rule-based trans-literation is performed using a collection of rules stored on the disk in a key-pair format and processed using python programming language to transform text in one script to another using defined rules, and wherein the rules are one of sub-word based, word based, even phrase based, and character based.

6. The processor implemented method of claim 5, wherein the rule-based ma-chine transliteration comprises:
executing the phrase-based replacements comprising at least one of word based, sub-word based and character-based replacements; and
discarding the characters left from the initial script of the transcripts.
7. The processor implemented method of claim 1, wherein the trained acoustic models are used to perform speech-to-text conversion for a plurality of lan-guages.

8. The processor implemented method of claim 1, wherein the speech recogni-tion engine trains an acoustic model using transfer learning over pre-trained acoustic model of base language with the plurality of audio files and corre-sponding transliterated text in the base language characters using transfer learning.

9. The processor implemented method of claim 8, wherein the pre-trained acoustic model is trained on plurality of datasets of the base language, and wherein the model trained using transfer learning over pre-trained acoustic model learns optimally when the characters in the transcript are from the base language itself.

10. The processor implemented method of claim 1, wherein the transfer learning is a machine learning method which reuses a pre-trained acoustic model de-veloped for converting speech in a base language as a starting point for train-ing an acoustic model for converting speech in an input language.

11. The processor implemented method of claim 10, wherein the transfer learn-ing is a technique of training machine learning models in which a pre-trained model or checkpoint is used to assign starting weights for model training.

12. The processor implemented method of claim 10, wherein a checkpoint is a collection of model weights which is read programmatically, and the weights are used as the starting model weights for the model to be trained.
13. The processor implemented method of claim 1, wherein the speech recogni-tion engine executes either of the beam search decoding algorithm or other functionally equivalent decoding algorithm on the output of the trained acoustic model to increase the accuracy of the generated text which compris-es characters of the base language, and wherein the beam search decoding is a method of extracting output sequence from ML models where instead of picking the individual output at each time step with either maximum score or probability given by ML model a plurality of alternatives is selected for out-put sequence at each timestep based on conditional probability, and wherein the number of alternatives at each timestep is called beam width, and where-in the output of acoustic model is used as the input to it in the case.

14. The processor implemented method of claim 13, wherein the probabilities which are returned by the acoustic model are used as the input to beam search decoding along with a language model to return a certain number of alternatives and the one with max conditional probability are picked.

15. The processor implemented method of claim 1, wherein the speech recogni-tion engine transliterates the generated text of input language in the base language characters, to output text comprising characters in the input lan-guage, using reverse transliteration using any transliteration methodology.

16. A system for converting speech in one of a plurality of input languages into text using machine transliteration and transfer learning, the system compris-ing:
a memory unit comprising an artificial intelligence (AI) based speech recognition engine comprising one or more executable modules; and
a processor configured to execute the one or more executable mod-ules for converting speech in one of a plurality of input languages into text using machine transliteration and transfer learning, the one or more executa-ble modules comprising:
a data reception module that receives training set including a plurality of audio files and corresponding transcripted texts in any input language;
a data transformation module that transforms the input text into trans-literated text comprising characters of a base language;
a training module that trains an acoustic model with the plurality of audio files and corresponding transliterated text;
an inference module that performs decoding on an output of the trained acoustic model to generate text comprising characters of the base language; and
a database that stores the plurality of audio files received as speech input for speech-to-text conversion and a corpus containing large datasets of curated and augmented texts;
wherein the inference module is configured to improve the accuracy of the generated text comprising characters of the base language through a pre-trained customized language model, and wherein the inference module is configured for receiving an audio file as input from the user;
processing the input audio data through an acoustic model; and
generating output text data in a base language, that is reverse translit-erated to obtain text in the original input language characters.

Documents

Application Documents

#	Name	Date
1	202111055740-PROVISIONAL SPECIFICATION [01-12-2021(online)].pdf	2021-12-01
2	202111055740-POWER OF AUTHORITY [01-12-2021(online)].pdf	2021-12-01
3	202111055740-FORM FOR STARTUP [01-12-2021(online)].pdf	2021-12-01
4	202111055740-FORM FOR SMALL ENTITY(FORM-28) [01-12-2021(online)].pdf	2021-12-01
5	202111055740-FORM 1 [01-12-2021(online)].pdf	2021-12-01
6	202111055740-EVIDENCE FOR REGISTRATION UNDER SSI(FORM-28) [01-12-2021(online)].pdf	2021-12-01
7	202111055740-EVIDENCE FOR REGISTRATION UNDER SSI [01-12-2021(online)].pdf	2021-12-01
8	202111055740-DRAWINGS [01-12-2021(online)].pdf	2021-12-01
9	202111055740-DECLARATION OF INVENTORSHIP (FORM 5) [01-12-2021(online)].pdf	2021-12-01
10	202111055740-Request Letter-Correspondence [24-11-2022(online)].pdf	2022-11-24
11	202111055740-Power of Attorney [24-11-2022(online)].pdf	2022-11-24
12	202111055740-FORM28 [24-11-2022(online)].pdf	2022-11-24
13	202111055740-Form 1 (Submitted on date of filing) [24-11-2022(online)].pdf	2022-11-24
14	202111055740-Covering Letter [24-11-2022(online)].pdf	2022-11-24
15	202111055740-Proof of Right [01-12-2022(online)].pdf	2022-12-01
16	202111055740-OTHERS [01-12-2022(online)].pdf	2022-12-01
17	202111055740-FORM FOR SMALL ENTITY [01-12-2022(online)].pdf	2022-12-01
18	202111055740-FORM 3 [01-12-2022(online)].pdf	2022-12-01
19	202111055740-EVIDENCE FOR REGISTRATION UNDER SSI [01-12-2022(online)].pdf	2022-12-01
20	202111055740-ENDORSEMENT BY INVENTORS [01-12-2022(online)].pdf	2022-12-01
21	202111055740-DRAWING [01-12-2022(online)].pdf	2022-12-01
22	202111055740-CORRESPONDENCE-OTHERS [01-12-2022(online)].pdf	2022-12-01
23	202111055740-COMPLETE SPECIFICATION [01-12-2022(online)].pdf	2022-12-01
24	202111055740-FORM-9 [13-12-2022(online)].pdf	2022-12-13
25	202111055740-MSME CERTIFICATE [14-12-2022(online)].pdf	2022-12-14
26	202111055740-FORM28 [14-12-2022(online)].pdf	2022-12-14
27	202111055740-FORM 18A [14-12-2022(online)].pdf	2022-12-14
28	202111055740-FER.pdf	2023-01-04
29	202111055740-RELEVANT DOCUMENTS [04-07-2023(online)].pdf	2023-07-04
30	202111055740-RELEVANT DOCUMENTS [04-07-2023(online)]-1.pdf	2023-07-04
31	202111055740-POA [04-07-2023(online)].pdf	2023-07-04
32	202111055740-POA [04-07-2023(online)]-1.pdf	2023-07-04
33	202111055740-OTHERS [04-07-2023(online)].pdf	2023-07-04
34	202111055740-MARKED COPIES OF AMENDEMENTS [04-07-2023(online)].pdf	2023-07-04
35	202111055740-MARKED COPIES OF AMENDEMENTS [04-07-2023(online)]-1.pdf	2023-07-04
36	202111055740-FORM 13 [04-07-2023(online)].pdf	2023-07-04
37	202111055740-FORM 13 [04-07-2023(online)]-1.pdf	2023-07-04
38	202111055740-FER_SER_REPLY [04-07-2023(online)].pdf	2023-07-04
39	202111055740-CORRESPONDENCE [04-07-2023(online)].pdf	2023-07-04
40	202111055740-COMPLETE SPECIFICATION [04-07-2023(online)].pdf	2023-07-04
41	202111055740-CLAIMS [04-07-2023(online)].pdf	2023-07-04
42	202111055740-AMMENDED DOCUMENTS [04-07-2023(online)].pdf	2023-07-04
43	202111055740-AMMENDED DOCUMENTS [04-07-2023(online)]-1.pdf	2023-07-04
44	202111055740-ABSTRACT [04-07-2023(online)].pdf	2023-07-04
45	202111055740-FORM 3 [09-11-2023(online)].pdf	2023-11-09
46	202111055740-US(14)-HearingNotice-(HearingDate-23-02-2024).pdf	2024-02-06
47	202111055740-Correspondence to notify the Controller [16-02-2024(online)].pdf	2024-02-16
48	202111055740-Written submissions and relevant documents [07-03-2024(online)].pdf	2024-03-07
49	202111055740-FORM 3 [07-03-2024(online)].pdf	2024-03-07
50	202111055740-RELEVANT DOCUMENTS [11-03-2024(online)].pdf	2024-03-11
51	202111055740-POA [11-03-2024(online)].pdf	2024-03-11
52	202111055740-PETITION UNDER RULE 137 [11-03-2024(online)].pdf	2024-03-11
53	202111055740-MARKED COPIES OF AMENDEMENTS [11-03-2024(online)].pdf	2024-03-11
54	202111055740-FORM 13 [11-03-2024(online)].pdf	2024-03-11
55	202111055740-AMMENDED DOCUMENTS [11-03-2024(online)].pdf	2024-03-11
56	202111055740-PatentCertificate26-05-2024.pdf	2024-05-26
57	202111055740-IntimationOfGrant26-05-2024.pdf	2024-05-26
58	202111055740-PROOF OF ALTERATION [15-09-2025(online)].pdf	2025-09-15
59	202111055740-FORM-26 [15-09-2025(online)].pdf	2025-09-15

Search Strategy

1	202111055740E_04-01-2023.pdf