A System And Method For End To End Continual Speech To Speech

< Back

A System And Method For End To End Continual Speech To Speech Translation

Abstract: Disclosed is a system (100) for speech-speech translation. The system (100) includes an input unit (102) and processing circuitry (108). The input unit (102) is configured to receive a first speech in a first language. The processing circuitry (108) is configured to combine a current language pair into a pre-stored dataset to generate a combined dataset, such that the pre-stored dataset comprising previous language pairs that are continually trained on speech-to-speech translations. The processing circuitry (108) implements a end-to-end continual speech-to-speech translation model (105) to extract a plurality of acoustic features from the first speech and then generates a series of feature vectors based on the plurality of acoustic features. The processing circuitry (108) then identifies one or more salient features and then generates a unit sequence for each salient feature to generate a waveform that corresponds to a second speech in a second language. FIG. 1 is the reference figure.

Get Free WhatsApp Updates!
Notices, Deadlines & Correspondence

Patent Information

Application #

Filing Date

01 July 2024

Publication Number

36/2024

Publication Type

INA

Invention Field

COMPUTER SCIENCE

Status

Parent Application

Patent Number

Legal Status

Grant Date

2025-06-11

Applicants

IITI DRISHTI CPS FOUNDATION

IIT Indore, Khandwa Road Simrol, Indore, Madhya Pradesh, 453552, India

Inventors

1. Ankit Malviya

CSE department, IIT Indore, Khandwa Road Simrol, Indore, Madhya Pradesh, 453552, India

2. Aditi Rao S

CSE department, IIT Indore, Khandwa Road Simrol, Indore, Madhya Pradesh, 453552, India

3. Balaram Sarkar

CSE department, IIT Indore, Khandwa Road Simrol, Indore, Madhya Pradesh, 453552, India

4. Chandresh Kumar Maurya

CSE department, IIT Indore, Khandwa Road Simrol, Indore, Madhya Pradesh, 453552, India

Specification

Description:TECHNICAL FIELD
The present disclosure relates generally to the field of language translations. More specifically, the present disclosure relates to a system and a method for end-to-end continual speech-to-speech translation.
BACKGROUND
In the rapidly globalizing world, effective communication across language barriers remains a significant challenge. Traditional translation methods, such as human interpreters or text-based translation software, often fall short in real-time applications and require substantial resources. Additionally, these methods can be inefficient and impractical in everyday scenarios, such as spontaneous conversations, live broadcasts, or real-time customer support interactions.
Existing speech-to-speech translation systems face several critical limitations. Firstly, they often struggle with maintaining accuracy and fluency across multiple languages, especially when switching between them. This is partly due to the significant variations in linguistic structures, phonetic nuances, and contextual dependencies inherent in different languages. Conventional models also tend to suffer from catastrophic forgetting, a major challenge in the field of machine learning.
The catastrophic forgetting occurs when a machine learning model loses previously acquired knowledge upon learning new information. In the context of speech-to-speech translation, this means that when the model is trained on new language pairs, it may forget how to accurately translate previously learned languages. This happens because the model's parameters are adjusted to optimize performance on the new data, often at the expense of degrading performance on older data. This problem is exacerbated in multilingual models where the need to switch between and retain multiple languages simultaneously is crucial.
Moreover, the training of these models typically requires extensive and diverse datasets, which can be heavily imbalanced with respect to language representation. This imbalance can bias the model towards languages with more abundant data, resulting in suboptimal performance for less-represented languages. Additionally, current approaches often necessitate storing and retraining entire datasets whenever new language pairs are introduced, leading to inefficiencies in terms of computational resources and time.
To address the limitations and challenges, there is a pressing need for a technical solution that seamlessly handles multiple languages, prevents catastrophic forgetting during continual learning, and effectively manages imbalanced datasets for reliable speech-to-speech translation.
SUMMARY
A system and method for End-to-End Continual Speech-to-Speech Translation is disclosed. The system includes an input unit and processing circuitry. The input unit is configured to receive a first speech in a first language. The processing circuitry is coupled to the input unit and configured to implement a end-to-end continual speech-to-speech translation model that is configured to extract a plurality of acoustic features from the first speech. The end-to-end speech-to-speech translation model generates a series of feature vectors to capture contextual information from the first speech based on the plurality of acoustic features. The end-to-end speech-to-speech translation model further determines one or more salient features of the feature vectors, such that the one or more salient features represent contextually relevant and linguistically significant features of the first speech. The end-to-end speech-to-speech translation model further generates a unit sequence for each salient feature of the one or more salient features and combines the unit sequence of each salient feature to generate a waveform, such that the waveform corresponds to a second speech in a second language.
In some aspects of the present disclosure, the end-to-end continual speech-to-speech translation model implements one or more attention techniques to capture the contextual information from the first speech.
In some aspects of the present disclosure, the end-to-end continual speech-to-speech translation model predicts a duration prior to the generation of the waveform.
In some aspects of the present disclosure, the system further includes an output unit that is configured to produce the second speech in the second language from the waveform.
In some aspects of the present disclosure, the processing circuitry employs a vocoder technique for waveform generation in the second language.
In some aspects of the present disclosure, to train the end-to-end continual speech-to-speech translation model the processing circuitry is configured to receive a current language pair such that the current language pair comprises the first language and the second language. The processing circuitry then combines the current language pair into a pre-stored dataset to generate a combined dataset such that the pre-stored dataset comprises previous language pairs that are trained on end-to-end continual speech-to-speech translations. The processing circuitry implements a gradient representative sampling GRS technique to generate a representative set that approximates the gradient of the language pairs in the combined dataset and selects the language pairs with maximum diversity from the combined dataset. The processing circuitry then implements a combination of a language-balanced sampling technique and a regular random sampling technique to balance the language pairs in the combined dataset.
In an aspect of the present disclosure, a method for end-to-end continual speech-to-speech translation is disclosed. The method begins with receiving a first speech by way of an input unit. The method further includes implementing a end-to-end continual speech-to-speech translation model by way of the processing circuitry. The method further includes extracting a plurality of acoustic features from the first speech by way of the end-to-end continual speech-to-speech translation model. The method further includes generating a series of feature vectors to capture contextual information from the first speech based on the plurality of acoustic features by way of the end-to-end continual speech-to-speech translation model. The method further includes determining one or more salient features of the feature vectors such that the one or more salient features represent contextually relevant and linguistically significant features of the first speech by way of the end-to-end continual speech-to-speech translation model. The method further includes generating a unit sequence for each salient feature of the one or more salient features by way of the end-to-end continual speech-to-speech translation model. The method further includes combining the unit sequence of each salient feature to generate a waveform such that the waveform corresponds to a second speech in a second language by way of the end-to-end continual speech-to-speech translation model. The method concludes with producing the second speech in the second language from the waveform by way of an output unit coupled to the processing circuitry.
In some aspects of the present disclosure, the method further includes implementing one or more attention techniques to capture the contextual information from the first speech by way of the end-to-end continual speech-to-speech translation model.
In some aspects of the present disclosure, the method further includes predicting a duration prior to the generation of the waveform by way of the end-to-end continual speech-to-speech translation model.
In an aspect of the present disclosure, a method for training the end-to-end continual speech-to-speech translation model by processing circuitry is disclosed. The method includes receiving a current language pair by way of the processing circuitry, such that the current language pair comprises the first language and the second language. The method further includes combining the current language pair into a pre-stored dataset to generate a combined dataset by way of the processing circuitry, such that the pre-stored dataset comprises previous language pairs that are trained on end-to-end continual speech-to-speech translations. The method also includes implementing a gradient representative sampling (GRS) technique by way of the processing circuitry to generate a representative set that approximates the gradient of the language pairs in the combined dataset and selecting the language pairs with maximum diversity from the combined dataset. Additionally, the method includes implementing a combination of a language-balanced sampling technique and a regular random sampling technique by way of the processing circuitry to balance the language pairs in the combined dataset.
BRIEF DESCRIPTION OF THE DRAWINGS
The description refers to provided drawings in which similar reference characters refer to similar parts throughout the different views, and in which:
FIG. 1 illustrates a block diagram of a system for end-to-end continual speech-to-speech translation, in accordance with an aspect of the present disclosure;
FIG. 2 illustrates a block diagram of processing circuitry of the system for end-to-end continual speech-to-speech translation, in accordance with an aspect of the present disclosure;
FIG. 3 illustrates a flow chart that depicts a method for end-to-end continual speech-to-speech translation, in accordance with an aspect of the present disclosure; and
FIG. 4 illustrates a flow chart of a method for continual training a end-to-end speech-to-speech translation model of the system for end-to-end continual speech-to-speech translation of the FIG. 1, in accordance with an aspect of the present disclosure.
To facilitate understanding, like reference numerals have been used, where possible to designate like elements common to the figures.
DETAILED DESCRIPTION OF DRAWINGS
Various aspects of the present disclosure provide a system and method for end-to-end continual speech-to-speech translation. The following description provides specific details of certain aspects of the disclosure illustrated in the drawings to provide a thorough understanding of those aspects. It should be recognized, however, that the present disclosure can be reflected in additional aspects and the disclosure may be practiced without some of the details in the following description.
The various aspects including the example aspects are now described more fully with reference to the accompanying drawings, in which the various aspects of the disclosure are shown. The disclosure may, however, be embodied in different forms and should not be construed as limited to the aspects set forth herein. Rather, these aspects are provided so that this disclosure is thorough and complete, and fully conveys the scope of the disclosure to those skilled in the art. In the drawings, the sizes of components may be exaggerated for clarity.
It is understood that when an element is referred to as being “on,” “connected to,” or “coupled to” another element, it can be directly on, connected to, or coupled to the other element or intervening elements that may be present. As used herein, the term “and/or” includes any and all combinations of one or more of the associated listed items.
The subject matter of example aspects, as disclosed herein, is described with specificity to meet statutory requirements. However, the description itself is not intended to limit the scope of this patent. Rather, the inventor/inventors have contemplated that the claimed subject matter might also be embodied in other ways, including different features or combinations of features similar to the ones described in this document, in conjunction with other technologies.
The aspects herein and the various features and advantageous details thereof are explained more fully with reference to the non-limiting aspects that are illustrated in the accompanying drawings and detailed in the following description. Descriptions of well-known components and processing techniques are omitted so as to not unnecessarily obscure the aspects herein. The examples used herein are intended merely to facilitate an understanding of ways in which the aspects herein may be practiced and to further enable those of skill in the art to practice the aspects herein. Accordingly, the examples should not be construed as limiting the scope of the aspects herein.
FIG. 1 illustrates a block diagram of a system for end-to-end continual speech-to-speech translation, in accordance with an aspect of the present disclosure. The system 100 for end-to-end continual speech-to-speech translation (hereinafter referred to and denoted as “the system 100”) may include an input unit 102, an information processing apparatus 104, and an output unit 106. The information processing apparatus 104 may include processing circuitry 108 and a database 110. The input unit 102, the information processing apparatus 104, and the output unit 106 may be communicatively coupled to each other by way of a communication network 112. The communication network 112 may include suitable logic, circuitry, and interfaces that may be configured to provide a plurality of network ports and a plurality of communication channels for transmission and reception of data related to operations of various entities in the system 100. Each network port may correspond to a virtual address (or a physical machine address) for transmission and reception of the communication data. For example, the virtual address may be an Internet Protocol Version 4 (IPV4) (or an IPV6 address), and the physical address may be a Media Access Control (MAC) address. The communication data may be transmitted or received via the communication protocols. Examples of the communication protocols may include, but are not limited to, Hypertext Transfer Protocol (HTTP), File Transfer Protocol (FTP), Simple Mail Transfer Protocol (SMTP), Domain Network System (DNS) protocol, Common Management Interface Protocol (CMIP), Transmission Control Protocol and Internet Protocol (TCP/IP), User Datagram Protocol (UDP), Long Term Evolution (LTE) communication protocols, or any combination thereof.
In some aspects of the present disclosure, the communication data may be transmitted or received via at least one communication channel of a plurality of communication channels in the communication network 112. The communication channels may include, but are not limited to, a wireless channel, a wired channel, a combination of wireless and wired channel thereof. The wireless or wired channel may be associated with a data standard which may be defined by one of a Local Area Network (LAN), a Personal Area Network (PAN), a Wireless Local Area Network (WLAN), a Wireless Sensor Network (WSN), Wireless Area Network (WAN), Wireless Wide Area Network (WWAN), a Metropolitan Area Network (MAN), a Satellite Network, the Internet, a Fiber Optic Network, a Coaxial Cable Network, an Infrared (IR) network, a Radio Frequency (RF) network, and a combination thereof. Aspects of the present invention are intended to include or otherwise cover any type of communication channel, including known, related art, and/or later developed technologies.
The input unit 102 may be configured to receive a first language and a second language. The first language may be spoken in a first speech. The first speech mentioned herein may refer to a speech that is spoken into the first language. The second language may be spoken in a second speech. The second speech as mentioned herein may refer to a speech into which the first language in the first speech is translated. For example, a person speaks a sentence in English, "Hello, how are you?" The English language in this case is the "first language." The system 100 then processes the sentence and translates the sentence into a Spanish language. The translated output in Spanish, "Hola, ¿cómo estás?" is in the "second language". In some aspects of the present disclosure, the input unit 102 may include, a microphone, a digital signal processor, a speech recognition module, and a wireless receiver. Aspects of the present disclosure are intended to include and/or otherwise cover all the input units, including known, related art, and/or later developed technologies.
The information processing apparatus 104 may be a network of computers, a framework, or a combination thereof, that may provide a generalized approach to create a server implementation. In some embodiments of the present disclosure, the information processing apparatus 104 may be a server. Examples of the server may include, but are not limited to, personal computers, laptops, mini-computers, mainframe computers, any non-transient and tangible machine that can execute a machine-readable code, cloud-based servers, distributed server networks, or a network of computer systems. The server may be realized through various web-based technologies such as, but not limited to, a Java web-framework, a .NET framework, a personal home page (PHP) framework, or any other web-application framework. The server 104 may include one or more processing circuitries of which processing circuitry 108 is shown and a non-transitory computer-readable storage medium 110.
The non-transitory computer-readable storage medium 110 (hereinafter interchangeably referred to and designated as “the database 110”) may be configured to store the logic, instructions, circuitry, interfaces, and/or codes of the processing circuitry 108 for executing various operations.
The processing circuitry 108 may be configured to execute various operations associated with the system 100. The processing circuitry 108 may execute one or more processes like data processing, decision-making, and signal generation to execute the various operations in the system 100. The processing circuitry 108 may execute one or more operations of the system 100 by generating the signals and communicating with the various components involved in the system 100. The processing circuitry 108 may be configured to implement an end-to-end continual speech-to-speech translation model 105. The end-to-end continual speech-to-speech translation model 105 may be a neural network-based architecture that may be designed to handle multiple languages. In some aspects of the present disclosure, the end-to-end continual speech-to-speech translation model 105 may incorporate advanced components such as encoders, decoders, and vocoders.
The end-to-end continual speech-to-speech translation model 105 may be configured to extract a plurality of acoustic features from the first speech. The acoustic features may include, but are not limited to, pitch, tone, frequency spectrum, energy, formants, cepstral coefficients (such as Mel-Frequency Cepstral Coefficients or MFCCs), and temporal dynamics.
In some aspects of the present disclosure, the end-to-end continual speech-to-speech translation model 105 exemplifies a robust neural network architecture that may be designed for continual learning, adept at handling diverse languages and nuanced speech characteristics. The end-to-end continual speech-to-speech translation model 105 incorporates advanced components such as encoders, decoders, and vocoders, each playing a pivotal role in the translation process. The encoders utilize attention mechanisms to capture contextual information from the first speech, while decoders generate target language outputs based on the contextual information captured. The vocoder, crucial for synthesizing natural-sounding speech, predicts durations and generates waveforms that correspond to a translated speech (i.e., translation of the first speech into the second speech). The end-to-end continual speech-to-speech translation model 105 continuously updates and refines parameters of the end-to-end continual speech-to-speech translation model 105 using a dynamic dataset that includes both current language inputs and a replay buffer of previously encountered language samples.
The end-to-end continual speech-to-speech translation model 105 may be configured to generate a series of feature vectors to capture contextual information from the first speech based on the plurality of acoustic features. The series of feature vectors may include time-aligned representations of the acoustic features, statistical measures of the speech signal (such as mean and variance), phonetic context information, prosodic features (such as intonation and stress patterns), and embeddings derived from deep learning models that encapsulate semantic and syntactic information from the speech. The feature vectors provide a comprehensive and nuanced representation of the input speech, enabling the translation model to accurately understand and process spoken language. In some aspects of the present disclosure, the end-to-end continual speech-to-speech translation model 105 may implement one or more attention techniques to capture the contextual information from the first speech.
The end-to-end continual speech-to-speech translation model 105 may be further configured to determine one or more salient features of the feature vectors. The one or more salient features represent contextually relevant and linguistically significant features of the first speech. The end-to-end continual speech-to-speech translation model 105 determines the one or more salient features by implementing a wave2seq technique.
The end-to-end continual speech-to-speech translation model 105 may be further configured to generate a unit sequence for each salient feature of the one or more salient features. The unit sequence as mentioned herein may refer to phonetic or linguistic units that correspond to the identified salient features. The unit sequence may include phonemes, syllables, words, or sub-word components, which together represent the essential elements of the speech in the first language.
The end-to-end continual speech-to-speech translation model 105 may be further configured to combine the unit sequence of each salient feature to generate a waveform, such that the waveform corresponds to the second speech in the second language. The waveform as mentioned herein may refer to continuous analog signal that represents the audible second speech in the second language. The waveform may be generated by synthesizing the phonetic or linguistic units into a coherent and natural-sounding speech signal, capturing intonation, rhythm, and prosody of the target language (second language). The end-to-end continual speech-to-speech translation model 105 may be configured to predict a duration prior to the generation of the waveform. In some aspects of the present disclosure, the prediction of the duration may refer to estimate the length of time each segment or unit of second that is to be pronounced. The speech signals (first or second speech signals) may have varying durations for different phonemes, syllables, and words, influenced by factors like emphasis, context, and speaker characteristics. The waveform generation by the processing circuitry 108 may further employ the wave2seq technique for combining the unit sequence of each salient feature, such that the wave2seq technique utilizes a vocoder to synthesize the second speech in the second language.
The processing circuitry 108 may be configured to train the end-to-end continual speech-to-speech translation model 105. The processing circuitry 108 may be configured to receive a current language pair that may include the first language and the second language. The processing circuitry 108 may be configured to combine the current language pair into a pre-stored dataset. The pre-stored dataset stored in the database 110 may include the previous language pairs that may be continually trained on end-to-end speech-to-speech translation. In other words, the pre-stored dataset may include the previous language pairs that may be already translated. Further, the processing circuitry 108 may be configured to implement a gradient representative sampling (GRS) technique to generate a representative set that may approximate the gradient of the language pairs in the combined dataset and select language pairs with maximum diversity from the combined dataset. Further, the processing circuitry 108 may be configured to implement a combination of a language-balanced sampling technique and a regular random sampling technique to balance the language pairs in the combined dataset.
The database 110 may be configured to store a variety of essential data components that enable efficient and accurate translation across multiple languages. The variety of essential data may include datasets of previous language pairs that have been used for training, which consist of recordings of speech in different languages along with their corresponding translations. Additionally, the database 110 may store the current language pair data, capturing both the first speech in the first language and the translated speech (or a second speech) in the second language. The database 110 may be configured to store the plurality of acoustic features and the series of feature vectors that may be extracted from the first speech, and thereby provide a end-to-end continual speech-to-speech translation model 105 with necessary contextual information.
In some aspects of the present disclosure, the contextual information may include, phonetic context, prosodic features such as intonation, stress patterns, and rhythm, temporal dynamics of the speech signal, linguistic features like syntax and semantics, speaker-specific characteristics, and environmental noise patterns.
Furthermore, the database 110 may be further configured to maintain the one or more salient features identified from the series of feature vectors, unit sequences generated for these salient features, and the combined waveforms that correspond to the translated speech in the target language. The database 110 may be further configured to incorporate a replay buffer, which is a pivotal component of the continual learning approach. The replay buffer may store the previously trained language samples, allowing the model to maintain performance across multiple languages by preventing catastrophic forgetting. By storing this comprehensive range of data, the database 110 may support the ability of the end-to-end continual speech-to-speech translation model 105 to learn incrementally, update continuously, and deliver reliable speech-to-speech translation.
It will be apparent to a person having ordinary skill in the art that the database 110 may be configured to store various types of data associated with the system 100, without deviating from the scope of the present disclosure. Examples of the database 110 may include but are not limited to, a Relational database, a NoSQL database, a Cloud database, an Object-oriented database, and the like. Further, the database 110 may include associated memories that may include, but is not limited to, a ROM, a RAM, a flash memory, a removable storage drive, a HDD, a solid-state memory, a magnetic storage drive, a PROM, an EPROM, and/or an EEPROM. Aspects of the present disclosure are intended to include or otherwise cover any type of the database 110 including known, related art, and/or later developed technologies. In some aspects of the present disclosure, a set of centralized or distributed network of peripheral memory devices may be interfaced with the server, as an example, on a cloud server.
The output unit 106 may be coupled to the information processing apparatus 104. The output unit 106 may be configured to produce the second speech in the second language from the waveform.
In some aspects of the present disclosure, the output unit 106 may include, any one of, or a combination of, a speaker, a headphone, wearable devices, and earphones. Aspects of the present disclosure are intended to include and/or otherwise cover all types of output units without deviating from the scope of the present disclosure.
FIG. 2 illustrates a block diagram of the processing circuitry 108 of the system 100 of FIG. 1, in accordance with an aspect of the present disclosure. The processing circuitry 108 may include a first set of engines and a second set of engines. The processing circuitry 108 may employ the first set of engines to implement the end-to-end continual speech-to-speech translation model 105 for translating the first speech in the first language into the second speech of the second language. The processing circuitry 108 may employ the second set of engines to train the end-to-end continual speech-to speech translation model 105. The first set of engines may be configured to translate the speech from the first language to the second language. The second set of engines may be configured to train the end-to-end continual speech-to-speech translation model 105. The first set of engines may include an acoustic feature extraction engine 202, a feature vector generation engine 204, a salient feature generation engine 206, a unit sequence generation engine 208, and a waveform generation engine 210. The second set of engines may include a dataset combination engine 212, a gradient representative sampling (GRS) engine 214, a language-balanced sampling engine 216, and a regular random sampling engine 218. The acoustic feature extraction engine 202, the feature vector generation engine 204, the salient feature generation engine 206, the unit sequence generation engine 208, the waveform generation engine 210, the dataset combination engine 212, the gradient representative sampling (GRS) engine 214, the language-balanced sampling engine 216, and the regular random sampling engine 218 may be communicatively coupled to each other by way of a communication bus 220.
The communication bus 220 may serve as a fundamental component facilitating data exchange and coordination among various processing engines within the system 100. The communication bus 220 operates as a pathway that allows these engines to transmit data, commands, and synchronization signals efficiently. The communication bus 220 ensures that information flows seamlessly between components, enabling coordinated processing and integration of tasks essential for speech processing and translation tasks. Examples of the communication bus 220 may include, a universal serial bus, inter integrated circuit, a controller area network, a peripheral component interconnect, and a serial peripheral interface. Aspects of the present disclosure are intended to include and/or otherwise include all the types of the communication bus 220, without deviating from the scope of the present disclosure.
The acoustic feature extraction engine 202 may be configured to extract the plurality of acoustic features from the first speech. The acoustic features may include properties such as pitch, tone, frequency, and energy. The acoustic feature engine 202 may implement techniques like Mel-Frequency Cepstral Coefficients (MFCCs), spectrogram analysis, and other signal processing algorithms to break down the speech into these fundamental components. The processing circuitry 108 may implement the acoustic feature extraction engine 202 by leveraging Digital Signal Processing (DSP) techniques and specialized libraries designed for the feature extraction.
Following the extraction of acoustic features, the feature vector generation engine 204 may be configured to generate the series of feature vectors to capture contextual information from the first speech. The feature vector generation engine 204 may organize the extracted features into structured vectors that can be analyzed by subsequent stages of the end-to-end continual speech-to-speech translation model 105. In some aspects, the feature vector generation engine 204 may implement one or more machine learning (ML) techniques to combine the acoustic features in a way that highlights patterns and contextual dependencies within the speech. In some aspects, the processing circuitry 108 may implement the feature vector extraction engine 204 by implementing one or more machine learning frameworks and dedicated hardware accelerators for efficient computation.
The salient feature determination engine 206 may be configured to identify the one or more salient features from the generated feature vectors. The salient features may represent contextually relevant and linguistically significant aspects of the first speech. The salient feature determination engine 206 may employ one or more advanced techniques such as attention techniques, which prioritize the most important parts of the input data. The processing circuitry 108 may integrate the salient feature determination engine 206 through one or more neural network models that include attention layers and other relevance-determining techniques.
Once the salient features are determined, the unit sequence generation engine 208 may generate the unit sequence for each salient feature of the one or more salient features. The unit sequence generation engine 208 may translate the one or more salient features into the phonetic or the linguistic units that may be processed into speech in the second language. The processing circuitry 108 may implement the unit sequence generation engine 208 by employing sequence-to-sequence models and or transformers, which are adept at handling sequential data and maintaining context over long sequences.
The waveform generation engine 210 may combine the unit sequences of each salient feature to generate a waveform that corresponds to the second speech in the second language. The waveform generation engine 210 may employ a vocoder to synthesize speech from the unit sequences, predicting durations and ensuring the natural flow of speech. The processing circuitry 108 may employ deep learning models, such as HIFiGAN, WaveNet or similar generative models, to produce high-quality, natural-sounding speech waveforms.
The dataset combination Engine 212 may be configured to receive the current language pair and combine the current language pairs with the pre-stored dataset to generate a combined dataset. The pre-stored dataset comprises previous language pairs that may be continually trained on end-to-end continual speech-to-speech translations. The processing circuitry 108 may implement the dataset combination engine 212 by employing a database management system and data processing pipelines to merge and organize training data effectively.
The Gradient Representative Sampling (GRS) Engine 214 may be configured to generate a representative set that approximates the gradient of the language pairs in the combined dataset. The GRS engine 214 may select language pairs with maximum diversity to ensure the model is exposed to a wide variety of linguistic patterns. The GRS engine 214 may employ gradient analysis and optimization algorithms to identify the most representative samples. The processing circuitry 108 may integrate the GRS engine 214 with machine learning libraries that support gradient-based sampling and optimization techniques.
The language-balanced sampling engine 216 may address the challenge of language imbalance in the training samples. The language-balanced sampling engine 216 may ensure that each language pair is adequately represented in the training data. The language-balanced sampling engine 216 may employ one or more techniques that may maintain a proportional representation of languages, preventing the model from becoming biased towards more frequently occurring languages. The processing circuitry 108 may implement the language-balanced sampling engine 216 through custom sampling strategies and data augmentation techniques.
The regular random sampling engine 218 may prevent overfitting, the regular random sampling engine 218 randomly selects samples from the combined dataset. The Regular Random Sampling Engine 218 may ensure that the training process remains robust and adaptable to different linguistic contexts. The processing circuitry 108 may incorporate the regular random sampling engine 218 using stochastic sampling methods and randomization algorithms provided by machine learning frameworks.
FIG. 3 illustrates a flow chart of a method 300 for facilitating end-to-end continual speech-to-speech translation by the system 100 of FIG. 1, in accordance with an aspect of the present disclosure. The method 300 may include following steps for facilitating end-to-end continual speech-to-speech translation:-
At step 302, the system 100 may be configured to receive the first speech in the first language. Specifically, the system 100 may be configured to receive the first speech in the first language, by way of the input unit 102.
At step 304, the system 100 may be configured to implement the end-to-end continual speech-to-speech translation model 105. Specifically, the system 100 may be configured to implement the end-to-end continual speech-to-speech translation model 105 by way of the processing circuitry 108.
At step 306, the system 100 may be configured to extract the plurality of the acoustic features from the first speech. Specifically, the system 100 may be configured to extract the plurality of acoustic features from the first speech by way of the end-to-end continual speech-to-speech translation model 105.
At step 308, the system 100 may be configured to implement the one or more attention techniques to capture the contextual information from the first speech. Specifically, the system 100 may be configured to implement the one or more attention techniques to capture the contextual information from the first speech by way of the end-to-end continual speech-to-speech translation model 105.
At step 310, the system 100 may be configured to generate a series of the feature vectors to capture the contextual information from the first speech based on the plurality of the acoustic features. Specifically, the system 100 may be configured to generate the series of the feature vectors to capture the contextual information from the first speech based on the plurality of the acoustic features, by way of the end-to-end continual speech-to-speech translation model 105.
At step 312, the system 100 may be configured to determine the one or more salient features of the feature vectors. Specifically, the system 100 may be configured to determine the one or more salient features of the feature vectors, by way of the end-to-end continual speech-to-speech translation model 105. The one or more salient features represent contextually relevant and linguistically significant features of the first speech.
At step 314, the system 100 may be configured to generate the unit sequence for each salient feature of the one or more salient features. Specifically, the system 100 may be configured to generate the unit sequence for each salient feature of the one or more salient features, by way of the end-to-end continual speech-to-speech translation model 105.
At step 316, the system 100 may be configured to predict the duration prior to the generation of the waveform. Specifically, the system 100 may be configured to predict the duration prior to the generation of the waveform, by way of the processing circuitry 108.
At step 318, the system 100 may be configured to combine the unit sequence of each salient feature to generate the waveform. Specifically, the system 100 may be configured to combine the unit sequence of each salient feature to generate the waveform, by way of the end-to-end continual speech-to-speech translation model 105. The waveform corresponds to a second speech in a second language.
At step 320, the system 100 may be configured to produce the second speech in the second language from the waveform. Specifically, the system 100 may be configured to produce the second speech in the second language from the waveform, by way of the end-to-end continual speech-to-speech translation model 105.
FIG. 4 illustrates a flow chart of a method 400 for continual training the end-to-end continual speech-to-speech translation model 105 by the system 100 of the FIG. 1, in accordance with an aspect of the present disclosure. The method 400 may include following steps for facilitating the continual training of the end-to-end continual speech-to-speech translation model 105:-
At step 402, the system 100 may be configured to receive the current language pair. Specifically, the system 100 may be configured to receive the current language pair, by way of the processing circuitry 108.
At step 404, the system 100 may be configured to combine the current language pair into the pre-stored dataset to generate a combined dataset. Specifically, the system 100 may be configured to combine the current language pair into the pre-stored dataset to generate a combined dataset, by way of the processing circuitry 108.
At step 406, the system 100 may be configured to implement the gradient representative sampling (GRS) technique to generate the representative set that approximates the gradient of the language pairs in the combined dataset and select the language pairs with maximum diversity from the combined dataset. Specifically, the system 100 may be configured to implement the gradient representative sampling (GRS) technique to generate the representative set that approximates the gradient of the language pairs in the combined dataset and select the language pairs with maximum diversity from the combined dataset, by way of the processing circuitry 108.
At step 408, the system 100 may be configured to implement the combination of the language-balanced sampling technique and the regular random sampling technique to balance the language pairs in the combined dataset. Specifically, the system 100 may be configured to implement the combination of the language-balanced sampling technique and a regular random sampling technique to balance the language pairs in the combined dataset, by way of the processing circuitry 108.
Advantages of the disclosed system and method for end-to-end continual speech-to-speech translation include, but are not limited to, the following:
1. Efficient Contextual Information Extraction: The processing circuitry efficiently extracts a diverse range of acoustic features from the first speech, enabling comprehensive analysis of pitch, tone, frequency spectrum, energy, formants, and temporal dynamics. This ensures that the translated speech maintains nuanced contextual information crucial for accurate communication across languages.
2. Enhanced Contextual Relevance: By generating feature vectors from the extracted acoustic features, the system 100 captures intricate contextual details embedded in the speech. This capability is pivotal in identifying and emphasizing salient features that are linguistically significant, thereby enhancing the fidelity and relevance of the translated output.
3. Precision in Salient Feature Identification: The method employs robust processing techniques to determine one or more salient features from the feature vectors. These salient features encapsulate the most pertinent aspects of the speech, ensuring that the translated speech retains contextually relevant nuances essential for conveying meaning accurately.
4. Natural and Coherent Waveform Generation: Through the generation of unit sequences for each identified salient feature and their subsequent combination into a waveform, the system produces a second speech in a target language that is natural-sounding and coherent. Predicting the duration prior to waveform generation further refines the output's naturalness, maintaining proper timing and rhythm akin to human speech patterns.
5. Scalable Training and Optimization: For continual training the end-to-end continual speech-to-speech translation model, the processing circuitry employs advanced techniques such as gradient representative sampling (GRS), language-balanced sampling, and regular random sampling. These methodologies ensure the model is trained on a diverse and representative dataset, optimizing its performance across various language pairs while minimizing bias and enhancing robustness.
6. Streamlined Output Production: With an integrated output unit, the system seamlessly produces the translated speech in the target language directly from the waveform. This streamlined process simplifies deployment in real-world applications, ensuring efficient and reliable speech translation capabilities.
The foregoing discussion of the present disclosure has been presented for purposes of illustration and description. It is not intended to limit the present disclosure to the form or forms disclosed herein. In the foregoing Detailed Description, for example, various features of the present disclosure are grouped together in one or more aspects, configurations, or aspects for the purpose of streamlining the disclosure. The features of the aspects, configurations, or aspects may be combined in alternate aspects, configurations, or aspects other than those discussed above. This method of disclosure is not to be interpreted as reflecting an intention the present disclosure requires more features than are expressly recited in each claim. Rather, as the following claims reflect, inventive aspects lie in less than all features of a single foregoing disclosed aspect, configuration, or aspect. Thus, the following claims are hereby incorporated into this Detailed Description, with each claim standing on its own as a separate aspect of the present disclosure.
Moreover, though the description of the present disclosure has included description of one or more aspects, configurations, or aspects and certain variations and modifications, other variations, combinations, and modifications are within the scope of the present disclosure, e.g., as may be within the skill and knowledge of those in the art, after understanding the present disclosure. It is intended to obtain rights which include alternative aspects, configurations, or aspects to the extent permitted, including alternate, interchangeable and/or equivalent structures, functions, ranges or steps to those claimed, whether or not such alternate, interchangeable and/or equivalent structures, functions, ranges or steps are disclosed herein, and without intending to publicly dedicate any patentable subject matter.
Certain terms are used throughout the following description and claims to refer to particular features or components. As one skilled in the art will appreciate, different persons may refer to the same feature or component by different names. This document does not intend to distinguish between components or features that differ in name but not structure or function. While various aspects of the present disclosure have been illustrated and described, it will be clear that the present disclosure is not limited to these aspects only. Numerous modifications, changes, variations, substitutions, and equivalents will be apparent to those skilled in the art, without departing from the spirit and scope of the present disclosure, as described in the claims.

, Claims:

1. The system (100) for end-to-end continual speech-to-speech translation, comprising:
an input unit (102) configured to receive a first speech in a first language;
processing circuitry (108) coupled to the input unit (102) and configured to:
implement an end-to-end continual speech-to-speech translation model (105) that is configured to:
extract a plurality of acoustic features from the first speech;
generate a series of feature vectors to capture contextual information from the first speech based on the plurality of acoustic features;
determine one or more salient features of the feature vectors, wherein the one or more salient features represent contextually relevant and linguistically significant features of the first speech;
generate a unit sequence for each salient feature of the one or more salient features; and
combine the unit sequence of each salient feature to generate a waveform, wherein the waveform corresponds to a second speech in a second language.

2. The system (100) as claimed in claim 1, wherein the end-to-end continual speech-to-speech translation model (105) implements one or more attention techniques to capture the contextual information from the first speech.

3. The system (100) as claimed in claim 1, wherein the end-to-end continual speech-to-speech translation model (105) predicts a duration prior to the generation of the waveform.

4. The system (100) as claimed in claim 1, further comprising an output unit (106) that is configured to produce the second speech in the second language from the waveform.

5. The system (100) as claimed in claim 1, wherein the processing circuitry (108) employs a vocoder technique for waveform generation in the second language.

6. The system (100) as claimed in claim 1, wherein to train the end-to-end continual speech-to-speech translation model (105), the processing circuitry (108) is further configured to:
receive a current language pair, wherein the current language pair comprising the first language and the second language;
combine the current language pair into a pre-stored dataset to generate a combined dataset, wherein the pre-stored dataset comprising previous language pairs that are continually trained on speech-to-speech translations;
implement a gradient representative sampling (GRS) technique to generate a representative set that approximates the gradient of the language pairs in the combined dataset and select the language pairs with maximum diversity from the combined dataset; and
implement a combination of a language-balanced sampling technique and a regular random sampling technique to balance the language pairs in the combined dataset.

7. A method (300) for speech-to-speech translation, comprising:
receiving (302), by way of an input unit (102), a first speech;
implementing (304), by way of the processing circuitry (108) coupled to the input unit, a speech-to-speech translation model (105);
extracting (306), by way of the end-to-end continual speech-to-speech translation model (105), a plurality of acoustic features from the first speech;
generating (310), by way of the end-to-end continual speech-to-speech translation model (105), a series of feature vectors to capture contextual information from the first speech based on the plurality of acoustic features;
determining (312), by way of the end-to-end continual speech-to-speech translation model (105), one or more salient features of the feature vectors, wherein the one or more salient features represent contextually relevant and linguistically significant features of the first speech;
generating (314), by way of the end-to-end continual speech-to-speech translation model (105), a unit sequence for each salient feature of the one or more salient features;
combining (318), by way of the end-to-end continual speech-to-speech translation model (105), the unit sequence of each salient feature to generate a waveform, wherein the waveform corresponds to a second speech in a second language; and
producing (320), by way of an output unit (106) coupled to the processing circuitry (108), the second speech in the second language from the waveform.

8. The method as claimed in claim 7, further comprising implementing (308), by way of the end-to-end continual speech-to-speech translation model (105), one or more attention techniques to capture the contextual information from the first speech.

9. The method as claimed in claim 7, further comprising predicting (316), by way of the end-to-end continual speech-to-speech translation model (105), a duration prior to the generation of the waveform.

10. The method as claimed in claim 7, wherein to train the end-to-end continual speech-to-speech translation model (105) by way of the processing circuitry (108), the method comprising;
receiving (402), by way of the processing circuitry (108), a current language pair wherein the current language pair comprising the first language and the second language;
combining (404), by way of the processing circuitry (108), the current language pair into a pre-stored dataset to generate a combined dataset, wherein the pre-stored dataset comprising previous language pairs that are continually trained on speech-to-speech translations;
implementing (406), by way of the processing circuitry (108), a gradient representative sampling (GRS) technique to generate a representative set that approximates the gradient of the language pairs in the combined dataset and select the language pairs with maximum diversity from the combined dataset; and
implementing (408), by way of the processing circuitry (108), a combination of a language-balanced sampling technique and a regular random sampling technique to balance the language pairs in the combined dataset.

Documents

Application Documents

#	Name	Date
1	202421050380-STATEMENT OF UNDERTAKING (FORM 3) [01-07-2024(online)].pdf	2024-07-01
2	202421050380-FORM FOR SMALL ENTITY(FORM-28) [01-07-2024(online)].pdf	2024-07-01
3	202421050380-FORM FOR SMALL ENTITY [01-07-2024(online)].pdf	2024-07-01
4	202421050380-FORM 1 [01-07-2024(online)].pdf	2024-07-01
5	202421050380-EVIDENCE FOR REGISTRATION UNDER SSI(FORM-28) [01-07-2024(online)].pdf	2024-07-01
6	202421050380-EVIDENCE FOR REGISTRATION UNDER SSI [01-07-2024(online)].pdf	2024-07-01
7	202421050380-DRAWINGS [01-07-2024(online)].pdf	2024-07-01
8	202421050380-DECLARATION OF INVENTORSHIP (FORM 5) [01-07-2024(online)].pdf	2024-07-01
9	202421050380-COMPLETE SPECIFICATION [01-07-2024(online)].pdf	2024-07-01
10	Abstract.1.jpg	2024-07-26
11	202421050380-FORM-9 [12-08-2024(online)].pdf	2024-08-12
12	202421050380-FORM-26 [14-08-2024(online)].pdf	2024-08-14
13	202421050380-MSME CERTIFICATE [05-11-2024(online)].pdf	2024-11-05
14	202421050380-FORM28 [05-11-2024(online)].pdf	2024-11-05
15	202421050380-FORM 18A [05-11-2024(online)].pdf	2024-11-05
16	202421050380-FER.pdf	2024-12-02
17	202421050380-FORM 3 [12-12-2024(online)].pdf	2024-12-12
18	202421050380-Proof of Right [31-12-2024(online)].pdf	2024-12-31
19	202421050380-PA [31-12-2024(online)].pdf	2024-12-31
20	202421050380-FORM28 [31-12-2024(online)].pdf	2024-12-31
21	202421050380-EVIDENCE FOR REGISTRATION UNDER SSI [31-12-2024(online)].pdf	2024-12-31
22	202421050380-EDUCATIONAL INSTITUTION(S) [31-12-2024(online)].pdf	2024-12-31
23	202421050380-ASSIGNMENT DOCUMENTS [31-12-2024(online)].pdf	2024-12-31
24	202421050380-8(i)-Substitution-Change Of Applicant - Form 6 [31-12-2024(online)].pdf	2024-12-31
25	202421050380-FER_SER_REPLY [17-03-2025(online)].pdf	2025-03-17
26	202421050380-US(14)-HearingNotice-(HearingDate-06-05-2025).pdf	2025-04-07
27	202421050380-Correspondence to notify the Controller [09-04-2025(online)].pdf	2025-04-09
28	202421050380-US(14)-ExtendedHearingNotice-(HearingDate-14-05-2025)-1200.pdf	2025-04-22
29	202421050380-Correspondence to notify the Controller [05-05-2025(online)].pdf	2025-05-05
30	202421050380-Written submissions and relevant documents [27-05-2025(online)].pdf	2025-05-27
31	202421050380-PatentCertificate11-06-2025.pdf	2025-06-11
32	202421050380-IntimationOfGrant11-06-2025.pdf	2025-06-11

Search Strategy

1	202421050380E_19-11-2024.pdf