Method For Building Low Resource Speech Synthesis System

< Back

Method For Building Low Resource Speech Synthesis System

Abstract: ABSTRACT METHOD FOR BUILDING LOW RESOURCE SPEECH SYNTHESIS SYSTEM The present invention relates to a method for building a low resource Text-to-Speech (TTS) synthesis system. The Text-to-Speech system based model comprises an encoder and a decoder to convert monolingual English data into output speech in a desired target speaker’s voice. The model is first trained on English data to build an English TTS. Once the model is trained with said data, the embedding layer is discarded and the remaining weights are retained to be used in the next stage of training. The model is further re-trained with synthetic data on an existing out-of-box Hindi TTS model. The synthetic data of existing Hindi TTS model is Hindi data that is being utilized to train the model in order to minimize the dependency on the availability of low resource language data, for example, Hindi. While training the model on limited real target speaker data, the model freezes the text encoder weights such that the encoder will not be trained during the fine-tuning of the decoder. The model provides fine tuning of the decoder by training the decoder on the real target speaker data so as to select a target speaker of desired choice. Figure 1

Get Free WhatsApp Updates!
Notices, Deadlines & Correspondence

Patent Information

Application #

Filing Date

17 November 2023

Publication Number

52/2023

Publication Type

INA

Invention Field

COMPUTER SCIENCE

Status

Parent Application

Applicants

FLIPKART INTERNET PRIVATE LIMITED

Buildings Alyssa, Begonia & Clover, Embassy Tech Village, Outer Ring Road, Devarabeesanahalli Village, Bengaluru - 560103, Karnataka, India

Inventors

1. JOSHI, Raviraj

Flat A-1104, Polite Harmony, Near Sane Chowk, Chikhali, Pune-411062, Maharashtra, India

2. GARERA, Nikesh

I-1638 Brigade Cosmopolis, 286 Whitefield Main Road, Bangalore-560066, Karnataka, India

Specification

Description:FIELD OF INVENTION

[001] The present invention relates to a Text-to-Speech (TTS) system. Particularly, the present invention relates to a method for building low resource Text-to-Speech (TTS) system. More particularly, the present invention utilizes existing TTS systems to generate synthetic audio data to build a low resource speech synthesis system.

BACKGROUND OF THE INVENTION

[002] Text-to-Speech (TTS) systems convert written text into spoken words. The TTS systems provide enhanced accessibility to users with disabilities such as visual impairment and provide immersive reading experience to different types of users. These systems are crucial in making the information available across different formats and improving inclusivity in communication.
[003] There are many known Text-to-Speech (TTS) systems such as concatenative, formant or parametric systems. Concatenative systems use pre-recorded human speech units, formant system synthesizes speech from modeled articulatory parameters and parametric system generates speech from mathematical models. The prior known Text-to-Speech systems have limitations including difficulties in natural intonation, pronunciation errors and challenges with complex linguistic nuances. Furthermore, due to lack of emotional expressiveness and context-dependent errors in speech, achieving human like prosody and understanding remains challenging in the prior known systems.
[004] Text to speech synthesiser can easily be trained easily on datasets for high resource languages like English. For low resource languages such as Hindi, there is a scarcity of high-quality training datasets. Reference can be made to Building Multilingual End-to-End Speech Synthesisers for Indian Languages, Anju Leela Thomas et al. (2019), which addresses the issue of low digital resources. A multilingual end-to-end speech synthesisers is proposed, wherein TTSes are trained for Indian languages using two text representations: character-based and phone-based. The focus is on capitalizing on the similarities among Indian languages for system building. In this context, two types of text representations are explored in the end-to-end framework: character-based and phone-based. However, the goal is limited to providing training for Indic systems for monolingual and multiple language data.
[005] Moreover, Generic Indic Text-to-speech Synthesisers with Rapid Adaptation in an End-to-end Framework, Anusha Prakash et al. (2020), provide generic voices for Indian languages and further adapt them to languages with low amounts of speech data. Indian languages can be classified into a finite set of families; prominent among them are Indo-Aryan and Dravidian. The proposed work exploits this property to build a generic TTS system using multiple languages from the same family in an end-to-end framework. In this work, generic voices are trained by pooling data from languages belonging to the same language family—Indo-Aryan and Dravidian. The aforementioned documents try to build a TTS system for Indic languages under similar constraints. However, such a generic language-specific text-to-speech synthesizer does not provide output speech for a low resource language by building a low-resource and low-budget TTS system in the voice of a specific speaker.
[006] Text -to-Speech (TTS) systems convert text data into speech output. These systems find application in voice-based interfaces. TTS systems are mainly machine learning based models built using end to end deep learning approaches. However, training a TTS system requires large amounts of high-quality data. The data is text-audio pairs recorded in studio conditions. The audio recordings are required to be of very high quality with no background noise. Another important aspect is coverage of all phoneme variations and prosodic variations. Therefore, recording such data is a costly operation as well. Therefore, there is a requirement to build a low-resource (for example, Hindi) and low-budget TTS system with the output speech voice of a specific speaker.

OBJECTIVES OF THE INVENTION

[007] The primary objective of the present invention is to provide a method for building low resource speech Text to Speech synthesis system.
[008] Another objective of the present invention is to use existing TTS systems to generate synthetic audio data from input text.
[009] Still another objective of the present invention is to utilize synthetic audio data to train the TTS system and fine tuning on the voice of a specific speaker of choice.
[0010] Other objectives and advantages of the present invention will become apparent from the following description taken in connection with the accompanying drawings, wherein, by way of illustration and example, the aspects of the present invention are disclosed.

SUMMARY OF THE INVENTION

[0011] The present invention relates to a method for building a low resource Text-to-Speech synthesis system. The Text-to-Speech system based model comprises an encoder and a decoder to convert monolingual data such as English data into output speech in a desired target speaker’s voice. The model is first trained on English data to build an English TTS. Once the model is trained with said data, the embedding layer is discarded and the remaining weights are used in the next stage of training. The weights of the embedding layer after the first step of training English data are discarded due to mismatches in the characters of English and Hindi data. The embedding layer weights are further trained in the subsequent step. The model is further re-trained with synthetic data on an existing Hindi TTS model. The synthetic data of existing Hindi TTS model is Hindi data that is being utilized to train the model in order to minimize the dependency on the availability of low resource language data, for example, Hindi. While training the model on limited real target speaker data, the model freezes the text encoder weights such that the encoder will not be trained during the fine-tuning of the decoder. The model provides fine tuning of the decoder by training the decoder on the real target speaker data so as to select a target speaker of choice.

BRIEF DESCRIPTION OF DRAWINGS

[0012] A complete understanding of the present invention may be obtained by reference to the accompanying drawings, when taken in conjunction with the detailed description thereof and in which:

[0013] Figure 1 illustrates a method for building a text to speech (TTS) system for a low resource language in the voice of a specific speaker using synthetic Hindi data.

DETAILED DESCRIPTION OF THE INVENTION

[0014] The following description describes various features and functions of the disclosed system and method with reference to the accompanying figure. In the figure, similar symbols identify similar components, unless context dictates otherwise. The illustrative aspects described herein are not meant to be limiting. It may be readily understood that certain aspects of the disclosed system and method can be arranged and combined in a wide variety of different configurations, all of which are contemplated herein.

[0015] Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope of the invention. In addition, descriptions of well-known functions and constructions are omitted for clarity and conciseness.

[0016] Features that are described and/or illustrated with respect to one embodiment may be used in the same way or in a similar way in one or more other embodiments and/or in combination with or instead of the features of the other embodiments.

[0017] The terms and words used in the following description and claims are not limited to the bibliographical meanings, but, are merely used to enable a clear and consistent understanding of the invention. Accordingly, it should be apparent to those skilled in the art that the following description of exemplary embodiments of the present invention are provided for illustration purpose only and not for the purpose of limiting the invention.

[0018] It is to be understood that the singular forms “a,” “an,” and “the” include plural referents unless the context clearly dictates otherwise.

[0019] It should be emphasized that the term “comprises/comprising” when used in this specification is taken to specify the presence of stated features, integers, steps or components but does not preclude the presence or addition of one or more other features, integers, steps, components or groups thereof. The equations used in the specification are only for computation purpose.

[0020] The terms “module” and “corpus” used herein denote a software or hardware component. The meaning of “unit” and “corpus” or “audio corpus” is not limited to software or hardware only. The “module” may be configured to operate in conjunction with a generic or a specific processing unit to execute instructions to carry out the functioning of the present invention. The audio corpus comprising audio data may be stored on a local hard drive attached to a computing system or a server in a local or cloud computing environment in a compatible format, for example, raw .wav format. It may be appreciated by person skilled in the art that the audio data may be stored in some other formats also. All the audio data may be re-sampled at 16 kHz and further encoded in 16-bit PCM wav format for training and inference. Such datasets may be used for training one or more modules. Other hardware or software components may also be utilized to implement the aspects of the present invention.

[0021] The systems and methods disclosed in this invention may be implemented by hardware, software, firmware and/or any combination thereof. For example, a processor such as CPU, GPU, or any other processing unit that can be implemented by logic circuits, microprocessors, Integrated Circuits, controllers, etc., may be used in the present invention. In a non-limiting example, the model disclosed in the present invention may be implemented by using a system configuration of Intel Xeon (Skylake, IBRS) CPU with 72 cores, Nvidia A100 Tensor core GPU with 40 GB VRAM, and a RAM (350 GB) along with other auxiliary hardware and software components. However, the model may be implemented with some another configuration of CPU/GPU and Memory as per the requirement of user.

[0022] Software may include different components such as application programs, operating system, drivers, etc. For example, an application program may include an Application Programming Interface (API) that enables different software components to communicate with each other, whereby internet-based web or mobile applications can access or request remote web services through their APIs. In a non-limiting example, the model of the present invention may be exposed with an API endpoint that is configured to connect with or hit from the Flipkart mobile application having a voice assistant feature. The API can also be hit from the customer experience voice bot.

[0023] While this invention has been described in connection with what is presently considered to be the most practical and preferred embodiment, it is to be understood that the invention is not limited to the disclosed embodiments, but, on the contrary, is intended to cover various modifications and equivalent arrangements included within the scope of the appended claims.

[0024] Accordingly, a Text-to-Speech (TTS) system may be provided for e-commerce mobile or web applications. The e-commerce application may include a voice assistant and a customer experience bot. The query text provided by the user of the e-commerce application is processed by the disclosed TTS system and an automatic response to the query is provided the bot in a voice of a specific speaker.

[0025] The present invention provides a method to build a Text-to-Speech (TTS) system for a low resource language in the voice of a specific speaker. The present invention uses two languages, a first language and a second language in which a text query is provided in first language and an output speech is generated in a second language in the voice of a desired speaker.

[0026] In an embodiment, a text to speech (TTS) system for a low resource language is provided. A method for building a low resource speech synthesis system is provided that uses an out-of-box existing TTS system to generate synthetic audio data in response to an input text. It may be appreciated by person skilled in the art that out-of-box TTS system may refer to any previously trained TTS system for a specific voice. It may refer to any external subscription based TTS services, for example, openly accessible TTS systems such as Google TTS, Microsoft TTS, etc. It may also refer to any internally trained TTS system. The internal system may be for a particular speaker whereas a new system is required to be built for another target speaker.

[0027] A synthetic corpus may be created comprising a pair of input text and the corresponding synthetic audio generated by the out-of-box existing TTS system. This synthetic corpus is used to train the TTS system first, followed by fine-tuning the speaker of choice. Such a TTS system uses very less speaker-specific data of low resource language. Accordingly, a novel approach has been designed to build a low resource TTS synthesis system that require limited low resource language data of a desired speaker.

[0028] For instance, synthetic corpus may include synthetic audios which may be generated by first selecting a text sample from a target domain (for example, a customer experience voice bot - ??????! ??? ???? ???????? ?????? ????????? ?). Further, the text is passed to an out-of-box TTS model which will generate speech corresponding to the given text. The generated audio-text pair may further be utilized for next step of training. Such an audio-text pair is called synthetic since it contains synthetic audio and real text. The audio is called synthetic since it was generated by a TTS model and not by a human speaker.

[0029] The present invention implements one or more features of TTS system using one or more non-limiting examples. Some of the examples for training the TTS system include: (i) English data; (ii) synthetic corpus including a pair of text and the corresponding synthetic audio data; and (iii) pure Hindi data, as illustrated below:

[0030] In an embodiment, an audio corpus may be provided, comprising a plurality of audio samples in the first language and the second language. The audio samples are studio recordings recorded with the voices of one or more speakers uttering a text in the first language or the second language. In another embodiment, the audio corpus performs a specific role of storing audio data corresponding to English and Hindi data, recorded in the voices of one or more artist speakers, and may include one or more datasets of pre-recorded audio samples. In yet another embodiment, the audio corpus is a single speaker containing audio samples in Hindi recorded with the voice of said speaker and may be a target speaker of desired choice.

[0031] In yet another embodiment, the audio corpus may store plurality of audio data generated in any speaker voice by one or more existing out-of-box TTS systems in response to the input text, and said audio data may be termed as synthetic audio data. Such synthetic audio data may further be utilized to train a TTS model.

[0032] Figure 1 represents an embodiment of the present invention, wherein a neural network model is used to convert input text data in a first language into output speech in a second language. The neural network model comprising a training module. The training module further comprises an encoder and a decoder.

[0033] Figure 1 illustrates a multi-step model training process. Initially, a model is trained on an English corpus, for example, a LJSpeech corpus. Further, one or more embedding layers are discarded as the English text in Roman script whereas the target script may be Devanagari script. The model is further trained on a synthetic Hindi data generated using an out-of-box TTS system. Finally, the model is fine-tuned on a target speaker data, whereas an encoder of the model is configured to be deactivated or frozen and only a decoder of the model is configured to be fine-tuned.

[0034] In an embodiment, the neural network model is a pre-trained model that encodes the language representation, such as characters, words, or sentences, so that one or more embedding vectors may be used for other tasks. In the context of Natural Language Processing (NLP) and deep learning models, especially for tasks involving text data, the embedding layer is a fundamental component that is used to convert categorical variables, such as words or tokens in text, into dense vectors of real numbers, often referred to as embeddings. These embeddings capture semantic relationships between words and allow the model to learn and understand the meaning of words in a continuous vector space. It is well known in the art that an embedding layer is a type of hidden layer in a neural network that maps input information from a high-dimensional space to a lower-dimensional space.

[0035] Some of the features of the present invention may be implemented using a two-stage Text-to-Speech synthesis model using Tacotron 2 and Waveglow architectures. Tacotron 2 is a neural network architecture that generates Mel-spectrogram frames directly from input text using an encoder-decoder architecture, whereas WaveGlow is a flow-based model that utilizes Mel-spectrogram frames to generate output speech. It can be appreciated by a person skilled in the art that other architecture models are also available that are being used in existing Text-to-Speech systems, for example, Fastspeech2, FastPitch, Transformer-TTS, MelGAN, HiFiGAN, StyleMelGAN, etc.

[0036] In some embodiments, a Tacotron 2 speech synthesis model may be implemented to generate a Mel-spectrogram from the input text. Tacotron 2 is a neural network architecture for Text-to-Speech synthesis and mainly includes a recurrent sequence-to-sequence feature prediction network that predicts a sequence of Mel-spectrogram frames from an input character sequence. It consists of an encoder, which creates an internal representation of the input character sequence, and a decoder, which turns this representation into a Mel-spectrogram.

[0037] In some embodiments, a Waveglow generative model may be implemented to generate speech output from a mel-spectrogram. WaveGlow is a flow-based modified Wavenet vocoder that generates time-domain waveform samples conditioned on the predicted Mel-spectrogram frames. Waveglow utilizes a single network and can be trained using only a single cost function, which makes the training procedure simple and stable.

[0038] In a preferred embodiment, the Text-to-Speech System implements a training module. The training module is first pre-trained on English data on a primary TTS model. Further, the training module provides an input text to pass through an existing secondary TTS model to generate one or more audio corresponding to the input text in the voice of speaker included in said secondary TTS. This will create one or more audio-text pairs that is further utilized to train the primary TTS model. In an embodiment, the audio-text pairs may be called synthetic corpus comprising the synthetic audio data and the input text. Furthermore, one or more audio recordings of a desired speaker of choice is utilized to fine tune the primary TTS model. During the fine-tuning process, the encoder is frozen, and the decoder is trained on a target speaker data audio comprising plurality of audio recordings of a target speaker of desired choice. In another embodiment, the primary TTS model is an English TTS model and the secondary TTS model is an existing out-of-box Hindi TTS model.

[0039] The training module implements a three step process. In Fig. 1, the reference numeral 110 illustrates the first step, wherein the model is first trained with high resource English language data to build an English TTS model such that once such model is trained with said data, the embedding layer of this model may be discarded and the remaining weights are used in the next stage of training. The weights of embedding layer after first step of training English data is discarded due to mismatching in the characters of English and Hindi data. In the first step (110), English data in roman script is used, whereas in the second step (120) and third step (130), Hindi data in Devanagari script is used. The embedding layer weights are further trained in second step. At second step (120), the model is further re-trained with a synthetic data on an existing Hindi TTS model. The synthetic data of existing Hindi TTS model is Hindi data that is being utilized to train the model in order to minimize the dependency on the availability of low resource language data, for example, Hindi. At third step (130), while training the model on a limited real target speaker data, the model freezes the text encoder weights such that the encoder will not be trained during fine-tuning step of the decoder. During the fine tuning step, the decoder is configured to train on the target speaker data so as to select a target speaker of desired choice. The real target speaker data is a limited Hindi audio recordings in the voice of a desired speaker.

[0040] The present invention provides a technical solution to the problem of how to build a low resource Text-to-Speech synthesis system to provide an output speech using a limited low resource training data. The present invention provides a technical solution for converting text inputs into a speech output of low resource language in the voice of desired speaker. Thus, as described in the present invention, a TTS system is provided that performs input text to monolingual speech conversion with limited availability of low resource language training data. A synthetic data may be generated in the domain of a choice, for example choice of a user on an e-commerce interface or a virtual medical consultation portal/ application. In such applications, an input text may be provided and a corresponding synthetic audio data may be generated. This is helpful because specific domains such as virtual medical consultation may have a specific vocabulary that is difficult to pronounce. A data may be generated in a target domain of a final application (in the case of voice assistant). The training on such domain-specific data is always helpful rather than training on out of domain data.

[0041] In an embodiment, the present invention discloses a Text-to-Speech system comprising: a first database for receiving data in a first language; an audio corpus comprising plurality of audio samples in the first language and a second low resource language, and a synthetic audio data; and a training module adapted to process the data received from the first database and the audio corpus.

[0042] In the embodiments, the training module comprises an encoder and a decoder, wherein the training module is configured to train with the first language and the audio sample corresponding to the first language, the encoder is further configured to discard one or more embedding weights associated with the first language.

[0043] In the embodiments, the training module is further configured to train with the synthetic audio data in an existing model configured to process data in the second language.

[0044] In the embodiments, the training module provides a fine tuning of the decoder, wherein the fine tuning includes training of the decoder with the audio sample corresponding to the second language.

[0045] In the embodiments, the training module further configured to select audio sample of a target speaker and discontinue the training of the encoder.

[0046] In the embodiments, the first language is English language and the second language is Hindi language.

[0047] In the embodiments, the audio sample of the target speaker is the output audio in the voice of a desired speaker of choice.

[0048] According to an aspect of the invention, the converted output speech is emitted by a playback means operably coupled to the processor wherein the playback means is selected from but not limited to speaker, mobile speaker, wireless speakers, bluetooth speakers, sound bars, etc.

[0049] In an embodiment, a method of building a text-to-speech system for a low resource language may be implemented. The method includes at least the steps of:
i. receiving data in a first language through a first database;
ii. training by an encoder in a training module with the data received from the first database and plurality of audio samples corresponding to the first language from an audio corpus;
iii. discarding one or more embedding weights associated with the first language by the encoder;
iv. training the training module with a synthetic audio data in an existing model configured to process data in a second language; and
v. providing by the training module a fine tuning of a decoder, wherein the fine tuning includes training of the decoder to select audio sample of a target speaker from the audio corpus and discontinue the training of the encoder.

[0050] In an embodiment, the present invention may include a computer program product that may comprise computer-readable instructions for causing a processor to carry out aspects of the present invention on a computer or a processing device.

[0051] While this invention has been described in connection with what is presently considered to be the most practical and preferred embodiment, it is to be understood that the invention is not limited to the disclosed embodiments, but, on the contrary, is intended to cover various modifications and equivalent arrangements included within the scope of the appended claims.
, Claims:WE CLAIM:

1. A text-to-speech system comprising:
i. a first database for receiving data in a first language;
ii. an audio corpus comprising plurality of audio samples in the first language and a second low resource language, and a synthetic audio data; and
iii. a training module adapted to process the data received from the first database and the audio corpus,
wherein
the training module comprises an encoder and a decoder, wherein
• the training module is configured to train with the first language and the audio sample corresponding to the first language, the encoder is further configured to discard one or more embedding weights associated with the first language,
• the training module is further configured to train with the synthetic audio data in an existing model configured to process data in the second language,
• the training module provides a fine tuning of the decoder, wherein the fine tuning includes training of the decoder with the audio sample corresponding to the second language, and
• the training module further configured to select audio sample of a target speaker and discontinue the training of the encoder.

2. The system as claimed in claim 1, wherein the first language is English language.

3. The system as claimed in claim 1, wherein the second language is Hindi language.

4. The system as claimed in claim 1, wherein the audio sample of the target speaker is the output audio in the voice of a desired speaker of choice.

5. A method of building a text-to-speech system for a low resource language as claimed in claim 1, wherein the method comprising the steps of:
i. receiving data in a first language through a first database;
ii. training by an encoder in a training module with the data received from the first database and plurality of audio samples corresponding to the first language from an audio corpus;
iii. discarding one or more embedding weights associated with the first language by the encoder;
iv. training the training module with the a synthetic audio data in an existing model configured to process data in the second language; and
v. providing by the training module a fine tuning of a decoder, wherein the fine tuning includes training of the decoder to select audio sample of a target speaker from the audio corpus and discontinue the training of the encoder.

6. The method as claimed in claim 6, wherein the first database receives English language data.

7. The method as claimed in claim 6, wherein the second language and the target speaker data is Hindi.

8. A computer program product comprising computer-readable instructions for implementing the method of claim 5 on a computer or a processing device.

Documents

Application Documents

#	Name	Date
1	202341078243-STATEMENT OF UNDERTAKING (FORM 3) [17-11-2023(online)].pdf	2023-11-17
2	202341078243-REQUEST FOR EXAMINATION (FORM-18) [17-11-2023(online)].pdf	2023-11-17
3	202341078243-REQUEST FOR EARLY PUBLICATION(FORM-9) [17-11-2023(online)].pdf	2023-11-17
4	202341078243-PROOF OF RIGHT [17-11-2023(online)].pdf	2023-11-17
5	202341078243-POWER OF AUTHORITY [17-11-2023(online)].pdf	2023-11-17
6	202341078243-FORM-9 [17-11-2023(online)].pdf	2023-11-17
7	202341078243-FORM 18 [17-11-2023(online)].pdf	2023-11-17
8	202341078243-FORM 1 [17-11-2023(online)].pdf	2023-11-17
9	202341078243-DRAWINGS [17-11-2023(online)].pdf	2023-11-17
10	202341078243-DECLARATION OF INVENTORSHIP (FORM 5) [17-11-2023(online)].pdf	2023-11-17
11	202341078243-COMPLETE SPECIFICATION [17-11-2023(online)].pdf	2023-11-17
12	202341078243-FER.pdf	2025-05-13
13	202341078243-OTHERS [27-06-2025(online)].pdf	2025-06-27
14	202341078243-FORM-26 [27-06-2025(online)].pdf	2025-06-27
15	202341078243-FER_SER_REPLY [27-06-2025(online)].pdf	2025-06-27
16	202341078243-COMPLETE SPECIFICATION [27-06-2025(online)].pdf	2025-06-27
17	202341078243-CLAIMS [27-06-2025(online)].pdf	2025-06-27
18	202341078243-US(14)-HearingNotice-(HearingDate-10-11-2025).pdf	2025-10-09
19	202341078243-US(14)-ExtendedHearingNotice-(HearingDate-12-11-2025)-1100.pdf	2025-11-06
20	202341078243-Correspondence to notify the Controller [07-11-2025(online)].pdf	2025-11-07
21	202341078243-Annexure [07-11-2025(online)].pdf	2025-11-07

Search Strategy

1	202341078243_SearchStrategyNew_E_saerch_texttospeechE_24-02-2025.pdf