Method And System For Controlling Speech Characteristics In Speech

< Back

Method And System For Controlling Speech Characteristics In Speech Synthesis Systems

Abstract: A method and speech synthesis system (101) for controlling speech characteristics in speech synthesis systems is disclosed. The speech synthesis system (101) receives input data comprising text and expression data. The speech synthesis system identifies one of predefined audio template or audio embedding based on expression data and generates an audio recording for text by generating expressions based on expression data using one of predefined audio template or audio embedding using a pretrained speech control model. The speech control model is trained using a dataset comprising plurality of training text and a portion of a training audio. Essentially, the speech control model is trained by disentangling at least one of time duration, emotion or prosodic information between training text and portion of training audio. The portion of training audio is one portion of plurality of portion of training audio, having expression of expression data, used for training speech control model. Fig.1A-1B

Get Free WhatsApp Updates!
Notices, Deadlines & Correspondence

Patent Information

Application #

Filing Date

08 July 2021

Publication Number

33/2021

Publication Type

INA

Invention Field

ELECTRONICS

Status

Email

Parent Application

Patent Number

Legal Status

Grant Date

2022-05-25

Renewal Date

Applicants

RED BRICK LANE MARKETING SOLUTIONS PRIVATE LIMITED

6th Floor, Salarpuria Sattva Magnificia, Next to Tin Factory, Old Madras Road, Bangalore 560016, India

Inventors

1. Sharath Adavanne

6th Floor, Salarpuria Sattva Magnificia, Next to Tin Factory, Old Madras Road, Bangalore 560016, India

2. Nagaraj Adiga

6th Floor, Salarpuria Sattva Magnificia, Next to Tin Factory, Old Madras Road, Bangalore 560016, India

3. Srikanth Konjeti

6th Floor, Salarpuria Sattva Magnificia, Next to Tin Factory, Old Madras Road, Bangalore 560016, India

4. Sumukh S Badam

6th Floor, Salarpuria Sattva Magnificia, Next to Tin Factory, Old Madras Road, Bangalore 560016, India

5. Arun Baby

6th Floor, Salarpuria Sattva Magnificia, Next to Tin Factory, Old Madras Road, Bangalore 560016, India

6. Pranav Jawale

6th Floor, Salarpuria Sattva Magnificia, Next to Tin Factory, Old Madras Road, Bangalore 560016, India

7. Saranya Vinnaitherthan

6th Floor, Salarpuria Sattva Magnificia, Next to Tin Factory, Old Madras Road, Bangalore 560016, India

Specification

Claims:We claim:
1. A method of controlling speech characteristics in speech synthesis systems, the method comprising:
receiving, by a speech synthesis system (101), an input data from a user device (103), wherein the input data comprises a text and an expression data;
identifying, by the speech synthesis system (101), one of a predefined audio template or an audio embedding based on the expression data; and
generating, by the speech synthesis system (101), an audio recording for the text by generating expressions based on the expression data using one of the predefined audio template or the audio embedding using a pretrained speech control model (207), wherein the speech control model (207) is trained using a dataset comprising plurality of training text and a portion of a training audio,
wherein the portion of the training audio is one portion of a plurality of portion of training audio, having expression of the expression data, used for training the speech control model (207).

2. The method as claimed in claim 1, wherein the expression data comprises information regarding expression in which the text is to be recorded.

3. The method as claimed in claim 1, wherein the expression comprises emotions, voice texture and characteristics, pitch, duration, loudness, timbre, and prosody elements; wherein the emotions comprise happy, angry, polite, sad, fear and surprise.

4. The method as claimed in claim 1, wherein the input data is received from a Natural Language Understanding (NLU) unit of the user device (103).

5. The method as claimed in claim 1, wherein the speech control model (207) is trained using loss function computed between output of the speech control model (207) generated using the portion of audio and entire audio recording associated with the portion of audio.

6. The method as claimed in claim 1, wherein the dataset is associated with a plurality of expressions and the portion of corresponding audio for each training text is selected based on time period of entire audio associated with the training text.

7. The method as claimed in claim 1, wherein training comprises learning and forming by the speech control model (207), a plurality of clusters of similar expressions in the dataset.

8. The method as claimed in claim 1, wherein the audio embedding is identified from a plurality of clusters of similar expressions.

9. The method as claimed in claim 1, wherein the dataset for training the speech control model (207) is generated by:
obtaining text corpus comprising at least one of text and audio associated with a domain from one or more sources and performing text normalization to remove errors from the text;
splitting the normalized text corpus into a predefined number of paragraphs based on reading capability of a voice artist, wherein the voice artist performs auto-recording of the number of paragraphs using one or more beacons to auto correct recorded audio; and
splitting the paragraphs into predefined number of sentence by aligning the paragraph-with audio at word level for training the speech control model using an Automatic Speech Recognition (ASR) technique.

10. The method as claimed in claim 9, wherein the text normalization is performed using normalization rules, wherein the normalization rules is generated based on parameters for multilingual languages comprising alphabets, abbreviations, letter sequences, numbers, cardinal numbers, ordinal numbers, number range, format and representation for money, date, percentages, scientific number, telephone number and alphanumerical.

11. The method as claimed in claim 1, wherein the speech control model (207) is trained to learn phonetic information from training text and acoustic features from the audio portion.

12. The method as claimed in claim 10, wherein the phonetic information is learned based on phoneme rules comprising suffix rules, prefix rules, syllabic rules, pattern rules, associated with different accents.

13. The method as claimed in claim 1 further comprising :

controlling pauses in the dataset by identifying and categorizing silent regions in the dataset during training into a plurality of clusters based on duration, and
mapping each of the plurality of clusters to unique silence phones.

14. The method as claimed in claim 1, wherein training the speech control model (207) comprises disentangling at least one of time duration, emotion or prosodic information between the training text and the portion of the training audio.

15. The method as claimed in claim 1, wherein the audio embedding comprises plurality of clusters of similar expressions.

16. The method as claimed in claim 1, wherein generating expressions based on the expression data further comprises concatenating different expressions for different words in the text.

17. A speech synthesis system (101) for controlling speech characteristics , comprising:
a processor (113); and
a memory (111) communicatively coupled to the processor (113), wherein the memory (111) stores processor instructions, which, on execution, causes the processor (113) to:
receive an input data from a user device (103), wherein the input data comprises a text and an expression data,
identify one of a predefined audio template or an audio embedding based on the expression data; and
generate an audio recording for the text by generating expressions based on the expression data using one of the predefined audio template or the audio embedding using a pretrained speech control model (207), wherein the speech control model (207) is trained using a dataset comprising plurality of training text and a portion of a training audio,
wherein the portion of the training audio is one portion of a plurality of portion of training audio, having expression of the expression data, used for training the speech control model (207).

18. The speech synthesis system (101) as claimed in claim 17, wherein the expression data comprises information regarding expression in which the text is to be recorded.

19. The speech synthesis system (101) as claimed in claim 17, wherein the expression comprises emotions, voice texture and characteristics, pitch, duration, loudness, timbre, and prosody elements; wherein the emotions comprise happy, angry, polite, sad, fear and surprise.

20. The speech synthesis system (101) as claimed in claim 17, wherein the input data is received from a Natural Language Understanding (NLU) unit of the user device (103).

21. The speech synthesis system (101) as claimed in claim 17, wherein the processor (113) trains the speech control model (207) using loss function computed between output of the speech control model generated using the portion of audio and entire audio recording associated with the portion of audio.

22. The speech synthesis system (101) as claimed in claim 17, wherein the processor (113) selects the dataset associated with a plurality of expressions and the portion of corresponding audio for each training text based on time period of entire audio associated with the training text.

23. The speech synthesis system (101) as claimed in claim 17, wherein training comprises learning and forming by the speech control model (207), a plurality of clusters of similar expressions in the dataset.

24. The speech synthesis system (101) as claimed in claim 17, wherein the audio embedding is identified from a plurality of clusters of similar expressions.

25. The speech synthesis system (101) as claimed in claim 17, wherein the processor (113) generates the dataset for training the speech control model (207) by:
obtaining text corpus comprising at least one of text and audio associated with a domain from one or more sources and performing text normalization to remove errors from the text;
splitting the normalized text corpus into a predefined number of paragraphs based on reading capability of a voice artist, wherein the voice artist performs auto-recording of the number of paragraphs using one or more beacons to auto correct recorded audio; and
splitting the paragraphs into predefined number of sentence by aligning the paragraph-with audio at word level for training the speech control model using an Automatic Speech Recognition (ASR) technique.

26. The speech synthesis system (101) as claimed in claim 25, wherein the processor (113) performs the text normalization using normalization rules, wherein the normalization rules are generated based on parameters for multilingual languages comprising alphabets, abbreviations, letter sequences, numbers, cardinal numbers, ordinal numbers, number range, format and representation for money, date, percentages, scientific number, telephone number and alphanumerical.

27. The speech synthesis system (101) as claimed in claim 17, wherein the speech control model (207) is trained to learn phonetic information from training text and acoustic features from the audio portion.

28. The speech synthesis system (101) as claimed in claim 27, wherein the phonetic information is learned based on phoneme rules comprising suffix rules, prefix rules, syllabic rules, pattern rules, associated with different accents.

29. The speech synthesis system (101) as claimed in claim 17 further comprising :

controlling pauses in the dataset by identifying and categorizing silent regions in the dataset during training into a plurality of clusters based on duration, and
mapping each of the plurality of clusters to unique silence phones.

30. The speech synthesis system (101) as claimed in claim 17, wherein the processor (113) trains the speech control model (207) by disentangling at least one of duration, emotion or prosodic information between the training text and the portion of the training audio.

31. The speech synthesis system (101) as claimed in claim 17, wherein the audio embedding comprises plurality of clusters of similar expressions.

32. The speech synthesis system (101) as claimed in claim 17, wherein the processor (113) generates expressions based on the expression data by concatenating different expressions for different words in the text.
, Description:TECHNICAL FIELD
[001] The present subject matter is related in general to speech synthesis system, more particularly, but not exclusively to method and system for controlling speech characteristics in speech synthesis systems.

BACKGROUND

[002] In recent years, text to speech technology has achieved significant progress and is an active area of research and development in providing different human computer interactive systems. Modern text-to-speech systems are able to produce natural and high-quality speech, but speech contains factors of variation (e.g., emotion, pitch, duration, loudness, timbre) that text alone cannot contain.

[003] Also, there exist a lack of publicly available information, techniques, and strategies of audio data collection for speech synthesis purpose. Existing systems only mention dataset statistics of number of speakers and total hours used to train speech synthesis models but fail to disclose details on acoustic nature of the dataset, the process and system to procure it. In addition, there exist a lack of publicly available information on developing text normalization rules to remove ambiguities from texts and help synthesis models to pronounce accurately. Existing systems do not include any details on normalizing multi-lingual (code mixed) text.

[004] Further, existing systems lack publicly available information on developing and evaluating rules to map words to its underlying phones in new accents/languages and testing them on large-scale vocabularies to be used in real-time applications. Currently, different supervised and unsupervised frameworks have been explored to enable controllability of expressions in speech synthesis. However, these frameworks have drawbacks of not being able to control expressions (emotions, pitch, duration, loudness, timbre, etc.) at a finer resolution, and most often apply expressions at a sentence/recording level. Thus, existing speech synthesis models fail to have consistent speech rates for synthesis of longer text (containing multiple sentences). They tend to vary randomly leading to an unnatural perception. In addition, existing speech synthesis models do not always support pauses of varying duration based on user requirement, even if they do, it is most probably a signal processing approach post the synthesis.
[005] The information disclosed in this background of the disclosure section is only for enhancement of understanding of the general background of the invention and should not be taken as an acknowledgement or any form of suggestion that this information forms the prior art already known to a person skilled in the art.

SUMMARY

[006] In an embodiment, the present disclosure may relate to a method for controlling speech characteristics in speech synthesis systems. The method includes receiving an input data from a user device, the input data comprising a text and an expression data. The method includes identifying one of a predefined audio template or an audio embedding based on the expression data and generating an audio recording for the text by generating expressions based on the expression data using one of the predefined audio template or the audio embedding using a pretrained speech control model. The speech control model is trained using a dataset comprising plurality of training text and a portion of a training audio. The portion of the training audio is one portion of a plurality of portion of training audio, having expression of the expression data, used for training the speech control model.

[007] In an embodiment, the present disclosure may relate to a speech synthesis system for controlling speech characteristics in speech synthesis systems. The speech synthesis system may comprise a processor and a memory communicatively coupled to the processor, where the memory stores processor executable instructions, which, on execution, may cause the speech synthesis system to receive an input data from a user device, the input data comprising a text and an expression data. The speech synthesis system identifies one of a predefined audio template or an audio embedding based on the expression data and generates an audio recording for the text by generating expressions based on the expression data using one of the predefined audio template or the audio embedding using a pretrained speech control model. The speech control model is trained using a dataset comprising plurality of training text and a portion of a training audio. The portion of the training audio is one portion of a plurality of portion of training audio, having expression of the expression data, used for training the speech control model.

[008] The foregoing summary is illustrative only and is not intended to be in any way limiting. In addition to the illustrative aspects, embodiments, and features described above, further aspects, embodiments, and features will become apparent by reference to the drawings and the following detailed description.

BRIEF DESCRIPTION OF THE ACCOMPANYING DRAWINGS

[009] The accompanying drawings, which are incorporated in and constitute a part of this disclosure, illustrate exemplary embodiments and, together with the description, serve to explain the disclosed principles. In the figures, the left-most digit(s) of a reference number identifies the figure in which the reference number first appears. The same numbers are used throughout the figures to reference like features and components. Some embodiments of system and/or methods in accordance with embodiments of the present subject matter are now described, by way of example only, and with reference to the accompanying figures, in which:

[010] Fig.1A illustrates an exemplary embodiment for controlling speech characteristics in speech synthesis system in accordance with some embodiments of the present disclosure;

[011] Fig.1B illustrates an exemplary embodiment for data collection for training speech synthesis system in accordance with some embodiments of the present disclosure;

[012] Fig.2 shows a detailed block diagram of a speech synthesis system in accordance with some embodiments of the present disclosure;

[013] Fig.3A shows an exemplary block for training speech control model in accordance with some embodiments of the present disclosure;

[014] Fig.3B shows an exemplary graphical representation for audio embeddings in accordance with some embodiments of the present disclosure;

[015] Fig.4A-4B show an exemplary speech control model for audio template and audio embeddings, respectively in accordance with some embodiments of the present disclosure;

[016] Fig.5A-5B show exemplary representations for single expression embedding and multiple expression embedding, respectively in accordance with some embodiments of the present disclosure;

[017] Fig.6A-6B shows an exemplary representation for controlling speech characteristics in voice assistance system in accordance with some embodiments of present disclosure;

[018] Fig.7 illustrates a flowchart showing a method for controlling speech characteristics in speech synthesis system in accordance with some embodiments of present disclosure; and

[019] Fig.8 illustrates a block diagram of an exemplary computer system for implementing embodiments consistent with the present disclosure.

[020] It should be appreciated by those skilled in the art that any block diagrams herein represent conceptual views of illustrative systems embodying the principles of the present subject matter. Similarly, it will be appreciated that any flow charts, flow diagrams, state transition diagrams, pseudo code, and the like represent various processes which may be substantially represented in computer readable medium and executed by a computer or processor, whether or not such computer or processor is explicitly shown.

DETAILED DESCRIPTION

[021] In the present document, the word "exemplary" is used herein to mean "serving as an example, instance, or illustration." Any embodiment or implementation of the present subject matter described herein as "exemplary" is not necessarily to be construed as preferred or advantageous over other embodiments.

[022] While the disclosure is susceptible to various modifications and alternative forms, specific embodiment thereof has been shown by way of example in the drawings and will be described in detail below. It should be understood, however that it is not intended to limit the disclosure to the particular forms disclosed, but on the contrary, the disclosure is to cover all modifications, equivalents, and alternative falling within the spirit and the scope of the disclosure.

[023] The terms “comprises,” “comprising,” or any other variations thereof, are intended to cover a non-exclusive inclusion, such that a setup, device, or method that comprises a list of components or steps does not include only those components or steps but may include other components or steps not expressly listed or inherent to such setup, device, or method. In other words, one or more elements in a system or apparatus proceeded by “comprises… a” does not, without more constraints, preclude the existence of other elements or additional elements in the system or method.

[024] In the following detailed description of the embodiments of the disclosure, reference is made to the accompanying drawings that form a part hereof, and in which are shown by way of illustration specific embodiments in which the disclosure may be practiced. These embodiments are described in sufficient detail to enable those skilled in the art to practice the disclosure, and it is to be understood that other embodiments may be utilized and that changes may be made without departing from the scope of the present disclosure. The following description is, therefore, not to be taken in a limiting sense.

[025] Embodiments of the present disclosure may relate to a method and speech synthesis system for controlling speech characteristics in speech synthesis systems. Speech synthesis systems perform artificial production of human speech, such as, a Text-to-Speech (TTS) system which converts normal language text into speech; other systems render symbolic linguistic representations like phonetic transcriptions into speech, and the like. The speech synthesis systems are used in various applications such as, voice cloning, voice assistance systems, TTS applications, and the like. Accordingly, based on the different applications and associated domain, collection of dataset is required for training the systems. Currently, there exist a lack of publicly available information, techniques, and strategies of audio data collection for speech synthesis purpose. Existing systems only mention dataset statistics of number of speakers and total hours used to train speech synthesis models but fail to disclose details on acoustic nature of the dataset, the process and system to procure it.

[026] Also, in any sentence/phrases, expressions and emphasis of words play crucial role. A sentence can be pronounced in many different ways based on different voice characteristics. Hence, audio/speech data provides additional aspects which may be controlled suitably during speech synthesis process. Currently, different supervised and unsupervised frameworks have been explored to enable controllability of expressions in speech synthesis. However, these frameworks have drawbacks of not being able to control expressions (emotions, pitch, speech rate, amplitude etc.) at a finer resolution, and most often apply expressions at a sentence/recording level. Thus, existing speech synthesis models fail to have consistent speech rates for synthesis of longer text (containing multiple sentences). They tend to vary randomly leading to an unnatural perception.

[027] The present disclosure resolves this problem by providing a method and system to effectively control expressiveness of the speech synthesis systems using audio templates from training data and enabling a transfer of acoustic characteristics associated with a selected audio template to an audio for synthesis using a speech control model. The speech control model is trained using a dataset which include training text and a portion of a training audio, where the portion of the training audio is one portion of a plurality of portion of training audio, with required acoustic characteristics. In addition, the present disclosure discloses a dataset collection process, including details on acoustic nature of the dataset. Thus, the present disclosure enables to learn better acoustic features that are not phone-related from audio templates, and further helps in generating consistent speech rate during synthesis irrespective of text length.

[028] Fig.1A illustrates an exemplary environment for controlling speech characteristics in speech synthesis system in accordance with some embodiments of the present disclosure.

[029] As shown in Fig.1A, an environment 100 includes a speech synthesis system 101 connected via a communication network 105 with a plurality of user devices 103 (such as, a user device 1031, a user device 1032, …………….user device 103N). In an embodiment, the speech synthesis system 101 may be connected to a database (not shown explicitly in Fig.1A). A person skilled in the art would understand that the environment 100 may also include any other units, not mentioned explicitly in the present disclosure.

[030] The speech synthesis system 101 is a system for controlling speech characteristics and is used in different speech related applications such as, voice cloning, voice bot assistance, Text to Speech (TTS) models and the like. In an embodiment, the speech synthesis system 101 may be configured with/within any other system (not shown explicitly in Fig.1A). The speech synthesis system 101 may include an I/O interface 109, a memory 111 and a processor 113. The I/O interface 109 may be configured to receive data from the plurality of user devices 103. The data from the I/O interface 109 may be stored in the memory 111. The memory 111 may be communicatively coupled to the processor 113 of the speech synthesis system 101. The memory 111 may also store processor instructions which may cause the processor 113 to execute the instructions for controlling speech characteristics.

[031] The speech synthesis system 101 may receive an input data from a user device of the plurality of user devices 103. The input data includes a text and an expression data. The expression data indicates information regarding expressions in which the text is to be recorded. In an embodiment, the expression may include, but not limited to, emotions, voice texture and characteristics, pitch, duration, loudness, timbre, and prosody elements; where the emotions may include, happy, angry, polite, sad, fear, surprise, and the like. A person skilled in the art would understand that any other expression or emotion may also included in the expression data, not mentioned explicitly in the present disclosure. For example, in instances for a voice cloning, the expression data indicating the expressions required for a text to be cloned is received from the user device.

[032] In another embodiment, the expression data may be obtained/determined based on the text or response to be provided to a user of the user device. For example, when the input data is a query in a voice assistant system. In such case, based on a response to the query, i.e., the text, the expression data may be obtained. In one example, a Natural Language Understanding (NLU) unit configured in the speech synthesis system 101 may generate the response to the query and associated expression data. In an embodiment, the NLU may be trained to obtain the expression data from a plurality of expression prestored in the speech synthesis system 101.

[033] On receiving the input data, the speech synthesis system 101 may identify one of a predefined audio template or an audio embedding based on the expression data. Essentially, the audio template or audio embedding is identified which represents the expressions indicated in the expression data. The audio template or an audio embedding is stored during training phase. In an embodiment, the audio template may be a recording of any text with the required expression as indicated in the expression data. While the audio embedding is identified from prestored embedding comprising plurality of clusters of similar expressions.

[034] Once the audio template or the audio embedding is identified, the speech synthesis system 101 generates an audio recording for the text from the input data by generating expressions using one of the audio template or the audio embedding using a speech control model. The speech control model is a deep neural network model and is trained previously using deep neural network techniques. The speech control model is trained using a dataset comprising plurality of training text and a portion of a training audio, where the portion of the training audio is one portion of a plurality of portion of training audio, having expression of the expression data. Unlike existing systems, where a one-to-one mapping is created between the text and audio template, the speech control model is trained using only a portion of training audio in order to learn phonetic information from training text and acoustic features from the audio portion. The phonetic information may be learned based on phoneme rules comprising suffix rules, prefix rules, syllabic rules, pattern rules, associated with different accents. Further, the speech control model is trained for learning and forming a plurality of clusters of similar expressions in the dataset.

[035] Particularly, the speech control model is trained using loss information computed between output of the speech control model generated using the portion of audio and entire audio recording associated with the portion of audio. Essentially, the speech control model is trained by disentangling at least one of time duration, emotion or prosodic information between the training text and the portion of the training audio. As a result of disentangling the training text and portion of audio data, the present disclosure allows the speech control model to learn better acoustic features which are not phone-related from the audio temperate, and further helps in generating consistent speech rate during synthesis irrespective of the text length.

[036] Further, the dataset used for training the speech control model is associated with a plurality of expressions and the portion of corresponding audio for each training text is selected based on time period of entire audio associated with the training text. In an embodiment, the time period may be selected using various techniques. For instance, from the entire/original audio, a start time, and end time may be selected randomly. Thus, selecting portion of audio using different time period during training may train the speech control model with different portions and durations of the same audio recording. This makes the speech control model generic and avoid any kind of overfitting. The dataset for the training is collected during a data collection process.

[037] Fig.1B illustrates an exemplary embodiment for data collection for training speech synthesis system in accordance with some embodiments of the present disclosure. As shown, Fig.1B represents a data collection device 114 which may or may not be a part of the speech synthesis system 101. The data collection device 114 includes various components such as, a text normalization unit 115, a text paragraph splitter 116, a display 117, an audio post-processing unit 118, an audio paragraph splitter 119 and an Automatic Speech Recognition (ASR) unit 121. In an embodiment, the data collection device 114 enables collection of recordings across multiple studios and with multiple voice-artists simultaneously, and allows creation of single-speaker, multi-speaker, and multilingual speech datasets rapidly. Initially as shown, text corpus comprising at least one of text and audio associated with a domain from one or more sources is obtained and provided to the text normalization unit 115 to perform text normalization and remove errors from the text, in order to avoid ambiguity from the text for a voice-artist while recording.

[038] For instance, numbers such as, “343” can be spoken in multiple ways, like “three four three”, “three hundred and forty-three” or “three forty-three”. The text normalization unit 115 removes such ambiguities by textually representing such numbers, dates, abbreviations, and the like. Particularly, the text normalization unit 115 uses normalization rules which are generated based on parameters for multilingual languages comprising alphabets, abbreviations, letter sequences, numbers, cardinal numbers, ordinal numbers, number range, format and representation for money, date, percentages, scientific number, telephone number, alphanumerical and the like. Further, the text paragraph splitter 116 splits the normalized text corpus into a predefined number of paragraphs based on reading capability of the voice artist. The predefined number of paragraphs is displayed on the display 117. In an embodiment, a paragraph-wise text is also one of outputs of a recording session. In an embodiment, the audio recording is kept on from the moment the voice-artist starts recording and turned off only at the end of the recording session.

[039] After successfully reading each paragraph, the voice-artist marks the completion with an audio beacon, which ends up being recorded in the audio recording. That is, the voice artist performs auto-recording of the number of paragraphs using one or more beacons to auto correct recorded audio. In an embodiment, the audio beacon is played back using either the same device used as the text display or a handy button that is connected to the display 117. Additionally, playing the audio beacon updates the display 117 with next paragraph to read. For unsuccessful attempts, the voice-artist plays-back a separate beacon, which retains the paragraph in display 117 so that the voice-artist is allowed to perform successful reading. After the audio beacon is detected, session-wise recordings are split into usable paragraph-wise recordings and unsuccessful attempts are discarded. The audio post-processing unit 118 pre-process paragraph-wise recordings to remove the session-wise background noise from the recording, silences in the beginning and end of the recordings, normalize loudness and apply any necessary filters across recording sessions. Thereafter, Sentence-level recordings is performed by aligning the paragraph-wise text to audio at a word-level using a force alignment technique of Automatic Speech Recognition (ASR). Thereafter, the paragraphs are split by the audio paragraph splitter 119 and the ASR unit 121 into predefined number of sentence by aligning the paragraph-with audio at word level for training the speech control model. Further, the data collection device 114 may control pauses in the dataset by identifying and categorizing silent regions in the dataset during training into a plurality of clusters based on duration, mapping each of the plurality of clusters to unique silence phones.

[040] Fig.2 shows a detailed block diagram of a speech synthesis system in accordance with some embodiments of the present disclosure.

[041] The speech synthesis system 101 may include data 200 and one or more modules 211 which are described herein in detail. In an embodiment, data 200 may be stored within the memory 111. The data 200 may include, for example, input data 201, template data 203, training data 205, speech control model 207, and other data 209.

[042] The input data 201 includes input received from the user device of the plurality of user devices 103. In an embodiment, the input may be a query. The input may include the text and the information regarding expression in which the text is to be recorded. In an embodiment, the input may also be received from the NLU unit which may either be configured in the speech synthesis system 101 or in the user device. Consider voice assistance application. For instance, the text may be “I cannot believe this is happening” and expression data is “happy and fast tone.”

[043] The template data 203 may include plurality of predefined audio templates and audio embeddings based on training dataset. The predefined audio templates and audio embeddings are stored based on different expressions such as, emotions, voice texture and characteristics, pitch, duration, loudness, timbre, and prosody elements. The audio embeddings include a plurality of clusters formed based on similar expressions in the dataset.

[044] The training data 205 includes the training dataset comprising plurality of training text and a plurality of portion of a training audio used individually for training the speech control model 207.

[045] The speech control model 207 is a machine learning model for controlling speech characteristics. The speech control model 207 may be a Convolutional Neural Network (CNN) model or any other combination of DNN. A person skilled in the art would understand that CNN, and explicitly, is exemplary technique, and the machine learning models may also include any other machine learning combinations.

[046] The other data 209 may store data, including temporary data and temporary files, generated by modules 211 for performing the various functions of the speech synthesis system 101.

[047] In an embodiment, the data 200 in the memory 111 are processed by the one or more modules 211 present within the memory 111 of the speech synthesis system 101. In an embodiment, the one or more modules 211 may be implemented as dedicated units. As used herein, the term module refers to an application specific integrated circuit (ASIC), an electronic circuit, a field-programmable gate arrays (FPGA), Programmable System-on-Chip (PSoC), a combinational logic circuit, and/or other suitable components that provide the described functionality. In some implementations, the one or more modules 211 may be communicatively coupled to the processor 113 for performing one or more functions of the speech synthesis system 101. The said modules 211 when configured with the functionality defined in the present disclosure will result in a novel hardware.

[048] In one implementation, the one or more modules 211 may include, but are not limited to a receiving module 213, a training module 215, a template identification module 217, and an audio generation module 219. The one or more modules 211 may also include other modules 221 to perform various miscellaneous functionalities of the speech synthesis system 101. The other modules 221 may include the Natural Language Understanding (NLU) module to generate the response to the query and associated expression data. In an embodiment, the NLU module may be trained to obtain the expression data from a plurality of expression prestored in the speech synthesis system 101. The other modules 221 may include a pause controlling module which controls pauses in the dataset by identifying and categorizing silent regions in the dataset during training into a plurality of clusters based on duration, and mapping each of the plurality of clusters to unique silence phones.

[049] The receiving module 213 may receive the input data from the user device of the plurality of user devices 103. The input data includes the text and the expression data.

[050] The training module 215 may train the speech control model 207 using the dataset comprising plurality of training text and the portion of a training audio. The portion of the training audio is one the portion of the plurality of portion of training audio, having expression of the expression data, used for training the speech control model 207. The dataset is associated with the plurality of expressions and the portion of corresponding audio for each training text is selected based on time period of entire audio associated with the training text. The training module 215 trains the speech control model 207 using partial /portions of audio recording instead of a complete recording in order to disentangle at least one of time duration, emotion or prosodic information between the training text and the portion of the training audio.

[051] Fig.3A shows an exemplary block for training speech control model in accordance with some embodiments of the present disclosure. Fig.3A shows an exemplary environment for training the speech control model 207 based on template-based synthesis i.e., synthesize with a specific style (voice characteristic) chosen from the training dataset. As shown, the speech control model 207 is trained with training dataset comprising the training text and audio related to different domains. Particularly, spectrograms are pre-computed from the training audio. Essentially, the dataset includes two inputs, the input text and the spectrogram matched with the text during training. Further, a variable-length spectrogram is passed through a bottle-neck feature extractor to obtain a fixed length embedding vector which predicts a spectrogram with similar acoustic characteristics to original spectrogram. Thus, loss is computed between the predicted spectrogram and original spectrogram which enables learning of phonetic information from the input text, and other voice characteristics such as voice texture, speed, and prosody elements directly from the input spectrogram.

[052] Returning to Fig.2, the training module 215 may train the speech control model 207 by learning and forming the plurality of clusters of similar expressions in the dataset. Fig.3B shows an exemplary representation for audio embeddings in accordance with some embodiments of the present disclosure. Essentially, during training, the speech control model 207 is trained with all the recordings in the dataset and may internally learn to cluster similar emotions in the dataset together in an unsupervised manner, i .e., without any explicit emotion labels. The audio embeddings as shown contains a latent space such as, 2D-Cartesian space as, with a specific range of x-axis and y-axis coordinates for different emotions in the dataset. For example, in the image as shown, each color represents an emotion that is learnt by the speech control model 207.

[053] Fig.4A-4B show an exemplary speech control model for audio template and audio embeddings, respectively in accordance with some embodiments of the present disclosure. As shown, the speech control model 207 may consists of a frontend 401 and an audio generator403 such as a vocoder. The frontend 401 maps input text, and audio template as shown in Fig.4A, or audio embedding as shown in Fig.4B, to the corresponding spectrogram of the audio to be synthesized. The audio generator 403 then maps this spectrogram to audio samples which can be played back. In an embodiment, the audio generator 403 is modelled using a generative adversarial network trained with multi-resolution spectral and adversarial loss functions. In an embodiment, the frontend 403 of the speech control model 207 is built using sequence to sequence framework consisting of three key layers namely, encoders 405 such as, phoneme encoder 4051 and audio encoder 4052, an attention layer 406 and a decoder 407.

[054] The phoneme encoder 4051 maps the inputs that are sequential in nature to a sequence in latent space. With the help of the attention layer 406, the decoder 407 then generates a sequence of spectrogram frames of audio one spectrogram frame after another. Specifically, when a portion of an audio is used as input as shown in Fig. 4A, the phoneme encoder 4051 and audio encoder 4052are used, one each for input A - text, and input B.1 which is a portion of audio. Both these inputs can be considered to be sequential in nature, i.e., the input A includes a sequence of phonemes/characters/words corresponding to the text, and input B.1. consisting of a sequence of audio samples. In an embodiment, the encoders 405 may include separate convolutional recurrent neural networks (CRNN) for both inputs. The output of the input A CRNN encoder 4051 is a sequence of textual embeddings. However, the output of the input B CRNN encoder 4052 is a single vector of expression embedding, which is identical to the Input B.2. in Fig.4B. This expression embedding contains the information of the speech characteristics which is used for generating output audio. The expression embedding is concatenated every time for textual embedding and provided to the attention layer 406 along with decoder 407 to generate a spectrogram output having characteristics encoded in the expression embedding. In an embodiment, the decoder 407 is implemented using stacked recurrent and convolutional neural networks.

[055] Returning to Fig.2, the template identification module 217 may identify one of the predefined audio template or the audio embedding based on the expression data. For instance, if the expression data include expression indicating a sad tone to the text, an audio template associated with a sad tone may be identified from the plurality of predefined audio templates, or an audio embedding which refers to a sad emotion. Essentially, to identify the audio embedding, the template identification module 217 may map either required expression or portion of recording with the predefined latent space with plurality of clusters of similar expressions. Particularly, the template identification module 217 checks each cluster of different emotions and suitable emotion cluster and associated values is used instead of using the audio recording itself.

[056] The audio generation module 219 may generate audio recording for the text by generating expressions based on the expression data using one of the predefined audio template or the audio embedding identified by the template identification module 217 by using the speech control model 207. The audio generation module 219 may generate the audio recording by either using single expression embedding or multiple expression embedding expressions. Fig.5A-5B show exemplary representations for single expression embedding and multiple expression embedding, respectively in accordance with some embodiments of the present disclosure. For instance, if an entire audio recording is required with a constant expression, the audio generation module 219 uses single expression embedding and concatenate with the text based on the expression data. The expression may be any of pre-defined expression templates in tone, emotion, speed, and other categories as shown in Fig.5A. However, if different words in the audio recording is required to have different combinations of expressions, then the audio generation module 219 concatenates different expression embeddings for different words in the text as shown in Fig. 5B. This allows to control expressions both at a recording level, and at a granular level.

[057] Fig.6A-6B show exemplary representations for controlling speech characteristics in voice assistance system in accordance with some embodiments of present disclosure.

[058] Fig.6A-6B show exemplary representations for controlling speech characteristics in voice assistance system. Consider a dataset with say, ten recordings as shown below, collected with different emotions and expressions. For example, sample 4 recording says ‘ This is beautiful’ in a happy emotion.
Recordings:
Sample 1: Angry, how can you do this?
Sample 2: Angry, this is rubbish!
Sample 3: Happy, I cannot believe this is happening.
Sample 4: Happy, this is beautiful.
Sample 5: Fast, hello my name is Mark.
Sample 6: Fast, I am from Atlanta, where are you from?
Sample 7: Slow, I am Sam, from India.
Sample 8: Slow, I think there is some issue with network.
Sample 9: High pitch, I cannot sing in that high pitch.
Sample 10: Low pitch, I can only sing in this low pitch.

[059] The dataset only includes the text and do not contain an explicit emotion labels (such as, angry, Happy, Fast, Slow, Low/High Pitch etc) for the recordings. The samples 1-10 indicate the explicit emotion labels for understanding expression in the text.

[060] Initially, during training phase, the speech control model 207 may be inputted with two inputs, namely, the training text (a), i.e., the samples 1-10, and a portion of the original recording (b) as shown and generates one output which is a recording speaking the input text, with the emotion present in the portion of input recording (b). The portion or partial recording is used in order to enable the speech control model 207 to learn to output a recording which is identical to the full recording of the input (b). As an example, with the above ten-sample dataset. The input a) is from sample 1 is the text - ‘ How can you do this?’, and a random portion of the sample 1 ‘Angry’ emotion recording as input b). The speech control model 207 may outputs a recording speaking the input (a) text with ‘angry’ emotion. In an embodiment, a portion of the recording in input (b) may be chosen using different techniques. For instance, from the actual/original recording, randomly choosing a start time, and end time. This way the speech control model 207 learns different portions and durations of the same recording, every time it iterates through the recording. This avoids any kind of overfitting. The process of using partial/portion of audio recording in the input (b) instead of a complete recording helps in disentanglement the emotion/expression (input b) from text (input a). This is because, during training, if the complete recording as input b) instead of a portion is used, then this the indicates an easy learning for the speech control model 207 and thereby minimizing the loss used for training. Typically, in existing scenarios due to consideration of compete audio recording for training, the models may copy input (b) to output, as the input and output are identical during training. Thus, when a model learns to copy and paste, it not only learns overall emotion in input b), but also learns the emotion in input b) with respect to input a) text and its duration in order to copy the emotion at the respective duration/text in the output audio. This may introduces unknown artifacts during testing/in real-time. This is because, during real-time situations, an input b) recording with the same text and desired emotion as input a) is unavailable. For instance, during training, if the speech control model 207 is trained with Sample 3 of the dataset with an ‘ Happy’ emotion recording with the text - ‘ I cannot believe this i s happening.’ However, in real-time situations, there may be requirement for speaking the same text in an ‘ angry’ emotion. This indicates that, during real-time situations, an audio recording/template with the same text and ‘Angry’ emotion for input b) is required, which is not available. However, in the present situation, such a recording is not required by the speech control model 207 since the speech control model 207 is trained with only a portion of the recording as input b) instead of the complete recording. Thus, enabling the speech control model 207 to not learn emotion with respect to any time or text of input a), rather learn the text and timing information of the output audio from input a) and emotion from input b).

[061] Similar to the training phase, during real-time situation, the speech control model 207 may be inputted with a) text, and b) emotion-template, in order to generate an output which is a recording speaking the text in input a) with the emotion in input b) template. In an embodiment, the input b) may be provided to the speech control model 207 in two ways. Firstly, an audio template may be identified from the dataset with the emotion required for the text. For example, if a happy emotion is required, sample 4 may be selected from the dataset above as the input b). Similarly for a slow speech-rate, sample 7 may be identified.

[062] Alternatively, as shown in Fig.6B, a cluster of different emotions indicating the audio embedding may be identified and used as the input b) instead of an audio recording/template.

[063] In real-time, the speech control model 207 may include a predefined templates of emotion recordings (for the case of B.1 input) or embeddings (for B.2 input format). Based on the ten-sample dataset as discussed above, post training the speech control model 207, emotions such as, Angry, Happy, Fast, Slow, High-Pitch, and LowPitch may be learnt by the speech control model 207. Also, for each of these emotions, an audio template or an audio embedding may be identified. In an embodiment, the speech control model 207 may be controlled by the (NLU) engine/unit.

[064] Further, based on the emotion and query text of the user, for instance, if a user is sad and uses a voice assistant with a query, the NLU engine/unit in real-time generates response in text (input A to the model) and corresponding emotion to reply in (input B to the model). In an embodiment, the NLU engine/unit is trained such that only one among the predefined audio template is selected. Thereafter, as shown in Fig.6A-6B, the text output of NLU engine/unit is used as input A to the speech control model 207 and for input B, the NLU engine/unit may select one among the known set of audio templates. Particularly, in the current example, the NLU engine/unit may select a template defined between 1 to 6. The selected template is then used for generating the audio recording for the text in the input A.

[065] Fig.7 illustrates a flowchart showing a method for controlling speech characteristics in speech synthesis system in accordance with some embodiments of present disclosure.

[066] As illustrated in Fig.7, the method 700 includes one or more blocks for controlling speech characteristics in speech synthesis system. The method 700 may be described in the general context of computer executable instructions. Generally, computer executable instructions can include routines, programs, objects, components, data structures, procedures, modules, and functions, which perform particular functions or implement particular abstract data types.

[067] The order in which the method 700 is described is not intended to be construed as a limitation, and any number of the described method blocks can be combined in any order to implement the method. Additionally, individual blocks may be deleted from the methods without departing from the scope of the subject matter described herein. Furthermore, the method can be implemented in any suitable hardware, software, firmware, or combination thereof.

[068] At block 701, the input data is received by the receiving module 213 from the user device. The input data includes the text and the expression data.

[069] At block 703, the one of the predefined audio template or the audio embedding is identified by the template identification module 217 based on the expression data.

[070] At block 705, the audio recording for the text is generated by the audio generation module 219 by generating expressions based on the expression data using one of the predefined audio template or the audio embedding using the pretrained speech control mode. The speech control model is trained by the training module 215 using the dataset comprising plurality of training text and the portion of the training audio. The portion of the training audio is one portion of the plurality of portion of training audio, having expression of the expression data, used for training the speech control model 207.

Computing System

[071] Fig.8 illustrates a block diagram of an exemplary computer system 800 for implementing embodiments consistent with the present disclosure. In an embodiment, the computer system 800 may be used to implement the speech synthesis system 101. The computer system 800 may include a central processing unit (“CPU” or “processor”) 802. The processor 802 may include at least one data processor for controlling speech characteristics in speech synthesis systems. The processor 802 may include specialized processing units such as, integrated system (bus) controllers, memory management control units, floating point units, graphics processing units, digital signal processing units, etc.

[072] The processor 802 may be disposed in communication with one or more input/output (I/O) devices (not shown) via I/O interface 801. The I/O interface 801 may employ communication protocols/methods such as, without limitation, audio, analog, digital, monoaural, RCA, stereo, IEEE-1394, serial bus, universal serial bus (USB), infrared, PS/2, BNC, coaxial, component, composite, digital visual interface (DVI), high-definition multimedia interface (HDMI), RF antennas, S-Video, VGA, IEEE 802.n /b/g/n/x, Bluetooth, cellular (e.g., code-division multiple access (CDMA), high-speed packet access (HSPA+), global system for mobile communications (GSM), long-term evolution (LTE), WiMax, or the like), etc.

[073] Using the I/O interface 801, the computer system 800 may communicate with one or more I/O devices such as input devices 812 and output devices 813. For example, the input devices 812 may be an antenna, keyboard, mouse, joystick, (infrared) remote control, camera, card reader, fax machine, dongle, biometric reader, microphone, touch screen, touchpad, trackball, stylus, scanner, storage device, transceiver, video device/source, etc. The output devices 813 may be a printer, fax machine, video display (e.g., Cathode Ray Tube (CRT), Liquid Crystal Display (LCD), Light-Emitting Diode (LED), plasma, Plasma Display Panel (PDP), Organic Light-Emitting Diode display (OLED) or the like), audio speaker, etc.

[074] In some embodiments, the computer system 800 consists of the speech synthesis system 101. The processor 802 may be disposed in communication with the communication network 809 via a network interface 803. The network interface 803 may communicate with the communication network 809. The network interface 803 may employ connection protocols including, without limitation, direct connect, Ethernet (e.g., twisted pair 10/100/1000 Base T), transmission control protocol/internet protocol (TCP/IP), token ring, IEEE 802.11a/b/g/n/x, etc. The communication network 809 may include, without limitation, a direct interconnection, local area network (LAN), wide area network (WAN), wireless network (e.g., using Wireless Application Protocol), the Internet, etc. Using the network interface 803 and the communication network 809, the computer system 800 may communicate with user devices 814. The network interface 803 may employ connection protocols include, but not limited to, direct connect, Ethernet (e.g., twisted pair 10/100/1000 Base T), transmission control protocol/internet protocol (TCP/IP), token ring, IEEE 802.11a/b/g/n/x, etc.

[075] The communication network 809 includes, but is not limited to, a direct interconnection, an e-commerce network, a peer to peer (P2P) network, local area network (LAN), wide area network (WAN), wireless network (e.g., using Wireless Application Protocol), the Internet, Wi-Fi, and such. The first network and the second network may either be a dedicated network or a shared network, which represents an association of the different types of networks that use a variety of protocols, for example, Hypertext Transfer Protocol (HTTP), Transmission Control Protocol/Internet Protocol (TCP/IP), Wireless Application Protocol (WAP), etc., to communicate with each other. Further, the first network and the second network may include a variety of network devices, including routers, bridges, servers, computing devices, storage devices, etc.

[076] In some embodiments, the processor 802 may be disposed in communication with a memory 805 (e.g., RAM, ROM, etc. not shown in Fig.8) via a storage interface 804. The storage interface 804 may connect to memory 805 including, without limitation, memory drives, removable disc drives, etc., employing connection protocols such as, serial advanced technology attachment (SATA), Integrated Drive Electronics (IDE), IEEE-1394, Universal Serial Bus (USB), fiber channel, Small Computer Systems Interface (SCSI), etc. The memory drives may further include a drum, magnetic disc drive, magneto-optical drive, optical drive, Redundant Array of Independent Discs (RAID), solid-state memory devices, solid-state drives, etc.

[077] The memory 805 may store a collection of program or database components, including, without limitation, user interface 806, an operating system 807 etc. In some embodiments, computer system 800 may store user/application data, such as, the data, variables, records, etc., as described in this disclosure. Such databases may be implemented as fault-tolerant, relational, scalable, secure databases such as Oracle or Sybase.

[078] The operating system 807 may facilitate resource management and operation of the computer system 800. Examples of operating systems include, without limitation, APPLE MACINTOSHR OS X, UNIXR, UNIX-like system distributions (E.G., BERKELEY SOFTWARE DISTRIBUTIONTM (BSD), FREEBSDTM, NETBSDTM, OPENBSDTM, etc.), LINUX DISTRIBUTIONSTM (E.G., RED HATTM, UBUNTUTM, KUBUNTUTM, etc.), IBMTM OS/2, MICROSOFTTM WINDOWSTM (XPTM, VISTATM/7/8, 10 etc.), APPLER IOSTM, GOOGLER ANDROIDTM, BLACKBERRYR OS, or the like.

[079] In some embodiments, the computer system 800 may implement a web browser 808 stored program component. The web browser 808 may be a hypertext viewing application, for example MICROSOFT® INTERNET EXPLORERTM, GOOGLE® CHROMETM, MOZILLA® FIREFOXTM, APPLE® SAFARITM, etc. Secure web browsing may be provided using Secure Hypertext Transport Protocol (HTTPS), Secure Sockets Layer (SSL), Transport Layer Security (TLS), etc. Web browsers 808 may utilize facilities such as AJAXTM, DHTMLTM, ADOBE® FLASHTM, JAVASCRIPTTM, JAVATM, Application Programming Interfaces (APIs), etc. In some embodiments, the computer system 800 may implement a mail server stored program component. The mail server may be an Internet mail server such as Microsoft Exchange, or the like. The mail server may utilize facilities such as ASPTM, ACTIVEXTM, ANSITM C++/C#, MICROSOFT®, NETTM, CGI SCRIPTSTM, JAVATM, JAVASCRIPTTM, PERLTM, PHPTM, PYTHONTM, WEBOBJECTSTM, etc. The mail server may utilize communication protocols such as Internet Message Access Protocol (IMAP), Messaging Application Programming Interface (MAPI), MICROSOFT® exchange, Post Office Protocol (POP), Simple Mail Transfer Protocol (SMTP), or the like. In some embodiments, the computer system 800 may implement a mail client stored program component. The mail client may be a mail viewing application, such as APPLE® MAILTM, MICROSOFT® ENTOURAGETM, MICROSOFT® OUTLOOKTM, MOZILLA® THUNDERBIRDTM, etc.

[080] Furthermore, one or more computer-readable storage media may be utilized in implementing embodiments consistent with the present disclosure. A computer-readable storage medium refers to any type of physical memory on which information or data readable by a processor may be stored. Thus, a computer-readable storage medium may store instructions for execution by one or more processors, including instructions for causing the processor(s) to perform steps or stages consistent with the embodiments described herein. The term “computer-readable medium” should be understood to include tangible items and exclude carrier waves and transient signals, i.e., be non-transitory. Examples include Random Access Memory (RAM), Read-Only Memory (ROM), volatile memory, non-volatile memory, hard drives, CD ROMs, DVDs, flash drives, disks, and any other known physical storage media.

[081] Advantages of the present disclosure:

[082] An embodiment of the present disclosure enables the network to learn different voice characteristics present in the training data in an unsupervised way without the requirement of additional labels. Thus, simplifying the training process and overhead of additional label training.

[083] An embodiment of the present disclosure enables selecting individually speech from diverse training data to synthesize the recordings.

[084] An embodiment of the present disclosure seamlessly learns different characteristics and replicate them during inference with correct and precise audio template.

[085] An embodiment of the present disclosure allows the network to learn better acoustic features that are not phone-related from the template, and further helps in generating consistent speech rate during synthesis irrespective of the text length. This improves overall performance of the network during controlling the speech characteristics in various speech synthesis systems.

[086] The disclosed method and system overcome technical problem of one-to-one mapping between text and audio data which requires an exact audio template as required for indicating the expression in the text by controlling speech characteristics by using a speech control model, which is trained using a dataset comprising plurality of training text and a portion of a training audio, which enables disentangling at least one of time duration, emotion or prosodic information between the training text and the portion of the training audio. That is, the speech control model is trained to learn phonetic information from training text and acoustic features from the audio portion. Therefore, the present disclosure enables to learn better acoustic features that are not phone-related from the template, and further helps in generating consistent speech rate during synthesis irrespective of the text length. This improves overall performance of the network during controlling the speech characteristics in various speech synthesis systems.

[087] Currently, there exist a lack of publicly available information, techniques, and strategies of audio data collection for speech synthesis purpose. Existing systems only mention dataset statistics of number of speakers and total hours used to train speech synthesis models but fail to disclose details on acoustic nature of the dataset, the process and system to procure it. Also, in any sentence/phrases, expressions and emphasis of words play crucial role. A sentence can be pronounced in many different ways based on different voice characteristics. Hence, audio/speech data provides additional aspects which may be controlled suitably during speech synthesis process. Currently, different supervised and unsupervised frameworks have been explored to enable controllability of expressions in speech synthesis. However, these frameworks have drawbacks of not being able to control expressions (emotions, pitch, speech rate, amplitude etc.) at a finer resolution, and most often apply expressions at a sentence/recording level. Thus, existing speech synthesis models fail to have consistent speech rates for synthesis of longer text (containing multiple sentences). They tend to vary randomly leading to an unnatural perception.

[088] In light of the above-mentioned advantages and the technical advancements provided by the disclosed method and system, the claimed steps as discussed above are not routine, conventional, or well understood in the art, as the claimed steps enable the following solutions to the existing problems in conventional technologies. Further, the claimed steps clearly bring an improvement in the functioning of the device itself as the claimed steps provide a technical solution to a technical problem.

[089] The described operations may be implemented as a method, system or article of manufacture using standard programming and/or engineering techniques to produce software, firmware, hardware, or any combination thereof. The described operations may be implemented as code maintained in a “non-transitory computer readable medium,” where a processor may read and execute the code from the computer readable medium. The processor is at least one of a microprocessor and a processor capable of processing and executing the queries. A non-transitory computer readable medium may include media such as magnetic storage medium (e.g., hard disk drives, floppy disks, tape, etc.), optical storage (CD-ROMs, DVDs, optical disks, etc.), volatile and non-volatile memory devices (e.g., EEPROMs, ROMs, PROMs, RAMs, DRAMs, SRAMs, Flash Memory, firmware, programmable logic, etc.), etc. Further, non-transitory computer-readable media include all computer-readable media except for a transitory. The code implementing the described operations may further be implemented in hardware logic (e.g., an integrated circuit chip, Programmable Gate Array (PGA), Application Specific Integrated Circuit (ASIC), etc.).

[090] Still further, the code implementing the described operations may be implemented in “transmission signals,” where transmission signals may propagate through space or through a transmission media, such as, an optical fiber, copper wire, etc. The transmission signals in which the code or logic is encoded may further include a wireless signal, satellite transmission, radio waves, infrared signals, Bluetooth, etc. The transmission signals in which the code or logic is encoded is capable of being transmitted by a transmitting station and received by a receiving station, where the code or logic encoded in the transmission signal may be decoded and stored in hardware or a non-transitory computer readable medium at the receiving and transmitting stations or devices. An “article of manufacture” includes non-transitory computer readable medium, hardware logic, and/or transmission signals in which code may be implemented. A device in which the code implementing the described embodiments of operations is encoded may include a computer readable medium or hardware logic. Of course, those skilled in the art will recognize that many modifications may be made to this configuration without departing from the scope of the invention, and that the article of manufacture may include suitable information bearing medium known in the art.

[091] The terms “an embodiment,” “embodiment,” “embodiments,” “the embodiment,” “the embodiments”, “one or more embodiments”, “some embodiments”, and “one embodiment” mean “one or more (but not all) embodiments of the invention(s)” unless expressly specified otherwise.

[092] The terms “including,” “comprising,” “having” and variations thereof mean “including but not limited to,” unless expressly specified otherwise.

[093] The enumerated listing of items does not imply that any or all of the items are mutually exclusive, unless expressly specified otherwise.

[094] The terms “a,” “an” and “the” mean “one or more,” unless expressly specified otherwise.

[095] A description of an embodiment with several components in communication with each other does not imply that all such components are required. On the contrary, a variety of optional components are described to illustrate the wide variety of possible embodiments of the invention.

[096] When a single device or article is described herein, it will be readily apparent that more than one device/article (whether or not they cooperate) may be used in place of a single device/article. Similarly, where more than one device or article is described herein (whether or not they cooperate), it will be readily apparent that a single device/article may be used in place of the more than one device or article, or a different number of devices/articles may be used instead of the shown number of devices or programs. The functionality and/or the features of a device may be alternatively embodied by one or more other devices which are not explicitly described as having such functionality/features. Thus, other embodiments of the invention need not include the device itself.

[097] The illustrated operations of Fig.7 show certain events occurring in a certain order. In alternative embodiments, certain operations may be performed in a different order, modified, or removed. Moreover, steps may be added to the above-described logic and still conform to the described embodiments. Further, operations described herein may occur sequentially or certain operations may be processed in parallel. Yet further, operations may be performed by a single processing unit or by distributed processing units.

[098] Finally, the language used in the specification has been principally selected for readability and instructional purposes, and it may not have been selected to delineate or circumscribe the inventive subject matter. It is therefore intended that the scope of the invention be limited not by this detailed description, but rather by any claims that issue on an application based here on. Accordingly, the disclosure of the embodiments of the invention is intended to be illustrative, but not limiting, of the scope of the invention, which is set forth in the following claims.

[099] While various aspects and embodiments have been disclosed herein, other aspects and embodiments will be apparent to those skilled in the art. The various aspects and embodiments disclosed herein are for purposes of illustration and are not intended to be limiting, with the true scope and spirit being indicated by the following claims.

Referral numerals:
Reference Number Description
100 Environment
101 Speech synthesis system
103 Plurality of user devices
105 Communication network
109 I/O interface
111 Memory
113 Processor
114 Data collection device
115 Text normalization unit
116 Text paragraph splitter
117 Display
118 Post-processing unit
119 Audio paragraph splitter
121 ASR
200 Data
201 Input data
203 Template data
205 Training data
207 Speech control model
209 Other data
211 Modules
213 Receiving module
215 Training module
217 Template identification module
219 Audio generation module
221 Other modules
401 Frontend
403 Vocoder
405 Encoders
4051 Phoneme encoder
4052 Audio encoder
406 Attention layer
407 Decoder
800 Computer system
801 I/O interface
802 Processor
803 Network interface
804 Storage interface
805 Memory
806 User interface
807 Operating system
808 Web browser
809 Communication network
812 Input devices
813 Output devices
814 User device

Documents

Application Documents

#	Name	Date
1	202141030614-STATEMENT OF UNDERTAKING (FORM 3) [08-07-2021(online)].pdf	2021-07-08
2	202141030614-POWER OF AUTHORITY [08-07-2021(online)].pdf	2021-07-08
3	202141030614-FORM FOR STARTUP [08-07-2021(online)].pdf	2021-07-08
4	202141030614-FORM FOR SMALL ENTITY(FORM-28) [08-07-2021(online)].pdf	2021-07-08
5	202141030614-FORM 1 [08-07-2021(online)].pdf	2021-07-08
6	202141030614-EVIDENCE FOR REGISTRATION UNDER SSI(FORM-28) [08-07-2021(online)].pdf	2021-07-08
7	202141030614-EVIDENCE FOR REGISTRATION UNDER SSI [08-07-2021(online)].pdf	2021-07-08
8	202141030614-DRAWINGS [08-07-2021(online)].pdf	2021-07-08
9	202141030614-DECLARATION OF INVENTORSHIP (FORM 5) [08-07-2021(online)].pdf	2021-07-08
10	202141030614-COMPLETE SPECIFICATION [08-07-2021(online)].pdf	2021-07-08
11	202141030614-STARTUP [09-08-2021(online)].pdf	2021-08-09
12	202141030614-OTHERS [09-08-2021(online)].pdf	2021-08-09
13	202141030614-FORM28 [09-08-2021(online)].pdf	2021-08-09
14	202141030614-FORM-9 [09-08-2021(online)].pdf	2021-08-09
15	202141030614-FORM FOR STARTUP [09-08-2021(online)].pdf	2021-08-09
16	202141030614-FORM 18A [09-08-2021(online)].pdf	2021-08-09
17	202141030614-EVIDENCE FOR REGISTRATION UNDER SSI [09-08-2021(online)].pdf	2021-08-09
18	202141030614-FER.pdf	2021-10-18
19	202141030614-OTHERS [11-01-2022(online)].pdf	2022-01-11
20	202141030614-FER_SER_REPLY [11-01-2022(online)].pdf	2022-01-11
21	202141030614-DRAWING [11-01-2022(online)].pdf	2022-01-11
22	202141030614-CLAIMS [11-01-2022(online)].pdf	2022-01-11
23	202141030614-US(14)-HearingNotice-(HearingDate-25-04-2022).pdf	2022-04-04
24	202141030614-RELEVANT DOCUMENTS [13-04-2022(online)].pdf	2022-04-13
25	202141030614-Proof of Right [13-04-2022(online)].pdf	2022-04-13
26	202141030614-PETITION UNDER RULE 137 [13-04-2022(online)].pdf	2022-04-13
27	202141030614-FORM-26 [19-04-2022(online)].pdf	2022-04-19
28	202141030614-Correspondence to notify the Controller [19-04-2022(online)].pdf	2022-04-19
29	202141030614-Written submissions and relevant documents [09-05-2022(online)].pdf	2022-05-09
30	202141030614-PatentCertificate25-05-2022.pdf	2022-05-25
31	202141030614-IntimationOfGrant25-05-2022.pdf	2022-05-25
32	202141030614-PROOF OF ALTERATION [12-07-2024(online)].pdf	2024-07-12
33	202141030614-FORM-26 [12-07-2024(online)].pdf	2024-07-12

Search Strategy

1	searchh(59)E_30-08-2021.pdf

ERegister / Renewals

3rd: 31 Jul 2022

From 08/07/2023 - To 08/07/2024

4th: 08 Jul 2024

From 08/07/2024 - To 08/07/2025

5th: 08 Jul 2024

From 08/07/2025 - To 08/07/2026

6th: 08 Jul 2024

From 08/07/2026 - To 08/07/2027

7th: 08 Jul 2024

From 08/07/2027 - To 08/07/2028

8th: 08 Jul 2024

From 08/07/2028 - To 08/07/2029