A System And Method To Synthesize Speech In Multiple Emotive Languages

< Back

A System And Method To Synthesize Speech In Multiple Emotive Languages

Abstract: A system 100 and method to synthesize speech of a language can include a server including one or more system modules with one or more database and one or more processors operatively coupled with memory storing instructions executable by one or more processors. The one or more processors are configured to obtain an original speech signal 104 to detect emotion associated with the transmitted speech 104 from a subject 102 to synthesize by converting a text to an audio format in a neutral voice which is then represented as a mel spectrogram, which is further manipulated to synthesize an engaging audio clip with emotion using deep learning algorithms including Transformers, Generative Adversarial Network (GAN), Auto-encoders and Decoders. The multiple deep learning datasets is used for the same subject 102 in a particular language ranging its emotion from neutral to other emotive sounds including calm, happy, sad, angry, fearful, disgust, and surprised.

Get Free WhatsApp Updates!
Notices, Deadlines & Correspondence

Patent Information

Application #

Filing Date

28 November 2022

Publication Number

22/2024

Publication Type

INA

Invention Field

ELECTRONICS

Status

Parent Application

Applicants

SYNCSENSE TECHNOLOGIES PVT. LTD.

T2-1103, Purva Skydale, Harlur, Bengaluru, Karnataka

Inventors

1. Bhalla Vishal Ashok

B-24, RH-1, Sector-8, Vashi, Navi Mumbai, Maharashtra 400703

2. Saran Vibhor

C4 Modern Aliganj, 2A/1 Chandganj Garden, Lucknow 226 024, Uttar Pradesh

Specification

DESC:TECHNICAL FIELD
[0001] The present disclosure relates generally to the field of speech synthesis. In particular, the disclosure is about a system and method for synthesizing multi-lingual speech from text or language into one or more emotive speech, more particularly in Indian languages.

BACKGROUND
[0002] Presently, most of the customer-oriented organisations are using digital devices which can answer users’ queries, give instructions for certain tasks and even talk to the users. The other segments are where machine articles can be spoken aloud or even handle customer queries. A lot of businesses have now started relying on their text to speech engines and speech synthesis engines-catboats and voice assistants to interact with their customers to help reduce the workload as well as the workforce.
[0003] Currently, the technologies available to synthesize speech are predominantly made for a single language; however, tonality changes are not optimised to make the speech more emotive in nature that can have emotions like laughing, sadness, crying, etc. Another limitation is that they do not create the same result when given the same instructions, and also in order to make the voices more engaging, it requires a lot of training data to learn the underlying patterns. Thus making it very difficult to replicate the same voice characteristics.
[0004] Many attempts have been made to make these speeches more conversational, particularly for English, Chinese, and known European languages, but the same methods are not very effective when done for Indian Languages
[0005] While a lot has been achieved on this front in the last few years, the level of engagement using these synthesized voices is very low. A paper titled “Assem-VC: Realistic Voice Conversion by Assembling Modern Speech Synthesis Technique” by Kim et al. at https://github.com/mindslab-ai/assem-vc, discloses voice conversion from one speaker to another while maintaining the intonation of the source speaker. The paper discloses transfer of the prosodic information from a source speaker’s voice to a different speaker’s voice within a set of fixed speakers in English only.
[0006] Another paper titled “Daft-Expert: Cross-Speaker Prosody Transfer on any Text for Expressive Speech Synthesis” by Zaidi J et al in https://github.com/ubisoft/ubisoft-laforge-daft-exprt uses FiLM conditioning layers to strategically inject different prosodic information in all parts of the architecture. The model explicitly encodes traditional low-level prosody features such as pitch, loudness and duration, but also higher level prosodic information that help generate convincing voices in highly expressive styles. Speaker identity and prosodic information are disentangled through an adversarial training strategy that enables accurate prosody transfer across speakers. However the proposed method lacks to add emotions to the same audio clip. Further, the method is limited to English and would not work for Indian languages.
[0007] While the cited references disclose transfer of the prosodic information from a source speaker’s voice to a different speaker’s voice within a set of fixed speakers, but ceases to add multiple emotions to the same audio clip, and lacks the ability to transfer or tune the rhythm, style, tone, etc. for the same speaker. Moreover, the method is limited to English language, and would not extrapolate to Indian languages which are in a different script; there is a possibility to provide a better solution to the problem by providing work for Indian languages with emotions and the same speaker voice.
[0008] There is, therefore, a need to provide a simple, efficient, and cost-effective system and method for synthesizing emotive speeches in a synthetically generated speech in any language, particularly Indian languages with emotional expression.

OBJECTS OF THE INVENTION
[0009] A general object of the present disclosure is to provide a system and method for synthesizing speech in different emotive languages especially Indian languages.
[0010] An object of the present disclosure is to provide a simple, efficient, and cost-effective system for synthesizing emotive speeches.
[0011] Another object of the present disclosure is to manipulate the particular parts of the speech signal to make sound more emotive in nature.
[0012] Another object of the present disclosure is to provide creation of deterministic voices by breaking down the tonality and emotion changes as a separate parameter for tuning.
[0013] Yet another object of the present disclosure is to provide synthesis of all Indian languages by considering the frequency domain of the speech signal.

SUMMARY
[0014] Aspects of the present disclosure relate generally to the field of speech synthesis. In particular, the disclosure is about a system and method for synthesizing multi-lingual speech from text or emotive language into one or more emotive speech, more particularly in Indian languages.
[0015] In an aspect, the disclosure is about a system to synthesize emotive speech of a language which can include a server including one or more system modules with one or more databases and one or more processors operatively coupled with memory storing instructions executable by one or more processors. The one or more processors are configured to obtain an original speech signal to detect emotion associated with the transmitted speech from a subject to synthesize; convert received text in the speech into a neutral mel spectrogram representation to synthesize speech as per the subject command and emotion; extract features from the speech to separate amplitude envelope and frequency boundaries with other relevant components for each word using statistical and deep learning algorithms; train the features to correct distortion appeared in the speech due to the power changes in different bands of frequency and reflected in the mel spectrogram using one or more deep learning algorithm including convolutional neural networks or generative adversarial networks; extract spectral roll-off to get base frequencies for the complete speech; compare the obtained mel spectrogram with the similar most stored mel spectrogram of speech to adjust power at the different frequencies based on required emotion of the subject; and generate result with a new or same voice after manipulating each word to ensure desired emotion is synthesized in the final speech.
[0016] In an embodiment, the one or more modules include a data acquisition module, and a computational module.
[0017] In an embodiment, the data acquisition module is selected including but not limited to a smartphone, tablet, laptop, and desktop capable of synthesizing any form of text and speech.
[0018] In an embodiment, the data acquisition module includes a keyboard for the subject to input the text in one or more languages, and a microphone to generate an audio clip and record voice to be transcribed to text.
[0019] In an embodiment, the system is synthesizing emotive speech in one or more languages including popular Indian languages.
[0020] In an embodiment, the system uses frequency domain of the input speech for synthesis, and the system transmits the same speech and emotions in a deterministic manner when parameters of the received speech are similar.
[0021] Another aspect of the disclosure is a method for synthesizing emotive speech for a language including steps for obtaining an original speech signal for detecting emotion associated with the transmitted speech from a subject for synthesizing; converting received text in the speech and then into a neutral mel spectrogram representation for synthesizing speech as per the subject command and emotion; extracting features from the speech for separating amplitude envelope and frequency boundaries with other relevant components for each word using statistical and deep learning algorithms; training the features for correcting distortion appearing in the speech due to the power changes in different bands of frequency and reflected in the mel spectrogram using one or more deep learning algorithm including convolutional neural networks or generative adversarial networks; extracting spectral roll-off for getting base frequencies for the complete speech; comparing the obtained mel spectrogram with the similar most stored mel spectrogram of speech for adjusting power at the different frequencies based on required emotion of the subject; and generating result with a new or same voice after manipulating each word to ensure desired emotion is synthesized in the final speech.
[0022] In an embodiment, the deep learning algorithms used to train the model includes Long-short term memory (LSTM), Recurrent Neural Networks (RNNs), Convolutional Neural Networks (CNNs), Transformers, Generative Adversarial Networks (GAN), Auto encoders and Decoders.
[0023] In an embodiment, the multiple deep learning datasets are used for the same subject in a particular language ranging its emotion from neutral to other emotive sounds.
[0024] In an embodiment, the other emotive sounds include from but not limited to neutral, calm, happy, sad, angry, fearful, disgust, and surprised.
[0025] Various objects, features, aspects and advantages of the inventive subject matter will become more apparent from the following detailed description of preferred embodiments, along with the accompanying drawing figures in which numerals represent like components.

BRIEF DESCRIPTION OF THE DRAWINGS
[0026] The accompanying drawings are included to provide a further understanding of the present disclosure and are incorporated in and constitute a part of this specification. The drawings illustrate exemplary embodiments of the present disclosure and, together with the description, serve to explain the principles of the present disclosure.
[0027] FIG. 1 illustrates an exemplary block diagram for the proposed system of synthesizing emotive speech for a language, in accordance with a first embodiment of the present disclosure.
[0028] FIG. 2 illustrates an exemplary method flow diagram for synthesizing emotive speech for a language, in accordance with embodiments of the present disclosure.
[0029] FIG. 3 illustrates an exemplary frequency curve for neutral Mel Spectrogram for the text “ main (???) aaj (??) bahut (????) dino (?????) baad (???) chutti (??????) par (??) gaya (???)” spoken by a subject, in accordance with a first embodiment of the present disclosure.
[0030] FIG. 4 illustrates exemplary word boundaries of the frequency curve for the text “??? ?? ???? ????? ??? ?????? ?? ???” using statistical method and deep learning algorithm, in accordance with a first embodiment of the present disclosure.
[0031] FIG. 5 illustrates an exemplary spectral roll-off performed on the mel spectrogram, in accordance with a first embodiment of the present disclosure.
[0032] FIG. 6 illustrates an exemplary spectral roll-off for a happy emotion for the text “??? ?? ???? ????? ??? ?????? ?? ???”, in accordance with a first embodiment of the present disclosure.
[0033] FIG. 7 illustrates an exemplary happy mel spectrogram for the text “??? ?? ???? ????? ??? ?????? ?? ???”, in accordance with a first embodiment of the present disclosure.
[0034] In the following description, embodiments of the invention are described in sufficient detail and in the accompanying drawings to enable those skilled in the art to practice the invention and it is understood that other embodiments may be utilized and that logical changes may be made without departing from the spirit or scope of the invention. However, the amount of detail offered is not intended to limit the anticipated variations of embodiments; on the contrary, the intention is to cover all modifications, equivalents, and alternatives falling within the spirit and scope of the present disclosure as defined by the appended claim.
[0035] Embodiments explained herein relate generally to the field of speech synthesis. In particular, the disclosure is about a system and method for synthesizing multi-lingual speech from text or language into one or more emotive speech, more particularly in Indian languages.
[0036] Embodiment of the present disclosure is about a system and method to synthesize emotive speech of a language can include a server including one or more system modules with one or more databases and one or more processors operatively coupled with memory storing instructions executable by one or more processors. The one or more processors are configured to obtain an original speech signal to detect emotion associated with the transmitted speech from a subject to synthesize by converting a text to an audio format in a neutral voice which is then represented as a mel spectrogram, which is further manipulated to synthesize an engaging audio clip with emotion using deep learning algorithms including but not limited to Transformers, RNNs, CNNs, GANs , Autoencoders and Decoders. The multiple deep learning datasets are used for the same subject in a particular language ranging its emotion from neutral to other emotive sounds including calm, happy, sad, angry, fearful, disgust, and surprised.
[0037] Referring to FIG. 1 where an exemplary block diagram for the proposed system 100 for synthesizing emotive speech for a language is shown. System 100 to synthesize emotive speech of a language can include a server comprising one or more system modules with one or more databases and one or more processors operatively coupled with memory storing instructions executable by one or more processors.
[0038] In an embodiment, the one or more modules in system 100 include a data acquisition module 106, and a computational module 112, and one or more databases. The data acquisition module 106 is selected from, including but not limited to, a smartphone, tablet, laptop, and desktop capable of synthesizing any form of text and speech.
[0039] In an embodiment, the data acquisition module 106 also includes a keyboard 108 for the subject 102 to input the text in one or more languages, and a microphone 110 to generate an audio clip and record voice to be transcribed to text. The audio clip includes several single-frequency sound waves and captured with different amplitudes
[0040] In an embodiment, the system 100 is synthesizing emotive speech 104 in one or more languages, including but not limited to English, Hindi, Tamil, Panjabi, and other popular Indian languages.
[0041] In an aspect, the emotions in speech can have many types of emotions and the emotion may be detected from audio-visual and/ or textual cues. Emotional speech syntheses allows machines to communicate like a real person. Many types of emotions such as neutral, calm, happy, sad, angry, fearful, disgust, and surprised are generated as output speech with different intensities. The emotion may be supplied as an explicit input by the user which can be synthesized to tune any voice containing at least one emotion.
[0042] In an embodiment, system 100 for synthesizing emotive speech in a language such that the speech being a synthetic machine generated voice is more engaging. The subject 102 can use the keyboard 108 or microphone 110 to input an audio clip or record their speech. Generally, the speech recording function is used to record the original speech.
[0043] In an embodiment, the text data received from subject 102 is passed from the data acquisition module 106 to computational module 112 where the text is converted into speech as an audio in a neutral sounding voice. The computational module 112 includes at least one or more machine learning or deep learning algorithms for generating the neutral sound audio clip.
[0044] In an embodiment, the feature extraction is performed by the computational module 112. Features of the speech signals are spectral like sound of the voice, prosodic information like accent, stress, rhythm, tone, pitch, intonation and melody of the speech, phonetic or a kind of spoken phonemes, reductions and elaborations, ideolectel that is choice of words, and semantic features for giving meaning. All of them can be influenced by emotional expression.
[0045] In an embodiment, features from the speech 104 are separated as amplitude envelope and frequency boundaries with other relevant components for each word using statistical and deep learning algorithms. Emotional embedding requires emotional data combined with obtained speech or text to implement an emotional speech with high quality.
[0046] In an embodiment, input text or speech 104 obtained from one or more subject 102 with multiple emotions or without emotions needs to be trained on a database of emotional types, irrespective whether rules are derived from the data or statistical algorithms are produced. The databases are often recorded by the actors, or taken from real life data.
[0047] In an embodiment, the deep learning algorithms used to train the model includes Long-short term memory (LSTM), Transformers, Recurrent Neural Network (RNNs), Convolutional Neural Networks (CNNs), Generative Adversarial Network (GAN), Autoencoders and Decoders. LSTM is a recurrent neural network architecture used in deep learning, where it captures long-term dependencies and makes it ideal for sequence prediction. Transformer, which is used for processing sequential data such as natural language text, sound signal or time series data. Generative Adversarial Network (GAN) which creates new data instantly that resemble the training data. To reduce noise, dimensionality and focus only on areas of real values, real people while processing data, AutoEncoders are used in the disclosure. Decoders are used to convert analogue signal into digital data which can be easily processed in the computational module 112.
[0048] In an embodiment, system 100 uses the Fourier transform, which allows decomposing the input signal 104 into its individual frequencies and their amplitudes. In other words, it converts the input signal 104 from the time domain into the frequency domain. The result is called a spectrogram. This is possible because every signal is decomposed into a set of sine and cosine waves that add up to the original signal.
[0049] In an embodiment, the training dataset for correcting distortion appearing in the speech 104 due to the power changes in different bands of frequency. System 100, further extract spectral roll-off for getting base frequencies for the complete speech 104 for comparing the obtained mel spectrogram with the similar most stored mel spectrogram of speech for adjusting power at the different frequencies based on required emotion of the subject 102. The result 114 is generated with a new voice after manipulating each word with desired emotion synthesized in the final speech.
[0050] In an embodiment, system 100 applied multi-component amplitude and frequency modulated (AFM) signal model, suitable for speech phoneme. Also, using the Discrete Energy Separation Algorithm, the Amplitude Envelope (AE) and Instantaneous Frequency (IF) extraction 106 of the speech resonance (component) is done. The estimated modulating parameters can be considered as features 108 for the corresponding speech phoneme. Finally, the obtained input speech 104 is converted and produced with the desired emotion, which is a happy emotional speech 114, as per disclosure, is obtained.
[0051] FIG. 2 illustrates an exemplary method flow diagram 200 for synthesizing emotive speech for a language. The method 200 to make any natural voice to required emotive expression involves manipulation of mel spectrogram or any recorded or synthesized speech using combination of statistical, machine learning, deep learning, transformer, GAN based approaches.
[0052] In an embodiment, method 200 for synthesizing emotive speech for a language including step (202) for obtaining an original speech signal 104 for detecting emotion associated with the transmitted speech from a subject 102 for synthesizing, and step (204) for converting received text in the speech 104 and then into a neutral mel spectrogram for synthesizing speech as per the subject 102 command and emotion
[0053] In an embodiment, step (206) defines extracting of features from the speech 104 for separating amplitude envelope and frequency boundaries with other relevant components for each word using statistical and deep learning algorithms. Step (208) defines training the features for correcting distortion appearing in the speech 104 due to the power changes in different bands of frequency and reflected in the mel spectrogram using one or more deep learning algorithm including convolutional neural network
[0054] In an embodiment, as per step (210), extracting of spectral roll-off for getting base frequencies for the complete speech 104 is performed. Step (212) defines comparing the obtained mel spectrogram with the similar most stored mel spectrogram of speech for adjusting power at the different frequencies based on required emotion of the subject 102, and step (214) for generating result with a new voice after manipulating each word to ensure desired emotion is synthesized in the final speech.
[0055] FIG. 3 illustrates an exemplary frequency curve 300 for neutral Mel Spectrogram for the text “main (???) aaj (??) bahut (????) dino (?????) baad (???) chutti (??????) par (??) gaya (???)” spoken by a subject. Initially, for generating an audio or recording the audio in the neutral sounding voice in a particular language is converted into Amplitude Envelope and Instantaneous Frequency features using the Discrete Energy Separation Algorithm. The frequency is plotted against Y-axis to log scale and amplitude dimension to decibels to form the spectrogram over time on X-axis. The spectrograms of several emotions, spoken by different subjects 102 but always the same sentence “ main (???) aaj (??) bahut (????) dino (?????) baad (???) chutti (??????) par (??) gaya (???)” will have different frequency and their respective amplitudes. One such spectrogram 300 is shown in FIG. 3 depicting amplitude of frequencies over time.
[0056] In an embodiment, waveforms are first processed by trimming the silent sections of the input 104. This is necessary because it requires a static input dimension. Each processed voice of signal 104 is transformed into Mel spectrogram representation.
[0057] In an aspect, a spectrogram is a visual depiction of a signal’s frequency composition over time and the mel spectrogram is a spectrogram where frequencies are converted to mel scale. The mel spectrogram provides a linear scale for the speech signal, and is used to provide sound information similar to what a human would perceive. The mel spectrogram can be represented by the formula-
m=2595 log10(1+f 700),
where m represents mel and f represents frequency in hertz.
[0058] FIG. 4 illustrates exemplary word boundaries 400 of the frequency curve for the text “ main (???) aaj (??) bahut (????) dino (?????) baad (???) chutti (??????) par (??) gaya (???)” using statistical method and deep learning algorithm. After converting voice into mel spectrum with neutral emotion, the next step is to find boundaries 402 for every word which is in the audio.
[0059] In an embodiment, FIG. 4 shows the boundary locations on the mel spectrogram for each word of the disclosure of Hindi audio clip, where each word of the sentence “ main (???) aaj (??) bahut (????) dino (?????) baad (???) chutti (??????) par (??) gaya (???)” is depicted between a pair of lines. Calculation of boundaries is done using existing statistical and deep learning methods. Statistical approaches combine the flexibility of the parametric synthesis with the naturalness of a large database. This approach inherits many of the tools and processes of automatic speech recognition, which define the speech sounds by means of a sequence of states, each representing part of a phoneme, with statistical probabilities learnt for the transitions between states and the mapping of these transitions onto sequences of words in a text. It includes a representation of phoneme frequency 402, duration (seconds) and amplitude 404 as characteristics of each state, and thus produces not just an acoustic sequence for synthesis but also an indication of the prosody of the word. The simulation of emotional styles is usually done by shifting the parameters of the source speech signal with respect to a target emotional style.
[0060] FIG. 5 illustrates an exemplary spectral roll-off 500 performed on the mel spectrogram. Spectral features, based on the short-term power spectrum of sound, contain rich information about expression and emotion. Mel spectral distortion is a widely adopted metric to measure the spectrum similarity.
[0061] In an embodiment, once the word boundary is extracted, then contour or the spectral roll-off has to be found to get base frequencies for the complete speech curve 502 as shown in FIG. 5. The spectral roll-off performed on each Hindi word for obtaining the mel spectrogram.
[0062] In an embodiment, achieving this, the sum up of the percentage of each emotion needs to be perfect 100, and synthesis of various internal emotion states is done by adjusting the percentages.
[0063] FIG. 6 illustrates an exemplary spectral roll-off curve 600 for a happy emotion for the text “ main (???) aaj (??) bahut (????) dino (?????) baad (???) chutti (??????) par (??) gaya (???)”, where the mel spectrogram that will act as a base to create the emotive mel spectrogram, which is taken as ‘happy’ emotion. The mel spectrogram needs to be reconstructed or manipulated using the spectral roll-off for desired emotion curve 602 as shown in FIG. 5. This can lead to a lot of distortion in the end sound because of the power changes in the different frequencies bands. In order to prevent this, a sophisticated approach can be used which transforms the complete mel spectrogram such that all the transitions become smooth when heard by a person. The obtained curve 602 shows a happy mel spectrogram.
[0064] FIG. 7 illustrates the final reconstructed exemplary happy mel spectrogram 700 for the text “main (???) aaj (??) bahut (????) dino (?????) baad (???) chutti (??????) par (??) gaya (???)”. The curve 602, where amplitude of the voice is depicted as 702 and range of manipulated frequency range as 704.
[0065] In an embodiment, deep learning model using convolutional neural networks, long-short term memory, transformers, generative adversarial networks, auto encoders and decoders can be used to train the multiple datasets of the same speaker in a particular language ranging its emotion from the neutral to happy emotive sound as achieved in the disclosure.
[0066] In an embodiment, one potential application of mixed emotion synthesis is building an emotion transition system 100, which aims to gradually transition the emotion state from one to another. The disclosure is similar where neutral emotion is converted into happy emotion. Compared with emotional voice conversion, the key challenge of emotion transition is to synthesize internal state between different emotion types, and in the disclosure these internal states have been obtained by mixing them with different emotions.
[0067] Furthermore, as a result when a new voice is generated, then manipulated each word in a way that the desired emotion is synthesized in the final audio speech. Using above techniques on multiple datasets of different subjects 102 in a particular language and find the relationship between the spectral roll-off, and power with different emotions. Hence, any voice which is recorded or synthesized can be tuned to be more emotive in nature.
[0068] In an embodiment, evaluation of the emotive speech is usually done with perception tests, often of the forced choice variety. Evaluation text is designed to be emotionally neutral. The aim to judge and rate the adequateness of the speech expression, instead of a simple identification mark. The synthetic emotional expression tends to be exaggerated and the results are in high recognition rates of 80% and above, however it depends on the number of emotions to identify, compared with emotion recognition.
[0069] Thus, compared with different methods disclosed in prior arts, the disclosed system and method is multi-lingual and synthesized emotion is more natural. The system is cost-effective as a single configuration can be used for various Indian languages and need not to configure different-different systems.
[0070] While the foregoing describes various embodiments of the invention, other and further embodiments of the invention may be devised without departing from the basic scope thereof. The scope of the invention is determined by the claims that follow. The invention is not limited to the described embodiments, versions or examples, which are included to enable a person having ordinary skill in the art to make and use the invention when combined with information and knowledge available to the person having ordinary skill in the art.

ADVANTAGES OF THE INVENTION
[0071] The present disclosure provides a system and method for synthesizing speech/text in different emotive languages especially Indian languages.
[0072] The present disclosure provides a simple, efficient, and cost-effective system for synthesizing emotive speeches.
[0073] The present disclosure provides to manipulate the particular parts of the speech signal to make sound more emotive in nature.
[0074] The present disclosure provides creation of deterministic voices by breaking down the tonality and emotion changes as a separate parameter for tuning.
[0075] The present disclosure provides synthesis of all Indian languages by considering the frequency domain of the speech signal.
,CLAIMS:1. A system (100) to synthesize speech of a language, the system (100) comprising:
a server comprising one or more system modules with one or more database and one or more processors operatively coupled with memory storing instructions executable by one or more processors, wherein the one or more processors are configured to:
obtain an original speech signal (104) to detect emotion associated with the transmitted speech from a subject (102) to synthesize;
convert received text in the speech (104) into a neutral mel spectrogram (300) representation to synthesize speech as per the subject (102) command and emotion;
extract features from the speech (104) to separate amplitude envelope (402) and frequency (404) boundaries with other relevant components for each word using statistical and deep learning algorithms;
train the features to correct distortion appeared in the speech (104) due to the power changes in different bands of frequency and reflected in the mel spectrogram (300) using one or more deep learning algorithm including convolutional neural network or generative adversarial networks;
extract spectral roll-off (500) to get base frequencies for the complete speech (104);
compare the obtained mel spectrogram (300) with the similar most stored mel spectrogram of speech to adjust power at the different frequencies based on required emotion of the subject (102); and
generate result (700) with a new or same voice after manipulating each word to ensure desired emotion is synthesized in the final speech.
2. The system as claimed in claim 1, wherein the one or more modules include a data acquisition module (106), and a computational module (112).
3. The system as claimed in claim 2, wherein the data acquisition module (106) is selected including but not limited to a smartphone, tablet, laptop, and desktop capable to synthesize any form of text and speech.
4. The system as claimed in claim 2, wherein the data acquisition module (106) comprise a keyboard (108) for the subject (102) to input the text in one or more language, and wherein a microphone (110) to generate an audio clip and record voice to be transcribed to text.
5. The system as claimed in claim 4, wherein the system (100) is synthesizing emotive speech (104) in one or more languages including popular Indian languages.
6. The system as claimed in claim 5, wherein the system (100) uses frequency domain of the input speech (104) for synthesis, and wherein the system (100) transmits same speech (700) and emotions in a deterministic manner when parameters of the received speech (104) are similar.
7. A method (200) for synthesizing emotive speech for a language, the method (200) comprising steps for:
obtaining an original speech signal (104) for detecting emotion associated with the transmitted speech from a subject (102) for synthesizing;
converting received text in the speech (104) and then into a neutral mel spectrogram (300) representation for synthesizing speech as per the subject (102) command and emotion;
extracting features from the speech (104) for separating amplitude (402) envelope and frequency (404) boundaries with other relevant components for each word using statistical and deep learning algorithms;
training the features for correcting distortion appearing in the speech (104) due to the power changes in different bands of frequency and reflected in the mel spectrogram (300) using one or more deep learning algorithm including convolutional neural network or generative adversarial networks;
extracting spectral roll-off (500) for getting base frequencies for the complete speech (104);
comparing the obtained mel spectrogram (300) with the similar most stored mel spectrogram of speech for adjusting power at the different frequencies based on required emotion of the subject (102); and
generating result (700) with a new voice after manipulating each word to ensure desired emotion is synthesized in the final speech.
8. The method as claimed in claim 7, wherein the deep learning algorithms used to train the model includes Long-short term memory (LSTM), Recurrent Neural Networks (RNNs), Convolutional Neural Networks (CNNs), Transformers, Generative Adversarial Networks (GAN), Autoencoders and Decoders.
9. The method as claimed in claim 7, wherein the multiple deep learning datasets is used for the same subject (102) in a particular language ranging its emotion from neutral to other emotive sounds.
10. The method as claimed in claim 9, wherein the other emotive sounds includes from but not limited to neutral, calm, happy, sad, angry, fearful, disgust, and surprised.

Documents

Application Documents

#	Name	Date
1	202241068211-STATEMENT OF UNDERTAKING (FORM 3) [28-11-2022(online)].pdf	2022-11-28
2	202241068211-PROVISIONAL SPECIFICATION [28-11-2022(online)].pdf	2022-11-28
3	202241068211-POWER OF AUTHORITY [28-11-2022(online)].pdf	2022-11-28
4	202241068211-FORM FOR STARTUP [28-11-2022(online)].pdf	2022-11-28
5	202241068211-FORM FOR SMALL ENTITY(FORM-28) [28-11-2022(online)].pdf	2022-11-28
6	202241068211-FORM 1 [28-11-2022(online)].pdf	2022-11-28
7	202241068211-EVIDENCE FOR REGISTRATION UNDER SSI(FORM-28) [28-11-2022(online)].pdf	2022-11-28
8	202241068211-EVIDENCE FOR REGISTRATION UNDER SSI [28-11-2022(online)].pdf	2022-11-28
9	202241068211-DRAWINGS [28-11-2022(online)].pdf	2022-11-28
10	202241068211-Correspondence_Form 1 And Form 26 _09-12-2022.pdf	2022-12-09
11	202241068211-RELEVANT DOCUMENTS [16-11-2023(online)].pdf	2023-11-16
12	202241068211-POA [16-11-2023(online)].pdf	2023-11-16
13	202241068211-FORM 13 [16-11-2023(online)].pdf	2023-11-16
14	202241068211-ENDORSEMENT BY INVENTORS [27-11-2023(online)].pdf	2023-11-27
15	202241068211-DRAWING [27-11-2023(online)].pdf	2023-11-27
16	202241068211-CORRESPONDENCE-OTHERS [27-11-2023(online)].pdf	2023-11-27
17	202241068211-COMPLETE SPECIFICATION [27-11-2023(online)].pdf	2023-11-27
18	202241068211-RELEVANT DOCUMENTS [22-01-2024(online)].pdf	2024-01-22
19	202241068211-MARKED COPIES OF AMENDEMENTS [22-01-2024(online)].pdf	2024-01-22
20	202241068211-FORM 13 [22-01-2024(online)].pdf	2024-01-22
21	202241068211-AMMENDED DOCUMENTS [22-01-2024(online)].pdf	2024-01-22
22	202241068211-FORM 18 [03-06-2024(online)].pdf	2024-06-03