System And Method Of Automated Audio Output

< Back

System And Method Of Automated Audio Output

Abstract: The present disclosure relates to a system and a method for facilitating automated conversion of an input text by a user into a synthesized speech based audio output based on an artificial intelligence architecture. The implementation involves processing, at a text processing engine of the system, an input text received from a user device associated with the user. A plurality of audio datasets is concatenated, through an artificial intelligence engine of the system, to obtain a first output comprising a concatenated speech. The concatenated speech may be refined, through the AI engine, based on one or more pre-defined prosody based attributes to obtain a second output comprising the synthesized speech based audio output, which is not robotic in nature and provides an enhanced audio quality.

Get Free WhatsApp Updates!
Notices, Deadlines & Correspondence

Patent Information

Application #

Filing Date

01 January 2021

Publication Number

01/2022

Publication Type

INA

Invention Field

ELECTRONICS

Status

jioipr@zmail.ril.com

Parent Application

Applicants

JIO PLATFORMS LIMITED

Office-101, Saffron, Nr. Centre Point, Panchwati 5 Rasta, Ambawadi, Ahmedabad - 380006, Gujarat, India.

Inventors

1. LANKA, Raghuram

Plot No: J-8, House No: 1-55-193/J-8, CMC Layout, Kondapur, Hyderabad - 500084, Telangana, India.

2. PAILLA, Balakrishna

137, Marigold, L&T Serene County, Gachi Bowli, Hyderabad - 500032, Telangana, India.

3. KUMAR, Dr. Shailesh

Flat No. C-16, Madhuvanam Apartment, Kanha Shantivanam, Hyderabad - 500084, Telangana, India.

4. ROHILLA, Sourabh

Flat No. 170, DDA MIG Flats, Shiv Mandir Road, Madipur, New Delhi - 110063, India.

5. RATNA, Anand

Shri Ramchandra Mission Ashram Campus, Jungle Tinkonia No. 2, Pipraich Road, Gorakhpur - 273014, Uttar Pradesh, India.

Specification

Claims:1. A system for facilitating automated conversion of an input text into a synthesized speech based audio output based on an artificial intelligence architecture, the system comprising:
a processor that executes a set of executable instructions that are stored in a memory, upon which execution, the processor causes the system to:
process, at a text processing engine of the system, an input text received from a user device associated with the user, wherein the input text is parsed to obtain a set of parsed text, the parsing being done to identify one or more pre-defined attributes associated with each parsed text;
concatenate, through an artificial intelligence (AI) engine of the system, a plurality of audio datasets to obtain a first output comprising a concatenated speech, wherein the concatenation is done by stitching the plurality of audio datasets in a predefined sequence based on the arrangement of each parsed set with respect to the input text, wherein the plurality of audio datasets are generated based on mapping of one or more pre-defined audio parameters of each audio dataset with the one or more pre-defined attributes of each parsed text; and
refine, through the AI engine, the concatenated speech based on one or more pre-defined prosody based attributes to obtain a second output comprising the synthesized speech based audio output that is relatively more refined than the concatenated speech and wherein the system enables real-time conversion of the input text into the synthesized speech based audio output.
2. The system as claimed in claim 1, wherein the input text is received in form of a first set of data packets from the user device associated with the user, wherein the synthesized speech based audio output is converted by the system into a waveform that is transmitted to the user device in real time in form of a second set of data packets, upon receipt of the input text.
3. The system as claimed in claim 1, wherein the pre-defined attributes associated with each parsed text comprises any or a combination of a script corresponding to a language, a word, a syllable, a space and a punctuation mark.
4. The system as claimed in claim 1, wherein the plurality of audio datasets are pre-stored in a database, wherein the one or more pre-defined audio parameters comprise any or a combination of a language, a dialect, a tone, extent of relevancy to the parsed text and voice attributes.
5. The system as claimed in claim 1, wherein the AI engine of the system includes a self-learning model based neural network comprising a plurality of layers, wherein the self-learning model is trained and updated based on received input data and the synthesized speech based audio output.
6. The system as claimed in claim 1, wherein the refinement may enable alteration of one or more characteristics of the concatenated speech, wherein the alteration of one or more characteristics is selected from energy based normalization, noise removal, voice activity detection, voice trimming, Cepstral smoothing and time-domain smoothing.
7. The system as claimed in claim 1, wherein the energy based normalization enables refinement in amplitude of the concatenated speech, wherein the noise removal enables identification of a threshold noise value by comparison with a low-noise segment of the concatenated speech, wherein effect of the threshold noise value is deducted from the concatenated speech.
8. The system as claimed in claim 1, wherein the voice activity detection and voice trimming enables removal of one or more unwarranted non-audio based silent segments, wherein the Cepstral smoothing facilitates refining of one or more stitched boundaries between each audio dataset in the concatenated speech, and wherein the time domain smoothing enables to generate time domain audio from the concatenated features on which a simple low pass moving average filter is applied to smoothen the generated audio.
9. The system as claimed in claim 1, wherein the prosody based attributes for refinement of the concatenated speech are selected from any or a combination of intonation feature, stress pattern feature, variations in loudness, pausing attribute, amplitude and rhythm.
10. A method for facilitating automated conversion of an input text by a user into a synthesized speech based audio output, the method comprising the steps:
processing, at a text processing engine of the system, an input text received from a user device associated with the user, wherein the input text is parsed to obtain a set of parsed text, the parsing being done to identify one or more pre-defined attributes associated with each parsed text;
concatenating, through an artificial intelligence engine of the system, a plurality of audio datasets to obtain a first output comprising a concatenated speech, wherein the concatenation is done by stitching the plurality of audio datasets in a predefined sequence based on the arrangement of each parsed set with respect to the input text, wherein the plurality of audio datasets are generated based on mapping of one or more pre-defined audio parameters of each audio dataset with the plurality of attributes of each parsed text; and
refining, through the AI engine, the concatenated speech based on one or more pre-defined prosody based attributes to obtain a second output comprising the synthesized speech based audio output, wherein synthesized speech based audio output is relatively more refined than the concatenated speech and wherein the system enables real-time conversion of the input text into the synthesized speech based audio output.
11. The method as claimed in claim 10, wherein the input text is received in form of a first set of data packets from the user device associated with the user, wherein the synthesized speech based audio output is converted by the system into a waveform that is transmitted to the user device in real time in form of a second set of data packets, upon receipt of the input text.
12. The method as claimed in claim 10, wherein the pre-defined attributes associated with each parsed text comprises any or a combination of a script corresponding to a language, a word, a syllable, a space and a punctuation mark.
13. The method as claimed in claim 10, wherein the plurality of audio datasets are pre-stored in a database, wherein the one or more pre-defined audio parameters comprise any or a combination of a language, a dialect, a tone, extent of relevancy to the parsed text and voice attributes.
14. The method as claimed in claim 10, wherein the AI engine of the system includes a self-learning model based neural network comprising a plurality of layers, wherein the self-learning model is trained and updated based on received input data and the synthesized speech based audio output.
15. The method as claimed in claim 10, wherein the refinement may enable alteration of one or more characteristics of the concatenated speech, wherein the alteration of one or more characteristics is selected from energy based normalization, noise removal, voice activity detection, voice trimming, Cepstral smoothing and time-domain smoothing.
16. The method as claimed in claim 10, wherein the energy based normalization enables refinement in amplitude of the concatenated speech, wherein the noise removal enables identification of a threshold noise value by comparison with a low-noise segment of the concatenated speech, wherein effect of the threshold noise value is deducted from the concatenated speech.
17. The method as claimed in claim 10, wherein the voice activity detection and voice trimming enables removal of one or more unwarranted non-audio based silent segments, wherein the Cepstral smoothing facilitates refining of one or more stitched boundaries between each audio dataset in the concatenated speech, and wherein the time domain smoothing enables to generate time domain audio from the concatenated features on which a simple low pass moving average filter is applied to smoothen the generated audio.
18. The method as claimed in claim 10, wherein the prosody based attributes for refinement of the concatenated speech are selected from any or a combination of intonation feature, stress pattern feature, variations in loudness, pausing attribute, amplitude and rhythm.
, Description:FIELD OF INVENTION
[0001] The embodiments of the present disclosure generally relate to facilitating conversion of text to speech. More particularly, the present disclosure relates to a system and method for facilitating automated conversion of an input text by a user into a synthesized speech based audio output based on an artificial intelligence based architecture.

BACKGROUND OF THE INVENTION
[0002] The following description of related art is intended to provide background information pertaining to the field of the disclosure. This section may include certain aspects of the art that may be related to various features of the present disclosure. However, it should be appreciated that this section be used only to enhance the understanding of the reader with respect to the present disclosure, and not as admissions of prior art.
[0003] Language is one of the basic modes of communication in human civilization, wherein more than 6500 languages are used around the world. It may be a common human tendency to prefer to interact in one’s own or native language even if official languages may be known to a person. Similarly, an individual may also like listening to any audio such as an audio book, in the language of their choice. However, it may be extremely difficult to convert text into audio using human recordings as such recordings may be limited. Hence a system or model that can generate human quality audio for as many languages as possible is greatly desired.
[0004] However, this is an extremely impractical as such models/systems like conventional deep learning-based text to speech (TTS) conversion may require large quantity of training data and may be computationally expensive to run. Hence, even the best state-of-art solutions may be restricted to support just few languages, which really restricts freedom of a user to choose his/her preferred language. Further the conventional system may generate a low-quality audio output that may sound synthetic or robotic in nature thus negatively impacting user experience.
[0005] There is, therefore a need in the art, to provide a system and a method that can enable facilitating automated conversion of an input text by a user into a synthesized audio output, by using limited audio database and while providing a high quality audio output that may resemble human speech thereby providing enhanced user experience.

OBJECTS OF THE PRESENT DISCLOSURE
[0006] Some of the objects of the present disclosure, which at least one embodiment herein satisfies are as listed herein below.
[0007] It is an object of the present disclosure to provide a system and a method for facilitating real-time automated conversion of an input text by a user into a synthesized speech based audio output.
[0008] It is an object of the present disclosure to provide a system and a method that can generate human quality speech audio from text in any language.
[0009] It is an object of the present disclosure to provide a system and a method that does not require huge amount of data for facilitating the conversion of text to audio.
[0010] It is an object of the present disclosure to provide a system and a method for enhancing user experience and to enable audio output that does not sound synthetic or robotic, like the conventional techniques.

SUMMARY
[0011] This section is provided to introduce certain objects and aspects of the present invention in a simplified form that are further described below in the detailed description. This summary is not intended to identify the key features or the scope of the claimed subject matter.
[0012] In order to achieve the aforementioned objectives, the present disclosure provides a system and method for facilitating automated conversion of an input text by a user into a synthesized speech based audio output based on an artificial intelligence based architecture. In an aspect, the system includes a processor that executes a set of executable instructions that are stored in a memory, upon which execution, may cause the system to process, at a text processing engine of the system, an input text received from a user device associated with the user, wherein the input text may be parsed to obtain a set of parsed text, the parsing being done to identify one or more pre-defined attributes associated with each parsed text. Further, the system can concatenate, through an artificial intelligence engine of the system, a plurality of audio datasets to obtain a first output comprising a concatenated speech, wherein the concatenation may be done by stitching the plurality of audio datasets in a predefined sequence based on the arrangement of each parsed set with respect to the input text, wherein the plurality of audio datasets may be generated based on mapping of one or more pre-defined audio parameters of each audio dataset with the pre-defined attributes of each parsed text. Furthermore, the system can refine, through the AI engine, the concatenated speech based on one or more pre-defined prosody based attributes to obtain a second output including the synthesized speech based audio output, that is relatively more refined than the concatenated speech. In an embodiment, the system may enable real-time conversion of the input text into the synthesized speech based audio output.
[0013] In another aspect, the present disclosure includes a method for facilitating automated conversion of an input text by a user into a synthesized speech based audio output. The method may be executed by a processor of a system, and includes the steps of processing, at a text processing engine of the system, an input text received from a user device associated with the user, wherein the input text may be parsed to obtain a set of parsed text, the parsing being done to identify one or more pre-defined attributes associated with each parsed text. The method can include concatenating, through an artificial intelligence engine of the system, a plurality of audio datasets to obtain a first output comprising a concatenated speech, wherein the concatenation may be done by stitching the plurality of audio datasets in a predefined sequence based on the arrangement of each parsed set with respect to the input text, wherein the plurality of audio datasets may be generated based on mapping of one or more pre-defined audio parameters of each audio dataset with the plurality of attributes of each parsed text. The method can include refining, through the AI engine, the concatenated speech based on one or more pre-defined prosody based attributes to obtain a second output comprising the synthesized speech based audio output, wherein synthesized speech based audio output may be relatively more refined than the concatenated speech.

BRIEF DESCRIPTION OF DRAWINGS
[0014] The accompanying drawings, which are incorporated herein, and constitute a part of this invention, illustrate exemplary embodiments of the disclosed methods and systems in which like reference numerals refer to the same parts throughout the different drawings. Components in the drawings are not necessarily to scale, emphasis instead being placed upon clearly illustrating the principles of the present invention. Some drawings may indicate the components using block diagrams and may not represent the internal circuitry of each component. It will be appreciated by those skilled in the art that invention of such drawings includes the invention of electrical components, electronic components or circuitry commonly used to implement such components.
[0015] FIG. 1 illustrates an exemplary architecture (100) for conversion of text to speech using conventional technique.
[0016] FIG. 2A illustrates an exemplary network architecture (200) in which or with which a system (270) of the present disclosure can be implemented, in accordance with an embodiment of the present disclosure.
[0017] FIG. 2B illustrates an exemplary representation (250) of system (270) or a centralized server (260), in accordance with an embodiment of the present disclosure.
[0018] FIG. 2C illustrates an exemplary representation (270) of a plurality of audio datasets generated for parsed text, in accordance with an embodiment of the present disclosure.
[0019] FIG. 3 illustrates an exemplary flow diagram (300) representation depicting the significant steps in conversion of an input text to a speech audio output, in accordance with an embodiment of the present disclosure.
[0020] FIG. 4 illustrates an exemplary representation (400) of components for generation of concatenated speech, in accordance with an embodiment of the present disclosure.
[0021] FIG. 5 illustrates an exemplary representation (500) depicting an artificial neural network associated with an artificial intelligence (AI) engine (216) of system (270), in accordance with an embodiment of the present disclosure.
[0022] FIG. 6A illustrates an exemplary representation (600) of an audio waveform generated by system (270), in accordance with an embodiment of the present disclosure.
[0023] FIG. 6B illustrates exemplary representation (650) for refinement of a concatenated speech output by using voice activity detection and trimming technique, in accordance with an embodiment of the present disclosure.
[0024] FIG. 7 illustrates exemplary method flow diagram (700) depicting a method for facilitating automated conversion of an input text by a user into a synthesized speech based audio output, in accordance with an embodiment of the present disclosure.
[0025] FIG. 8 refers to the exemplary computer system (800) in which or with which embodiments of the present disclosure can be utilized, in accordance with embodiments of the present disclosure.
[0026] The foregoing shall be more apparent from the following more detailed description of the invention.

DETAILED DESCRIPTION OF INVENTION
[0027] In the following description, for the purposes of explanation, various specific details are set forth in order to provide a thorough understanding of embodiments of the present disclosure. It will be apparent, however, that embodiments of the present disclosure may be practiced without these specific details. Several features described hereafter can each be used independently of one another or with any combination of other features. An individual feature may not address all of the problems discussed above or might address only some of the problems discussed above. Some of the problems discussed above might not be fully addressed by any of the features described herein.
[0028] The ensuing description provides exemplary embodiments only, and is not intended to limit the scope, applicability, or configuration of the disclosure. Rather, the ensuing description of the exemplary embodiments will provide those skilled in the art with an enabling description for implementing an exemplary embodiment. It should be understood that various changes may be made in the function and arrangement of elements without departing from the spirit and scope of the invention as set forth.
[0029] Specific details are given in the following description to provide a thorough understanding of the embodiments. However, it will be understood by one of ordinary skill in the art that the embodiments may be practiced without these specific details. For example, circuits, systems, networks, processes, and other components may be shown as components in block diagram form in order not to obscure the embodiments in unnecessary detail. In other instances, well-known circuits, processes, algorithms, structures, and techniques may be shown without unnecessary detail in order to avoid obscuring the embodiments.
[0030] Also, it is noted that individual embodiments may be described as a process which is depicted as a flowchart, a flow diagram, a data flow diagram, a structure diagram, or a block diagram. Although a flowchart may describe the operations as a sequential process, many of the operations can be performed in parallel or concurrently. In addition, the order of the operations may be re-arranged. A process is terminated when its operations are completed but could have additional steps not included in a figure. A process may correspond to a method, a function, a procedure, a subroutine, a subprogram, etc. When a process corresponds to a function, its termination can correspond to a return of the function to the calling function or the main function.
[0031] The word “exemplary” and/or “demonstrative” is used herein to mean serving as an example, instance, or illustration. For the avoidance of doubt, the subject matter disclosed herein is not limited by such examples. In addition, any aspect or design described herein as “exemplary” and/or “demonstrative” is not necessarily to be construed as preferred or advantageous over other aspects or designs, nor is it meant to preclude equivalent exemplary structures and techniques known to those of ordinary skill in the art. Furthermore, to the extent that the terms “includes,” “has,” “contains,” and other similar words are used in either the detailed description or the claims, such terms are intended to be inclusive—in a manner similar to the term “comprising” as an open transition word—without precluding any additional or other elements.
[0032] Reference throughout this specification to “one embodiment” or “an embodiment” or “an instance” or “one instance” means that a particular feature, structure, or characteristic described in connection with the embodiment is included in at least one embodiment of the present invention. Thus, the appearances of the phrases “in one embodiment” or “in an embodiment” in various places throughout this specification are not necessarily all referring to the same embodiment. Furthermore, the particular features, structures, or characteristics may be combined in any suitable manner in one or more embodiments.
[0033] The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used herein, the singular forms “a”, “an” and “the” are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms “comprises” and/or “comprising,” when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. As used herein, the term “and/or” includes any and all combinations of one or more of the associated listed items.
[0034] The present invention provides artificial intelligence based effective solution for automatic speech synthesis in response to a text input such that the system and method enable to generate text-to-speech conversion for a wide variety of languages that may also include any native language or the corresponding dialects. Further, the amount of data required by the present system for model training purpose may be very less in comparison to conventional deep learning architectures that rely heavily of huge amount of data, which is depicted clearly in FIG.1. As illustrated in FIG. 1, an exemplary architecture (100) for conversion of text to speech using conventional deep learning technique is shown, wherein a test text sentence (102) may be fed to a conventional deep learning based text-to-speech (TTS) conversion system (104), wherein the deep learning architecture may involve really large and complex networks with greater than 10 million parameters (108) which in turn may demand a huge amount of annotated data (110) as input that may involve more than ten thousand hours of data. The complex networks therein may also require a wide variety of input data with specified parameters (112) such as language information, text based features, phoneme distribution and the like. However, such complex system may not be able to provide a refined audio as they fail to focus on the aspects of prosody, intonation and pitch, due to which the corresponding audio may sound robotic. Thus, the conventional deep learning techniques may be limited by the above mentioned aspects. Further, even if the current state-of-the-art based deep learning models can incorporate prosody, style and intonation but still would require a huge amount of data, require complex models that may be difficult to deploy on smartphones and also the overall implementation may not be cost-effective, especially considering that such models would need to be trained for every single language and corresponding dialect. The present disclosure enables to overcome these concerns by ensuring that using a limited data, a wide-variety of language based text can be converted to audio and also the final audio generated is of high quality, thereby enhancing the user experience in a cost-effective manner. Further, the models used by the system in the present disclosure are self-learning that can be trained and updated without the requirement of huge amount of data.
[0035] Referring to FIG. 2A that illustrates an exemplary network architecture (200) in which or with which system (270) of the present disclosure can be implemented, in accordance with an embodiment of the present disclosure. As illustrated, the exemplary architecture (200) includes a system (270) equipped with an artificial intelligence (AI) engine (216) for facilitating automated conversion of an input text by a user into a synthesized speech based audio output based on AI architecture. The input text may be sent by a user (252) using a user device (254) such that the input text may be transmitted in form of data packets from the user device (254) to the system (270) via a network (258). The system (270) may be communicably coupled to a centralized server (260) and to an external database (256). The external database (256) may include a plurality of pre-stored audio datasets that may be used by the system (270) for speech synthesis in generating the final audio output. The centralized server (260) may enable storing the received input text and the corresponding output.
[0036] In an embodiment, the user (252) may type the text query in any language, wherein the input text may be received in form of a first set of data packets from the user device (254). The user (252) may be any individual interested in listening to audio in a preferred language of their choice. The system may generate a synthesized speech based audio output in response to text query, wherein the audio output may be converted by the system (270) into a waveform that may be transmitted to the user device in real time in form of a second set of data packets, upon receipt of the input text by the user. The user device (254) may be equipped with an internally built or coupled to an external device that may enable hearing the synthesized speech based audio output on the user device (254).
[0037] In an embodiment, the user device (254) may communicate with the system (270) via set of executable instructions residing on any operating system, including but not limited to, Android TM, iOS TM, Kai OS TM and the like. In an embodiment, the user device (254) may include, but not limited to, any electrical, electronic, electro-mechanical or an equipment or a combination of one or more of the above devices such as mobile phone, smartphone, virtual reality (VR) devices, augmented reality (AR) devices, laptop, a general-purpose computer, desktop, personal digital assistant, tablet computer, mainframe computer, or any other computing device, wherein the user device may include one or more in-built or externally coupled accessories including, but not limited to, a keyboard, input devices for receiving input from a user such as touch pad, touch enabled screen, electronic pen and the like. It may be appreciated that the user device (254) may not be restricted to the mentioned devices and various other devices may be used.
[0038] In an embodiment, the system (270) enables processing input text received from a user device associated with the user, wherein the input text may be parsed to obtain a set of parsed text. In an embodiment, the parsing may be done to identify one or more pre-defined attributes associated with each parsed text. In an embodiment, the pre-defined attributes associated with each parsed text may include any or a combination of a script corresponding to a language, a word, a syllable, a space and a punctuation mark. In an exemplary embodiment, an input text such as “My name is Kridha” may be parsed into set of four parsed text including first parsed text (My), second parsed text “ name”, third parsed text “ is” and fourth parsed text “Kridha” to identify the words therein. In another exemplary embodiment, the parsing may be done to identify the syllable. In another exemplary embodiment, the parsing may be done to identify the language of the input text. In another exemplary embodiment, the parsing may be done to identify the punctuation marks so as to enable consideration of expression at the time of speech synthesis. The feature of text processing may not only be limited to parsing and other type of attribute extraction may also be done that can enable identification of one or more features of the input text.
[0039] In an embodiment, the system (270) may enable to concatenate, a plurality of audio datasets to obtain a first output comprising a concatenated speech, wherein the concatenation may be achieved through the AI engine (216) of the system (270). In a general terminology, the term “concatenate” refers to linking in a chain or in a series. In accordance with the present disclosure, concatenation means to link or stitch the plurality of audio datasets in a predefined sequence, wherein this sequence may be dependent on the arrangement of each parsed set with respect to the input text. In an exemplary embodiment, and considering the previous example, the input text as shown in FIG. 2C, wherein for example, the input text “My name is Kridha” may correspond to four parsed text including (My), (name), (is) and (Kridha) such that a plurality of audio datasets that may be most relevant to the four parsed text may be concatenated in the same sequence as the parsed text occurs with respect to input text. Similarly several other parameters may be considered. In an embodiment, the one or more pre-defined audio parameters may include any or a combination of a language, a dialect, a tone, extent of relevancy to the parsed text and voice attributes.
[0040] In an embodiment, the plurality of audio datasets may be generated based on mapping of one or more pre-defined audio parameters of each audio dataset with the pre-defined attributes of each parsed text. In an embodiment, to achieve this, one of the pre-defined attribute, for example, language of the parsed text may be identified and then based on the language of the parsed text, the language corresponding to the audio parameters may be chosen based on mapping of language. In another exemplary embodiment, the pre-defined attribute of the parsed text may be identified based on the punctuation marks, wherein based on the type of punctuation mark such as a question mark, an exclamation mark, a comma, a full stop and the like, the AI engine (216) may map the pre-defined audio parameters such that an appropriate tone and voice attribute may be considered while generating plurality of audio datasets. In an embodiment, the plurality of audio datasets may be generated based on previously stored audio data in database of the system (270) or from an external database (256).
[0041] The concatenated speech may not be refined in nature and may sound synthetic in nature. However, AI engine (216) of the system (270) can enable to refine the concatenated speech based on one or more pre-defined prosody based attributes to obtain a second output including the synthesized speech based audio output that may be relatively more refined than the concatenated speech. In an embodiment, the refinement of the concatenated speech can include any or a combination of steps selected from energy based normalization, noise removal, voice activity detection, voice trimming, Cepstral smoothing and time-domain smoothing. These refinement techniques may enable to improve the overall quality of the audio and render a human quality audio that is much better in quality compared to concatenated audio as well as conventional techniques. Further, the system (270) enables conversion of the input text into the synthesized speech based audio output in real-time.
[0042] In an embodiment, the system (270) may include one or more processors coupled with a memory, wherein the memory may store instructions which when executed by the one or more processors may cause the system to perform the generation of an automated audio output to an input text. FIG. 2 with reference to FIG. 1, illustrates an exemplary representation of system (270) /centralized server (260) for facilitating automated conversion of an input text by a user into a synthesized speech based audio output based on an artificial intelligence based architecture, in accordance with an embodiment of the present disclosure. In an aspect, the system (270) /centralized server (260) may comprise one or more processor(s) (202). The one or more processor(s) (202) may be implemented as one or more microprocessors, microcomputers, microcontrollers, digital signal processors, central processing units, logic circuitries, and/or any devices that process data based on operational instructions. Among other capabilities, the one or more processor(s) (202) may be configured to fetch and execute computer-readable instructions stored in a memory (206) of the system (270). The memory (206) may be configured to store one or more computer-readable instructions or routines in a non-transitory computer readable storage medium, which may be fetched and executed to create or share data packets over a network service. The memory (206) may comprise any non-transitory storage device including, for example, volatile memory such as RAM, or non-volatile memory such as EPROM, flash memory, and the like.
[0043] In an embodiment, the system (270)/centralized server (260) may include an interface(s) 204. The interface(s) 204 may comprise a variety of interfaces, for example, interfaces for data input and output devices, referred to as I/O devices, storage devices, and the like. The interface(s) 204 may facilitate communication of the system (270). The interface(s) 204 may also provide a communication pathway for one or more components of the system (270) or the centralized server (260). Examples of such components include, but are not limited to, processing engine(s) 208 and a database 210.
[0044] The processing engine(s) (208) may be implemented as a combination of hardware and programming (for example, programmable instructions) to implement one or more functionalities of the processing engine(s) (208). In examples described herein, such combinations of hardware and programming may be implemented in several different ways. For example, the programming for the processing engine(s) (208) may be processor executable instructions stored on a non-transitory machine-readable storage medium and the hardware for the processing engine(s) (208) may comprise a processing resource (for example, one or more processors), to execute such instructions. In the present examples, the machine-readable storage medium may store instructions that, when executed by the processing resource, implement the processing engine(s) (208). In such examples, the system (270) /centralized server (260) may comprise the machine-readable storage medium storing the instructions and the processing resource to execute the instructions, or the machine-readable storage medium may be separate but accessible to the system (270) /centralized server (260) and the processing resource. In other examples, the processing engine(s) (208) may be implemented by electronic circuitry.
[0045] The processing engine (208) may include one or more engines selected from any of a text processing engine (212), a speech synthesis engine (214), AI engine (216), learning module (218) and other engines (220). In an embodiment, the text processing engine (212) of the system (270) can enable processing input text received from the user device (254) associated with the user (252). As explained earlier, the processing of the input text may mainly involve parsing to obtain a set of parsed text. The text processing engine (212) may include a set of pre-stored parameters that can be used for identification of pre-defined attributes associated with each parsed text after the parsing of the input text. In an embodiment, prior to processing, the text processing engine may pre-process the input text for removing any errors or unmeaningful words. The speech synthesis engine (214) may enable generation or retrieving of plurality of audio datasets from the external database based on the mapping of one or more pre-defined audio parameters of each audio dataset with the one or more pre-defined attributes of each parsed text, as done by the AI engine to concatenate the plurality of audio datasets. The AI engine (216) may enable refinement of the concatenated speech based on prosody based attributes to provide a better listening experience to a user such that the final refined synthesized speech based audio output may not sound robotic. The prosody based attributes may be selected from any or a combination of intonation feature, stress pattern feature, variations in loudness, pausing attribute, amplitude and rhythm. By a general definition, the term “prosody” refers to one or more elements of a speech that include linguistic functions such as intonation, tone, stress, and rhythm that enhance the expressiveness in a sentence. The learning module (218) may include a self-learning model that can be updated with the input text and the generated audio output based on which the model can keep updating in real time. The other engines (220) may include waveform generation engine to obtain the synthesized speech based audio output as a waveform that is sent to the user device in form of data packets. The database (210) may comprise data that may be either stored or generated as a result of functionalities implemented by any of the components of the processing engine(s) 208 of the system (270)/centralized server (260). The database (210) may also enable to store data fed to the AI engine (216) during learning phase.
[0046] FIG. 3 illustrates an exemplary flow diagram (300) representation depicting the significant steps in conversion of an input text to a speech audio output, in accordance with an embodiment of the present disclosure. In an embodiment and as shown in FIG. 3, the input text (320) may be received from a user device, wherein the input text may be processed (302) by parsing to obtain a set of parsed text, wherein parsing may be done to identify one or more pre-defined attributes associated with each parsed text. At 304, plurality of audio datasets may be generated based on mapping of one or more pre-defined audio parameters of each audio dataset with the one or more pre-defined attributes of each parsed text, wherein the plurality of audio datasets may be obtained from a database of audio blocks (310) which could be an external database or database of the system. The AI engine (216) may concatenate the plurality of audio datasets by stitching the plurality of audio datasets in a predefined sequence based on the arrangement of each parsed set with respect to the input text to obtain a first output including a concatenated speech. At 306, the concatenated speech may be refined by the AI engine based on one or more pre-defined prosody based attributes to obtain synthesized speech based audio output that is relatively more refined than the concatenated speech. The audio output may be transformed to waveform (308) and the corresponding output speech may be sent to the user device.
[0047] FIG. 4 illustrates an exemplary representation (400) of components for generation of concatenated speech, in accordance with an embodiment of the present disclosure. As illustrated in FIG. 4, the system (270) may enable concatenation of plurality of audio datasets by using an impulse train generator (404) and a white noise generator (406) based on the pattern of audio and pauses needed in the generation of concatenated speech. The concatenation can be controlled by a switch (408) and upon determining system gain (410), the vocal tract transfer function (412) may be activated based on parameters (414) to obtain the output speech (416).
[0048] The AI engine (216) may be associated with one or more self-learning model based neural network, wherein the self-learning model may be trained and updated based on received input data and the synthesized speech based audio output. In an exemplary embodiment, the neural network associated with AI engine (216) of system (270) may include multiple layers, as shown in an exemplary representation (500) in FIG. 5. In an embodiment, the neural network (500) may include neurons (represented as circles), wherein the neural network may have three basic layers including input layer (502) (D1, D2, D3… Dn), hidden layer (504) and output layer (506) (Y). In an exemplary embodiment, the input text and the generated synthesized output may be used to train and update the neural network in real-time.
[0049] The final waveform generated as audio output may be obtained after refining of the concatenated speech, wherein the refinement may enable alteration of one or more characteristics of the concatenated speech. FIG. 6A illustrates an exemplary representation (600) of an audio waveform generated by system (270), in accordance with an embodiment of the present disclosure. As illustrated in FIG. 6A and as per an embodiment, the waveform 604 includes plurality of frames 606 (length 10 ms), wherein it can be seen that the amplitude of waveform varies as shown in 602. In addition, each frame of the audio is associated with floating point value (608), cepstral coefficients (610), pitch parameter (612) and LPC gain (614).
[0050] In an embodiment, during refinement, alteration of one or more characteristics of the concatenated speech may be selected from energy-based normalization, noise removal, voice activity detection, voice trimming, Cepstral smoothing and time-domain smoothing. In an embodiment, the energy based normalization may enable refinement in amplitude of the concatenated speech. It is commonly observed that all the audio files need not be of the same amplitude range, which makes the simple stitching during concatenation ineffective and also introduces problem in deciding the threshold value of noise for the next block. In an exemplary embodiment, energy based normalization can be done by:
Equation 1:

Equation 2:

In an embodiment, an exemplary result of energy based normalization on chosen sequence is place below in the form of an audio file:

[0051] In an embodiment, the noise removal may include identification of a threshold noise value by comparison with a low-noise segment of the concatenated speech, wherein effect of the threshold noise value may be deducted from the concatenated speech, wherein the following equations may be used:

Exemplary result of this block after block sounds is placed below in the form of an audio file:

[0052] In an embodiment, the voice activity detection (VAD) and voice trimming may enable removal of one or more unwarranted non-audio based silent segments. The individual recorded audios have beginning and trailing silences, and so to remove the same, VAD is used to detect silence regions using an energy based approach and remove these parts from the audio, as illustrated in FIG. 6B. showing exemplary representation (650) for refinement of a concatenated speech output by using voice activity detection and trimming technique, in accordance with an embodiment of the present disclosure. As shown in FIG. 6B, the input signal 652 (audio that needs to be refined) is obtained which can be assessed at data block 654 or data block 670, wherein at 656 the data may be subjected to Fast Fourier transform i.e. FFT, post which peak picking (658), tone detection (660), reconstruction (662), LSPE calculation (664), detection (666) may occur. On data block 670, noise level detection (672) and energy detection (674) may occur which can be compared by comparator (676). Both the streams (666) and (676) are subjected to fusion (678) to obtain VAD output (680). Exemplary result of the blocks can be represented in the below audio file:

[0053] In an embodiment, the Cepstral smoothing facilitates refining of one or more stitched boundaries between each audio dataset in the concatenated speech. To smooth the transitions at the boundary of stitched words, cepstral domain smoothing is performed on the stitched feature vectors using the equation:

Where G is the smoothened signal, k is frequency bin, m is time index,
ß is tuning parameter (0.1 – workable value)
Exemplary representation of this block is shown as below audio file:

[0054] In an embodiment, the time domain smoothing enables to generate time domain audio from the concatenated features on which a simple low pass moving average filter is applied to smoothen the generated audio. The time domain audio is created from the concatenated features and on that a simple low pass moving average filter is applied to smoothen the audio and aid continuity, using the equation:
, where * means convolution
Exemplary result of all the blocks can be seen in the below audio file:

[0055] FIG. 7 illustrates exemplary method flow diagram (700) depicting a method for facilitating automated conversion of an input text by a user into a synthesized speech based audio output, in accordance with an embodiment of the present disclosure. At step 702, the method includes the step of processing, at a text processing engine of the system, an input text received from a user device associated with the user, wherein the input text is parsed to obtain a set of parsed text, the parsing being done to identify one or more pre-defined attributes associated with each parsed text. At step 704, the method includes the step of concatenating, through an artificial intelligence engine of the system, a plurality of audio datasets to obtain a first output comprising a concatenated speech, wherein the concatenation is done by stitching the plurality of audio datasets in a predefined sequence based on the arrangement of each parsed set with respect to the input text, wherein the plurality of audio datasets may be generated based on mapping of one or more pre-defined audio parameters of each audio dataset with the plurality of attributes of each parsed text. At step 706, the method includes the step of refining, through the AI engine, the concatenated speech based on one or more pre-defined prosody based attributes to obtain a second output comprising the synthesized speech based audio output, wherein synthesized speech based audio output is relatively more refined than the concatenated speech and wherein the system enables real-time conversion of the input text into the synthesized speech based audio output.
[0056] FIG. 8 illustrates an exemplary computer system in which or with which embodiments of the present disclosure can be utilized in accordance with embodiments of the present disclosure. As shown in FIG. 8, computer system 800 can include an external storage device 810, a bus 820, a main memory 830, a read only memory 840, a mass storage device 850, communication port 860, and a processor 870. A person skilled in the art will appreciate that the computer system may include more than one processor and communication ports. Examples of processor 870 include, but are not limited to, an Intel® Itanium® or Itanium 2 processor(s), or AMD® Opteron® or Athlon MP® processor(s), Motorola® lines of processors, FortiSOC™ system on chip processors or other future processors. Processor 870 may include various modules associated with embodiments of the present disclosure. Communication port 860 can be any of an RS-232 port for use with a modem based dialup connection, a 10/100 Ethernet port, a Gigabit or 10 Gigabit port using copper or fiber, a serial port, a parallel port, or other existing or future ports. Communication port 860 may be chosen depending on a network, such a Local Area Network (LAN), Wide Area Network (WAN), or any network to which computer system connects. Memory 830 can be Random Access Memory (RAM), or any other dynamic storage device commonly known in the art. Read-only memory 840 can be any static storage device(s) e.g., but not limited to, a Programmable Read Only Memory (PROM) chips for storing static information e.g., start-up or BIOS instructions for processor 870. Mass storage 850 may be any current or future mass storage solution, which can be used to store information and/or instructions. Exemplary mass storage solutions include, but are not limited to, Parallel Advanced Technology Attachment (PATA) or Serial Advanced Technology Attachment (SATA) hard disk drives or solid-state drives (internal or external, e.g., having Universal Serial Bus (USB) and/or Firewire interfaces), e.g. those available from Seagate (e.g., the Seagate Barracuda 7252 family) or Hitachi (e.g., the Hitachi Deskstar 7K1000), one or more optical discs, Redundant Array of Independent Disks (RAID) storage, e.g. an array of disks (e.g., SATA arrays), available from various vendors including Dot Hill Systems Corp., LaCie, Nexsan Technologies, Inc. and Enhance Technology, Inc.
[0057] Bus 820 communicatively couples processor(s) 870 with the other memory, storage and communication blocks. Bus 820 can be, e.g. a Peripheral Component Interconnect (PCI) / PCI Extended (PCI-X) bus, Small Computer System Interface (SCSI), USB or the like, for connecting expansion cards, drives and other subsystems as well as other buses, such a front side bus (FSB), which connects processor 870 to software system.
[0058] Optionally, operator and administrative interfaces, e.g. a display, keyboard, and a cursor control device, may also be coupled to bus 820 to support direct operator interaction with a computer system. Other operator and administrative interfaces can be provided through network connections connected through communication port 860. The external storage device 810 can be any kind of external hard-drives, floppy drives, IOMEGA® Zip Drives, Compact Disc - Read Only Memory (CD-ROM), Compact Disc-Re-Writable (CD-RW), Digital Video Disk-Read Only Memory (DVD-ROM). Components described above are meant only to exemplify various possibilities. In no way should the aforementioned exemplary computer system limit the scope of the present disclosure.
[0059] Thus, the present disclosure provides a unique and inventive solution for facilitating automated conversion of an input text by a user into a synthesized speech based audio output based on an artificial intelligence architecture, thus providing an automated and improved user experience. The solution offered by the present disclosure enables extremely fast and real time deployment such that a user typing a text may get the corresponding speech output in real time, wherein the system also provides human quality audio unlike existing state-of-art models which give robotic sound. Further, the present disclosure does not require huge corpus and intensive based synthesizer as well as the present system can correctly model prosody and intonation. At the same time, the system can be used for all dialect variations and nuances even in a single language, thus making the system a universal text to speech (TTS) converting medium for any user desiring speech generation in any language or dialect. Moreover, the ingenious AI engine of the system does not require huge amount of data unlike conventional deep learning systems having complex network. Thus, the system and method of the present disclosure is an efficient, economical tool that provides an enhanced user experience to a user.
[0060] While considerable emphasis has been placed herein on the preferred embodiments, it will be appreciated that many embodiments can be made and that many changes can be made in the preferred embodiments without departing from the principles of the invention. These and other changes in the preferred embodiments of the invention will be apparent to those skilled in the art from the disclosure herein, whereby it is to be distinctly understood that the foregoing descriptive matter to be implemented merely as illustrative of the invention and not as limitation.

Documents

Application Documents

#	Name	Date
1	202121000018-FORM-26 [28-02-2025(online)].pdf	2025-02-28
1	202121000018-FORM-8 [11-10-2024(online)].pdf	2024-10-11
1	202121000018-STATEMENT OF UNDERTAKING (FORM 3) [01-01-2021(online)].pdf	2021-01-01
2	202121000018-Annexure [26-03-2024(online)].pdf	2024-03-26
2	202121000018-FORM-8 [11-10-2024(online)].pdf	2024-10-11
2	202121000018-REQUEST FOR EXAMINATION (FORM-18) [01-01-2021(online)].pdf	2021-01-01
3	202121000018-Annexure [26-03-2024(online)].pdf	2024-03-26
3	202121000018-FORM 18 [01-01-2021(online)].pdf	2021-01-01
3	202121000018-FORM 3 [26-03-2024(online)].pdf	2024-03-26
4	202121000018-Written submissions and relevant documents [26-03-2024(online)].pdf	2024-03-26
4	202121000018-FORM 3 [26-03-2024(online)].pdf	2024-03-26
4	202121000018-FORM 1 [01-01-2021(online)].pdf	2021-01-01
5	202121000018-Written submissions and relevant documents [26-03-2024(online)].pdf	2024-03-26
5	202121000018-DRAWINGS [01-01-2021(online)].pdf	2021-01-01
5	202121000018-Correspondence to notify the Controller [07-03-2024(online)].pdf	2024-03-07
6	202121000018-FORM-26 [07-03-2024(online)].pdf	2024-03-07
6	202121000018-DECLARATION OF INVENTORSHIP (FORM 5) [01-01-2021(online)].pdf	2021-01-01
6	202121000018-Correspondence to notify the Controller [07-03-2024(online)].pdf	2024-03-07
7	202121000018-US(14)-HearingNotice-(HearingDate-11-03-2024).pdf	2024-02-12
7	202121000018-FORM-26 [07-03-2024(online)].pdf	2024-03-07
7	202121000018-COMPLETE SPECIFICATION [01-01-2021(online)].pdf	2021-01-01
8	202121000018-ABSTRACT [27-07-2022(online)].pdf	2022-07-27
8	202121000018-FORM-26 [17-03-2021(online)].pdf	2021-03-17
8	202121000018-US(14)-HearingNotice-(HearingDate-11-03-2024).pdf	2024-02-12
9	202121000018-ABSTRACT [27-07-2022(online)].pdf	2022-07-27
9	202121000018-CLAIMS [27-07-2022(online)].pdf	2022-07-27
9	202121000018-Proof of Right [15-06-2021(online)].pdf	2021-06-15
10	202121000018-CLAIMS [27-07-2022(online)].pdf	2022-07-27
10	202121000018-CORRESPONDENCE [27-07-2022(online)].pdf	2022-07-27
10	Abstract1.jpg	2021-10-19
11	202121000018-CORRESPONDENCE [27-07-2022(online)].pdf	2022-07-27
11	202121000018-FER_SER_REPLY [27-07-2022(online)].pdf	2022-07-27
11	202121000018-FORM-9 [29-12-2021(online)].pdf	2021-12-29
12	202121000018-Covering Letter [06-01-2022(online)].pdf	2022-01-06
12	202121000018-FER_SER_REPLY [27-07-2022(online)].pdf	2022-07-27
12	202121000018-FORM 3 [27-07-2022(online)].pdf	2022-07-27
13	202121000018-FORM 3 [27-07-2022(online)].pdf	2022-07-27
13	202121000018-FORM 3 [27-06-2022(online)].pdf	2022-06-27
13	202121000018 CORRESPONDANCE WIPO DAS 11-01-2022.pdf	2022-01-11
14	202121000018-FER.pdf	2022-01-28
14	202121000018-FORM 3 [27-06-2022(online)].pdf	2022-06-27
15	202121000018 CORRESPONDANCE WIPO DAS 11-01-2022.pdf	2022-01-11
15	202121000018-FER.pdf	2022-01-28
15	202121000018-FORM 3 [27-06-2022(online)].pdf	2022-06-27
16	202121000018 CORRESPONDANCE WIPO DAS 11-01-2022.pdf	2022-01-11
16	202121000018-Covering Letter [06-01-2022(online)].pdf	2022-01-06
16	202121000018-FORM 3 [27-07-2022(online)].pdf	2022-07-27
17	202121000018-FER_SER_REPLY [27-07-2022(online)].pdf	2022-07-27
17	202121000018-FORM-9 [29-12-2021(online)].pdf	2021-12-29
17	202121000018-Covering Letter [06-01-2022(online)].pdf	2022-01-06
18	202121000018-FORM-9 [29-12-2021(online)].pdf	2021-12-29
18	Abstract1.jpg	2021-10-19
18	202121000018-CORRESPONDENCE [27-07-2022(online)].pdf	2022-07-27
19	202121000018-CLAIMS [27-07-2022(online)].pdf	2022-07-27
19	202121000018-Proof of Right [15-06-2021(online)].pdf	2021-06-15
19	Abstract1.jpg	2021-10-19
20	202121000018-ABSTRACT [27-07-2022(online)].pdf	2022-07-27
20	202121000018-FORM-26 [17-03-2021(online)].pdf	2021-03-17
20	202121000018-Proof of Right [15-06-2021(online)].pdf	2021-06-15
21	202121000018-COMPLETE SPECIFICATION [01-01-2021(online)].pdf	2021-01-01
21	202121000018-FORM-26 [17-03-2021(online)].pdf	2021-03-17
21	202121000018-US(14)-HearingNotice-(HearingDate-11-03-2024).pdf	2024-02-12
22	202121000018-COMPLETE SPECIFICATION [01-01-2021(online)].pdf	2021-01-01
22	202121000018-DECLARATION OF INVENTORSHIP (FORM 5) [01-01-2021(online)].pdf	2021-01-01
22	202121000018-FORM-26 [07-03-2024(online)].pdf	2024-03-07
23	202121000018-Correspondence to notify the Controller [07-03-2024(online)].pdf	2024-03-07
23	202121000018-DECLARATION OF INVENTORSHIP (FORM 5) [01-01-2021(online)].pdf	2021-01-01
23	202121000018-DRAWINGS [01-01-2021(online)].pdf	2021-01-01
24	202121000018-DRAWINGS [01-01-2021(online)].pdf	2021-01-01
24	202121000018-FORM 1 [01-01-2021(online)].pdf	2021-01-01
24	202121000018-Written submissions and relevant documents [26-03-2024(online)].pdf	2024-03-26
25	202121000018-FORM 3 [26-03-2024(online)].pdf	2024-03-26
25	202121000018-FORM 18 [01-01-2021(online)].pdf	2021-01-01
25	202121000018-FORM 1 [01-01-2021(online)].pdf	2021-01-01
26	202121000018-REQUEST FOR EXAMINATION (FORM-18) [01-01-2021(online)].pdf	2021-01-01
26	202121000018-FORM 18 [01-01-2021(online)].pdf	2021-01-01
26	202121000018-Annexure [26-03-2024(online)].pdf	2024-03-26
27	202121000018-STATEMENT OF UNDERTAKING (FORM 3) [01-01-2021(online)].pdf	2021-01-01
27	202121000018-REQUEST FOR EXAMINATION (FORM-18) [01-01-2021(online)].pdf	2021-01-01
27	202121000018-FORM-8 [11-10-2024(online)].pdf	2024-10-11
28	202121000018-STATEMENT OF UNDERTAKING (FORM 3) [01-01-2021(online)].pdf	2021-01-01
28	202121000018-FORM-26 [28-02-2025(online)].pdf	2025-02-28

Search Strategy

1	SS_202121000018E_27-01-2022.pdf