System And Method For Text To Speech Synthesis

Abstract: A system for text-to-speech synthesis is disclosed. The system includes a concatenation module, configured to generate one or more sentences at runtime by concatenating the plurality of speech units and produce a resulting waveform for a plurality of concatenated speech units. The system includes a variance analysis module, configured to calculate spectrogram of the resulting waveform by short time Fourier transform and analysis variance of calculated spectrogram across energies of all frequencies. The system includes a click sound removable module configured to detect the click sound in formulated threshold, segregate the click sound over the spectrogram by adding a time window around the detected click sound in formulated threshold and minimise amplitude of the click sound by using a fade out filter before the click sound and fade in filter after the click sound.

Patent Information

Application #

Filing Date

12 January 2021

Publication Number

04/2021

Publication Type

INA

Invention Field

PHYSICS

Status

Email

filings@ipexcel.com

Parent Application

Patent Number

Legal Status

Grant Date

2021-12-14

Renewal Date

Applicants

Kamacak Analytics Private Limited

Flat No. 2058, Tower Felecita, 20th Floor, Mahagun Moderne, Plot No. GH-02, Sector 78, Noida, Gautam Buddh Nagar, Uttar Pradesh, 201301, India

Inventors

1. Anant Tripathi

91Springboard, A-130, Sector 63, Noida, Uttar Pradesh, India

2. Chakrapani Mishra

91Springboard, A-130, Sector 63, Noida, Uttar Pradesh, India

3. Ziya Khan

91Springboard, A-130, Sector 63, Noida, Uttar Pradesh, India

Specification

Embodiments of a present disclosure relates to generating synthesized
speech, and more particularly to a system and a method for text-to-speech synthesis
by concatenating multiple speech units.
BACKGROUND
[0002] Generally, any input text undergoes morphological analysis and syntactic
parsing in the language processing unit, and then undergoes accent and intonation
processes in the prosodic processing unit to output phoneme string and prosodic
features or suprasegmental features (pitch or fundamental frequency, duration or
phoneme duration time, power, and the like). After all such process, speech signal
is synthesized from the phoneme string and the prosodic features.
[0003] The quality of any such speech synthesizer is assessed by the speech
similarity to the human voice and by the speech clarity. An intelligible text-tospeech program allows people with visual impairments or reading disabilities to
listen to written words on a computing device.
[0004] Conventionally, for speech synthesis, small synthesis units (e.g., CV,
CVC, VCV, and the like (V=vowel, C=consonant)) are stored, and are selectively
read out. Such speech synthesis relies on spectrogram analysis. The spectrogram is
an image of frequencies, energies and their variation across time. In such method
frequencies and duration of stored speech units are controlled, then that segments
are connected to generate synthetic speech. An efficient way for concatenation of
two adjacent speech units during speech synthesis is to reduce introduction of any
acoustical mismatch or any unwanted noise. However, during the joining or
“concatenating” of multiple snippets, if the transition between two snippets is not
smooth, then a “click” sound is heard at the joint. This can be very irritating for end
user and we aim to minimize the loudness of such sounds.
3
[0005] The presently known techniques are not efficient at analysis of variance
of spectrograms to detect and fix points of discontinuity in concatenative speech
synthesis.
[0006] Hence, there is a need for an improved system for text-to-speech
synthesis by concatenating multiple speech units and a method to operate the same
and therefore address the aforementioned issues.
BRIEF DESCRIPTION
[0007] In accordance with one embodiment of the disclosure, a system for textto-speech synthesis is disclosed. The system includes a concatenation module
operable by one or more processors. The concatenation module is configured to
generate one or more sentences at runtime by concatenating the plurality of speech
units and produce a resulting waveform for a plurality of concatenated speech units.
The system also includes a variance analysis module is operable by the one or more
processors. The variance analysis module is operatively coupled to the
concatenation module. The variance analysis module is configured to calculate
spectrogram of the resulting waveform by short time Fourier transform.
[0008] The variance analysis module is also configured analysis variance of
calculated spectrogram across energies of all frequencies. The system also includes
a click sound removable module operable by the one or more processors. The click
sound removable module is operatively coupled to the variance analysis module.
The click sound removable module is configured to formulate a pre-determined
threshold for which the calculated spectrogram contains a click sound. The click
sound removable module is also configured to detect the click sound in formulated
threshold. The click sound removable module is also configured to segregate the
click sound over the spectrogram by adding a time window around the detected
click sound in formulated threshold. The click sound removable module is also
4
configured to minimise amplitude of the click sound by using a fade out filter before
the click sound and fade in filter after the click sound.
[0009] In accordance with one embodiment of the disclosure, a method for textto-speech synthesis is disclosed. The method includes generating one or more
sentences at runtime by concatenating the plurality of speech units and produce a
resulting waveform for a plurality of concatenated speech units. The method also
includes calculating spectrogram of the resulting waveform by short time Fourier
transform. The method also includes analysing variance of calculated spectrogram
across energies of all frequencies. The method also includes formulating a predetermined threshold for which the calculated spectrogram contains a click sound.
[0010] The method also includes detecting the click sound in formulated
threshold. The method also includes segregating the click sound over the
spectrogram by adding a time window around the detected click sound in
formulated threshold. The method also includes minimising amplitude of the click
sound by using a fade out filter before the click sound and fade in filter after the
click sound.
[0011] To further clarify the advantages and features of the present disclosure, a
more particular description of the disclosure will follow by reference to specific
embodiments thereof, which are illustrated in the appended figures. It is to be
appreciated that these figures depict only typical embodiments of the disclosure and
are therefore not to be considered limiting in scope. The disclosure will be described
and explained with additional specificity and detail with the appended figures.
BRIEF DESCRIPTION OF THE DRAWINGS
[0012] The disclosure will be described and explained with additional specificity
and detail with the accompanying figures in which:
[0013] FIG. 1 is a block diagram representation of a system for text-to-speech
synthesis in accordance with an embodiment of the present disclosure;
5
[0014] FIG. 2 is a flowchart representation of an embodiment representing click
removable process of FIG. 1 in accordance with an embodiment of the present
disclosure;
[0015] FIG. 3 a is a graphical representation of an embodiment representing the
spectrogram before threshold computation in accordance with an embodiment of
the present disclosure;
[0016] FIG. 3 b is a graphical representation of an embodiment representing the
spectrogram after threshold computation in accordance with an embodiment of the
present disclosure;
[0017] FIG. 4 a is a graphical representation of an embodiment representing the
alignment measure of -0.9391 in accordance with an embodiment of the present
disclosure;
[0018] FIG. 4 b is a graphical representation of an embodiment representing the
alignment measure of -0.1866 in accordance with an embodiment of the present
disclosure.;
[0019] FIG. 5 is a block diagram of a computer or a server in accordance with
an embodiment of the present disclosure; and
[0020] FIG. 6 is a flowchart representing the steps of a method for text-to-speech
synthesis in accordance with an embodiment of the present disclosure.
[0021] Further, those skilled in the art will appreciate that elements in the figures
are illustrated for simplicity and may not have necessarily been drawn to scale.
Furthermore, in terms of the construction of the device, one or more components of
the device may have been represented in the figures by conventional symbols, and
the figures may show only those specific details that are pertinent to understanding
the embodiments of the present disclosure so as not to obscure the figures with
details that will be readily apparent to those skilled in the art having the benefit of
the description herein.
6
DETAILED DESCRIPTION
[0022] For the purpose of promoting an understanding of the principles of the
disclosure, reference will now be made to the embodiment illustrated in the figures
and specific language will be used to describe them. It will nevertheless be
understood that no limitation of the scope of the disclosure is thereby intended.
Such alterations and further modifications in the illustrated online platform, and
such further applications of the principles of the disclosure as would normally occur
to those skilled in the art are to be construed as being within the scope of the present
disclosure.
[0023] The terms "comprises", "comprising", or any other variations thereof, are
intended to cover a non-exclusive inclusion, such that a process or method that
comprises a list of steps does not include only those steps but may include other
steps not expressly listed or inherent to such a process or method. Similarly, one or
more devices or subsystems or elements or structures or components preceded by
"comprises... a" does not, without more constraints, preclude the existence of other
devices, subsystems, elements, structures, components, additional devices,
additional subsystems, additional elements, additional structures or additional
components. Appearances of the phrase "in an embodiment", "in another
embodiment" and similar language throughout this specification may, but not
necessarily do, all refer to the same embodiment.
[0024] Unless otherwise defined, all technical and scientific terms used herein
have the same meaning as commonly understood by those skilled in the art to which
this disclosure belongs. The system, methods, and examples provided herein are
only illustrative and not intended to be limiting.
[0025] In the following specification and the claims, reference will be made to
a number of terms, which shall be defined to have the following meanings. The
singular forms “a”, “an”, and “the” include plural references unless the context
clearly dictates otherwise.
7
[0026] A computer system (standalone, client or server computer system)
configured by an application may constitute a “module” that is configured and
operated to perform certain operations. In one embodiment, the “module” may be
implemented mechanically or electronically, so a module may comprise dedicated
circuitry or logic that is permanently configured (within a special-purpose
processor) to perform certain operations. In another embodiment, a “module” may
also comprise programmable logic or circuitry (as encompassed within a generalpurpose processor or other programmable processor) that is temporarily configured
by software to perform certain operations.
[0027] Accordingly, the term “module” should be understood to encompass a
tangible entity, be that an entity that is physically constructed permanently
configured (hardwired) or temporarily configured (programmed) to operate in a
certain manner and/or to perform certain operations described herein.
[0028] The present invention provides for systema and method for analysing the
variation of frequencies and energies in a particular time window to detect if the
time window contains a discontinuity in speech. The system is configured to
automatically analyse variance of Spectrograms to detect and fix points of
discontinuity in concatenative speech synthesis.
[0029] FIG. 1 is a block diagram representation of a system (10) for text-tospeech synthesis in accordance with an embodiment of the present disclosure.
Speech synthesis is an artificial production of human speech. A computer system
used for this purpose is called a speech computer or speech synthesizer and is
implemented via software or hardware products. A text-to-speech (TTS) system
converts normal language text and symbolic linguistic representations like phonetic
transcriptions into speech. The disclosed system (10) basically uses concatenative
speech synthesis, whereby large datasets of spoken speech are collected which are
then searched through to produce a new sentence at runtime.
[0030] The system (10) comprises a concatenation module (20) operable by one
or more processors. The concatenation module (20) is configured to generate one
8
or more sentences at runtime by concatenating a plurality of speech units. The
concatenation module (20) produces a resulting waveform for a plurality of
concatenated speech units. In such embodiment, one or more generated sentences
comprise the plurality of speech units, joints between two adjacent plurality of
speech units and one or more static sections.
[0031] In one specific embodiment, during the joining or “concatenating” of
multiple speech snippets, if the transition between two snippets is not smooth, then
a “click” sound is heard at the joint. The click sound refers to unwanted noise
present between two waveforms for the plurality of concatenated speech units. In
one exemplary embodiment, the concatenation module (20) generates the one or
more sentences by receiving input of one or more texts and extracting a
corresponding speech unit from a data set of spoken speech for each of received
one or more texts.
[0032] Furthermore, to speed up synthesis, static sections of the audio are often
stored as binaries which are then concatenated to a dynamic string at runtime. Such
method guarantees a smooth transition between those concatenated pieces of audio.
Additionally, the system (10) is designed to further analyse the variation of
frequencies and energies in a particular time window to detect if the time window
contains a discontinuity in speech or the click sound.
[0033] Moreover, the system (10) includes a variance analysis module (30)
operable by the one or more processors. The variance analysis module (30) is
operatively coupled to the concatenation module (20). The variance analysis
module (30) is configured to calculate spectrogram of the resulting waveform by
short time Fourier transform. As used herein, the term “Short-time Fourier
transform”, is a Fourier-related transform used to determine the sinusoidal
frequency and phase content of local sections of a signal as it changes over time.
[0034] Here, the short time Fourier transform computed using NumPy in python.
NumPy is a Python library used for working with arrays and further has functions
for working in domain of linear algebra, Fourier transform, and matrices. The
9
variance analysis module (30) is also configured to analysis variance of calculated
spectrogram across energies of all frequencies.
[0035] The basic premise of the whole process is that all click sounds across all
frequencies is detected at end. In one embodiment, such analysis leads to a low
variance in energy across frequencies. In such embodiment, the system (10)
analyses the frequencies in a window of time described by its window length which
usually ranges from 12.5ms to 30ms. The process is repeated by skipping to the
next frame by a hop length of around 12.5ms until the entire audio speech is
covered.
[0036] The system (10) also includes a click sound removable module (40)
operable by the one or more processors. The click sound removable module (40) is
operatively coupled to the variance analysis module (30). The click sound
removable module (40) is configured to formulate a pre-determined threshold for
which the calculated spectrogram contains a click sound. The click sound
removable module (40) is also configured to detect the click sound in formulated
threshold.
[0037] Furthermore, the click sound removable module (40) segregates the click
sound over the spectrogram by adding a time window around the detected click
sound in formulated threshold. The click sound removable module (40) is also
configured to minimise amplitude of the click sound by using a fade out filter before
the click sound and fade in filter after the click sound. So that removes the unwanted
click sound. FIG. 2 explain the whole removable process in detail.
[0038] FIG. 2 is a flowchart representation of an embodiment representing click
removable process (70) of FIG. 1 in accordance with an embodiment of the present
disclosure. The system receives the input waveform to detect and remove any click
present during synthesis in step 60. Short-time Fourier transform (STFT) is
computed of the entered waveform via a spectrogram in step 70 and 80.
Furthermore, standard deviation is calculated for such waveform in step 90. Such
standard deviation helps to analyse frequencies and energies in a particular time
10
window and thereby detect if the time window contains a discontinuity in speech
or the click sound.
[0039] FIG. 3 a is a graphical representation of an embodiment representing the
spectrogram (130) before threshold computation in accordance with an embodiment
of the present disclosure. It is to be noted that before threshold computation and
applying it for click sound removal, the spectrogram for the original sound contains
artifacts representing the click sounds. Like any other sound, click sound is made
up of multiple frequencies. Therefore, thresholds need to be determined for each of
the contributing formant frequencies in the click sounds. FIG. 3 b is a graphical
representation of an embodiment representing the spectrogram (140) after threshold
computation in accordance with an embodiment of the present disclosure.
[0040] A pre-determined threshold for which the calculated spectrogram
contains a click sound is detected in step 100. The click sound is segregated or
localized in step 110. Once the system (10) zeroed in on the position of clicks in the
taken sample, fade out filters and fade in filters are applied around the clicks to
reduce the click sound in step 120.
[0041] It is pertinent to note that the silences perceived by the human ear usually
contain small energies distributed across frequencies. This also leads to a low
variance due to which differentiating between silences and clicks becomes difficult.
However, the energy difference here between click and silences help us to easily
differentiate between the two.
[0042] The system (10) further includes alignment plot analysis module
operable by the one or more processors. The alignment plot analysis module is
operatively coupled to the concatenation module (20). The alignment plot analysis
module is configured to recognize weights distribution in a matrix by weighing the
numbers with x, y position and square of x, y position. In such embodiment, matrix
includes either a rectangular matrices or square matrices.
11
[0043] The alignment plot analysis module is also configured to evaluate
weights clustered along a diagonal in the matrices. In such embodiment, the
diagonal refers to weights lying on the line from position (0, 0) to point (a, b) for
the matrix. And lastly the alignment plot analysis module predicts consistency of
the one or more generated sentences. The result value lies between -1 and 1 which
may then be tuned to decide whether the output attention is good enough. Here,
result value indicates alignment measure.
[0044] In such embodiment, for analysis the alignment plot analysis module
weighs the numbers of the matrix with both their x, y position as well as the square
of their x, y position to arrive at the final measure. It is pertinent to note that larger
numbers along or close to the diagonal push the score towards 1, whereas a
completely randomly distributed matrix will yield a score closer to 0. Final
computation powers the created matrix to spit out a number between -1 and 1 which
indicates that the attention plot was good. It is pertinent to note that the decoder
output at a particular frame is not dependent on the corresponding encoder step, but
rather a step earlier in time (such as in the case of pauses due to comma or full stop),
the plot inherently deviates from a pure diagonal output to more of a “step” like
output.
[0045] FIG. 4 a is a graphical representation of an embodiment representing the
alignment measure (150) of -0.9391 in accordance with an embodiment of the
present disclosure. FIG. 4 b is a graphical representation of an embodiment
representing the alignment measure (160) of -0.1866 in accordance with an
embodiment of the present disclosure. Text-to-Speech is a probabilistic generative
process. Encoder and decoder are two components of the pipeline. The
diagonalization plot is taken between encoder time steps and decoder time steps. It
basically maps the weight given to an encoder time steps while decoding is
happening step by step. A more diagonal looking plot has a good correlation with
the output generated from the text-to-speech system. The present invention
proposes a method to measure this diagonization, so that quality of text-to-speech
12
synthesis can be automatically determined. A lower value means poor quality
synthesis, a higher value means good quality text-to-speech output.
[0046] Simultaneously, the process analyses the degree of “diagonal-ness” of a
rectangular matrix by treating it as a square matrix after resizing the matrix as an
image and then converting back into a matrix. High degree of “diagonal-ness”
means that the alignment was good, and the output has a high probability of being
correct. Traditional methods rely on eyeballing the attention plots to look for a
diagonal line in Text to Speech models. Such methods are cumbersome and prone
to human error. Moreover, such “goodness” detection method helps in monitoring
during training deep learning text -to-speech synthesis models, as well as during
synthesis, where attention mechanisms have a tendency to fail 1-4% of the time.
[0047] FIG. 5 is a block diagram of a computer or a server (170) in accordance
with an embodiment of the present disclosure. The server (170) includes
processor(s) (200), and memory (180) coupled to the processor(s) (200).
[0048] The processor(s) (200), as used herein, means any type of computational
circuit, such as, but not limited to, a microprocessor, a microcontroller, a complex
instruction set computing microprocessor, a reduced instruction set computing
microprocessor, a very long instruction word microprocessor, an explicitly parallel
instruction computing microprocessor, a digital signal processor, or any other type
of processing circuit, or a combination thereof.
[0049] The memory (180) includes a plurality of modules stored in the form of
executable program which instructs the processor (200) via bus (190) to perform
the method steps illustrated in Fig 1. The memory (180) has following modules: the
concatenation module (20), the variance analysis module (30) and the click sound
removable module (40).
[0050] The concatenation module (20) is configured to generate one or more
sentences at runtime by concatenating the plurality of speech units and produce a
resulting waveform for a plurality of concatenated speech units. The variance
13
analysis (30) module is configured to calculate spectrogram of the resulting
waveform by short time Fourier transform.
[0051] The variance analysis module (30) is also configured analysis variance
of calculated spectrogram across energies of all frequencies. The click sound
removable module (40) is configured to formulate a pre-determined threshold for
which the calculated spectrogram contains a click sound. The click sound
removable module (40) is also configured to detect the click sound in formulated
threshold. The click sound removable module (40) is also configured to segregate
the click sound over the spectrogram by adding a time window around the detected
click sound in formulated threshold. The click sound removable module (40) is also
configured to minimise amplitude of the click sound by using a fade out filter before
the click sound and fade in filter after the click sound.
[0052] Computer memory elements may include any suitable memory device(s)
for storing data and executable program, such as read only memory, random access
memory, erasable programmable read only memory, electrically erasable
programmable read only memory, hard drive, removable media drive for handling
memory cards and the like. Embodiments of the present subject matter may be
implemented in conjunction with program modules, including functions,
procedures, data structures, and application programs, for performing tasks, or
defining abstract data types or low-level hardware contexts. Executable program
stored on any of the above-mentioned storage media may be executable by the
processor(s) (200).
[0053] FIG. 6 is a flowchart representing the steps of a method (200) for text-tospeech synthesis in accordance with an embodiment of the present disclosure. The
method (200) includes generating one or more sentences at runtime by
concatenating the plurality of speech units and produce a resulting waveform for a
plurality of concatenated speech units in step 210. In one embodiment, generating
the one or more sentences at runtime by concatenating the plurality of speech units
and produce the resulting waveform for a plurality of concatenated speech units
14
includes generating the one or more sentences at runtime by concatenating the
plurality of speech units and produce the resulting waveform for a plurality of
concatenated speech units by a concatenation module. In another embodiment,
generating the one or more generated sentences comprise the plurality of speech
units, joints between two adjacent plurality of speech units and one or more static
sections.
[0054] The method (210) also includes calculating spectrogram of the resulting
waveform by short time Fourier transform in step 230. In one embodiment,
calculating spectrogram of the resulting waveform by the short time Fourier
transform includes calculating by a variance analysis module.
[0055] The method (210) also includes analysing variance of calculated
spectrogram across energies of all frequencies in step 240. In one embodiment,
analysing the variance of calculated spectrogram across energies of all the
frequencies includes analysing the variance of calculated spectrogram across
energies of all the frequencies by the variance analysis module.
[0056] The method (210) also includes formulating a pre-determined threshold
for which the calculated spectrogram contains a click sound in step 250. In one
embodiment, formulating the pre-determined threshold for which the calculated
spectrogram contains the click sound includes formulating the pre-determined
threshold for which the calculated spectrogram contains the click sound by a click
removable module. In another embodiment, formulating the pre-determined
threshold for which the calculated spectrogram contains the click sound referring
to unwanted noise present between two waveforms for the plurality of concatenated
speech units.
[0057] The method (210) also includes detecting the click sound in formulated
threshold in step 260. In one embodiment, detecting the click sound in formulated
threshold includes detecting by the click removable module.
15
[0058] The method (210) also includes segregating the click sound over the
spectrogram by adding a time window around the detected click sound in
formulated threshold in step 270. In one embodiment, segregating the click sound
over the spectrogram by adding the time window around the detected click sound
in formulated threshold includes segregating the click sound over the spectrogram
by adding the time window around the detected click sound in formulated threshold
by the click removable module.
[0059] The method (210) also includes minimising amplitude of the click sound
by using a fade out filter before the click sound and fade in filter after the click
sound in step 280. In one embodiment, minimising amplitude of the click sound by
using the fade out filter before the click sound and the fade in filter after the click
sound includes minimising by the click removable module.
[0060] The method (210) further comprises recognizing weights distribution in
a matrix by weighing the numbers with x, y position and square of x, y position. In
one embodiment, recognizing weights distribution in the matrix by weighing the
numbers with x, y position and square of x, y position includes recognizing weights
distribution in a matrix by an alignment plot analysis module.
[0061] The method (210) further includes evaluating weights clustered along a
diagonal in the matrices, wherein the diagonal refers to weights lying on the line
from position (0, 0) to point (a, b) for the matrix. In one embodiment, evaluating
the weights clustered along a diagonal in the matrices, wherein the diagonal refers
to weights lying on the line from position (0, 0) to point (a, b) for the matrix includes
evaluating weights by the alignment plot analysis module.
[0062] The method (210) further comprises predicting consistency of the one or
more generated sentences. In one embodiment, predicting consistency of the one or
more generated sentences includes predicting by the alignment plot analysis
module.
16
[0063] Present disclosure provides a cost-effective system for text-to-speech
synthesis. The system removes click sound or unwanted sound as produced during
concatenation of speech units by using spectrogram variance detection technique.
Thereby no human intervention is required for removable of click sound. Moreover,
by usage of alignment plot analysis method, the disclosed system may easily predict
the quality of “goodness” from the diagonal line plot.
[0064] While specific language has been used to describe the disclosure, any
limitations arising on account of the same are not intended. As would be apparent
to a person skilled in the art, various working modifications may be made to the
method in order to implement the inventive concept as taught herein.
[0065] The figures and the foregoing description give examples of
embodiments. Those skilled in the art will appreciate that one or more of the
described elements may well be combined into a single functional element.
Alternatively, certain elements may be split into multiple functional elements.
Elements from one embodiment may be added to another embodiment. For
example, order of processes described herein may be changed and are not limited
to the manner described herein. Moreover, the actions of any flow diagram need
not be implemented in the order shown; nor do all of the acts need to be necessarily
performed. Also, those acts that are not dependant on other acts may be performed
in parallel with the other acts. The scope of embodiments is by no means limited by
these specific examples.

WE CLAIM:
1. A system (10) for text-to-speech synthesis, comprising:
a concatenation module (20), operable by one or more processors,
configured to generate one or more sentences in runtime by concatenating
a plurality of speech units, and produce a resulting waveform for a plurality
of concatenated speech units, wherein one or more generated sentences
comprise the plurality of speech units, joints between two adjacent
plurality of speech units and one or more static sections;
a variance analysis module (30), operable by the one or more
processors, and operatively coupled to the concatenation module (20),
wherein the variance analysis module (30) is configured to:
calculate spectrogram of the resulting waveform by short time
Fourier transform; and
analysis variance of calculated spectrogram across energies of
all frequencies; and
a click removable module (40), operable by the one or more
processors, and operatively coupled to the variance analysis module (30),
wherein the click sound removable module (40) is configured to:
formulate a pre-determined threshold for which the calculated
spectrogram contains a click sound, wherein the click sound refers
to unwanted noise present between two waveforms for the plurality
of concatenated speech units;
detect the click sound in formulated threshold;
segregate the click sound over the spectrogram by adding a
time window around the detected click sound in formulated
threshold; and
18
minimise amplitude of the click sound by using a fade out filter
before the click sound and fade in filter after the click sound.
2. The system (10) as claimed in claim 1, wherein the concatenation module
(20) is also configured to generate the one or more sentences by receiving input of
one or more texts, and extracting a corresponding speech unit from a data set of
spoken speech for each of received one or more texts.
3. The system (10) as claimed in claim 1, further comprising alignment plot
analysis module operable by the one or more processors, and operatively coupled
to the concatenation module (20), wherein the alignment plot analysis module is
configured to:
recognize weights distribution in a matrix by weighing the numbers
with x, y position and square of x, y position;
evaluate weights clustered along a diagonal in the matrices, wherein
the diagonal refers to weights lying on the line from position (0,0) to point
(a,b) for the matrix; and
predict consistency of the one or more generated sentences.
4. The system (10) as claimed in claim 1, wherein the matrix comprises one of
a rectangular matrices or square matrices.
5. A method (210) for converting text into speech, comprising:
generating, by a concatenation module, one or more sentences at
runtime by concatenating the plurality of speech units, and produce a
resulting waveform for a plurality of concatenated speech units (220);
calculating, by a variance analysis module, spectrogram of the
resulting waveform by short time Fourier transform (230);
19
analysing, by the variance analysis module, variance of calculated
spectrogram across energies of all frequencies (240);
formulating, by a click removable module, a pre-determined
threshold for which the calculated spectrogram contains a click sound
(250);
detecting, by the click removable module, the click sound in
formulated threshold (260);
segregating, by the click removable module, the click sound over the
spectrogram by adding a time window around the detected click sound in
formulated threshold (270); and
minimising, by the click removable module, amplitude of the click
sound by using a fade out filter before the click sound and fade in filter
after the click sound (280).
6. The method (210) as claimed in claim 5, wherein generating, by the
concatenation module, one or more generated sentences comprise the plurality of
speech units, joints between two adjacent plurality of speech units and one or more
static sections.
7. The method (210) as claimed in claim 5, wherein formulating, by the click
removable module, the pre-determined threshold for which the calculated
spectrogram contains a click sound referring to unwanted noise present between
two waveforms for the plurality of concatenated speech units.
8. The method (210) as claimed in claim 5, further comprising recognizing, by
an alignment plot analysis module, weights distribution in a matrix by weighing the
numbers with x, y position and square of x, y position.
9. The method (210) as claimed in claim 8, further comprising evaluating, by
the alignment plot analysis module, weights clustered along a diagonal in the
20
matrices, wherein the diagonal refers to weights lying on the line from position (0,
0) to point (a, b) for the matrix.
10. The method (210) as claimed in claim 8, further comprising predicting, by
the alignment plot analysis module, consistency of the one or more generated
sentences.

Documents

Orders

Section	Controller	Decision Date

Application Documents

#	Name	Date
1	202111001387-STATEMENT OF UNDERTAKING (FORM 3) [12-01-2021(online)].pdf	2021-01-12
1	IN 384249-F-15-Decision ur 84(2).pdf	2024-02-16
2	202111001387-FORM-15 [04-08-2023(online)].pdf	2023-08-04
2	202111001387-STARTUP [12-01-2021(online)].pdf	2021-01-12
3	202111001387-PROOF OF RIGHT [12-01-2021(online)].pdf	2021-01-12
3	202111001387-POWER OF AUTHORITY [04-08-2023(online)].pdf	2023-08-04
4	202111001387-RELEVANT DOCUMENTS [04-08-2023(online)]-1.pdf	2023-08-04
4	202111001387-FORM28 [12-01-2021(online)].pdf	2021-01-12
5	202111001387-RELEVANT DOCUMENTS [04-08-2023(online)].pdf	2023-08-04
5	202111001387-FORM-9 [12-01-2021(online)].pdf	2021-01-12
6	202111001387-RELEVANT DOCUMENTS [29-09-2022(online)].pdf	2022-09-29
6	202111001387-FORM FOR STARTUP [12-01-2021(online)].pdf	2021-01-12
7	202111001387-IntimationOfGrant14-12-2021.pdf	2021-12-14
7	202111001387-FORM FOR SMALL ENTITY(FORM-28) [12-01-2021(online)].pdf	2021-01-12
8	202111001387-PatentCertificate14-12-2021.pdf	2021-12-14
8	202111001387-FORM 18A [12-01-2021(online)].pdf	2021-01-12
9	202111001387-Annexure [23-11-2021(online)].pdf	2021-11-23
9	202111001387-FORM 1 [12-01-2021(online)].pdf	2021-01-12
10	202111001387-EVIDENCE FOR REGISTRATION UNDER SSI(FORM-28) [12-01-2021(online)].pdf	2021-01-12
10	202111001387-Written submissions and relevant documents [23-11-2021(online)].pdf	2021-11-23
11	202111001387-EVIDENCE FOR REGISTRATION UNDER SSI [12-01-2021(online)].pdf	2021-01-12
11	202111001387-FORM-26 [15-11-2021(online)].pdf	2021-11-15
12	202111001387-Correspondence to notify the Controller [08-11-2021(online)].pdf	2021-11-08
12	202111001387-DRAWINGS [12-01-2021(online)].pdf	2021-01-12
13	202111001387-DECLARATION OF INVENTORSHIP (FORM 5) [12-01-2021(online)].pdf	2021-01-12
13	202111001387-FER.pdf	2021-10-19
14	202111001387-COMPLETE SPECIFICATION [12-01-2021(online)].pdf	2021-01-12
14	202111001387-US(14)-HearingNotice-(HearingDate-16-11-2021).pdf	2021-10-19
15	202111001387-ABSTRACT [26-07-2021(online)].pdf	2021-07-26
15	202111001387-FORM-26 [20-01-2021(online)].pdf	2021-01-20
16	202111001387-CLAIMS [26-07-2021(online)].pdf	2021-07-26
16	202111001387-OTHERS [26-07-2021(online)].pdf	2021-07-26
17	202111001387-FER_SER_REPLY [26-07-2021(online)].pdf	2021-07-26
18	202111001387-OTHERS [26-07-2021(online)].pdf	2021-07-26
18	202111001387-CLAIMS [26-07-2021(online)].pdf	2021-07-26
19	202111001387-ABSTRACT [26-07-2021(online)].pdf	2021-07-26
19	202111001387-FORM-26 [20-01-2021(online)].pdf	2021-01-20
20	202111001387-COMPLETE SPECIFICATION [12-01-2021(online)].pdf	2021-01-12
20	202111001387-US(14)-HearingNotice-(HearingDate-16-11-2021).pdf	2021-10-19
21	202111001387-DECLARATION OF INVENTORSHIP (FORM 5) [12-01-2021(online)].pdf	2021-01-12
21	202111001387-FER.pdf	2021-10-19
22	202111001387-Correspondence to notify the Controller [08-11-2021(online)].pdf	2021-11-08
22	202111001387-DRAWINGS [12-01-2021(online)].pdf	2021-01-12
23	202111001387-EVIDENCE FOR REGISTRATION UNDER SSI [12-01-2021(online)].pdf	2021-01-12
23	202111001387-FORM-26 [15-11-2021(online)].pdf	2021-11-15
24	202111001387-Written submissions and relevant documents [23-11-2021(online)].pdf	2021-11-23
24	202111001387-EVIDENCE FOR REGISTRATION UNDER SSI(FORM-28) [12-01-2021(online)].pdf	2021-01-12
25	202111001387-Annexure [23-11-2021(online)].pdf	2021-11-23
25	202111001387-FORM 1 [12-01-2021(online)].pdf	2021-01-12
26	202111001387-FORM 18A [12-01-2021(online)].pdf	2021-01-12
26	202111001387-PatentCertificate14-12-2021.pdf	2021-12-14
27	202111001387-FORM FOR SMALL ENTITY(FORM-28) [12-01-2021(online)].pdf	2021-01-12
27	202111001387-IntimationOfGrant14-12-2021.pdf	2021-12-14
28	202111001387-FORM FOR STARTUP [12-01-2021(online)].pdf	2021-01-12
28	202111001387-RELEVANT DOCUMENTS [29-09-2022(online)].pdf	2022-09-29
29	202111001387-FORM-9 [12-01-2021(online)].pdf	2021-01-12
29	202111001387-RELEVANT DOCUMENTS [04-08-2023(online)].pdf	2023-08-04
30	202111001387-FORM28 [12-01-2021(online)].pdf	2021-01-12
30	202111001387-RELEVANT DOCUMENTS [04-08-2023(online)]-1.pdf	2023-08-04
31	202111001387-PROOF OF RIGHT [12-01-2021(online)].pdf	2021-01-12
31	202111001387-POWER OF AUTHORITY [04-08-2023(online)].pdf	2023-08-04
32	202111001387-STARTUP [12-01-2021(online)].pdf	2021-01-12
32	202111001387-FORM-15 [04-08-2023(online)].pdf	2023-08-04
33	IN 384249-F-15-Decision ur 84(2).pdf	2024-02-16
33	202111001387-STATEMENT OF UNDERTAKING (FORM 3) [12-01-2021(online)].pdf	2021-01-12

Search Strategy

1	2021-02-2314-15-20E_23-02-2021.pdf