Systems And Methods For Assessing And Evaluating Speech Quality

Abstract: Communication skills are vital and require to be evaluated. These involve challenging tasks such as identifying critical parameters affecting fluency of speakers when performed by machines, as these are not subjective fields and qualitative assessments of these, needs comprehension of the contents. Embodiments of the present disclosure machine learning system(s) which has intelligence to understand and assess the speaking skills of a person wherein features from speech signal are extracted for estimating speech parameters affecting speech quality, and computing predictive score. The present disclosure further dynamically selects and provides a set of questions based on the context of current speech. The set of questions are dynamically selected by querying relevant historical data and determining context of unresolved queries.

Patent Information

Application #

Filing Date

16 July 2018

Publication Number

03/2020

Publication Type

INA

Invention Field

COMPUTER SCIENCE

Status

Email

ip@legasis.in

Parent Application

Patent Number

Legal Status

Grant Date

2024-03-01

Renewal Date

Applicants

Tata Consultancy Services Limited

Nirmal Building, 9th Floor, Nariman Point, Mumbai - 400021, Maharashtra, India

Inventors

1. TOMMY, Robin

Tata Consultancy Services Limited, Bodhi Park (CLC) Technopark Campus, Kariyavattom P. O., Thiruvananthapuram - 695581, Kerala, India

2. S, Sarath

Tata Consultancy Services Limited, Bodhi Park (CLC) Technopark Campus, Kariyavattom P. O., Thiruvananthapuram - 695581, Kerala, India

3. JAMES, Nithin

Tata Consultancy Services Limited, Bodhi Park (CLC) Technopark Campus, Kariyavattom P. O., Thiruvananthapuram - 695581, Kerala, India

4. DAS, Arun

Tata Consultancy Services Limited, Bodhi Park (CLC) Technopark Campus, Kariyavattom P. O., Thiruvananthapuram - 695581, Kerala, India

Specification

Claims:1.
A processor implemented method, comprising:
receiving, via one or more hardware processors, a speech signal comprising conversation between one or more participants (202);
pre-processing, via the one or more hardware processors, the received speech signal to obtain a preprocessed speech signal (204);
identifying, using a Gaussian mixture model (GMM) based training model, one or more speakers in the preprocessed speech signal and extracting information pertaining to time taken by each of the one or more speakers, content presented by the one or more speakers, wherein the one or more speakers are identified and the information is extracted by using magnitude and phase based features techniques obtained by applying the GMM based training model on the preprocessed speech signal (206);
identifying one or more peaks in the preprocessed signal and computing energy levels in the identified one or more peaks (208);
estimating, using the one or more peaks and the energy levels, one or more speech parameters comprising number of pauses, speech rate, articulation rate, phonation rate, number of syllables, average syllable duration, number of repetitive fillers, mother tongue influence, one or more overlaps between conversations associated with each of the one or more speakers, or combinations thereof (210);
estimating clarity and coherence in the preprocessed signal using the one or more speech parameters (212); and
determining, using the clarity and coherence, fluency of each of the one or more speakers (214).

2. The processor implemented method of claim 1, wherein the clarity and coherence are estimated using one or more Deep Neural Networks (DNN).
3. The processor implemented method of claim 1, further comprising computing, based on the one or more estimated speech parameters, a predictive score for the estimated clarity and coherence.

4. The processor implemented method of claim 3, further comprising, determining, based on the predictive score, consistency of context in the preprocessed signal by the one or more speakers, wherein the consistency of context indicates speaking skills and characteristics of each of the one or more speakers.

5. The processor implemented method of claim 4, further comprising dynamically selecting and providing a set of questions to each of the one or more speakers based on the speaking skills and characteristics.

6. A system (100) comprising:
a memory (102) storing instructions;
one or more communication interfaces (106); and
one or more hardware processors (104) coupled to the memory (102) via the one or more communication interfaces (106), wherein the one or more hardware processors (104) are configured by the instructions to:
receive, a speech signal comprising conversation between one or more participants;
pre-process the received speech signal to obtain a preprocessed speech signal;
identify, using a Gaussian mixture model (GMM) based training model, one or more speakers in the preprocessed speech signal and extracting information pertaining to time taken by each of the one or more speakers, content presented by the one or more speakers, wherein the one or more speakers are identified and the information is extracted by using magnitude and phase based features techniques obtained by applying the GMM based training model on the preprocessed speech signal;
identify one or more peaks in the preprocessed signal and computing energy levels in the identified one or more peaks;
estimate, using the one or more peaks and the energy levels, one or more speech parameters comprising number of pauses, speech rate, articulation rate, phonation rate, number of syllables, average syllable duration, number of repetitive fillers, mother tongue influence, one or more overlaps between conversations associated with each of the one or more speakers, or combinations thereof;
estimate clarity and coherence in the preprocessed signal using the one or more speech parameters; and
determining, using the clarity and coherence, fluency of each of the one or more speakers.

7. The system of claim 6, wherein the clarity and coherence are estimated using one or more Deep Neural Networks (DNN).

8. The system of claim 6, wherein the one or more hardware processors are further configured by the instructions to compute, based on the one or more estimated speech parameters, a predictive score for the estimated clarity and coherence.

9. The system of claim 8, wherein the one or more hardware processors are further configured by the instructions to determine, based on the predictive score, consistency of context in the preprocessed signal by the one or more speakers, wherein the consistency of context indicates speaking skills and characteristics of each of the one or more speakers.

10. The system of claim 9, wherein the one or more hardware processors are further configured by the instructions to dynamically select and provide, a set of questions from a questions pool, to each of the one or more speakers based on the speaking skills and characteristics.
, Description:FORM 2

THE PATENTS ACT, 1970
(39 of 1970)
&
THE PATENT RULES, 2003

COMPLETE SPECIFICATION
(See Section 10 and Rule 13)

Title of invention:

SYSTEMS AND METHODS FOR ASSESSING AND EVALUATING SPEECH QUALITY

Applicant

Tata Consultancy Services Limited
A company Incorporated in India under the Companies Act, 1956
Having address:
Nirmal Building, 9th floor,
Nariman point, Mumbai 400021,
Maharashtra, India

The following specification particularly describes the invention and the manner in which it is to be performed.

TECHNICAL FIELD
The disclosure herein generally relates to speech signal processing systems, and, more particularly, to systems and methods for assessing and evaluating speech quality in speech signals.

BACKGROUND
Communication skills are vital in every sphere of life. Hence to evaluate and improve communication skills is mandatory in any learning program. These involve challenging tasks such as identifying critical parameters affecting fluency of speakers when performed by machines, as these are not subjective fields and qualitative assessments of these, needs comprehension of the contents. Currently in the conventional systems and methods it is strictly considered to be a task accomplished only with human effort. The evaluation needs an expert resource at hand and like any task, the evaluation parameters are prone to changes from person to person. Even though if there is machine intervention, they may be prone to errors.

SUMMARY
Embodiments of the present disclosure present technological improvements as solutions to one or more of the above-mentioned technical problems recognized by the inventors in conventional systems. For example, in one aspect, there is provided a processor implemented method for assessing and evaluating speech quality in speech signals, comprising: receiving, via one or more hardware processors, a speech signal comprising conversation between one or more participants; pre-processing, via the one or more hardware processors, the received speech signal to obtain a preprocessed speech signal; identifying, using a Gaussian mixture model (GMM) based training model, one or more speakers in the preprocessed speech signal and extracting information pertaining to time taken by each of the one or more speakers, content presented by the one or more speakers, wherein the one or more speakers are identified and the information is extracted by using magnitude and phase based features techniques obtained by applying the GMM based training model on the preprocessed speech signal; identifying one or more peaks in the preprocessed signal and computing energy levels in the identified one or more peaks; estimating, using the one or more peaks and the energy levels, one or more speech parameters comprising number of pauses, speech rate, articulation rate, phonation rate, number of syllables, average syllable duration, number of repetitive fillers, mother tongue influence, one or more overlaps between conversations associated with each of the one or more speakers, or combinations thereof; estimating clarity and coherence in the preprocessed signal using the one or more speech parameters; and determining, using the clarity and coherence, fluency of each of the one or more speakers.
In an embodiment, the clarity and coherence are estimated using one or more Deep Neural Networks (DNN). In an embodiment, the method may further comprise computing, based on the one or more estimated speech parameters, a predictive score for the estimated clarity and coherence.
In an embodiment, the method may further comprise determining, based on the predictive score, consistency of context in the preprocessed signal by the one or more speakers, wherein the consistency of context indicates speaking skills and characteristics of each of the one or more speakers. In an embodiment, the method may further comprise dynamically selecting and providing a set of questions to each of the one or more speakers based on the speaking skills and characteristics.
In another aspect, there is provided a system for assessing and evaluating speech quality in speech signals, comprising: a memory storing instructions; one or more communication interfaces; and one or more hardware processors coupled to the memory via the one or more communication interfaces, wherein the one or more hardware processors are configured by the instructions to: receive, a speech signal comprising conversation between one or more participants; pre-process the received speech signal to obtain a preprocessed speech signal; identify, using a Gaussian mixture model (GMM) based training model, one or more speakers in the preprocessed speech signal and extracting information pertaining to time taken by each of the one or more speakers, content presented by the one or more speakers, wherein the one or more speakers are identified and the information is extracted by using magnitude and phase based features techniques obtained by applying the GMM based training model on the preprocessed speech signal; identify one or more peaks in the preprocessed signal and computing energy levels in the identified one or more peaks; estimate, using the one or more peaks and the energy levels, one or more speech parameters comprising number of pauses, speech rate, articulation rate, phonation rate, number of syllables, average syllable duration, number of repetitive fillers, mother tongue influence, one or more overlaps between conversations associated with each of the one or more speakers, or combinations thereof; estimate clarity and coherence in the preprocessed signal using the one or more speech parameters; and determining, using the clarity and coherence, fluency of each of the one or more speakers. In an embodiment, the clarity and coherence are estimated using one or more Deep Neural Networks (DNN).
In an embodiment, the one or more hardware processors are further configured by the instructions to compute, based on the one or more estimated speech parameters, a predictive score for the estimated clarity and coherence. In an embodiment, the one or more hardware processors are further configured by the instructions to determine, based on the predictive score, consistency of context in the preprocessed signal by the one or more speakers, wherein the consistency of context indicates speaking skills and characteristics of each of the one or more speakers.
In an embodiment, the one or more hardware processors are further configured by the instructions to dynamically select and provide, a set of questions from a questions pool, to each of the one or more speakers based on the speaking skills and characteristics.
In yet another aspect, there is provided one or more non-transitory machine readable information storage mediums comprising one or more instructions which when executed by one or more hardware processors causes assessing and evaluating speech quality in speech signals, comprising: receiving, via the one or more hardware processors, a speech signal comprising conversation between one or more participants; pre-processing, via the one or more hardware processors, the received speech signal to obtain a preprocessed speech signal; identifying, using a Gaussian mixture model (GMM) based training model, one or more speakers in the preprocessed speech signal and extracting information pertaining to time taken by each of the one or more speakers, content presented by the one or more speakers, wherein the one or more speakers are identified and the information is extracted by using magnitude and phase based features techniques obtained by applying the GMM based training model on the preprocessed speech signal; identifying one or more peaks in the preprocessed signal and computing energy levels in the identified one or more peaks; estimating, using the one or more peaks and the energy levels, one or more speech parameters comprising number of pauses, speech rate, articulation rate, phonation rate, number of syllables, average syllable duration, number of repetitive fillers, mother tongue influence, one or more overlaps between conversations associated with each of the one or more speakers, or combinations thereof; estimating clarity and coherence in the preprocessed signal using the one or more speech parameters; and determining, using the clarity and coherence, fluency of each of the one or more speakers.
In an embodiment, the clarity and coherence are estimated using one or more Deep Neural Networks (DNN). In an embodiment, the instructions may further cause computing, based on the one or more estimated speech parameters, a predictive score for the estimated clarity and coherence.
In an embodiment, the instructions may further cause determining, based on the predictive score, consistency of context in the preprocessed signal by the one or more speakers, wherein the consistency of context indicates speaking skills and characteristics of each of the one or more speakers. In an embodiment, the instructions may further cause dynamically selecting and providing a set of questions to each of the one or more speakers based on the speaking skills and characteristics.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the invention, as claimed.

BRIEF DESCRIPTION OF THE DRAWINGS

The accompanying drawings, which are incorporated in and constitute a part of this disclosure, illustrate exemplary embodiments and, together with the description, serve to explain the disclosed principles:
FIG. 1 illustrates an exemplary block diagram of a system for evaluating and assessing speech signals in accordance with an embodiment of the present disclosure.
FIG. 2 illustrates an exemplary flow diagram illustrating a method for evaluating and assessing speech signals using the system of FIG. 1 according to an embodiment of the present disclosure.
FIG. 3A depicts a graphical representation of speech signal pertaining one or more users in accordance with an embodiment of the present disclosure.
FIG. 3B depicts a spectrogram of the audio signal specific to a speaker with noise and silences in accordance with an embodiment of the present disclosure.
FIG. 4A depicts a graphical representation of a pre-processed signal in accordance with an embodiment of the present disclosure.
FIG. 4B depicts a spectrogram of the pre-processed signal in accordance with an embodiment of the present disclosure.
FIG. 5 depicts a Mel Frequency Cepstral Coefficients (MFCC) technique implemented by the system of FIG. 1 in accordance with an embodiment of the present disclosure.
FIG. 6A depicts a graphical presentation of energy level/peaks computed in the pre-processed signal in accordance with an embodiment of the present disclosure.
FIG. 6B depicts a graphical presentation of filter(s) in accordance with an embodiment of the present disclosure.
FIG. 7 depicts a speech signal with number of pauses in accordance with an example embodiment of the present disclosure.
FIG. 8A depicts a graphical representation of MFCC data for given speech in accordance with an embodiment of the present disclosure.
FIG. 8B depicts a graphical presentation of speech given having corresponding MFCC Features being mapped in accordance with an embodiment of the present disclosure.
FIG. 9 depicts a block diagram illustrating a Deep Neural Network (DNN) implementation by the system of FIG. 1 in accordance with an embodiment of the present disclosure.
FIG. 10 depicts an exemplary DNN implemented by the system of FIG. 1 for computing predictive score based on one or more estimated speech parameters in accordance with an embodiment of the present disclosure.
FIG. 11 depicts a block diagram illustrating steps involved in context identification within speech in accordance with an embodiment of the present disclosure.
FIG. 12 depicts a block diagram illustrating steps involved in dynamical question selection for each speaker based on prediction score (or predictive score) in accordance with an embodiment of the present disclosure.
FIG. 13 depicts an exemplary block diagram illustrating steps implemented by the system of FIG. 1 in accordance with an embodiment of the present disclosure.

DETAILED DESCRIPTION OF EMBODIMENTS
Exemplary embodiments are described with reference to the accompanying drawings. In the figures, the left-most digit(s) of a reference number identifies the figure in which the reference number first appears. Wherever convenient, the same reference numbers are used throughout the drawings to refer to the same or like parts. While examples and features of disclosed principles are described herein, modifications, adaptations, and other implementations are possible without departing from the spirit and scope of the disclosed embodiments. It is intended that the following detailed description be considered as exemplary only, with the true scope and spirit being indicated by the following claims.
Referring now to the drawings, and more particularly to FIGS. 1 through 13, where similar reference characters denote corresponding features consistently throughout the figures, there are shown preferred embodiments and these embodiments are described in the context of the following exemplary system and/or method.
FIG. 1 illustrates an exemplary block diagram of a system 100 for evaluating and assessing speech signals in accordance with an embodiment of the present disclosure. In an embodiment, the system 100 may also be referred as Speech Signal Evaluation and Assessment System (SSEAS), and interchangeably used hereinafter. In an embodiment, the system 100 includes one or more processors 104, communication interface device(s) or input/output (I/O) interface(s) 106, and one or more data storage devices or memory 102 operatively coupled to the one or more processors 104. The memory 102 comprises a database 108. The one or more processors 104 that are hardware processors can be implemented as one or more microprocessors, microcomputers, microcontrollers, digital signal processors, central processing units, state machines, logic circuitries, and/or any devices that manipulate signals based on operational instructions. Among other capabilities, the processor(s) is configured to fetch and execute computer-readable instructions stored in the memory. In an embodiment, the system 100 can be implemented in a variety of computing systems, such as laptop computers, notebooks, hand-held devices, workstations, mainframe computers, servers, a network cloud and the like.
The I/O interface device(s) 106 can include a variety of software and hardware interfaces, for example, a web interface, a graphical user interface, and the like and can facilitate multiple communications within a wide variety of networks N/W and protocol types, including wired networks, for example, LAN, cable, etc., and wireless networks, such as WLAN, cellular, or satellite. In an embodiment, the I/O interface device(s) can include one or more ports for connecting a number of devices to one another or to another server.
The memory 102 may include any computer-readable medium known in the art including, for example, volatile memory, such as static random access memory (SRAM) and dynamic random access memory (DRAM), and/or non-volatile memory, such as read only memory (ROM), erasable programmable ROM, flash memories, hard disks, optical disks, and magnetic tapes.
The database 108 may store information pertaining to inputs (e.g., speech signals) obtained pertaining to one or more users. Further, the database 108 stores information pertaining to inputs fed to the system 100 and/or outputs generated by the system (e.g., at each stage), specific to the methodology described herein. More specifically, the database 108 stores information being processed at each step of the proposed methodology.
FIG. 2, with reference to FIG. 1, illustrates an exemplary flow diagram illustrating a method for evaluating and assessing speech signals using the system 100 of FIG. 1 according to an embodiment of the present disclosure. In an embodiment, the system 100 comprises one or more data storage devices or the memory 102 operatively coupled to the one or more hardware processors 104 and is configured to store instructions for execution of steps of the method by the one or more processors 104. The steps of the method of the present disclosure will now be explained with reference to the components of the system 100 as depicted in FIG. 1, and the flow diagram of FIG. 2. In an embodiment of the present disclosure, at step 202, the one or more hardware processors 104 receive a speech signal comprising conversation between one or more participants. FIG. 3A, with reference to FIGS. 1-2, depicts a graphical representation of speech signal pertaining one or more users in accordance with an embodiment of the present disclosure. More specifically, FIG. 3A depicts an audio speech signal collected from the user(s) (or speaker(s)) in time domain with noise and with silences. FIG. 3B, with reference to FIGS. 1-3A, depicts a Spectrogram of the audio signal specific to a speaker with noise and silences in accordance with an embodiment of the present disclosure.
In an embodiment of the present disclosure, at step 204, the one or more hardware processors 104 pre-process the received speech signal to obtain a pre-processed speech signal. The pre-processed signal may now be free from any distortion or noise. The audio (or speech) signals are recorded from the one or more users on which the evaluation is to be done. It may be noted that there could be instances where there are no overlaps in audio samples. The conversation between speakers are taken as an independent continuous voice sample for an individual speaker and further processed for training the system 100. In an embodiment, user who is using the system 100 for evaluating his speech quality and on whom the speech parameter are analyzed may be referred as speaker. An audio sample approximatively of duration 120 second was taken for analysis. This duration is taken under assumption that speaker covers all the phoneme in linguistic content which would yield better analyzed result. The silence in audio signal yields unnecessary information (content) in the audio signal and could affect continuity of the signal. Thus the audio signal is pre-processed to extract only the speech data from audio signal for better analysis. FIG. 4A, with reference to FIGS. 1-3B, depicts a graphical representation of a pre-processed signal in accordance with an embodiment of the present disclosure. More specifically, FIG. 4A depicts an audio signal that is pre-processed wherein unnecessary silence has been removed for further analysis. FIG. 4B, with reference to FIGS. 1-4A, depicts a spectrogram of the pre-processed signal in accordance with an embodiment of the present disclosure. More specifically, FIG. 4B depicts a spectrogram of an audio signal that is pre-processed wherein unnecessary silence has been removed for further analysis. FIG. 4A depicts the pre-processed signal in a time domain wherein FIG. 4B depicts the pre-processed signal in a frequency domain. It is to be understood and noted that with spectrogram output (e.g., refer FIG. 4), the system 100 is enabled to identify how many speakers are actually speaking which is only possible when the speech signal output is represented or depicted in a frequency domain.
The system 100 is further configured to generate GMM based training model for particular speaker. In an embodiment of the present disclosure, at step 206, the one or more hardware processors 104 identify, using a GMM based training model, one or more speakers in the preprocessed speech signal and extract information pertaining to time taken by each of the one or more speakers, content presented by the one or more speakers. In the present disclosure, the one or more speakers are identified and the information is extracted by using magnitude and phase based features techniques. In the above step, the information is extracted by applying the GMM based training model on the preprocessed speech signal, in one example embodiment. The GMM based training model that is generated may be stored in the database 108 and executed by the system 100 accordingly to generate desired output(s) (e.g., for instance, one or more speakers identified in the preprocessed speech signal and information extraction such as to time taken by each of the one or more speakers, content presented by the one or more speakers, etc.). In an embodiment of the present disclosure, the magnitude and phase based features technique(s) comprise but are not limited to, Mel Frequency Cepstral Coefficients (MFCC) technique, Gammatone Frequency Cepstral Coefficients (GFCC) technique, Cochlear Frequency Cepstral Coefficients (CFCC) technique, group delay and product spectrum MFCC technique, or combinations thereof. In an embodiment, the magnitude and phase based features technique(s) may be stored in the database 108 and executed by the system 100 accordingly to generate desired output(s) (e.g., for instance, information extraction as mentioned above).
Signals in time domain are used to determine the characteristic of amplitude. In frequency domain, it helps in determining rate at which the signal is changing. But for speech signals, it depends on various aspect features. So it is necessary to represent the signals as simple vectors (magnitude and phase). This is done only to extract more information of speech signals. As the speech signal is traversed from time domain to magnitude/phase domain, there is seen a translation of signal data. In order to get speech parameter using magnitude and phase based techniques, the system 100 represent the signal in that domain (magnitude/phase vectors) to avoid mismatch in domain otherwise this would result in loss of accurate information of signal. In the present disclosure, the system 100 type of magnitude and phase based technique(s) implemented by the system is Mel Frequency Cepstral Coefficients (MFCC) technique. It is to be understood by a person having ordinary skill in the art and person skilled in the art that shall implementation example shall not be construed as limiting the scope of embodiments and the present disclosure.
FIG. 5, with reference to FIGS. 1 through 4B, depicts a Mel Frequency Cepstral Coefficients (MFCC) technique implemented by the system 100 of FIG. 1 in accordance with an embodiment of the present disclosure. It is to be understood by a person having ordinary skill in the art and person skilled in the art that speech is sounds generated by a human and are filtered by the shape of the vocal tract including tongue, teeth, etc. This shape determines what sound comes out. If the shape is determined accurately, this should give an accurate representation of the phoneme being produced. The shape of the vocal tract manifests itself in the envelope of the short time power spectrum, and the job of MFCC are used to understand and represent this envelope (e.g., feature(s) affecting individual’s voice). The continuous speech signal(s) is/are broken into frames and transformed into Fourier Transform to study the characteristics of signal. Using Mel Filterbank, MFCC is generated. It is used to represent the disturbance caused by the shape of vocal tract which affect the individuals speech. This technique is a mimic model of human ear, how a voice is perceived in the ear. Using DCT (Discrete cosine transform) it is converted from frequency to time domain for further analysis. The advantages of implemented Mel Filterbank by the system 100 are as follows:
It applies the Mel-frequency scaling, which is perceptual scale that helps to simulate the way human ear works. This corresponds to better resolution at low frequencies and less at high.
Using a triangular filter-bank (Mel Filterbank type) by the system 100, helps to capture the energy at each critical band and gives a rough approximation of the spectrum shape, as well as smooths the harmonic structure. In theory it could be manipulated on raw DFT bins, but then it is noted that feature(s) dimensionality are not reduced. Therefore the purpose of the system 100 to implement such implementing triangular filter-bank (Mel Filterbank type) and performing filter-bank analysis is to capture the spectral envelope.
In an embodiment of the present disclosure, at step 208, the one or more hardware processors 104 identify one or more peaks in the pre-processed signal and energy levels in the identified one or more peaks are computed. FIG. 6A, with reference to FIGS. 1 through 5, depicts a graphical presentation of energy level/peaks computed in the pre-processed signal in accordance with an embodiment of the present disclosure. FIG. 6B, with reference to FIGS. 1 through 6A, depicts a graphical presentation of filter(s) in accordance with an embodiment of the present disclosure. More specifically, FIG. 6B, depicts a Mel Filterbank, in one example embodiment.
In an embodiment of the present disclosure, at step 210, the one or more hardware processors 104 estimate, using the one or more peaks and the energy levels of FIG. 6A-6B, one or more speech parameters comprising number of pauses, speech rate, articulation rate, phonation rate, number of syllables, average syllable duration, number of repetitive fillers, native language influence, one or more overlaps between conversations associated with each of the one or more speakers, or combinations thereof.
FIG. 7, with reference to FIGS. 1 through 6B, depicts a speech signal with number of pauses in accordance with an example embodiment of the present disclosure. In an embodiment, number of pauses refers to discontinuity in the flow of speech, and (generally) reflects on the delay of individual’s sentence formation capability, which also affects the conversation flow. In the present disclosure, speech rate refers to rate at which the words are spoken by the user and an overall time used for speaker to deliver message in a given time. Number of pauses can be taken as a direct map to a person’s fluency. A fluent person makes only required number of pauses with a limited duration of each. An aberration from this can be considered as a penalty point.
In the present disclosure, articulation rate refers to how quickly/soon sound segments are produced by the speaker. In the present disclosure, phonation rate refers rate at which the speaker pronounces the phonemes. In the present disclosure, syllables is a building block of speech and a unit pronunciation having one vowel sound, with or without surrounding consonants, forming the whole or a part of a word a character or characters representing a syllable. The syllables rate and phonation rate helps the system 100 to analyze word formation ability of individual(s). In the present disclosure, average syllable duration refers to average time taken for forming those syllables. In the present disclosure, number of repetitive fillers refers to number of repetitive words used by a speaker speech in the entire conversation. In the present disclosure, native language influence (or mother tongue influence) refers to a common factor affecting person speech. This can be detected using GMM (Gaussian mixture models) by calculating the MFCCs (Mel Frequency Cepestral Coefficients).
Referring back to step 208, the audio signal is further converted into frequency domain by FFT for analysis. And the signal is enveloped with the Mel filterbank to collect the MFCC data, which can be used to calculate native language influence (e.g., say mother tongue influence) on the speaker speech. During this stage, the system 100 also identifies who are actual speakers.
FIG. 8A, with reference to FIGS. 1 through 7, depicts a graphical representation of MFCC data for given speech in accordance with an embodiment of the present disclosure. More specifically, FIG. 8A depicts Mef Frequency Cepstral coefficient that is utilized by the system 100 to identify who are the actual speakers. FIG. 8B, with reference to FIGS. 1 through 8A, depicts a graphical presentation of speech given having corresponding MFCC Features being mapped in accordance with an embodiment of the present disclosure.
Referring back to steps of FIG. 2, more particularly, at step 212, in an embodiment of the present disclosure, the one or more hardware processors 104 estimate clarity and coherence in the preprocessed signal using the one or more speech parameters (as computed above). Clarity and coherence are the main aspect in the speech, and these describe and analyze degree of similarity between the speech and the topic. Clarity and coherence enable the system 100 (or end user) to understand the ability of user to make an understandable statements to the related topic. Clarity and coherence also deal with the coherent sentences constructed throughout the speech. It is to be understood that calculation of relative comparison of clarity/coherence in speech data is dependent on word formation rather than articulation. Thus the audio data (or speech signal) is converted into text using Speech-to-text conversion (e.g., speech to text processing technique(s) and/or text processing technique(s) known in the art). This processed text is feed as input to Recurrent Neural Networks (RNN) where the coherence/clarity is calculated wherein dependency of the current word to the previous word and next word formed is utilized that provides (or gives) a quantitative measure. FIG. 9, with reference to FIGS. 1 through 8B, depicts a block diagram illustrating a Deep Neural Network (DNN) implementation by the system 100 of FIG. 1 in accordance with an embodiment of the present disclosure. More specifically, FIG. 9 depicts a Recurrent Neural Network implementation by the system 100 of FIG. 1 for evaluating and assessing speech signals in accordance with an embodiment of the present disclosure. Upon clarity and coherence estimation by the system 100 in the RNN, the system 100 is further configured to determining, using the clarity and coherence, fluency of each of the one or more speakers at step 214.
Upon estimation of clarity and coherence and further determination of the fluency of each speaker in the speech signal, the system 100 is further configured to compute a predictive score (also referred as ‘predicted score’) for the estimated clarity and coherence. FIG. 10 depicts an exemplary DNN implemented by the system 100 of FIG. 1 for computing predictive score based on one or more estimated speech parameters in accordance with an embodiment of the present disclosure. More specifically, the system 100 implements the DNN for computing predicative score for the estimated clarity and coherence based on one or more estimated speech parameters. The DNN (and/or the RNN) may be either internally connected to the system 100 or is an external component that is connected to the system 100 via the I/O interface(s) 106, in one example embodiment. In an embodiment, the speech parameters for example, number of pauses, speech rate, articulation rate, phonation rate, number of syllables, average syllable duration, number of repetitive fillers, mother tongue influence, clarity/coherence and fluency are calculated based on the speaker speech signals. And these extracted speech parameters from the audio data are processed into the Deep Neural Network to compute the predictive score. In an embodiment of the present disclosure, the Deep Neural Network is trained for a large set of speakers with a scale of predictive score(s) which are analyzed and pre-trained accurately and by using these pre-trained models, and it further performs comparative analysis for a particular speaker over the large population, thereby assessing and evaluating speaker(s). The system further determines consistency (or contextual consistency) in the preprocessed signal by the one or more speakers. The consistency of speaking by the speakers indicate speaking skills and characteristics of each of the one or more speakers.
Upon evaluating and assessing the speaker(s), the system 100 (or the one or more hardware processors 104 (or the DNN implemented by the system 100) dynamically select and provide a set of questions to each of the one or more speakers based on the speaking skills and characteristics. Such dynamic selecting of set of questions and providing thereof is enabled wherein the system 100 crawls historical data (or training data) to identify similar context as compared to the current context under consideration and determines whether any points that were discussed in a previous event having similar topics are addressed with solutions in the current event. For example, say in an event 1 there were certain agenda points that were discussed. If the system 100 identifies any future event with a similar context focusing on one or more of agenda points previously discussed in the event 1, then the system 100 dynamically selects a set of questions and provides the questions to the speakers who can answer them during a current event (e.g., say event 3). In an embodiment of the present disclosure, a large pool of questions are classified based on predicted score(s) scale. So when the speaker is evaluated in the system 100, a score is generated based on his/her speech parameters. This score reflects the speaker’s potential, skills and characteristics. Thus depending on the score for the speaker, the question is fetched from the pool. FIG. 11, with reference to FIGS. 1 through 10, depicts a block diagram illustrating steps involved in context identification within speech in accordance with an embodiment of the present disclosure. FIG. 12, with reference to FIGS. 1 through 11, depicts a block diagram illustrating steps involved in dynamical question selection for each speaker based on prediction score (or predictive score) in accordance with an embodiment of the present disclosure.
FIG. 13, with reference to FIGS. 1 through 12, depicts an exemplary block diagram illustrating steps implemented by the system 100 of FIG. 1 in accordance with an embodiment of the present disclosure. In FIG. 13, it is seen that the system 100 (or DNN) computes/estimates various speech parameters, clarity and coherence, and predictive score for a particular speech. In an example embodiment of the present disclosure, the speech under consideration is an interview scenario. The transcript of the interview from the scenario is as below:
Speaker 1:
“Good morning, my name is John Doe. I am a qualified accountant with six years of experience in finance industry. I have a reputation for my attention to detail and delivering within strict deadlines and enjoy working with financial data.
Going forward I want to work in a challenging financial role within the same industry and your organization is one in which I believe I could settle down and make a real contribution.”
The following speech parameters were estimated by the system:
Time: 60 seconds
Phonation rate/time: 40 seconds
Number of pauses: 350 milliseconds
Speech rate: (words per minute): 72
Articulation rate (syllables per minute): 172.5
Average Syllables (syllables per minute): 115
Native Language Influence (based on MFCC data): 0.2
Clarity and Coherence: 0.75 (relative score)
The above speech parameters were computed by way of example expressions/equations shown below:
The audio sample of 60 seconds (refer audio signal or speech signal of FIG. 3A) was recorded with 16 Kilo Hertz (KHz) sampling rate and 16 bit integer bit-depth (b) by the system.
Fs = 16 KHz
Bit-depth (b) = 16 bit signed Integer (also referred as audio bit depth)
Duration= 60 seconds
Then duration of silence in audio (or speech signal) is determined by VAD (Voice Activity Detection) and a new audio is created without the unnecessary silences in audio. The duration of silence (S) and number of pauses (Np) is measured and the new audio duration is calculated which gives the Phonation Time (time) or the total time used by the speaker only for delivering the speech content by way of example expression below:
PT=T-S
T=60 seconds
S=20 seconds
Phonation Time (PT) = 60-20= 40 seconds
Np =6
The above audio is processed into text by Speech Text Conversion. The processed text is shown below by way of example:
Processed Text:
“Good morning, my name is John Doe. I am qualified accountant with the six years’ experience in finance industry. I have a reputation for my attention to detail and delivering within strict deadlines and enjoy working with financial data.
Going forward I want to work in a challenging finance role within the same industry and your organization is one in which I believe I could settle down and make a real contribution.”
Then number of words (Nw) and syllables (Ns) are calculated in the processed text.
Nw = 72 words
Ns = 115 syllables
Depending on the silence duration and Number of pauses, Average Pause duration (Avg. Pause) is calculated by way of example expression/equation below:
Avg.P=S/Np
S = 20 seconds
Np = 6
Avg. P= 20/18 = 20/6 = 3.33 seconds approximately
Depending on the Number of words and Total time, speech rate was calculated by way of example expression equation below:
SR=Nw/T
Nw = 72
T = 60 seconds = 1 minute
SR = 72/1= 72 (words per minute)
Depending on the Number of syllables (Ns) and Phonation time (PT), the Articulation Rate (AR) was calculated by way of example expression/equation below:
AR=Ns/PT
Ns = 115
PT = 40 seconds = 0.666 minutes approximately
AR = 115/0.666 = 172.5 syllables per minute approximately.
Depending on the Number of syllables (Ns) and total time (T), average syllables was calculated by way of example expression/equation below
Avg.S=Ns/T
Ns = 115
T = 60 seconds = 1 minute
Avg.S = 115/1 = 115 syllables per minute
Native Language Influence was estimated/computed as shown below:
The MFCC data audio is calculated by the phase and magnitude spectrum technique which is given as C vector (13 vector coefficients). The C observation is interpreted by language model (implemented by the system 100) and log likelihood is calculated as shown below.
P(C/?)=1/N ?¦?_(n=0)^N¦log p(C ((n))/?)
where ? corresponds to a language model.
The system 100 takes input spoken English (by the speaker 1). So using the above equation it was noted that the probability of the speaker spoke English and the Influence Score is calculated by a conditional complementary probability of English Language Model implemented by the system 100. The influence score was computed by way of example below:
InfluenceScore=1-p(C/? e)
where p(C/? e) is the probability of the speaker spoke english.
For the above example, Native Language Influence was calculated as 0.2 (refer https://www.ripublication.com/ijaer18/ijaerv13n5_79.pdf, and https://pdfs.semanticscholar.org/052d/0325403e4c8c1fdc9833732314956333febd.pdf)
Below is an example of how the system 100 has computed Predicted Score for the input speech signal specific to a speaker (refer example text above):
Predictive Score is predicted by the Deep Neural Network (DNN) where it is trained by Mean Square Error (MSE) Loss Function with the speech parameters as its inputs. The Mean Square Error (MSE) Loss Function is computed by way of following exemplary expression/equation below:
MSE= 1/n ?_(i=1)^n¦?(Y_i-Y ^_i)?^2
where MSE is the minimum square errror, Y_i is Ground truth score, and Y ^_i Yi’ is the estimated or predicted score and n is the total number of dataset points.
Based on all speech parameters (features), DNN (implemented by the system 100) predicts probability for each class (e.g., say class from 0 to 10). For the above example, the predicted class is 8 (Predicted Score for the Speaker Speech).
Initially, the system 100 assigns a topic to the user e.g., “Introduce Yourself” for which user speaks on it. Depending on the Voice and processed text data, the speech parameters are calculated as mentioned above. Based on the speech parameters, the predictive score (e.g., 8.0) is computed/generated by the Deep Neural Network and based on the answers, the context is determined by Natural Language Processing. Based on the Context and Predicted Score, the Question are selected from the wide range of questions, in one example embodiment.
The below question pool gives examples of questions trained in the Deep Neural Network. The Sample Question Pool(s) are provided by way of example below:
Questions Context Predicted Score
Can you say more about your work in this field Financial Management 6
Which accounting applications are you familiar with Financial Management 7
What do you consider to be the biggest challenge facing the accounting profession today? Financial Management 8

Depending on the context and Predicted Score, the next question is fetched from the pool and presented to the speaker. In this case, the next question would be “What do you consider to be the biggest challenge facing the accounting profession today?”
The present disclosure overcomes the technical problem by providing a technical solution by segmenting speech signal, performing frequency spectrum analysis, deriving features from them and train complex networks (DNN) using the same is which is considered to be difficult. The present disclosure estimates speech parameters (or qualitative and non-objective features) for example, native language influence (e.g., mother tongue influence) that are quite a challenging (or daunting) task as can be realized in conventional systems and methods. Further in the present disclosure, system 100 is trained with training data that enables it to intelligently understand and assess the speaking skills of a person (or subsequent users and associated speech signals) by extracting features from the speech signal(s) for example, number of pauses, number of repetitive fillers, and the like. The system of the present disclosure further extracts textual content and maps the content with summary of the speech to analyze the context. The system 100 may be implemented in a client server architecture, wherein the scenario involves a user who uses a recording device, and speaks on a given topic. This speech signal that is recorded may be sent to a server side (or a computer system) where the speech and its transcript that is generated wherein the fluency and the content the user spoke about are measured. The system 100 further generates score(s) that are fed into a database (e.g., say the database 108) from where it can be retrieved on demand.
The written description describes the subject matter herein to enable any person skilled in the art to make and use the embodiments. The scope of the subject matter embodiments is defined by the claims and may include other modifications that occur to those skilled in the art. Such other modifications are intended to be within the scope of the claims if they have similar elements that do not differ from the literal language of the claims or if they include equivalent elements with insubstantial differences from the literal language of the claims.
It is to be understood that the scope of the protection is extended to such a program and in addition to a computer-readable means having a message therein; such computer-readable storage means contain program-code means for implementation of one or more steps of the method, when the program runs on a server or mobile device or any suitable programmable device. The hardware device can be any kind of device which can be programmed including e.g. any kind of computer like a server or a personal computer, or the like, or any combination thereof. The device may also include means which could be e.g. hardware means like e.g. an application-specific integrated circuit (ASIC), a field-programmable gate array (FPGA), or a combination of hardware and software means, e.g. an ASIC and an FPGA, or at least one microprocessor and at least one memory with software modules located therein. Thus, the means can include both hardware means and software means. The method embodiments described herein could be implemented in hardware and software. The device may also include software means. Alternatively, the embodiments may be implemented on different hardware devices, e.g. using a plurality of CPUs.
The embodiments herein can comprise hardware and software elements. The embodiments that are implemented in software include but are not limited to, firmware, resident software, microcode, etc. The functions performed by various modules described herein may be implemented in other modules or combinations of other modules. For the purposes of this description, a computer-usable or computer readable medium can be any apparatus that can comprise, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device.
The illustrated steps are set out to explain the exemplary embodiments shown, and it should be anticipated that ongoing technological development will change the manner in which particular functions are performed. These examples are presented herein for purposes of illustration, and not limitation. Further, the boundaries of the functional building blocks have been arbitrarily defined herein for the convenience of the description. Alternative boundaries can be defined so long as the specified functions and relationships thereof are appropriately performed. Alternatives (including equivalents, extensions, variations, deviations, etc., of those described herein) will be apparent to persons skilled in the relevant art(s) based on the teachings contained herein. Such alternatives fall within the scope and spirit of the disclosed embodiments. Also, the words “comprising,” “having,” “containing,” and “including,” and other similar forms are intended to be equivalent in meaning and be open ended in that an item or items following any one of these words is not meant to be an exhaustive listing of such item or items, or meant to be limited to only the listed item or items. It must also be noted that as used herein and in the appended claims, the singular forms “a,” “an,” and “the” include plural references unless the context clearly dictates otherwise.
Furthermore, one or more computer-readable storage media may be utilized in implementing embodiments consistent with the present disclosure. A computer-readable storage medium refers to any type of physical memory on which information or data readable by a processor may be stored. Thus, a computer-readable storage medium may store instructions for execution by one or more processors, including instructions for causing the processor(s) to perform steps or stages consistent with the embodiments described herein. The term “computer-readable medium” should be understood to include tangible items and exclude carrier waves and transient signals, i.e., be non-transitory. Examples include random access memory (RAM), read-only memory (ROM), volatile memory, nonvolatile memory, hard drives, CD ROMs, DVDs, flash drives, disks, and any other known physical storage media.
It is intended that the disclosure and examples be considered as exemplary only, with a true scope and spirit of disclosed embodiments being indicated by the following claims.

Documents

Application Documents

#	Name	Date
1	201821026519-IntimationOfGrant01-03-2024.pdf	2024-03-01
1	201821026519-STATEMENT OF UNDERTAKING (FORM 3) [16-07-2018(online)].pdf	2018-07-16
2	201821026519-PatentCertificate01-03-2024.pdf	2024-03-01
2	201821026519-REQUEST FOR EXAMINATION (FORM-18) [16-07-2018(online)].pdf	2018-07-16
3	201821026519-Written submissions and relevant documents [21-02-2024(online)].pdf	2024-02-21
3	201821026519-FORM 18 [16-07-2018(online)].pdf	2018-07-16
4	201821026519-FORM 1 [16-07-2018(online)].pdf	2018-07-16
4	201821026519-Correspondence to notify the Controller [06-02-2024(online)].pdf	2024-02-06
5	201821026519-FORM-26 [06-02-2024(online)]-1.pdf	2024-02-06
5	201821026519-FIGURE OF ABSTRACT [16-07-2018(online)].jpg	2018-07-16
6	201821026519-FORM-26 [06-02-2024(online)].pdf	2024-02-06
6	201821026519-DRAWINGS [16-07-2018(online)].pdf	2018-07-16
7	201821026519-US(14)-HearingNotice-(HearingDate-07-02-2024).pdf	2024-01-12
7	201821026519-COMPLETE SPECIFICATION [16-07-2018(online)].pdf	2018-07-16
8	201821026519-FORM-26 [05-09-2018(online)].pdf	2018-09-05
8	201821026519-FER.pdf	2021-10-18
9	201821026519-ABSTRACT [01-04-2021(online)].pdf	2021-04-01
9	Abstract1.jpg	2018-09-07
10	201821026519-CLAIMS [01-04-2021(online)].pdf	2021-04-01
10	201821026519-Proof of Right (MANDATORY) [25-09-2018(online)].pdf	2018-09-25
11	201821026519-COMPLETE SPECIFICATION [01-04-2021(online)].pdf	2021-04-01
11	201821026519-ORIGINAL UR 6(1A) FORM 26-120918.pdf	2019-02-13
12	201821026519-FER_SER_REPLY [01-04-2021(online)].pdf	2021-04-01
12	201821026519-ORIGINAL UR 6(1A) FORM 1-011018.pdf	2019-02-18
13	201821026519-OTHERS [01-04-2021(online)].pdf	2021-04-01
14	201821026519-FER_SER_REPLY [01-04-2021(online)].pdf	2021-04-01
14	201821026519-ORIGINAL UR 6(1A) FORM 1-011018.pdf	2019-02-18
15	201821026519-COMPLETE SPECIFICATION [01-04-2021(online)].pdf	2021-04-01
15	201821026519-ORIGINAL UR 6(1A) FORM 26-120918.pdf	2019-02-13
16	201821026519-CLAIMS [01-04-2021(online)].pdf	2021-04-01
16	201821026519-Proof of Right (MANDATORY) [25-09-2018(online)].pdf	2018-09-25
17	Abstract1.jpg	2018-09-07
17	201821026519-ABSTRACT [01-04-2021(online)].pdf	2021-04-01
18	201821026519-FER.pdf	2021-10-18
18	201821026519-FORM-26 [05-09-2018(online)].pdf	2018-09-05
19	201821026519-US(14)-HearingNotice-(HearingDate-07-02-2024).pdf	2024-01-12
19	201821026519-COMPLETE SPECIFICATION [16-07-2018(online)].pdf	2018-07-16
20	201821026519-FORM-26 [06-02-2024(online)].pdf	2024-02-06
20	201821026519-DRAWINGS [16-07-2018(online)].pdf	2018-07-16
21	201821026519-FORM-26 [06-02-2024(online)]-1.pdf	2024-02-06
21	201821026519-FIGURE OF ABSTRACT [16-07-2018(online)].jpg	2018-07-16
22	201821026519-FORM 1 [16-07-2018(online)].pdf	2018-07-16
22	201821026519-Correspondence to notify the Controller [06-02-2024(online)].pdf	2024-02-06
23	201821026519-Written submissions and relevant documents [21-02-2024(online)].pdf	2024-02-21
23	201821026519-FORM 18 [16-07-2018(online)].pdf	2018-07-16
24	201821026519-REQUEST FOR EXAMINATION (FORM-18) [16-07-2018(online)].pdf	2018-07-16
24	201821026519-PatentCertificate01-03-2024.pdf	2024-03-01
25	201821026519-IntimationOfGrant01-03-2024.pdf	2024-03-01
25	201821026519-STATEMENT OF UNDERTAKING (FORM 3) [16-07-2018(online)].pdf	2018-07-16

Search Strategy

1	SearchStrategy_201821026519E_30-09-2020.pdf