"Method And Apparatus For Recognizing Human Emotion Expressions Based

"Method And Apparatus For Recognizing Human Emotion Expressions Based On Speech Signal"

Abstract: A method and apparatus for recognizing human emotion expressions based on speech signal is disclosed. In one embodiment, a method includes filtering an input speech signal corresponding to a voiced acoustic expression of an individual and determining glottal closure instance (GCI) locations in the speech signal using the filtered signal. Further, the method includes computing instantaneous values of a set of acoustic features of the speech signal using the estimated GCI locations in the speech signal. The instantaneous values of the set of acoustic features are normalized with reference to a same set of acoustic features of a neutral speech signal associated with the individual. A multi–dimensional joint distribution is generated based on the normalized instantaneous values of acoustic features and matched with a pre-defined emotion specific template. The Moreover, the method includes determining a human emotion expression associated with the individual by calculating the number of matched found.

Patent Information

Application #

Filing Date

28 October 2013

Publication Number

19/2015

Publication Type

INA

Invention Field

COMMUNICATION

Status

Email

mail@lexorbis.com

Parent Application

Patent Number

Legal Status

Grant Date

2023-02-02

Renewal Date

Applicants

SAMSUNG R&D INSTITUTE INDIA – BANGALORE PRIVATE LIMITED

# 2870, ORION Building, Bagmane Constellation Business Park, Outer Ring Road, Doddanakundi Circle, Marathahalli Post, Bangalore-560 037

INTERNATIONAL INSTITUTE OF INFORMATION TECHNOLOGY, HYDERABAD

Gachibowli, Hyderabad, Andhra Pradesh, 500032

Inventors

1. Bayya Yegnanarayana

IIIT- HYDERABAD, Gachibowli, Hyderabad, Andhra Pradesh, 500032

2. Mittal Vinay Kumar

IIIT- HYDERABAD, Gachibowli, Hyderabad, Andhra Pradesh, 500032

3. Moogi Pratibha

Employed at Samsung R&D Institute India – Bangalore Private Limited having its office at # 2870, ORION Building, Bagmane Constellation Business Park, Outer Ring Road, Doddanakundi Circle, Marathahalli Post, Bangalore-560 037

4. Gangamohan Paidi

IIIT- HYDERABAD, Gachibowli, Hyderabad, Andhra Pradesh, 500032

5. Kadiri Sudarsana Reddy

IIIT- HYDERABAD, Gachibowli, Hyderabad, Andhra Pradesh, 500032

Specification

CLIAMS:
1. A method of recognizing human emotion expressions based on speech signal corresponding to a voiced acoustic expression of an individual, comprising:
determining, using a processor, glottal closure instance (GCI) locations in a speech signal corresponding to a voice activity of an individual;
computing instantaneous values of a set of acoustic features of the speech signal using the determined GCI locations in the speech signal;
normalizing the instantaneous values of the set of acoustic features of the speech signal with reference to a same set of acoustic features of a neutral speech signal associated with the individual; and
determining a human emotion expression associated with the individual based on the normalized instantaneous values of the set of acoustic features.

2. The method as claimed in claim 1, further comprising:
generating a zero frequency filtered (ZFF) signal by applying a filter on the speech signal; and
applying a running average filter on the filtered speech signal.

3. The method as claimed in claim 2, wherein computing the instantaneous values of the set of acoustic features of the speech signal comprises:
computing instantaneous value of fundamental pitch frequency (F0) using the estimated GCI locations in the filtered speech signal;
computing instantaneous value of strength of excitation (SoE) as slope of the filtered speech signal on the estimated GCI locations in the filtered speech signal; and
computing an instantaneous value of energy of excitation (EoE) as energy around the estimated GCI locations in the filtered speech signal.

4. The method as claimed in claim 3, wherein normalizing the instantaneous values of the set of acoustic features of the speech signal comprises:
capturing a neutral speech signal corresponding to a voice activity associated with the individual as a part of pre-registering step
computing mean value of each of a set of acoustic features of the neutral speech signal; and
computing normalized values of the set of acoustic features of the speech signal based on the instantaneous values of the set of acoustic features of the speech signal and the mean values of the set of acoustic features of the neutral speech signal.

5. The method as claimed in claim 2, further comprising:
extracting rate and amount of the voice activity in the filtered speech signal; and
determining whether the rate and amount of speech in the filtered speech signal is greater than a pre-defined threshold.

6. The method as claimed in claim 4 and 5, wherein determining the human emotion expression associated with the individual comprises:
generating a multi-dimensional joint distribution of the normalized values of the set of acoustic features;
matching each of the multi-dimensional joint distribution of the normalized values of the set of acoustic features with corresponding multi-dimensional joint distribution of the emotion specific voice expression templates, wherein each of the emotion specific voice expression templates correspond to one of emotion categories; and
recognizing a human emotion expression associated with the individual based outcome of the comparison.

7. The method as claimed in claim 6, wherein generating the multi-dimensional joint distribution of the normalized values of the set of acoustic features comprises:
generating a first multi-dimensional joint distribution of the normalized values of the fundamental pitch frequency (F0) and the strength of excitation (SoE);
generating a second multi-dimensional joint distribution of the normalized values of the fundamental pitch frequency (F0) and the Energy of Excitation (EoE); and
generating a third multi-dimensional joint distribution of the normalized values of the energy of excitation (EoE) and the strength of excitation (SoE).

8. The method of claim 7, wherein recognizing the human emotion expression associated with the individual based outcome of the comparison comprising:
computing a score indicating a probability of each multi-dimensional joint distribution of the normalized values matching corresponding multi-dimensional joint distribution of the emotion specific voice expression templates; and
determining a count of each multi-dimensional joint distribution of the normalized values matching the corresponding multi-dimensional joint distribution in one of the emotion specific voice expression templates based on the score; and
determining the human emotion expression of the individual as belonging to one of the emotion categories based on the matched count of the multi-dimensional joint distribution of the normalized values matching the corresponding multi-dimensional joint distribution of one of the emotion specific voice expression templates.

9. The method of claim 8, wherein determining the human emotion expression of the individual as belonging to one of the emotion categories based on the matched count comprises:
performing hierarchical classification of the matched counts corresponding to the emotion categories; and
determining a human emotion expression as belonging to one of the emotion categories based on the hierarchical classification.

10. The method of claim 1, further comprising:
outputting the human emotion expression of the individual as belonging to one of the emotion categories.

11. An apparatus of recognizing human emotion expressions based on speech signal corresponding to a voiced acoustic expression of an individual, comprising:
a processor; and
a memory comprising a human emotional expression detection module configured for:
determining glottal closure instance (GCI) locations in a speech signal corresponding to a voice activity of an individual;
computing instantaneous values of a set of acoustic features of the speech signal using the estimated GCI locations in the speech signal;
normalizing the instantaneous values of the set of acoustic features of the speech signal with reference to a same set of acoustic features of a neutral speech signal associated with the individual; and
determining a human emotion expression associated with the individual based on the normalized instantaneous values of the set of acoustic features.

12. The method as claimed in claim 1, wherein the human emotional expression detection module is configured for:
generating a zero frequency filtered (ZFF) signal by applying a filter on the speech signal; and
applying a running average filter on the filtered speech signal.

13. The apparatus as claimed in claim 11, wherein in obtaining the instantaneous values of the set of acoustic features of the speech signal, the human emotional expression detection module is configured for:
computing instantaneous value of fundamental pitch frequency (F0) using the estimated GCI locations in the filtered speech signal;
computing instantaneous value of strength of excitation (SoE) as slope of the filtered speech signal on the estimated GCI locations in the filtered speech signal; and
computing an instantaneous value of energy of excitation (EoE) as energy around the estimated GCI locations in the filtered speech signal.

14. The apparatus as claimed in claim 13, wherein in normalizing the instantaneous values of the set of acoustic features of the speech signal, the human emotional expression detection module is configured for:
capturing a neutral speech signal corresponding to a voice activity associated with the individual as a part of pre-registering step
computing mean value of each of a set of acoustic features of the neutral speech signal; and
computing normalized values of the set of acoustic features of the speech signal based on the instantaneous values of the set of acoustic features of the speech signal and the mean values of the set of acoustic features of the neutral speech signal.

15. The apparatus as claimed in claim 12, wherein the human emotional expression detection module is further configured for:
extracting rate and amount of the voice activity in the filtered speech signal; and
determining whether the rate and amount of speech in the filtered speech signal is greater than a pre-defined threshold.

16. The apparatus as claimed in claim 14 and 15, wherein in recognizing the human emotion expression associated with the individual, the human emotional expression detection module is configured for:
generating a multi-dimensional joint distribution of the normalized values of the set of acoustic features;
matching each of the multi-dimensional joint distribution of the normalized values of the set of acoustic features with corresponding multi-dimensional joint distribution of the emotion specific voice expression templates, wherein each of the emotion specific voice expression templates correspond to one of emotion categories; and
recognizing a human emotion expression associated with the individual based outcome of the comparison.

17. The apparatus as claimed in claim 16, wherein in generating the multi-dimensional joint distribution of the normalized values of the set of acoustic features, the human emotional expression detection module is configured for:
generating a first multi-dimensional joint distribution of the normalized values of the fundamental pitch frequency (F0) and the strength of excitation (SoE);
generating a second multi-dimensional joint distribution of the normalized values of the fundamental pitch frequency (F0) and the Energy of Excitation (EoE); and
generating a third multi-dimensional joint distribution of the normalized values of the energy of excitation (EoE) and the strength of excitation (SoE).

18. The apparatus of claim 17, wherein recognizing the human emotion expression associated with the individual based outcome of the comparison, the human emotional expression detection module is configured for:
computing a score indicating a probability of each multi-dimensional joint distribution of the normalized values matching corresponding multi-dimensional joint distribution of the emotion specific voice expression templates; and
determining a count of each multi-dimensional joint distribution of the normalized values matching the corresponding multi-dimensional joint distribution in one of the emotion specific voice expression templates based on the score; and
determining the human emotion expression of the individual as belonging to one of the emotion categories based on the matched count of the multi-dimensional joint distribution of the normalized values matching the corresponding multi-dimensional joint distribution of one of the emotion specific voice expression templates.

19. The apparatus of claim 18, wherein in determining the human emotion expression of the individual as belonging to one of the emotion categories based on the matched count, the human emotional expression detection module is configured for:
performing hierarchical classification of the matched counts corresponding to the emotion categories; and
determining a human emotion expression as belonging to one of the emotion categories based on the hierarchical classification.

20. The apparatus of claim 11, wherein the human emotional expression detection module is further configured for:
outputting the human emotion expression of the individual as belonging to one of the emotion categories. ,TagSPECI:

FIELD OF THE INVENTION

The present invention generally relates to the field of human emotion expression recognition, and more particularly relates to a method and apparatus for recognizing human emotion expressions based on speech signal.

BACKGROUND OF THE INVENTION

Emotion is a subjective, conscious experience that is characterized primarily by psychophysiological expressions, biological reactions, and mental states. Detection of emotions such as Happy, Anger, Sad, etc. helps to interpret the contextual background and could be beneficial for inferring user’s engagement level, level of frustration, level of depression etc., especially for the user scenarios such as user mood monitoring systems, user health monitoring systems, and automatic media recommendation systems and so on.

Typically, human emotion expression recognition is performed using either facial expression recognition or speech recognition. The human emotion expression recognition methods based on speech signal includes differentiation of speech based on acoustic cues. Based on the different acoustic cues the speech signal is analyzed for user’s different voice expressions and human emotion corresponding to the underlying voice expression is detected. Typically, such emotion detection is performed using machine learning algorithms. The human emotion expression detection application can be performed in real time as well as on recorded speech signal.

Underlying method requires user to register his “Neutral” voice (speech) as the very first step. This system is language independent, spoken-content independent, and also speaker independent with the condition that any new user is first asked to register his “Neutral” speech to start with the follow up emotion recognition.

BRIEF DESCRIPTION OF THE ACCOMPANYING DRAWINGS

Figure 1 is a block diagram of an exemplary expression recognition device, according to one embodiment.

Figure 2 is an exploded view of a human emotion expression detection module, according to embodiment.

Figure 3 is a process flowchart illustrating an exemplary method of recognizing a human emotion expression of an individual based on a speech signal, according to one embodiment.

Figure 4 is a process flowchart illustrating an exemplary method of generating a multi-dimensional joint distribution of normalized values of a set of acoustic features of a speech signal, according to one embodiment.

Figure 5A is a pictorial representation of a zero frequency filtering (ZFF) signal used for computing instantaneous values of a set of acoustic features, according to one embodiment.

Figure 5B is a schematic representation illustrating a process of matching two dimensional joint distribution values associated with the individual with pre-defined emotion specific expression templates, according to one embodiment.

Figure 5C is a diagrammatic representation illustrating an exemplary hierarchical decision tree for determining a final human emotion expression of the individual based on matched count, according to one embodiment.

Figure 6 is a block diagram of an exemplary emotion recognition device, such as those shown in Figure 1, showing various components for implementing embodiments of the present subject matter.

DETAILED DESCRIPTION OF THE INVENTION

A method and apparatus for recognizing human emotion expressions based on speech signal is disclosed. In the following detailed description of the embodiments of the invention, reference is made to the accompanying drawings that form a part hereof, and in which are shown by way of illustration specific embodiments in which the invention may be practiced. These embodiments are described in sufficient detail to enable those skilled in the art to practice the invention, and it is to be understood that other embodiments may be utilized and that changes may be made without departing from the scope of the present invention. The following detailed description is, therefore, not to be taken in a limiting sense, and the scope of the present invention is defined only by the appended claims.

Figure 1 is a block diagram of an exemplary expression recognition device 100, according to one embodiment. The expression recognition device 100 includes a processor 102 and a memory 104. The expression recognition device 100 may be a laptop, a desktop, a smart phone, a tablet, a special purpose computer and the like. According to the present invention, the memory 104 includes a human emotion expression detection module 106 for recognizing human emotion expressions based on speech signal corresponding to a voiced acoustic expression of an individual. The voiced acoustic expression of an individual is the voice that is recorded when the individual speak.

It is understood that, the present invention can be implemented with hardware, software, or combination thereof. In hardware implementation, the present invention can be implemented with one of an application specific integrated circuit (ASIC), a digital signal processor (DSP), a programmable logic device (PLD), a field programmable gate array (FPGA), a processor, a controller, a microprocessor, other electronic units, and combination thereof, which are designed to recognize human emotion expressions based on speech signal corresponding to a voiced acoustic expression of an individual. In software implementation, the present invention can be implemented with a module (e.g., human emotion expression detection module 106) for recognizing human emotion expressions based on speech signal corresponding to a voiced acoustic expression of an individual. The software module is stored in the memory unit 104 and executed by the processor 102. Various means widely known to those skilled in the art can be used as the memory unit 104 or the processor 102.

Figure 2 is an exploded view of the human emotion expression detection module 106, according to embodiment. The human emotion expression detection module 106 includes a filtering module 202, an acoustic feature extraction module 204, a multi-dimensional joint distribution module 206, and an expression classification module 208. The filtering module 202 segments an input speech signal corresponding to a voiced acoustic expression of an individual into a plurality of audio frames and generates zero frequency filtering (ZFF) signal from the plurality of audio frames. The input speech signal may correspond to a live speech captured using a microphone. On the other hand, the input speech signal may correspond to a pre-recorded speech data.

The acoustic feature extraction module 204 estimates glottal closure instance (GCI) locations in the speech signal based on the ZFF signal and computes instantaneous values of a set of acoustic features based on the GCI locations in the speech signal. The set of acoustic features includes a fundamental frequency (F0), strength of excitation (SoE) and energy of excitation (EoE). Additionally, the acoustic feature extraction module 204 normalizes the instantaneous values of the set of acoustic features.

The multi-dimensional joint distribution module 206 generates multi-dimensional joint distribution of the normalized values of the set of acoustic features. The expression classification module 208 matches the multi-dimensional joint distribution of the normalized values of the set of acoustic features with pre-defined emotion specific expression templates and determine a human emotion expression corresponding to the speech signal based on the matching counts of the multi-dimensional joint distribution with the pre-defined emotion specific expression templates.

Figure 3 is a process flowchart 300 illustrating an exemplary method of recognizing human emotion expression of an individual based on a speech signal, according to one embodiment. At step 302, a speech signal for which human emotion expressions is to be recognized is input from a voiced acoustic expression of the individual (e.g., speaker). At step 303, the speech signal is preprocessed and segmented into a plurality of audio frames. For example, the real-time incoming speech signal is preprocessed by applying a Linear Predictive Coding (LPC) filter and residual signal transformed to HILBERT energy envelop signal. In some embodiments, the speech signal is segmented into 500ms to 2 sec long audio frames. At step 304, a zero frequency filtering (ZFF) signal is generated by applying a zero frequency filtering method on the preprocessed audio frames.

At step 305, a running average filter is applied on the ZFF signal to remove mean trend from the ZFF signal. Mean trend removal is the process of removing short term average computed over the ZFF signal. At step 306, Glottal Closure Instance (GCI) locations in the mean trend removed signal are determined. Glottal Closure Instants (GCIs) are the instances of signi?cant excitation of the vocal tract. At step 308, instantaneous value of set of acoustic features is calculated based on the GCI locations. In an exemplary embodiment, the acoustic features include fundamental frequency (f0), strength of excitation (SoE) and energy of excitation (EoE). The graphical representation of the GCI locations in the ZFF signal and computation of acoustic features are illustrated in Figure 5A.

In an exemplary implementation, the instantaneous value of the fundamental frequency (F0) is calculated based on time period between two consecutive GCI locations. Consider that the time period between two consecutive GCI locations (gci(0) and gci(1)) is ?0. In such case, the fundamental frequency (F0) corresponding to the speech signal is 1/?0. The instantaneous value of energy of excitation (EoE) is computed as energy around the GCI location of the speech signal. The instantaneous value of strength of excitation (SoE) of the speech signal is computed as the slope of the speech signal on the GCI locations as shown in Figure 5A.

At step 310, the computed instantaneous values of acoustic features (i.e., F0, SoE, and EoE) are normalized with reference to a same set of acoustic features of a neutral speech signal corresponding to a voiced acoustic expression of the individual. For normalizing the instantaneous values, a neutral speech signal corresponding to the voice acoustic expression associated with the individual is captured during a pre-registration process. Then, a mean value of each of a set of acoustic features of the neutral speech signal is computed. Accordingly, normalized values of the set of acoustic features of the speech signal is computed based on the instantaneous values of the set of acoustic features of the speech signal and the mean values of the set of acoustic features of the neutral speech signal.

For instance, the normalized value of instantaneous value of fundamental frequency (Y_F0) is computed as follows:
Y_F0 = (X_F0-M_F0)/M_F0
where, X_F0 is instantaneous value of fundamental frequency (F0) and M_F0 is mean of the fundamental frequency at the neutral speech signal of the individual.

The normalized value of instantaneous value of strength of excitation (Y_SoE) is computed using following equation:
Y_SoE = (X_SoE-M_SoE)/M_SoE
where X_SoE is instantaneous value of strength of excitation (SoE) at the GCI locations of the speech signal and M_SoE is mean value of strength of excitation (SoE) of the neutral speech signal.

Similarly, the normalized value of instantaneous value of energy of excitation (Y_EoE) is computed using the following equation:
Y_EoE = (X_EoE-M_EoE)/M_EoE
where X_EoE is instantaneous value of energy of excitation at the GCI locations of the speech signal and M_EoE is mean value of energy of excitation of the neutral speech signal.

At step 312, a multi-dimensional joint distribution of the normalized instantaneous values of set of acoustic features of the speech signal is generated. According to present invention, using the acoustic features such as fundamental frequency, strength of excitation and energy of excitation, two dimensional joint distributions may be generated. For example, a multi-dimensional joint distribution of the normalized values of the fundamental pitch frequency (F0) and the strength of excitation (SoE) is generated. Also, a multi-dimensional joint distribution of the normalized values of the fundamental pitch frequency (F0) and energy of excitation (EoE) is generated. Additionally, a multi-dimensional joint distribution of the normalized values of the strength of excitation (SoE) and energy of excitation (EoE) is generated.

At step 314, the generated multi-dimensional joint distribution of the normalized values is matched with each of pre-defined emotion specific expression templates to recognize human emotion expressions. A pre-defined emotion specific voice expression template corresponds to one of emotion categories. For example, emotion categories may include angry, sad, neutral, and happy. A pre-defined emotion specific voice expression template is generated by collecting a plurality of samples of speech signal for which human emotion expressions are known. In some embodiments, each of multi-dimensional joint distribution of the normalized values of the set of acoustic features is matched with corresponding multi-dimensional joint distribution of each of the emotion specific voice expression templates. The process of matching the multi-dimensional joint distribution with the pre-defined emotion specific voice expression templates is illustrated in Figure 5B.

At step 316, a human emotion expression is determined based on number of times the multi-dimensional joint distribution of the normalized values match corresponding multi-dimensional joint distribution of one of the emotion specific voice expression templates. In some embodiments, hierarchical classification of the matched counts corresponding to the emotion categories is performed and a final human emotion expression is determined as belonging to one of the emotion categories based on the hierarchical classification. A matched count corresponding to an emotion category may correspond to number of times the multi-dimensional joint distribution of the normalized values match corresponding multi-dimensional joint distribution of the emotion specific voice expression templates corresponding to the respective emotion category. A detailed process of determining human emotion expression from the matched count is illustrated in Figure 5C. At step 318, the human emotion expression of the individual as belonging to one of the emotion categories is outputted.

Figure 4 is a process flowchart 400 illustrating an exemplary method of generating a multi-dimensional joint distribution of the normalized values of the set of acoustic features of the speech signal, according to one embodiment. Prior to performing the step 312 of the Figure 3, steps 402 to 406 are performed to determine whether rate and speech required for generating multi-dimensional joint distribution is present in the filtered speech signal. At step 402, a voiced acoustic expression of an individual is detected from a captured speech signal. At step 404, rate and amount of speech is extracted from the speech signal. At step 406, it is determined whether the extracted rate and amount of speech is greater than a pre-defined threshold. The pre-defined threshold is a minimum rate and amount of speech required in a speech signal to detect human emotion expression in the speech signal. If the rate and amount of speech is less than the pre-defined threshold, then at step 408, the process 400 is terminated. That is, at step 408, no human emotion expression is detected. If the rate and amount of speech is greater than the pre-defined threshold, then step 312 of Figure 3 is performed in which a multi-dimensional joint distribution of normalized values of the set of acoustic features is generated.

Figure 5A is a pictorial representation 500 of a zero frequency filtering (ZFF) signal 502 used for computing instantaneous values of a set of acoustic features, according to one embodiment. As can be seen from Figure 5A, the ZFF signal 502 contains GCI locations 504 of a speech signal corresponding to voiced acoustic expressions of an individual. According to the present invention, the expression recognition device 100 determines GCI locations 504 in the speech signal using the ZFF signal 502. The expression recognition device 100 computes instantaneous values of the set of acoustic features from the GCI locations 504. For example, the expression recognition device 100 computes fundamental frequency (F0) based on distance between any two successive GCI locations 504 (e.g., gci(0) and gci(1)). The expression recognition device 100 also computes strength of excitation (SoE) 506 by computing slope of the speech signal at one of the GCI locations. Furthermore, the expression recognition device 100 computes energy of excitation (EoE) 508 by computing energy around one of the GCI locations. The expression recognition device 100 normalizes the values of F0, SoE, and EoE and uses the normalized values of F0, SoE and EoE for generating two dimensional joint distribution (D1, D2, D3).

The expression recognition device 100 generates the first two dimensional joint distribution (D1) using the normalized values of Fundamental pitch frequency (F0) and Energy of Excitation (EoE). The expression recognition device 100 generates the second two dimensional joint distribution (D2) using normalized values of Fundamental pitch frequency (F0) and Energy of Excitation (EoE). The expression recognition device 100 generates the third two dimensional joint distribution (D3) using normalized values of the Energy of Excitation (EoE) and Strength of Excitation (SoE). For example, the two dimensional joint distributions D1, D2 and D3 are represented as mean and variance of normalized values of the acoustic features of the speech signal. The mean (MD1, MD2 and MD3) can be represented as follows:

MD1 = [¦(m_SoE &m_F_0 )];
MD2 = [¦(m_F_0& m_EoE)]; and
MD3 = ¦([m_SoE& m_EoE]).

The variance (VarD1, VarD2 and VarD3) can be represented as follows:
VarD1 (S_(SOE,F0)) = [¦(s_(SoE,SoE)&s_(SoE,Fo)@s_(SoE,Fo)&s_(Fo,Fo) )];
VarD2 (S_(F0,EoE)) = [¦(s_(Fo,Fo)&s_FoEoE@s_(EoE,Fo)&s_(EoE,EoE) )]; and
VarD3 (S_(SoE,EoE)) = [¦(s_(SoE,SoE)&s_(SoE,EOE)@s_(SoE,EoE)&s_(EoE,EoE) )].

The expression recognition device 100 matches the multi-dimensional joint distributions (D1, D2, and D3) with pre-defined emotion specific expression templates corresponding to different emotion categories (e.g., angry, sad, happy, and neutral) for one or more speakers to determine human emotion expression.

As shown in Figure 5A, for each of the emotion categories, the predefined emotion specific expression templates corresponds to two dimensional joint distributions (D1, D2, and D3) of instantaneous values of acoustic features associated with a speech signal of speaker 1 for three sentences and two dimensional joint distribution (D1, D2, and D3) of instantaneous values of acoustic features associated with a speech signal of speaker 2 for three sentences. It can be noted that, the predefined emotion specific expression templates of each emotion category may correspond to two dimensional joint distributions of instantaneous values of acoustic features for ‘N’ number of speakers and ‘M’ number of sentences.

Each of the emotion specific templates corresponding to each emotion category for each speaker is represented in terms of mean and variance of instantaneous values of acoustic features of the speech signal. For instance, the mean of two dimensional distributions D1, D2 and D3 corresponding to each emotion category for specific speaker is represented as follows:

[¦(M_D1@M_D2@M_D3 )] =[¦(m_(SoE.i)&m_Foi@m_(Fo,i)&m_(EoE,i)@m_(SoE,i)&m_(EoE,i) )]

The variance of two dimensional distributions D1, D2 and D3 corresponding to each emotion category for specific speaker is calculated as follows:
Var D1 = [¦(s_(SoE,SoE,i)&s_(SoE,Fo,i)@s_(SoE,Fo,i)&s_(Fo,Fo,i) )];
Var D2 = [¦(s_(Fo,Fo,i)&s_(FoEoE,i)@s_(EoE,Fo,i)&s_(EoE,EoE,i) )]; and
Var D3 = [¦(s_(SoE,SoE,i)&s_(SoE,EoE,i)@s_(SoE,EoE,i)&s_(EoE,EoE,i) )].

Figure 5B is a schematic representation 575 illustrating a process of matching two dimensional joint distribution values (D1, D2, and D3) associated with the individual with the pre-defined emotion specific expression templates, according to one embodiment. The expression recognition device 100 matches each two dimensional joint distribution (D1, D2, or D3) with respective pre-defined emotion specific expression templates corresponding two dimensional joint distributions (D1, D2, or D3) belonging to each of emotion categories for different speakers (e.g., speaker 1 and speaker 2).

In an exemplary implementation, the expression recognition device 100 computes a score (also known as KL distance) for each pre-defined emotion specific template corresponding to each two dimensional distribution (D1, D2 or D3) for each emotion category based on a Kullback–Leibler distance (KL distance) measure between the mean and the variance of instantaneous values of acoustic features of the speech signal and mean and variance of each of the pre-defined emotion specific templates corresponding to respective two dimensional distributions for each emotion category associated with specific speaker.

where K is the number of emotion specific templates per speaker;
E is the emotion corresponding to the speaker;
D is the order to multi-dimensional joint distribution.

is the joint distribution for a particular speaker for a given emotion for a pre-defined specific template.
is the test distribution of the speaker for a given joint distribution.

Consider a case where the mean and variance of the two dimensional distribution D1 is represented as:
MD1 = [¦(M_0&M_1 )]; and
VarD1 = [¦(s_00&s_01@s_10&s_11 )].

The mean and variance of two dimensional distributions D1( )for predefined emotion specific template is represented as:
MD1 for predefined emotion specific template =¦(¯([M)_0&¯M_1 )],

VarD1 for predefined emotion specific template =[¦(¯s_00&¯s_01@¯s_10&¯s_11 )],

The KL distance measure is calculated using following equation:
KL score= 1/2 ( Val_(D1 )+ Val_(D1 )+ Val_(D3 )-2)
Where Val_(D1 )=tr(S_1^(-1) S_0 ) = tr[¦(s_00&s_01@s_10&s_11 )]
= ¦(s_00+&s_11 )
Val_D2=[M1-M2] S_1^(-1 ) ?[M1-M2] ?^T
Val_D3 = - log??(¯s_(00¯s_11- s_00 ) ¯s_01)/(¯s_00 ¯s_11 -¯s_01 ¯s_10 )?

The above formula gives the KL distance measure for two dimensional joint distributions. In case of multidimensional score can be measured using the below formula.
D_KL (N_0?N_1 )= 1/2 (tr¦ (S_1^(-1) S_0 )+ (M_D1-µ_0 )^(? ) S_1^(-1 ) (µ_1-µ_0 )-k-ln?(det??S_0 ?/det??S_1 ? ))

It can be noted that, a two dimensional joint distribution (D1, D2 or D3) is said to be matching corresponding predefined emotion specific template for respective two dimensional joint distribution belonging to one of the emotion categories for a specific speaker when the score for the corresponding predefined emotion specific template for respective two dimensional joint distribution (D1, D2 or D3) among the emotion categories is maximum. It can be noted that the score indicates probability of match between a two dimensional joint distribution associated with an individual and corresponding predefined emotion specific template for respective two dimensional joint distribution belonging to one of the emotion categories

Based on the score, the expression recognition device 100 maintains a count of match between each of the two dimensional joint distributions and corresponding predefined emotion specific template for respective two dimensional joint distribution belonging to different emotion categories for each speaker.

Every time each of the two dimensional joint distributions matches corresponding predefined emotion specific template for respective two dimensional joint distributions belonging to different emotion categories for each speaker, the expression recognition device 100 increments the match count for the corresponding emotion as explained below.

For example consider, i = 1 for the first speaker and J=1 for the first distribution, then minimum score for a particular distribution is determined. The emotion E for which the minimum score is obtained, the matched count of corresponding emotion category is incremented by one. Likewise, for all emotion categories, the KL distance between emotion specific template and the test distribution for a given speaker is calculated. The template having lowest score is considered as the winning template. Accordingly, the matching count for the emotion category for the corresponding winning template is incremented by one.
Once the process of matching is complete for all the predefined emotion specific expression templates, the expression recognition device 100 determines total matched count for each emotion category which in turn is used for further classification & determining human emotion expression of the individual as explained below.

Figure 5E is a diagrammatic representation 590 illustrating an exemplary hierarchical decision tree for determining final human emotion expression of the individual based on the matched count, according to one embodiment. Based on the matched counts for each emotion category, the matched count for emotions ‘angry’ or ‘happy’ is compared with the matched count for emotions ‘sad’ or ‘neutral’. If the matched count of emotions ‘angry’ or ‘happy’ is greater than the matched count of emotions ‘sad’ or ‘neutral’, the matched count of emotion ‘angry’ is compared with the matched count of emotion ‘happy’. If the matched count of emotion ‘angry’ is greater than the matched count of emotion ‘happy’, the expression recognition device 100 outputs human emotion expression recognized from voiced acoustic expression of the individual as ‘angry’. On the contrary, if the matched count of emotion ‘happy’ is greater than the matched count of emotion ‘angry’, the expression recognition device 100 outputs human emotion expression recognized from the voiced acoustic expression of the individual as ‘happy’.

Similarly, if the matched count of emotions ‘sad’ or ‘neutral’ is greater than the matched count of emotions ‘angry’ or ‘happy’, the matched count of emotion ‘sad’ is compared with the matched count of emotion ‘neutral’. If the matched count of emotion ‘sad’ is greater than the matched count of emotion ‘neutral’, the expression recognition device 100 outputs human emotion expression recognized from the voiced acoustic expression of the individual as ‘sad’. On the contrary, if the matched count of emotion ‘neutral’ is greater than the matched count of emotion ‘sad’, the expression recognition device 100 outputs human emotion expression recognized from the voiced acoustic expression of the individual as ‘neutral’.

Figure 6 is a block diagram of an exemplary emotion recognition device 100, such as those shown in Figure 1, showing various components for implementing embodiments of the present subject matter. In Figure 6, the emotion recognition device 100 includes the processor 102, the memory 104, a display 602, an input device 604, and a cursor control 606, a read only memory (ROM) 608, and a bus 610.

The processor 102, as used herein, means any type of computational circuit, such as, but not limited to, a microprocessor, a microcontroller, a complex instruction set computing microprocessor, a reduced instruction set computing microprocessor, a very long instruction word microprocessor, an explicitly parallel instruction computing microprocessor, a graphics processor, a digital signal processor, or any other type of processing circuit. The processor 102 may also include embedded controllers, such as generic or programmable logic devices or arrays, application specific integrated circuits, single-chip computers, smart cards, and the like.

The memory 104 and the ROM 608 may be volatile memory and non-volatile memory. The memory 104 includes the human emotion expression detection module 106 for recognizing human emotion expression based on a speech signal corresponding to a voiced acoustic expression of an individual, according to one or more embodiments described above. A variety of computer-readable storage media may be stored in and accessed from the memory elements. Memory elements may include any suitable memory device(s) for storing data and machine-readable instructions, such as read only memory, random access memory, erasable programmable read only memory, electrically erasable programmable read only memory, hard drive, removable media drive for handling compact disks, digital video disks, diskettes, magnetic tape cartridges, memory cards, and the like.

Embodiments of the present subject matter may be implemented in conjunction with modules, including functions, procedures, data structures, and application programs, for performing tasks, or defining abstract data types or low-level hardware contexts. The human emotion expression detection module 106 may be stored in the form of machine-readable instructions on any of the above-mentioned storage media and may be executed by the processor 102. For example, a computer program may include machine-readable instructions, that when executed by the processor 102, cause the processor 102 to recognize human emotion expression based on a speech signal corresponding to a voiced acoustic expression of an individual, according to the teachings and herein described embodiments of the present subject matter. In one embodiment, the computer program may be included on a compact disk-read only memory (CD-ROM) and loaded from the CD-ROM to a hard drive in the non-volatile memory.

The bus 610 acts as interconnect between various components of the emotion recognition device 100. The components such as the display 602, the input device 604, and the cursor control 606 are well known to the person skilled in the art and hence the explanation is thereof omitted.

The present embodiments have been described with reference to specific example embodiments; it will be evident that various modifications and changes may be made to these embodiments without departing from the broader spirit and scope of the various embodiments. Furthermore, the various devices, modules, and the like described herein may be enabled and operated using hardware circuitry, for example, complementary metal oxide semiconductor based logic circuitry, firmware, software and/or any combination of hardware, firmware, and/or software embodied in a machine readable medium. For example, the various electrical structure and methods may be embodied using transistors, logic gates, and electrical circuits, such as application specific integrated circuit.

Documents

Application Documents

#	Name	Date
1	2013_SAIT_443_Form 5.pdf	2013-10-29
1	4854-CHE-2013-IntimationOfGrant02-02-2023.pdf	2023-02-02
2	2012_SAIT_443_Drawings for filing.pdf	2013-10-29
2	4854-CHE-2013-PatentCertificate02-02-2023.pdf	2023-02-02
3	4854-CHE-2013-FORM-26 [16-07-2021(online)].pdf	2021-07-16
3	2012_SAIT_443_CS for filing.pdf	2013-10-29
4	4854-CHE-2013-ABSTRACT [18-04-2020(online)].pdf	2020-04-18
4	4845-CHE-2013 FORM-1 16-12-2013.pdf	2013-12-16
5	4854-CHE-2013-CLAIMS [18-04-2020(online)].pdf	2020-04-18
5	4845-CHE-2013 CORRESPONDENCE OTHERS 16-12-2013.pdf	2013-12-16
6	4854-CHE-2013-FER_SER_REPLY [18-04-2020(online)].pdf	2020-04-18
6	4854-CHE-2013 CORRESPONDENCE OTHERS 07-01-2014.pdf	2014-01-07
7	4854-CHE-2013-OTHERS [18-04-2020(online)].pdf	2020-04-18
7	4854-CHE-2013 POWER OF ATTORNEY 07-01-2014.pdf	2014-01-07
8	abstract 4854-CHE-2013.jpg	2014-09-18
8	4854-CHE-2013-FER.pdf	2019-11-15
9	4854-CHE-2013-AMENDED DOCUMENTS [23-07-2019(online)].pdf	2019-07-23
9	4854-CHE-2013-RELEVANT DOCUMENTS [23-07-2019(online)].pdf	2019-07-23
10	4854-CHE-2013-FORM 13 [23-07-2019(online)].pdf	2019-07-23
11	4854-CHE-2013-AMENDED DOCUMENTS [23-07-2019(online)].pdf	2019-07-23
11	4854-CHE-2013-RELEVANT DOCUMENTS [23-07-2019(online)].pdf	2019-07-23
12	4854-CHE-2013-FER.pdf	2019-11-15
12	abstract 4854-CHE-2013.jpg	2014-09-18
13	4854-CHE-2013 POWER OF ATTORNEY 07-01-2014.pdf	2014-01-07
13	4854-CHE-2013-OTHERS [18-04-2020(online)].pdf	2020-04-18
14	4854-CHE-2013 CORRESPONDENCE OTHERS 07-01-2014.pdf	2014-01-07
14	4854-CHE-2013-FER_SER_REPLY [18-04-2020(online)].pdf	2020-04-18
15	4845-CHE-2013 CORRESPONDENCE OTHERS 16-12-2013.pdf	2013-12-16
15	4854-CHE-2013-CLAIMS [18-04-2020(online)].pdf	2020-04-18
16	4845-CHE-2013 FORM-1 16-12-2013.pdf	2013-12-16
16	4854-CHE-2013-ABSTRACT [18-04-2020(online)].pdf	2020-04-18
17	2012_SAIT_443_CS for filing.pdf	2013-10-29
17	4854-CHE-2013-FORM-26 [16-07-2021(online)].pdf	2021-07-16
18	2012_SAIT_443_Drawings for filing.pdf	2013-10-29
18	4854-CHE-2013-PatentCertificate02-02-2023.pdf	2023-02-02
19	4854-CHE-2013-IntimationOfGrant02-02-2023.pdf	2023-02-02
19	2013_SAIT_443_Form 5.pdf	2013-10-29

Search Strategy

1	search_15-11-2019.pdf