Abstract: The present invention discloses a method and device for recognizing human emotion from a video stream. The method comprises detecting face in the video stream, aligning a plurality of fiducial points on the detected face, detecting emotion state of the face by analyzing an appearance change in the face of the video stream with respect to an initial neutral frame of the video stream, predicting action units based on the detected emotion state of the face and recognizing the human emotion based on the predicted one or more action units. The predicting action units based on the detected emotion state incudes predicting one or more AU based on an extracted plurality of appearance based features, predicting one or more action units by detecting geometrical change infacial part using the fiducial point and predicting one or more action units by detecting change in appearance based features in face. Figure 3
CLIAMS:
1. A method of recognizing human emotion expressions in a video streamcomprising:
detecting one or more face in the video stream;
aligning a plurality of fiducial points on the detected face in the video stream;
detectingemotion state of the face by analyzing an appearance change in the face of the videostream with respect to an initial neutral frame of the video stream;
predicting one or more action units based on the detected emotion state of the in the face; and
recognizing the human emotion based on the predicted one or more action units.
2. The method as claimed in claim 1, further comprising:
training an offline module using a plurality of images.
3. The method as claimed in claim 2, wherein the training an offline module using a plurality of images comprising:
detectingthe face in the video stream and labeling one or more action units (AU) based on peak expression area of the detected face;
cropping the detected face in a pre-defined size by maintaining a pre-defined ratio between different facial parts;
selecting the region of interest on the cropped face for each of the action units based on the facial partscontributing the action unit;
extracting one or more appearance based features by applying one or more filter bank and constructing one or more feature vectors;
selecting one or more appearance based features by a feature reduction module, using shape of the action unit, confidence measure of the action unit and support for the action unit in the face;
training a support vector machine (SVM) classifier for each action unit for a frame level data to generate AU models; and
storing the generated action unit models.
4. The method as claimed in claim 1, wherein predicting one or more action units based on the detected appearance change in the face comprising:
predicting one or more action unit (AU) based on anextracted plurality of appearance based features;
predicting one or more action units by detecting geometrical change in in at least one facial part using the fiducial point;
predicting one or more action units by detecting change in appearance based features in face around thefiducial points based on a pre-defined threshold logic.
5. The method as claimed in claim 1 and 4, wherein predicting one or more action unit(AU) based on the extracted plurality of appearance based features comprising:
cropping the detected face in a pre-defined size by maintaining a pre-defined ratio between different facial parts;
selecting the region of interest on the cropped face for each of the selected action units based on the facial parts present in the region of interest;
extracting one or more appearance based features by applying one or more filter bank and to construct one or more feature vectors;
predicting one more action units corresponding to the appearance based feature using the pre-stored action unit models in the offline training module.
6. The method as claimed in claim 1 and 4, wherein predicting the one or more action units by detecting geometrical change in at least one facial part using the fiducial point comprising:
detecting geometrical change in one or more facial parts with respect to fiducial points; and
obtaining the geometrical features based on fiducial points, wherein the fiducial points are moved with respect to a pre-defined reference point.
7. The method as claimed in claim 1, wherein predicting the one or more action units by detecting change in appearance based features in face comprising:
extracting texture histogram feature at each emotion key point by considering a local region; and
applying a threshold based detection logic to detect the change in appearance of the detected face in the video as compared to the initial neutral face.
8. The method as claimed in claim 1 and 5 to 7, wherein recognizing the human expression emotion comprising:
collectingthe predicted one or more action units;
determining one or more final action units from the collected predicted one more action units; and
mappingthe final one or more action units to corresponding one or more emotion using a statistical relationship between the selected action units and emotions.
9. A device for recognizing human emotion expressions in a video stream comprising:
a fiducial point fitting module for aligning a plurality of fiducial points on a face in the video stream;
a change detection module coupled with fiducial point fitting module for detecting emotion state of the face based on an appearance change in the face of the video stream with respect to an initial neutral frame of the video stream;
an AU prediction module coupled with change detection module for predicting one or more action units based on the detected emotion state; and
amapping module coupled with AU prediction module for mapping one or more action units to corresponding one or more emotion.
10. The device as claimed in claim 9, further comprising:
an offline training module for training a plurality of images.
11. The device as claimed in claim 9, wherein the offline training module comprising:
a face detection module for detecting the face in the video stream and labeling one or more action units (AU) based on peak expression area of the detected face;
a face cropper module coupled with the face detection module for cropping the detected face in a pre-defined size by maintaining a pre-defined ratio between different facial parts;
a region of interest selection module coupled with the face cropper module for selecting the region of interest on the cropped face for each of the action units based on the facial parts present in the region of interest;
an appearance based feature extraction module coupled with the region of interest selection module for extracting one or more appearance based features by applying a filter bank and constructing one or more feature vectors;
a feature reduction module coupled with the appearance based feature extraction module for selecting one or more appearance based features by a feature reduction module, using shape of the action unit, confidence measure of the action unit and support for the action unit in the face;
a AU modelling module coupled with the feature reduction module for training a support vector machine (SVM) classifier for each action unit for a frame level data to generate AU models; and
a pre-stored AU model database for storing the generated action unit models.
12. The device as claimed in claim 9, wherein the AU prediction module comprising:
an appearance based feature AU prediction module for predicting one or more action unit (AU) based on a plurality of appearance based features;
a geometrical change based AU prediction module coupled with the appearance based feature AU prediction module for predicting one or more action units by detecting geometrical change in in at least one facial part using the fiducial point;
an appearance change based AU prediction module coupled with geometrical change based AU prediction module for predicting one or more action units by detecting change in appearance based features in face around fiducial points based on a pre-defined threshold logic.
13. The device as claimed in claim 12, wherein the appearance based feature AU prediction module comprises:
a face and pupil detection module for detecting the face in the video stream;
a face cropper module for cropping the detected face in a pre-defined size by maintaining a pre-defined ratio between different facial parts;
a region of interest selection module selecting the region of interest on the cropped face for each of the selected action units based on the facial parts present in the region of interest;
an appearance based feature extraction module for extracting one or more appearance based features by applying the filter bank and a construct one or more feature vectors; and
a feature based AU prediction module for predicting one more action units corresponding to the appearance based feature using the pre-stored action unit models in the offline training module.
14. The device as claimed in claim 12, wherein the geometrical change based AU prediction module is adapted for:
detecting geometrical change in one or more facial parts with respect to the fiducial points in initial neutral face; and
obtaining the geometrical features based on fiducial points, wherein the fiducial points are moved with respect to a pre-defined reference point.
15. The device as claimed in claim 12, wherein the appearance change based AU prediction module is adapted for:
extracting texture histogram feature at each emotion key point by considering a local region; and
applying a threshold based detection logic to detect the change in appearance of the detected face in the video as compared to the initial neutral face. ,TagSPECI:FIELD OF THE INVENTION
The present invention generally relates to the field of human emotion expression recognition, and more particularly relates to a method and apparatus for recognizing human emotion based from a videostream by detecting the appearance and movement of different facial parts.
BACKGROUND OF THE INVENTION
Emotion is a subjective, conscious experience that is characterized primarily by psychophysiological expressions, biological reactions, and mental states. Detection of emotions such as Happy, Anger, Sad, etc. helps to interpret the contextual background and could be beneficial for inferring user’s engagement level, level of frustration, level of depression etc., especially for the user scenarios such as user mood monitoring systems, user health monitoring systems, and automatic media recommendation systems and so on.
Typically, human emotion expression recognition is performed using either facial expression recognition or speech recognition. The human emotion expression recognition methods based on facial expression includes differentiation of appearance and geometric change of facial parts.Typically facial expression specific information present on a face is encoded by a set of muscle action units (AU). In facial expression recognition system, AUs are detected and analyzed to infer the underlying human emotions. Efficient feature description of the face plays a crucial role in detecting the facial expressions accurately. The feature descriptor need to be robust to the challenges in realistic scenarios like wide range of illumination changes, errors in face tracking and fiducial point detection, variations in image resolutions or scale and the presence of head pose variations.
The rate of false alarm of emotion expressions that are predicted based on the appearance or the muscle movement of the face increases if the detected AU’s are not appropriate. Further, the movements of some facial parts for several emotions are similar. This also increases the rate of false alarm.
SUMMARY OF THE INVENTION
An objective of present invention is to provide a method and system for recognizing human emotion expressions in a video stream.
Another objective of present invention is to the reduction in false positives and overlook in an emotion detection system.
An embodiment of the present invention describes a method of recognizing human emotion expressions in a video stream. The method comprises detecting one or more face in the video stream, aligning a plurality of fiducial points on the detected face in the video stream, detecting motion state of the face by analyzing an appearance change in the face of the video stream with respect to an initial neutral frame of the video stream, predicting one or more action units based on the detected emotion state of the in the face andrecognizing the human emotion based on the predicted one or more action units. The method further comprises training an offline module usinga plurality of images.
Another embodiment of the present invention describes a method the training an offline module using a plurality of images. The method comprises detecting the face in the video stream and labeling one or more action units (AU) based on peak expression area of the detected face, cropping the detected face in a pre-defined size by maintaining a pre-defined ratio between different facial parts, selecting the region of interest on the cropped face for each of the action units based on the facial parts contributing the action unit, extracting one or more appearance based features by applying one or more filter bank and constructing one or more feature vectors, selecting one or more appearance based features by a feature reduction module, using shape of the action unit, confidence measure of the action unit and support for the action unit in the face, training a support vector machine (SVM) classifier for each action unit for a frame level data to generate AU models andstoring the generated action unit models.
In one embodiment of present invention, predicting one or more action units based on the detected appearance change in the face comprising predicting one or more action unit (AU) based on an extracted plurality of appearance based features, predicting one or more action units by detecting geometrical change in in at least one facial part using the fiducial point, predicting one or more action units by detecting change in appearance based features in face around the fiducial points based on a pre-defined threshold logic.
According to another aspect of present invention, predicting one or more action unit(AU) based on the extracted plurality of appearance based features comprising, cropping the detected face in a pre-defined size by maintaining a pre-defined ratio between different facial parts, selecting the region of interest on the cropped face for each of the selected action units based on the facial parts present in the region of interest, extracting one or more appearance based features by applying one or more filter bank and to construct one or more feature vectors and predicting one more action units corresponding to the appearance based feature using the pre-stored action unit models in the offline training module.
Furthermore, predicting the one or more action units by detecting geometrical change in at least one facial part using the fiducial point comprising detecting geometrical change in one or more facial parts with respect to fiducial points, and obtaining the geometrical features based on fiducial points, wherein the fiducial points are moved with respect to a pre-defined reference point. The prediction of one or more action units by detecting change in appearance based features in face comprising extracting texture histogram feature at each emotion key point by considering a local region and applying a threshold based detection logic to detect the change in appearance of the detected face in the video as compared to the initial neutral face.
According to another embodiment of present invention, recognizing the human expression emotion comprises collecting the predicted one or more action units, determining one or more final action units from the collected predicted one more action units andmapping the final one or more action units to corresponding one or more emotion using a statistical relationship between the selected action units and emotions.
Yet another embodiment of present invention, a device for recognizing human emotion expressions in a video stream comprises a fiducial point fitting module for aligning a plurality of fiducial points on a face in the video stream, a change detection module coupled with fiducial point fitting module for detecting emotion state of the face based on an appearance change in the face of the video stream with respect to an initial neutral frame of the video stream, an AU prediction module coupled with change detection module for predicting one or more action units based on the detected emotion state anda mapping module coupled with AU prediction module for mapping one or more action units to corresponding one or more emotion.
The device according to present invention further comprises an offline training module for training a plurality of images. The offline training module comprises a face detection module for detecting the face in the video stream and labeling one or more action units (AU) based on peak expression area of the detected face, a face cropper module coupled with the face detection module for cropping the detected face in a pre-defined size by maintaining a pre-defined ratio between different facial parts, a region of interest selection module coupled with the face cropper module for selecting the region of interest on the cropped face for each of the action units based on the facial parts present in the region of interest, an appearance based feature extraction module coupled with the region of interest selection module for extracting one or more appearance based features by applying a filter bank and constructing one or more feature vectors, a feature reduction module coupled with the appearance based feature extraction module for selecting one or more appearance based features by a feature reduction module, using shape of the action unit, confidence measure of the action unit and support for the action unit in the face, a AU modelling module coupled with the feature reduction module for training a support vector machine (SVM) classifier for each action unit for a frame level data to generate AU models and a pre-stored AU model database for storing the generated action unit models.
Yet another embodiment of present invention describes the AU prediction module which comprises an appearance based feature AU prediction module for predicting one or more action unit (AU) based on a plurality of appearance based features, a geometrical change based AU prediction module coupled with the appearance based feature AU prediction module for predicting one or more action units by detecting geometrical change in in at least one facial part using the fiducial point and an appearance change based AU prediction module coupled with geometrical change based AU prediction module for predicting one or more action units by detecting change in appearance based features in face around fiducial points based on a pre-defined threshold logic.
According to one aspect of present invention, the appearance based feature AU prediction module comprises a face and pupil detection module for detecting the face in the video stream, a face cropper module for cropping the detected face in a pre-defined size by maintaining a pre-defined ratio between different facial parts, a region of interest selection module selecting the region of interest on the cropped face for each of the selected action units based on the facial parts present in the region of interest, an appearance based feature extraction module for extracting one or more appearance based features by applying the filter bank and a construct one or more feature vectors and a feature based AU prediction module for predicting one more action units corresponding to the appearance based feature using the pre-stored action unit models in the offline training module.
The geometrical change based AU prediction module according to present invention is adapted for detecting geometrical change in one or more facial parts with respect to the fiducial points in initial neutral face andobtaining the geometrical features based on fiducial points, wherein the fiducial points are moved with respect to a pre-defined reference point.
The appearance change based AU prediction module according to present invention is adapted extracting texture histogram feature at each emotion key point by considering a local region and applying a threshold based detection logic to detect the change in appearance of the detected face in the video as compared to the initial neutral face.
BRIEF DESCRIPTION OF THE ACCOMPANYING DRAWINGS
The aforementioned aspects and other features of the present invention will be explained in the following description, taken in conjunction with the accompanying drawings, wherein:
Figure 1 is a block diagram of an exemplary emotion recognition device, according to one embodiment.
Figure 2 is an exploded view of a human emotion detection module, according to an embodiment herein.
Figure 2A is an exploded view of appearance based action unit prediction modulesuch as those shown in Figure 2, according to an embodiment herein.
Figure 3 is a flowchart illustrating an exemplary method of recognizing human emotion from faces detected in a video stream, according to one embodiment.
Figure 4 is an exploded view of an offline training module, according to an embodiment herein.
Figure 5 is a flowchart illustrating an exemplary method of training an offline training module, according to one embodiment.
Figure 6 is a flowchart illustrating an exemplary method of predicting action units based on the appearance based feature detection, according to one embodiment.
Figure 7 is a flowchart illustrating an exemplary method of predicting action units based on the geometrical change in facial parts, according to one embodiment.
Figure 8 is a flowchart illustrating an exemplary method of predicting action units based on the change in appearance, according to one embodiment.
Figure 9 is a block diagram of an emotion recognition device, such as those shown in Figure 1, showing various components for implementing embodiments of the present subject matter.
Although specific features of the present invention are shown in some drawings and not in others. This is done for convenience only as each feature may be combined with any or all of the other features in accordance with the present invention.
DETAILED DESCRIPTION OF THE INVENTION
The embodiments of the present invention will now be described in detail with reference to the accompanying drawings. However, the present invention is not limited to the embodiments. The present invention can be modified in various forms. Thus, the embodiments of the present invention are only provided to explain more clearly the present invention to the ordinarily skilled in the art of the present invention. In the accompanying drawings, like reference numerals are used to indicate like components.
The specification may refer to “an”, “one” or “some” embodiment(s) in several locations. This does not necessarily imply that each such reference is to the same embodiment(s), or that the feature only applies to a single embodiment. Single features of different embodiments may also be combined to provide other embodiments.
As used herein, the singular forms “a”, “an” and “the” are intended to include the plural forms as well, unless expressly stated otherwise. It will be further understood that the terms “includes”, “comprises”, “including” and/or “comprising” when used in this specification, specify the presence of stated features, integers, steps, operations, elements and/or components, but do not preclude the presence or addition of one or more other features integers, steps, operations, elements, components, and/or groups thereof. As used herein, the term “and/or” includes any and all combinations and arrangements of one or more of the associated listed items.
Unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this disclosure pertains. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the relevant art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.
A method and device for recognizing human emotion from a video stream is disclosed. In the following detailed description of the embodiments of the invention, reference is made to the accompanying drawings that form a part hereof, and in which are shown by way of illustration specific embodiments in which the invention may be practiced. These embodiments are described in sufficient detail to enable those skilled in the art to practice the invention, and it is to be understood that other embodiments may be utilized and that changes may be made without departing from the scope of the present invention. The following detailed description is, therefore, not to be taken in a limiting sense, and the scope of the present invention is defined only by the appended claims.
Figure 1 is a block diagram of an exemplary emotion recognition device 100, according to one embodiment. The emotion recognition device 100 includes a processor 102 and a memory 104. The expression recognition device 100 may be a laptop, a desktop, a smart phone, a tablet, a special purpose computer and the like. It is understood that, the present invention can be implemented with hardware, software, or combination thereof. In hardware implementation, the present invention can be implemented with one of an application specific integrated circuit (ASIC), a digital signal processor (DSP), a programmable logic device (PLD), a field programmable gate array (FPGA), a processor, a controller, a microprocessor, other electronic units, and combination thereof, which are designed to recognize emotion from a video signal. In software implementation, the present invention can be implemented with a module (e.g., human emotion detection module(106) for detecting human emotion from a video stream. The software module is stored in the memory unit 104 and executed by the processor 102. Various means widely known to those skilled in the art can be used as the memory unit 104 or the processor 102.
Figure 2 is an exploded view of a human emotion detection module, according to an embodiment herein. The human emotion detection module 106, according to present invention includes fiducial point fitting module 204, change detection module 206, an AU prediction module 210, a facial biases prediction module 216 and mapping module 218. The AU prediction module 210 includes an appearance based feature action unit prediction module 208, geometrical change based AU prediction module 212, appearance change based AU prediction module 214.
The fiducial fitting module 204 detects one or more faces in the incoming video stream and the corresponding fiducial points successfully on static images or video sequences in real time. Each of the detected face is treated as individual entity and the emotion is detected in parallel. A plurality of face and fiducial points are stored in the emotion recognition device 100 by training an offline module. The process of training the offline module is explained in Figure 4 and Figure 5. The detected fiducial points are used for accurate face alignment and cropping. Moreover, the fiducial points helps to detect geometrical movement and appearance change occurred in certain region of the face. Accurate detection of fiducial points on the face is done with the help of Constrained Local Model(CLM) based fiducial point fitting. CLM combines the output of classifiers trained on local patch descriptors from the neighborhood of fiducial points with the global constraints learnt on shape model to find the best possible locations of the points on the new face.
Theconstrained local model (CLM) point tracking is made more accurate by
splitting shape model of face into parts based on strength of non-rigid behavior of the face detected in the video stream; and
by applying a sequence of Haar-detectors to initialize and localize the search region and eventually exploiting the modifications for decrease in search iteration complexity.
The change detection module 206 detects the change in the face with reference to the initial frame. For instance, the initial frame is considered as neutral face based on the initial frame. The change detection module 206 generates a set of key emotion points with respect to CLM points such that they are sensitive to facial emotion changes and robust to alignment errors simultaneously. A Local Ternery Pattern (LTP) histogram is computed at each emotionpoint in the neutral registered frame is used as a model for that point. The same procedure is done for the subsequent frames and finally changes are detected based on the similarity between the neutral face in the initial frame and the current image.
The appearance based feature action unit prediction module 208 includes face and pupil detection module 230, face cropper module 232, region of interest selection module 234 and appearance based feature extraction module 236 and an action unit prediction module 238. An exploded view of appearance based feature action unit prediction module 208 is illustrated in Figure 2A. The appearance based feature action unit prediction module208 predicts a plurality of action units based on the appearance based features extracted and a plurality of action units stored in pre-stored AU model database 209 in the emotion detection device, as disclosed inFigure 4 and Figure 5.
According to one embodiment of present invention, the geometrical change based action unit prediction module 212 uses the initial neutral face registered for the image to decide whether the CLM points for the current frame is moved with respect to the reference points, where the reference points are set in the initial neutral face. The current fitted CLM points are corrected for fine variations and taken to the mean shape space for the computing the geometric variations.The mean shape space is the common space for computing all geometric and appearance based features related to fiducial points. The mean shape of the human face is learned offline using a large dataset and stored for further alignment of all input face shapes.
In order to detect the geometric variation, the facial parts that deliver high rage of emotion specific movements such as eye brows, lips etc.are considered. The geometrical change based AU prediction module 212 measures the amount of vertical upward and downward shifts in the selected facial partby comparing the currently fitted CLM points (after correcting for the fine variations) against the reference facial parts. For instance, the facial part considered for detecting geometrical change is eye brows, and then the geometrical change detection module 212 interprets the upward vertical shift, beyond a threshold, as AU3 and a vertical shift downwards beyond a threshold as AU4. The decisions on AU3 and AU4 are made only if the current nose points are seen to be geometrically stable with respect to the reference nose points.
The appearance change based action unit prediction module 214predicts one or more action units corresponding to the change in appearance of the face detected in the video stream. According to one embodiment of present invention, the appearance based action unit prediction module 214 extracts LTP histograms at each emotion key point of the detected face, such as lips, eye brows, chin etc. The histogram is of 59 dimensional. For emotion points such as forehead, the entire region is not considered, instead, only one quarter of the region in both the directions are considered for histogram evaluation, as the forehead emotion changes are prominent in that regions. The histograms used in the present invention incorporate the local texture information.The similarities at the emotion points with the initial emotion are fused using MAX logic, and the resultant value is compared with a predefined threshold value. Unpaired emotion points are directly compared with the predefined threshold values.Further, the appearance based action unit prediction module 214 obtains a plurality ofsimilarity values for appearance changes in regions such as cheek, eye brows, forehead, below mouth, above mouth, mouth corner and mouth middle. The similarity values are compared with a predefined threshold. The resultant of the comparison is used to determine the change in appearance of the selected region. A statistical model for neutral appearance is created on every key emotion point using the LTP histograms extracted from initial neutral reference frames. The similarity value of the current histogram to its respective statistical model is evaluated using Mahanalobis distance measure. The threshold values for each key emotion point are different and are evaluated empirically. The similarity values are computed from the statistical vector.
The mapping module 218 collects the action units predicted by appearance based feature action unit prediction module 208, geometrical change based action unit prediction module 212 and appearance change based action unit prediction module 214 to decide on the final emotion present on the face detected from the video signal. A final AU is determined from the predicted set of action units based on the number of occurrence of that particular AU. It maps the set of detected AUs (viz., cheek raiser(AU6), brow lowered (AU4) ) to one or more of emotions (viz., anger, fear, sad, happy, surprise, disgust etc., using statistical relationships between the AUs and emotions and a distance measurebased on the cost calculated using Longest Common Subsequence (LCS) technique.
Figure 2A is an exploded view of appearance based action unit prediction module 208, according to an embodiment herein.The appearance based action unit prediction module 208 includes face and pupil detection module 230, face cropper module 232, region of interest selection module 234, appearance based feature extraction module 236 and action unit prediction module 238.
The face and pupil detection module 230 localizes the faces to be analyzed in the video stream using fiducial points. The fiducial points are detected using Constrained Local Model (CLM) method. In order to perform the localization, the face is first centralized and further rotated based on the angle made by the straight line connecting the pupils to the X-axis. For the scaling purpose, pupils are used as reference points on the rotated face image and a fixed distance is maintained between them. Nose and chin points are further used to make sure the complete face get enclosed in a predefined fixed size template.The face cropper module 232 uses the localized face and marked pupil points to crop the face region while maintaining a standard ratio between the various facial featureslike eye, mouth and chin using variable scale factors. The face is rotated and scaled to get a normalized faces of uniform sizes. The region of interest module 234 selects the region of interest (ROI) on face to be used for each of action units (AU) by understanding the relationship between the face regions and AU. The region of interest detection module 234 according to the present invention plays a major role in reduction in false positives and overlook.
The appearance based feature extraction module 236 derives features from the cropped face by applying filter banks and construct feature vector of huge size in order to capture the appearance of facial eventsuniquely to represent local events (textures) around each pixel. There are two methods are used for extracting the features, Gabor feature extractionmethod using Gabor filter and Spin Local Gradient Binary Patterns (LGBP) based method.
Gabor filteris a linear filter used for edge detection. Frequency and orientation representations of Gabor filters are similar to those of the human visual system. The Gabor filters are appropriate for texture representation and discrimination. Gabor feature extractionmethod generates smooth gradient responses of the detected face at M scales and N orientations.
An equation for calculating Gabor feature for a detected face is represented as follows:
?_k ? (x ? )= k ?^2/s^2 exp(-(k ?^2 x ?^2)/(2s^2 ))[exp(ik ?x ? )-exp (-s^2/2)]
The Gabor features capture the emphasis (magnitude of response) of different facial events at various scales and orientations.
The Spin Local Gradient Binary Patterns (LGBP) based method generates code/pattern for each pixel in each of the detected face obtained from the face and pupil detection module 230, by integrating the relative information around the pixel at a radius ?, where ? is a function of corresponding response image scale. The Local Gradient Binary Pattern featurescollect the features extracted through the SPIN support with P radial distances and Q angle of the detected face. The method steps involved in extracting features and predicting AU based on appearance using SPIN LGBP method includes segmenting each of the each LGBP response image into blocks, for example 4 quadrants, imposing SPIN support is on each of the block and calculating the histogram of patterns falling under each of the sub-blocks in spin support and concatenating the calculated histogram patterns to determine a feature vector.
The action unit prediction module 238 predicts the presence or absence of each action unit using the respective feature vectors extracted from the module 236. The AU prediction is based on Support Vector Machine based classification which uses the offline pre-stored model for each action unit.
Figure 3 is a flowchart illustrating an exemplary method of recognizing human emotion in a video stream, according to one embodiment. At step 304, inputs a video stream to the human emotion detection module 106. At step 303, one or more faces in the video stream is recognized and sent to the human emotion detection module 106. At step 306, the fiducial points are fit in to the detected human face using the technique of constrained local model fitting.. At step 308, a change in the initially recognized face is detected. No emotion is detected if there is no change in the initially recognized face. The emotions for the faces are detected only if there exist an appearance change in the recognized face from the initial face, as shown in step 310.
Further, one or more action units are predicted for detecting the human emotion. Action units (AUs) are representation of muscular activity that produces facial appearance changes. According to one embodiment of present invention, three methods are used to predict one or more action units correspond to a face recognized. The method for predicting AU includes AU prediction method using appearance based features (step 314), geometrical change based AU prediction method (step 318) and appearance change based AU prediction method (Step 320).
For the AU prediction method using appearance based features, an offline training module is trained to create a pre-stored AU model database. A detailed explanation for appearance based AU prediction is illustrated in Figure 6.
At step 318, action units are predicted based on the geometrical change of facial parts. The AU prediction method based on geometrical change in facial part as shown in step 318 is explained in detail in Figure 7.
At step 320, action units correspond to the change in appearance is detected. The change in appearance of each face is detected with respect to an initial neutral face registered. The AU prediction based on change in appearance as shown in step 320 is illustrated in detail in Figure 8.
At step 322, one or more final action unit is detected from different action units. According to one embodiment of present invention, the predictions of same action unit by different prediction method are detected. If one particular action unit is predicted by at least 2 prediction methods, then that particular action unit is considered as the final action unit. Likewise, different method of such prediction may be valid for particular action units. Considering all these, a final action unit is selected.
At step 324, the action units predicted using AU prediction methods are mapped in to emotionby usingstatistical relationships between the AUs and emotions and suitable distance measure. The statistical relationship between the AUs and emotion is obtained using discriminative power concept [H].
H = P(Yj|Xi) – P~(Yj|Xi)
where P(Yj|Xi) is the probability of AU Yj given that emotion Xi has occurred and P~(Yj|Xi) is the probability of AU Yj given that emotion Xi has not occurred
The test string of predicted AUs from the input image is matched against the template strings to find the emotion. An approximate string matching technique (Longest Common Subsequence - LCS) is used to compare the test and template strings. LCS helps in handling various errors such as: detection of irrelevant AUs (Insertion), missing of relevant AUs (Deletion), and false substitution (Replacement).
At step 236, the mapped emotion is outputted.
Figure 4 is an exploded view of an offline training module, according to an embodiment herein. The offline training module is used for creating a pre-stored face and fiducial point database and a plurality of action unit model database for comparing the action units predicted using appearance based features. This is performed during the development of the human emotion detection module using a pre-defined number of images.
According to one embodiment of present invention, in order to create the pre-stored AU model database 209, the offline training module 400 includes a database of images 402, face detection module 404, and face cropper module 406, region of interest selection module 408, appearance based feature extraction module 410, feature reduction module 412 and AU modelling module 414. Whereas, in order to create the pre-stored face and fiducial points database 203 the offline training module includes an appearance and shape modelling module 403.
The face detection module 404 is a module that localizes the face(s) in each of the training images using a manually marked pupil points. The face cropper module 404 uses the localized face and marked pupil points to crop the face region while maintaining a standard ratio between the various facial featureslike eye, mouth and chin using varied scale factors. The face is rotated and scaled to get a normalized faces of uniform sizes. The region of interest module 408 selects the region of interest (ROI) on face to be used for each of action units (AU) by understanding the relationship between the face regions and AU. The region of interest detection module 408 according to the present invention plays a major role in reduction in false positives and overlook. The appearance based feature selection module 410 derives features from the cropped face by applying filter banks and construct feature vector of huge size in order to capture the appearance of facial eventsuniquely to represent local events (textures) around each pixel.
The face cropper module 406, the region of interest selection module 408 and appearance based feature extraction module 410 present in the offline training module 400 functions similarly as the face cropper module 232, the region of interest selection module 234 and appearance based feature extraction module 236 of the appearance based action unit prediction module 208.
The feature reduction module 412 selectively chooses relevant and sufficient features for further classification. In one embodiment of present invention the feature reduction is done by AdaBoost technique. Moreover, feature reduction according to present invention depends on shape dependency of AU, confidence measure and muscle size of facial part under consideration.
For example, the action units corresponding to the features extracted from the appearance based feature extraction module 410 have high dependency with another action units based on the shape of the AU. For instance, AU 27 (Mouth open) do not depends on any actions, while AU 7 (eye lid tighten) happens when trying to make AU27. The confidence measure of each action unit in respect of emotions is also taken in to consideration when the feature reduction is performed. That is, if any AU has more correlation for inferring emotions, more importance is given to that particular AU. For instance, AU 28 denotes highly important for Happy and AU 9 denotes highly important for Disgust, whereas AU 7 happens only acts as a supporting action for Anger and Disgust. Besides, the size of the facial action muscles also helps in feature reduction process. For example, AU 27indicates Bigger muscle, and hence more features around that muscle, while AU 5 indicates smaller muscle, and hence lesser features around that muscle.
A Support Vector Machine (SVM) is used to train the action unit modelling module 414 on images in the image database. A separate classifier is trained for every AU and that results in a representative model for each AU. The AU models derived from the AU modelling module 414 are stored in pre-stored AU database 209.
The appearance and shape modelling module 403 creates patches around each fiducial point of the face. Further, the appearance and shape modelling module 403 models the appearance of the patches by learning SVM on the texture data obtained from the patches. The Global shape for the face is modeled using the co-ordinate locations of the fiducial points in the CLM database and doing PCA on the shape data. The major modes of variations in shape are represented by an Eigenvector and the corresponding Eigen values represents corresponding strength of the modes of variation.
Pre-store Face and Fiducial Points database 204 stores the Eigen vectors and Eigen values of shape model and the SVM weight vectors of appearance model. In one exemplary embodiment, an xml file is used for storing the Eigen values and corresponding Eigen vectors. The stored Eigen values and corresponding Eigen vectors are used for detecting the fiducial points and CLM points of a face to detect human emotion using human emotion detection module 106.
Figure 5 is a flowchart illustrating an exemplary method of training an offline training module, according to one embodiment. At step 502, an image is fetched from the image database 402. At step 503, one or more face is detected in the fetched image by the face detection module 404. At step 504, the detected face is cropped in a pre-defined size by marinating a pre-defined ratio between different facial parts such as eye brow, lips, jaw etc. At step 506, a region of interest is selected for the cropped face for each of the action units (AU) based on the correlation power between action units and region of interests. At step 508, one or more appearance based features are extracted using fiducial points and CLM points. At step 510, relevant appearance based features are selected using feature reduction module 412. In one exemplary embodiment of present invention, the shape of the selected action units, confidence measure corresponding to the AUs under consideration and muscle size of the facial parts are considered for the selection of relevant appearance based features. At step 512, action unit models are generated by training a support vector machine.
Figure 6 is a flowchart illustrating an exemplary method of predicting action units based on the appearance based feature detection, according to one embodiment. At step 602, one or more faces in an input video stream are detected. At step 604, the detected face is cropped in a pre-defined size by maintaining a pre-defined ratio between different facial parts such as eye brow, lips, jaw etc. According to one exemplary embodiment of present invention, cropping of detected face is done based on the fiducial point alignment. The facial parts are localized based on the fiducial points. Further, different facial parts are detected based on the specific ratio between them. At step 606, a region of interest is selected for the cropped face for each of the action units (AU) based on the correlation power between action units and region of interests.
At step 608, one or more appearance based features are extracted applying filter banks and construct feature vector of huge size in order to capture the appearance of facial eventsuniquely to represent local events (textures) around each pixel. There are two methods are used for extracting the features, Gabor feature extractionmethod using Gabor filter and Spin Local Gabor Binary Patterns (LGBP) based method.
Gabor filteris a linear filter used for edge detection. Frequency and orientation representations of Gabor filters are similar to those of the human visual system. The Gabor filters are appropriate for texture representation and discrimination. Gabor feature extractionmethod generates smooth gradient responses of the detected face at M scales and N orientations.
An equation for calculating Gabor feature for a detected face is represented as follows:
?_k ? (x ? )= k ?^2/s^2 exp(-(k ?^2 x ?^2)/(2s^2 ))[exp(ik ?x ? )-exp (-s^2/2)]
The Gabor features capture the emphasis (magnitude of response) of different facial events at various scales and orientations.
The Spin Local Gradient Binary Patterns (LGBP) based method generates code/pattern for each pixel in each of the detected face obtained from the face and pupil detection module 230, by integrating the relative information around the pixel at a radius ?, where ? is a function of corresponding response image scale. The Local Gradient Binary Pattern featurescollect features extracted through the SPIN support with P radial distances and Q angles. The method steps involved in extracting features and predicting AU based on appearance using SPIN LGBP method includes segmenting each of the each LGBP response image into blocks, for example 4 quadrants, imposing SPIN support is on each of the block and calculating the histogram of patterns falling under each of the sub-blocks in spin support and concatenating the calculated histogram patterns to determine a feature vector.
At step 610, different action units corresponding to the appearance based features predicted by SVM based classification using the pre-stored AU models present in the pre-stored AU database 209 and the extracted appearance based features at step 608.
Figure 7 is a flowchart illustrating an exemplary method of predicting action units based on the geometrical change in facial parts, according to one embodiment.The geometrical change in one or more facial pats are usedpredict one or more AUs to detect the human emotion. According to one embodiment of present invention, a reference shape (i.e. the initial neutral face) of the detected face is registered for the image to decide if the fiducial points for the current frame have moved with respect to the reference points. For instance, consider that the eye brows in the detected face moves vertically from a position of “x” to “Y”. Then, the current fitted fiducial points are corrected for the affine variations and taken to the mean shape space for the computing the geometric variations. In order to detect the geometric variation, the facial parts that deliver high rage of emotion specific movements such as eye brows, lips etc. are considered.
At step 702, geometrical changes in the facial parts are detected by measuring the amount of vertical upward and downward shifts in the selected facial. In one exemplary embodiment of present invention, the geometrical change of facial part is detected by comparing the currently fitted fiducial points (after correcting for the affine variations) against the reference facial parts. At step 704, the change and direction of movement of facial parts are estimated by comparing the currently fitted fiducial points (after correcting for the affine variations) against thereference facial parts. At step 706, the action units correspond to the estimated geometrical change is predicted. For instance, the facial part considered for detecting geometrical change is eye brows, and then the geometrical change detection module 212 interprets the upward vertical shift, beyond a threshold, as AU3 and a vertical shift downwards beyond a threshold as AU4. The current nose points are also compared against the reference points.The decisions on AU3 and AU4 are made only if the current nose points are seen to be geometrically stable with respect to the reference nose points.
Figure 8 is a flowchart illustrating an exemplary method of predicting action units based on the change in appearance, according to one embodiment. At step 802, obtain appearance based features based on fiducial points. This detects the change in the detected face at real time by extracting LTP histograms at each emotion key point, of the detected face, such as lips, eye brow, chin etc.
At step 806, threshold based detection logic is applied on the appearance based features. The output of threshold detection logic is used to determine whether any change is happened at that region or not and corresponding AU is predicted if any change is detected as shown step 808.
Figure 9is a block diagram of an emotion recognition device, such as those shown in Figure 1, showing various components for implementing embodiments of the present subject matter. In Figure 9, the emotion detection device 100 includes the processor 102, the memory 104, a display 902, an input device 904, and a cursor control 906, a read only memory (ROM) 908, and a bus 910.
The processor 102, as used herein, means any type of computational circuit, such as, but not limited to, a microprocessor, a microcontroller, a complex instruction set computing microprocessor, a reduced instruction set computing microprocessor, a very long instruction word microprocessor, an explicitly parallel instruction computing microprocessor, a graphics processor, a digital signal processor, or any other type of processing circuit. The processor 102 may also include embedded controllers, such as generic or programmable logic devices or arrays, application specific integrated circuits, single-chip computers, smart cards, and the like.
The memory 104 and the ROM 908 may be volatile memory and non-volatile memory. The memory 104 includes the human emotion detection module 106 for detecting emotion of faces detected in a video stream according to one or more embodiments described above. A variety of computer-readable storage media may be stored in and accessed from the memory elements. Memory elements may include any suitable memory device(s) for storing data and machine-readable instructions, such as read only memory, random access memory, erasable programmable read only memory, electrically erasable programmable read only memory, hard drive, removable media drive for handling compact disks, digital video disks, diskettes, magnetic tape cartridges, memory cards, and the like.
Embodiments of the present subject matter may be implemented in conjunction with modules, including functions, procedures, data structures, and application programs, for performing tasks, or defining abstract data types or low-level hardware contexts. The human emotion detection module 106 may be stored in the form of machine-readable instructions on any of the above-mentioned storage media and may be executed by the processor 102. For example, a computer program may include machine-readable instructions, that when executed by the processor 102, cause the processor 102 to detect emotion of one or more faces detected in a video stream according to the teachings and herein described embodiments of the present subject matter. In one embodiment, the computer program may be included on a compact disk-read only memory (CD-ROM) and loaded from the CD-ROM to a hard drive in the non-volatile memory.
The bus 910 acts as interconnect between various components of the emotion detection device 100. The components such as the display 1002, the input device 904, and the cursor control 1006 are well known to the person skilled in the art and hence the explanation is thereof omitted.
The present embodiments have been described with reference to specific example embodiments; it will be evident that various modifications and changes may be made to these embodiments without departing from the broader spirit and scope of the various embodiments. Furthermore, the various devices, modules, and the like described herein may be enabled and operated using hardware circuitry, for example, complementary metal oxide semiconductor based logic circuitry, firmware, software and/or any combination of hardware, firmware, and/or software embodied in a machine readable medium. For example, the various electrical structure and methods may be embodied using transistors, logic gates, and electrical circuits, such as application specific integrated circuit.
Although the embodiments herein are described with various specific embodiments, it will be obvious for a person skilled in the art to practice the invention with modifications. However, all such modifications are deemed to be within the scope of the claims. It is also to be understood that the following claims are intended to cover all of the generic and specific features of the embodiments described herein and all the statements of the scope of the embodiments which as a matter of language might be said to fall there between.
| # | Name | Date |
|---|---|---|
| 1 | POA_Samsung R&D Institute India-Bangalore.pdf | 2014-02-12 |
| 2 | 2013_SAIT_327_Form 5_for filing.pdf | 2014-02-12 |
| 3 | 2013_SAIT_327_Drawings_for filing.pdf | 2014-02-12 |
| 4 | 2013_SAIT_327_Complete Specification_for filing.pdf | 2014-02-12 |
| 5 | abstract554-CHE-2014.jpg | 2014-10-29 |
| 6 | 554-CHE-2014-FORM-26 [05-08-2019(online)].pdf | 2019-08-05 |
| 7 | 554-CHE-2014-FORM 13 [06-08-2019(online)].pdf | 2019-08-06 |
| 8 | 554-CHE-2014-FER.pdf | 2019-10-16 |
| 9 | 554-CHE-2014-FER_SER_REPLY [16-04-2020(online)].pdf | 2020-04-16 |
| 10 | 554-CHE-2014-US(14)-HearingNotice-(HearingDate-02-03-2023).pdf | 2023-02-09 |
| 11 | 554-CHE-2014-FORM-26 [27-02-2023(online)].pdf | 2023-02-27 |
| 12 | 554-CHE-2014-Correspondence to notify the Controller [27-02-2023(online)].pdf | 2023-02-27 |
| 13 | 554-CHE-2014-US(14)-ExtendedHearingNotice-(HearingDate-15-03-2023).pdf | 2023-03-01 |
| 14 | 554-CHE-2014-Correspondence to notify the Controller [10-03-2023(online)].pdf | 2023-03-10 |
| 15 | 554-CHE-2014-Written submissions and relevant documents [29-03-2023(online)].pdf | 2023-03-29 |
| 1 | TPOSearchStrategy_07-10-2019.pdf |
| 2 | SearchStrategyMatrix554CHE2014_07-10-2019.pdf |