Sign In to Follow Application
View All Documents & Correspondence

Music Recommendation System And Method Thereof

Abstract: The present invention relates to music recommendation system using speech and body gesture emotions. More particularly, present invention relates to a music recommendation system that recognizes emotions from both body gestures and speech input to recommend songs by using few-shot meta-learning-based Siamese Neural Network to obtain the best accuracy and to add a level of personalized choice of songs for the user. Fig. 4

Get Free WhatsApp Updates!
Notices, Deadlines & Correspondence

Patent Information

Application #
Filing Date
06 February 2023
Publication Number
10/2023
Publication Type
INA
Invention Field
COMPUTER SCIENCE
Status
Email
sunita@skslaw.org
Parent Application
Patent Number
Legal Status
Grant Date
2025-07-22
Renewal Date

Applicants

AMRITA VISHWA VIDYAPEETHAM
Amrita School of Computing, Kasavanahalli, Carmelaram P.O., Bangalore – 560035, India

Inventors

1. NAMBIAR, Kirti R.
10/1, “RAAGALAYA”, Borewell Road, Annasandra Palya, Bangalore, Karnataka 560017
2. PALANISWAMY, Suja
#102, MB4 Suryanagara Phase 1, Chandapura, Anekal road, Bangalore, Karnataka 560099

Specification

Description:FIELD OF THE INVENTION:
The present invention relates to music recommendation system using speech and body gesture emotions. More particularly, present invention relates to a music recommendation system that recognizes emotions from both body gestures and speech input to recommend songs by using few-shot meta-learning based Siamese Neural Network to obtain the best accuracy.

BACKGROUND OF THE INVENTION:
Music has always been a very popular form of entertainment and technology has been quick to recognize this. With the growth of music streaming platforms, users are left with a multitude of options to choose from. This calls for a system that makes music organization and search management easier.

Speech is a very important part of human communication and is used to convey both linguistic and paralinguistic information. Classical speech recognition relies on retrieval of only linguistic information and fails to identify the paralinguistic information such as gender, emotion and state of mind of the speaker.

However, relying solely on emotions from speech may not be very effective as it is subject to variations. Emotions recognized from speech alone is subject to variations across people of different genders, race and region. Speech emotion recognition may also be ineffective for the speech-impaired people.

The recent advancements in the field of Human-Computer Interaction (HCI) has bridged the gap between computers and human beings. Emotion recognition is a HCI domain that has gained a significant advantage owing to this development. With the present-day emphasis on mental health, virtual learning and productivity, emotion recognition has become a very important area of research. Emotion recognition is the process of recognizing emotion from human interactions and is often used to evaluate a non-verbal response in a system. Emotions are highly influenced by social context and play a significant role in determining human behavior. Emotions may be expressed through a combination of facial expressions, speech, body language etc. A single modality such as facial expression or speech alone may not give the accurate emotion of the person. Therefore, emotion recognition can be treated as a multi-modal problem.

One article “Multimodal emotion recognition in speech-based interaction using facial expression, body gesture and acoustic analysis” by Loic Kessous et al. discloses a model for multimodal emotion recognition using Convolutional neural network (CNN) to extract features with linear addition. The emotion recognition is done with the help of facial expressions and body movements. The CNN model was trained using apex frames. Feature-level fusion was performed using Compact Bilinear Pooling, where the outer product of two vectors is calculated and linearized into a matrix. This helps to reduce the dimensions of the combined feature. This method achieved an accuracy of 87.2% on the face and body gesture (FABO) dataset.

Another article “Deep Emotion Recognition through Upper Body Movements and Facial Expression” by Chaudhary Muhammad Aqdus Ilyas et al. demonstrates a model that learns to understand emotions based on upper body movements and face expressions by combining the features gathered from facial expressions and body emotions using CNN and Long Short-Term Memory (LSTM). The FABO dataset is used to train the model, while the Geneva Multimodal Emotion Portrayals (GEMEP) dataset is used to test it. The model was able to obtain an accuracy of 77.7% for facial
expressions alone.

Another article “Smart Music Player integrating Facial Emotion Recognition and Music Mood Recommendation” by Shlok Gilda et al. proposed a music recommendation model that makes recommendations based on the emotions recognized from the facial expressions of the user by using two dimensional CNN network with an accuracy of 90.23%.

Since recommending songs based on emotions derived from body gesture and speech is relatively less popular and requires a large number of samples since enumerating all body motions is difficult. The emotion recognition from body gestures coupled with meta-learning is a fairly under researched topic which would give more accurate results.

As a result of the above limitations, there is a need for a music recommendation system based on the emotions recognized from speech and body gestures by using multi-modal emotion recognition to give more accurate results.

OBJECT OF THE INVENTION:
In order to obviate the drawbacks of the existing state of the art, the present invention discloses a system for music recommendation based on the emotions recognized from the speech and body gestures.

The main object of the present invention is to provide a system for music recommendation based on emotions identified though speech and body gestures.

Another object of the present invention is to provide a system for music recommendation that recognizes emotions from both body gestures such as facial expressions and speech input to recommend songs by using few-shot meta-learning based Siamese Neural Network.

Yet another object of the invention is to provide a system for music recommendation based on emotions derived by combining both the speech and body gesture emotions by using element-wise multiplication to counter the issue of dimensionality problem.

Yet another object of the invention is to provide a system for music recommendation by recognizing emotions from body gestures based on an “apex” frame extracted from every video to reduce the computing time to yield optimum accuracy.

Yet another object of the invention is to provide a system for music recommendation by combining both the speech and body gesture emotions by using Siamese network, a 4-shot learning model to yield better accuracy.

Yet another object of the invention is to provide a system for music recommendation by recognizing both the speech and body gesture emotions to add a level of personalized choice of songs.

SUMMARY OF THE INVENTION:
Accordingly, the present invention relates to a music recommendation system that recognizes emotions from both body gestures and speech input to recommend songs by using few-shot meta-learning based Siamese Neural Network to obtain the best accuracy.

Emotion plays a vital role in human beings to express their feelings. These feelings can be expressed in various gestures such as in body language, in the voice tone or in facial expressions. These emotions are helpful for an individual to understand what they are exactly conveying their thoughts or ideas which helps in better interaction. It has also been acknowledged that when a person continues to be in a bad and sad mood, he is capable of drifting into a form of mental depression which may affect the health adversely.

Music has a major impact on person’s mood. It has got a unique ability to uplift one's mood. If a user receives a recommendation based on his preference, it will also improve his listening experience. Music is a form of art which lightens the mood of a person. Besides it is also an entertainment medium for music lovers and listeners.

The present invention aims to alleviate the sad or bad mood of the user (U) by creating an automatic music search management system (S) which plays music according to the user’s emotion recognized by both speech and body gestures such as facial recognition through an input-output device which is fused to obtain ‘input’ emotion for input into the meta-learning based Siamese Neural Network to provide multi-modal emotion recognition. The said input emotion is computed by the Recommendation Module (RM). The system therefore addresses the gap in manual personalization of music choice to automatic personalization of music choice of the said user (U) by identifying the mood and selecting the appropriate music from the pre-determined playlist of the user (U) from the music application.

The present invention provides a system for music recommendation that recognizes emotions from both body gestures and speech input to recommend songs by using few-shot meta-learning based Siamese Neural Network to obtain the best accuracy.

Further, the present invention provide a system for music recommendation by recognizing both the speech and body gesture emotions to add a level of personalized choice of songs for the user.

STATEMENT OF INVENTION:
The music recommendation system of the present invention comprises of emotion module and recommendation module. The emotion module has input-output device which can portray video clips which has both audio and video of the user, ), Geneva Multimodal Emotion Portrayals (GEMEP) dataset, valance arousal model and feedback mechanism. The video clips are language specific.

The recommendation module has 4 layer neural network architecture, music application programming interface such as Spotify API of the user and music output module.

The emotions of the user are extracted from the video by studying the audio and the visuals which are fused using a fusion module based on meta-learning based Siamese Neural Network to obtain a input emotion thereby providing precise multi-modal emotion recognition. The emotion forms the input of the recommendation module which scrutinizes the indexed pre-recommended list on music API of said user and recommends a song corresponding to input emotion of said user thereby providing appropriate music based on multi-modal emotion recognition and plays the music on the music output module.

The invention also discloses a method for music recommendation from the user’s preferred list on the music API corresponding to the input emotion of the user. The preferred list on the music API is classified into 4 categories by the Valence Arousal module namely: happy, serene, energetic and sad. When the music is recommended based on the input emotion, the user has a choice to pick from the recommended list and play the same on the music output device.

BRIEF DESCRIPTION OF THE DRAWINGS:
Fig. 1 depicts a system for music recommendation using speech and body gesture emotions

Fig. 2 depicts different body gestures conveying different emotions

Fig. 3 depicts samples of the GEMEP dataset

Fig. 4 depicts general architecture of the music recommendation system

Fig. 5 depicts frames extracted from the video displaying the emotion “anger”

Fig. 6 depicts last frames for the emotions Anger, Fear, Joy, Pride and Sad respectively

Fig. 7 depicts apex frames for the emotions Anger, Fear, Joy, Pride and Sad respectively

Fig. 8 depicts a spectrogram of the audio used to train the network

Fig. 9 depicts element-wise fusion of apex frame and speech in order to be trained using Siamese Model

Fig. 10 depicts the song recommendations being made for the user mood detected as “Anger”

DETAILED DESCRIPTION OF THE INVENTION:
The present invention relates to music recommendation system using speech and body gesture emotions. More particularly, present invention relates to a music recommendation system that recognizes emotions from both body gestures and speech input to recommend songs by using few-shot meta-learning-based Siamese Neural Network to obtain the best accuracy.

Music is the form of art known to have a greater connection with a person's emotion. It has got a unique ability to uplift one's mood. With the growth of music streaming platforms, users are left with a multitude of options to choose from. If a user receives a recommendation based on his preference, it will also improve his listening experience. This calls for a system music recommendation by recognizing user’s emotions.

Emotion recognition is the process of recognizing emotion from human interactions and is often used to evaluate a non-verbal response in a system. Emotions are highly influenced by social context and play a significant role in determining human behavior. Emotions may be expressed through a combination of facial expressions, speech, body language etc. A single modality such as facial expression or speech alone may not give the accurate emotion of the person. Therefore, emotion recognition can be treated as a multi-modal problem.

Emotions recognized from speech alone is subject to variations across people of different genders, race and region. Speech emotion recognition may also be ineffective for the speech impaired population of customers. In an era when AI companies are focusing on accessibility issues for the disabled, a multi-modal form of emotion recognition will make personalization effective even for the speech impaired.

During interactions, body gestures are a key component in expressing emotions. It includes the movement of hands and other body parts, along with changes in facial expression to convey various emotions and thoughts. Body gestures convey crucial information regarding the person's affective and psychological inner state and happens to be far more under researched than other forms of emotion recognition such as facial expressions. The present invention focuses on recognizing emotions from body gestures, along with speech input.

Accordingly, the present invention provides a music recommendation system that recognizes emotions from both body gestures and speech input to recommend songs by using few-shot meta-learning-based Siamese Neural Network to obtain the best accuracy as shown in Fig. 1.

The present invention aims to personalize the music search using a music recommendation system that will suggest music depending on the user's emotion recognized by both speech and body gesture and therefore addresses the gap in personalization in the existing applications and increasing user satisfaction. Fig. 2 shows examples of different body gestures that convey different emotions.

Since emotion recognition does not depend on a single modality, using multi-modal emotion recognition make the system of the present invention more robust and yield better accuracy. Accordingly, in the present invention, the emotional information obtained from both the speech and body gesture are combined to conclude the final emotion of the user by using a Siamese network based meta-learning technique which reduces the dependence of the model on training data, with low contrastive loss to train the model.

Recommendation systems are frameworks used to recommend items to customers. For example, the suggestions for the next movie to watch on a movie streaming platform is the work of a classic recommendation system. Recommendation systems are mainly of 3 types – (a) collaborative filtering, (b) content-based and (c) hybrid recommender systems. The present invention aims to build a recommender system that takes into account the user’s emotional state at the time and maps it to an appropriate song based on its mood. The model is built with a feedback mechanism that will help to take into account the user’s preferences and choice at the time of the recommendation.

Meta-learning is a class of machine learning algorithms that learn how to learn across a number of different prediction experiments. It also observes how different machine learning algorithms learn during training and then learns from this “experience”. In this class of learning the model relies on lesser number of training data and is less prone to overfitting. Meta-learning employs techniques such as Zero shot, One-shot and Few shot learning to learn effectively from minimum number of data points.

Zero Shot Learning (ZSL) is a machine learning problem in which a learner observes samples from classes that were not viewed during training and identifies the category to which they belong. The observed and unobserved categories are combined using zero-shot approaches, which use auxiliary information to represent observable differentiating features of objects.

One-shot learning refers to classification tasks in which a large number of predictions must be made based on only one (or a few) instances of each class. This refers to issues in the face detection and recognition domain, such as facial identification and facial authentication, in which people must be accurately identified based on one or a few sample pictures employing a variety of facial expressions, lighting conditions, accessories, and hairstyles. Siamese networks are a common one-shot learning model that compares a learned feature vector for known and candidate cases. Contrastive loss and triplet loss functions, which provide the cornerstone for modern face recognition systems, can be used to learn high-quality face embedding vectors.

Traditional machine learning algorithms require as much data as the model can handle, and huge data loading allows the model to predict more accurately. Few-shot learning, also known as One-Shot learning or Low-shot learning, is a machine learning concept in which we train a model with little or no data. This learning seeks to develop accurate machine learning model with little training data. It has a number of advantages, including the ability to reduce time, computing expenses, and data analysis costs.

Few-shot learning methods work on the premise of training a model using a small amount of data. Whereas, in order for models to effectively forecast, zero-shot learning approaches rely on the idea of using zero data for each class. One-shot learning methods are a blend of few-shot and zero-shot learning methods in which only one instance is used to train the models. For training the model with only one photograph of the user, the majority of facial recognition systems use one-shot learning approaches.

Accordingly, the present invention caters to the need of the music search management by creating a system for music recommendation which plays music according to the user’s emotion recognized by both speech and body gestures and therefore addresses the gap in personalization in the existing applications.

Datasets
The present invention uses Geneva Multimodal Emotion Portrayals (GEMEP) dataset. It's a compilation of audio and video recordings in which ten actors portray 17 various emotional states, each with its own set of linguistic contents and styles of expression. The recorded videos have a resolution of 720x576 pixels and a frame rate of 25 frames per second. The videos are of 2 seconds duration on average, and the audio is in French. The emotions classes are admiration, amusement, anger, anxiety, contempt, despair, disgust, fear, interest, irritation, joy, pleasure, pride, relief, sadness, surprise and tenderness. For better classification performance only 5 classes – anger, fear, joy, pride and sadness were considered for classification. A sample of the GEMEP dataset used is shown in Fig. 3.

The dataset used for the Music Recommendation Module is the “data moods” dataset. It is a dataset containing a list of over 600 songs along with their features extracted using a music application programming interface (API), in this case, the Spotify API. Each song is mapped to one of the following moods – happy, serene, energetic and sad, using Valence Arousal model. These moods are mapped to appropriate user emotions to make recommendations.

The present invention consists of two modules:
(I) Emotion Module (EM)
(II) Recommendation Module (RM)
The Emotion Module of a Siamese model that is trained to recognize emotions from speech and body gestures. The recommendation module consists of a 4 layer neural network architecture and is responsible for recommending songs based on the user’s emotions. Fig. 4 illustrates the general architecture of the music recommendation system.

(I) EMOTION MODULE (EM)
The Emotion Module (EM) comprises of at least one input-output devices capable of capturing emotions from speech and body gestures such as facial expressions. Said Emotion Module (EM) is responsible for extracting the emotion information from the user’s speech input signals and body gestures. These features are concatenated using element-wise multiplication and passed as input to the model. The model therefore determines the emotion by considering both the signals using a Siamese network. The steps involved include video processing, speech recognition and feature combination which are explained below:

(a) Video Processing
For emotion recognition from body gestures, the Geneva Multimodal Emotion Portrayals (GEMEP) dataset was used in the present invention. The dataset that originally contained 17 classes of emotions recorded by 10 actors was trimmed to 5 classes - anger, fear, joy, pride and sadness, in order to improve the performance of the model. The videos have a frame rate of 25 frames per second (fps) and are of varying durations. The video was thus, first processed by splitting it into frames by setting the frame rate as 25fps. The frames extracted from the video displaying “anger” is shown in Fig. 5.

From these frames an “apex” frame is chosen. An apex frame is a frame that displays the peak emotions. It is identified by finding the frame that is at the maximum distance from the “neutral” frame. A neutral frame is one that displays no emotion. The GEMEP dataset does not have a neutral frame however, the last frames are the ones that display minimum emotion and are used as reference to find the apex frame. The difference in the degree of emotion display can clearly be perceived in Fig. 6 and Fig. 7 as they depict last frames and apex frames for the emotions Anger, Fear, Joy, Pride and Sad respectively.

(b) Speech Processing
The audio from the same set of videos used for body gesture emotion recognition was extracted and used to train the Siamese model. This helps to avoid issues arising due to difference in classes used to train the audio and video modalities that might affect the performance of the model. These audio signals are then converted into visual representations of the signal, called spectrogram, using a Hanning window of size 512.

Steps to create a spectrogram:
• Audio is split into windows or overlapping frames.
• Short Time Fourier Transform (STFT) is performed on each window and the absolute value is taken. The STFT computes Discrete Fourier Transforms (DFT) over small contiguous windows to represent a signal in the temporal frequency domain. This function will return a complex-valued matrix D that looks like this: np.abs(D[f, t]) which represents the magnitude of frequency bin f at frame t.
• A vertical line represents the magnitude vs frequency in each of the produced windows.
• Convert the window into decibels or mel scale. These are perpetually relevant scales that are logarithmic in nature.

(c) Feature Combination
One of the main challenges in multi-modal emotion recognition is choosing when to combine the input features. The two most common ways of fusion are feature level fusion and decision level fusion.

Feature level fusion is the technique of combining the features before they are passed to the training model. Therefore, a single combined feature set is passed to the classifier model for training. This is often done by concatenation or performing outer product on the features. However, combination of 2 or more features might result in a drastic increase in the dimension of the feature, thereby increasing the computational complexity.

Decision level fusion is another approach in feature fusion for multi-modal emotion recognition. In this method the features are trained separately and combined at the end. This technique, on the other hand, fails to recognize the relationship between input modalities.

In the present invention, feature level combination is used to train the model to recognize emotions from both speech and body gestures. Fig. 9 depicts the combination of the apex frame and audio spectrogram before training. In order to check the dimensionality of data, we use element-wise multiplication of features. Element multiplication is the method where corresponding elements in 2 matrices of the same dimensions are multiplied to give a new matrix of the same dimension as the output i.e, [a, b] x [c, d] = [x, y]. In the given example, the inputs and the resulting output are of size 1x2.

The resulting feature vector is given as input to a Siamese model that calculates the difference between the inputs of its 2 branches using contrastive loss and classifies the inputs into appropriate classes. The difference between the outputs can be calculated by using a loss function such as contrastive loss, binary cross entropy or triplet loss function. In the present invention, contrastive loss function was used.

By contrasting the inputs, the contrastive loss function distinguishes between similar and dissimilar inputs. The purpose of the contrastive loss function is to promote intra-class compactness by minimizing the distance between positive pairs whilst increasing the distance between negative pairs.

(II) Recommendation Module
The recommendation module takes the user emotion as input and recommends a song accordingly. This model used a dataset data_moods.csv which contains a list of songs along with their IDs and features obtained using any music application such as Spotify API to play user appropriate music on a music output device (MO). For purposes of working this invention the inventors used the pre-determined list obtained from the Spotify API. This is no way limiting of the invention. The said features obtained using Spotify were used to classify the songs into 4 different moods using Valence Arousal model. The Spotify API returns the following data regarding the music:

• Accousticness: A rating ranging from 0.0 to 1.0 that indicates whether or not a track is acoustic. The rating 1.0 denotes a high degree of certainty that the track is acoustic.
• Danceability: This is a musical term that refers to how ideal a song is for dancing, taking into account a number of factors such as tempo, rhythm stability, beat strength, and general regularity. The least danceable value is 0.0, while the most danceable value is 1.0.
• Energy: It's a 0 to 1.0 scale that offers a subjective estimate of activity and intensity. Typical energetic music has a fast, loud, and boisterous vibe to it. “Death metal”, for example, has a high amount of energy, but a “Bach” prelude has low level perceptual characteristics such as dynamic range, perceived loudness, timbre, onset rate, and general entropy influence this attribute.
• Instrumentalness: Whether or not a track contains no vocals. In this context, "ooh" and "aah" noises are considered instrumental. Clearly "vocal" tracks, such as rap or spoken word, are clearly "vocal." The closer the instrumentalness score gets to 1.0, the more likely the track is devoid of vocals. Instrumental recordings are represented by values above 0.5, but as the value approaches 1.0, confidence increases.
• Liveness: The term "liveness" refers to whether or not the recording has a listener. The higher the liveness number, the more probable the track is to be performed live. If the value is larger than 0.8, the music is almost certainly live.
• Loudness: The decibel measurement of a track's total volume is called loudness (dB). Loudness ratings are averaged across the duration of a song and can be used to compare the relative loudness of different recordings. The quality of a sound, or its loudness, is a strong psychological correlate with physical strength (amplitude). The decibel levels typically range from -60 to 0 dB.
• Speechiness is a feature that identifies whether or not a track contains spoken words. The more speech-like the recording is (e.g., talk show, audiobook, poetry), the closer the attribute value is to 1.0. Tracks with a grade of 0.66 or above almost certainly contains spoken language.
• Valence is a scale that ranges from 0.0 to 1.0 and describes the melodic positivity of a music track. Music with a high valence sound more positive (e.g., happy, cheerful, euphoric), whereas those with a low valence sound more negative (e.g., sad, depressed, angry).
• Tempo: The approximate tempo of a song in beats per minute (BPM). The average beat duration determines the tempo, which is the speed or pace of a work in musical terms.

The present invention is supported by non-limiting experimental data as detailed below. The sentiment analysis was performed on the dataset to extract the subjectivity and polarity from the names of the songs. Subjectivity indicates the amount of personal opinion contained in a text and may vary from 0 to 1, 0 being the least subjective text and 1 being the highest. Polarity expresses the degree of negation in the text. It may vary between -1 to 1. These along with the acoustic features of the song and user emotions were used to train the module. The average accuracy of the present music recommendation system is calculated by the following equation:

Eq. (1) Average Accuracy
wherein Nct represents the total number of correctly classified images and Nq represents the total number of query images during testing.

Module Modality Accuracy
Emotion Module Speech 55.0%
Body gesture 70.14%
Speech + Body gesture 72.20%
Recommendation Module NA 98.02%
Table 1 shows the overall accuracy obtained for the emotion module and recommendation module.

The overall accuracy obtained for the emotion module and recommendation module of the present invention is shown in Table 1. It is observed that when the same model was trained using speech alone, it acquired an accuracy of 55%, which is significantly improved with the use of body gestures in emotion recognition and acquired an accuracy of 70.14%. The model gives the best results when using both emotions from speech and body gestures by acquiring an accuracy of 72.20%

Table 2: Confusion matrix for speech based emotion recognition


Table 3: Confusion matrix for body gesture based emotion recognition


Table 4: Confusion matrix for speech + body gesture based emotion recognition

Tables 2, 3 and 4 show the class-wise confusion matrix obtained using different modalities i.e. based on speech, body gesture, speech + body gesture respectively in emotion recognition module. It may be observed that Anger has the highest classification accuracy and Pride has the least. This is because Anger is a more powerful emotion when compared to the other classes and is characterized by a distinct change in body language, facial expressions and tone. Pride on the other hand is more subtle and less expressive and is therefore often misclassified as Anger or Fear.

The recommendation module achieves an accuracy of 98.02%, which indicates the likeliness of the song recommended being of the appropriate mood as per the user emotion and plays it on the music output (MO) device (Fig. 10). The ratio of training to testing samples was 4:1 (80-20 split). Fig. 10 shows the song recommendations being made for the user mood detected as “Anger”. The emotion “Anger” is mapped to the mood “Calm” as these songs can sooth an angry or disturbed listener. Therefore, calm and soothing songs such as “Adrift”, “City Lights”, “Lost” and “Dew Drops” are recommended.

Accordingly, the present invention provides a multi-modal emotion-based music recommendation system that makes music recommendations depending on the user’s emotions by recognizing both speech and body gesture emotions to improve emotion recognition. The features obtained from the two modalities are fused together using element-wise multiplication in order to maintain the dimensionality of data and given as input to the Siamese network using contrastive loss to make effective classifications. Siamese network is a one-shot meta-learning technique, that classifies inputs by contrasting the difference between them.

Using 4 shot learning to recognize 5 different classes of emotions - anger, fear, joy, pride and sadness, the emotion module was trained on 20 audio and video samples, using which we were able to achieve an accuracy of 72.20% in the Emotion Recognition Module. It could be observed that emotion recognition using speech input alone was only able to achieve an accuracy of 55.0%, which was improved significantly when combined with the body gestures in the video component.

The recommendation module of the present invention was trained on a dataset containing over 600 songs, along with features obtained using Spotify API. The module effectively maps the user emotions to appropriate song moods and recommends songs accordingly. This model used 4-fold cross validation and achieved an accuracy of 98.02%.
, Claims:1. A music recommendation system (S) comprising of :
(a) an emotion module (EM) having an input-output device, capable of capturing audio-visual clips of a user (U), Geneva Multimodal Emotion Portrayals (GEMEP) dataset to obtain input emotion, Valence Arousal (VA) model to classify songs in pre-recommended list of said user in a music API ,,
(b) a recommendation module (RM) having 4 layer neural network architecture, music application programming interface (API) of the user (U) and music output (MO) module with feedback mechanism
wherein said emotion module (EM) extracts the emotions of the user (U) from body gestures of said user (U) from atleast one video and speech of the said user (U) from said same video and fuses both the emotions by fusion model using meta-learning based Siamese Neural Network to obtain a input emotion (IE) thereby providing precise multi-modal emotion recognition; and
wherein said input emotion (IE) is input in the recommendation module (RM) which scrutinizes the indexed pre-recommended list on music API of said user (U) and recommends a song corresponding to input emotion (IE) of said user (U) thereby providing appropriate music based on multi-modal emotion recognition and plays the music on the music output (MO) module based on said feedback mechanism.

2. Music recommendation system (S) as claimed in claim 1 wherein said indexed pre-recommended list comprises of a list of songs mapped to one of the moods classified as– happy, serene, energetic and sad, using said Valence Arousal (VA) model.
3. Music recommendation system (S) as claimed in claim 1 wherein moods of said user (U) are mapped to corresponding input emotion (IE) by the recommendation module (RM).

4. Music recommendation system (S) as claimed in claim 1 wherein said fusion level model selected from feature level fusion and decision level fusion, preferably feature level fusion.

5. Music recommendation system (S) as claimed in claim 1, wherein said system yields 98.02% accuracy by recognizing emotions by combination of both speech and body gestures.

6. Music recommendation system (S) as claimed in claim 1, wherein said system yields the highest accuracy for the emotion ‘anger’.

7. Method for music recommendation by using said system as claimed in claim 1, wherein said method for music recommendation comprising the steps of:
- recognizing emotions from body gestures from a video by using Geneva Multimodal Emotion Portrayals (GEMEP) dataset according to the emotion class i.e. anger, fear, joy, pride and sadness by selecting a ‘apex’ frame; and
-recognizing emotions from the speech by converting audio signals into visual representations of the signal “spectrogram”; and
-combining emotions recognized from both speech and body gestures by using feature level fusion to obtain a input emotion for the recommendation module; and
- recommending music by said recommendation module by using 4 layer neural network architecture which takes said input emotion from said emotion module as user’s emotion
for providing appropriate music based on multi-modal emotion recognition and plays the music on the music output (MO) module.

Documents

Application Documents

# Name Date
1 202341007344-STATEMENT OF UNDERTAKING (FORM 3) [06-02-2023(online)].pdf 2023-02-06
2 202341007344-FORM FOR SMALL ENTITY(FORM-28) [06-02-2023(online)].pdf 2023-02-06
3 202341007344-FORM FOR SMALL ENTITY [06-02-2023(online)].pdf 2023-02-06
4 202341007344-FORM 1 [06-02-2023(online)].pdf 2023-02-06
5 202341007344-FIGURE OF ABSTRACT [06-02-2023(online)].pdf 2023-02-06
6 202341007344-EVIDENCE FOR REGISTRATION UNDER SSI(FORM-28) [06-02-2023(online)].pdf 2023-02-06
7 202341007344-EDUCATIONAL INSTITUTION(S) [06-02-2023(online)].pdf 2023-02-06
8 202341007344-DRAWINGS [06-02-2023(online)].pdf 2023-02-06
9 202341007344-DECLARATION OF INVENTORSHIP (FORM 5) [06-02-2023(online)].pdf 2023-02-06
10 202341007344-COMPLETE SPECIFICATION [06-02-2023(online)].pdf 2023-02-06
11 202341007344-Proof of Right [03-03-2023(online)].pdf 2023-03-03
12 202341007344-ENDORSEMENT BY INVENTORS [03-03-2023(online)].pdf 2023-03-03
13 202341007344-FORM-9 [07-03-2023(online)].pdf 2023-03-07
14 202341007344-FORM 18 [07-03-2023(online)].pdf 2023-03-07
15 202341007344-FORM-26 [21-03-2023(online)].pdf 2023-03-21
16 202341007344-Correspondence_Power Of Attorney_30-03-2023.pdf 2023-03-30
17 202341007344-FER.pdf 2023-10-03
18 202341007344-MARKED COPIES OF AMENDEMENTS [01-04-2024(online)].pdf 2024-04-01
19 202341007344-FORM 13 [01-04-2024(online)].pdf 2024-04-01
20 202341007344-AMMENDED DOCUMENTS [01-04-2024(online)].pdf 2024-04-01
21 202341007344-OTHERS [02-04-2024(online)].pdf 2024-04-02
22 202341007344-FER_SER_REPLY [02-04-2024(online)].pdf 2024-04-02
23 202341007344-DRAWING [02-04-2024(online)].pdf 2024-04-02
24 202341007344-CLAIMS [02-04-2024(online)].pdf 2024-04-02
25 202341007344-US(14)-HearingNotice-(HearingDate-01-07-2025).pdf 2025-05-28
26 202341007344-FORM-8 [24-06-2025(online)].pdf 2025-06-24
27 202341007344-Response to office action [27-06-2025(online)].pdf 2025-06-27
28 202341007344-Correspondence to notify the Controller [27-06-2025(online)].pdf 2025-06-27
29 202341007344-Response to office action [10-07-2025(online)].pdf 2025-07-10
30 202341007344-RELEVANT DOCUMENTS [10-07-2025(online)].pdf 2025-07-10
31 202341007344-MARKED COPIES OF AMENDEMENTS [10-07-2025(online)].pdf 2025-07-10
32 202341007344-FORM 13 [10-07-2025(online)].pdf 2025-07-10
33 202341007344-Annexure [10-07-2025(online)].pdf 2025-07-10
34 202341007344-AMMENDED DOCUMENTS [10-07-2025(online)].pdf 2025-07-10
35 202341007344-PatentCertificate22-07-2025.pdf 2025-07-22
36 202341007344-IntimationOfGrant22-07-2025.pdf 2025-07-22

Search Strategy

1 Screenshot2023-09-30144610E_30-09-2023.pdf

ERegister / Renewals

3rd: 28 Jul 2025

From 06/02/2025 - To 06/02/2026