Method And System For Determining Real Time Customer Sentiments And

< Back

Method And System For Determining Real Time Customer Sentiments And Corresponding Actionable Interventions

Abstract: A method (300) for determining real time customer sentiments and corresponding actionable interventions. The method (300) includes receiving (302) in real time, via user interface (110), multimodal data of video feed (216) corresponding to customer-agent interaction; determining (314) set of sentiments corresponding to customer-agent interaction from text data, audio data, and video data using set of AI models; generating (332) final sentiment (218) corresponding to customer-agent interaction using configurable matrix to merge text sentiment, audio sentiment, and facial sentiment; When final sentiment (218) of customer-agent interaction is below predefined threshold sentiment, identifying (334) casual factors (220) for final sentiment (218) based on set of sentiments using topic modelling algorithm; generating (336) in real time, remedies (222) corresponding to casual factors (220) using GenAI model; predicting (338) customer churn (224) based on video data corresponding to customer (404) using churn prediction model. [To be published with FIG. 2]

Get Free WhatsApp Updates!
Notices, Deadlines & Correspondence

Patent Information

Application #

Filing Date

12 September 2025

Publication Number

40/2025

Publication Type

INA

Invention Field

COMPUTER SCIENCE

Status

Parent Application

Applicants

HCL Technologies Limited

806, Siddharth, 96, Nehru Place, New Delhi, 110019, India

Inventors

1. Konduru Venkat Sai

58-15-10/c, Shanti nagar, near sachivalayam, NAD, Visakhapatnam, Andhra Pradesh, 530009, India

2. Atul Singh

C406 SJR Brooklyn, ITPL Main Road, Bengaluru, Karnataka, 560037, India

3. Keertana S

#1024/5, 2nd main, 7th cross, Taralabalu Badavane, Davangere, Karnataka, 577005, India

4. Arvind Maurya

B1/212, T-9, Silvercity-2, Sec-Pi2, Greater Noida, Uttar Pradesh, 201310, India

Specification

Description:DESCRIPTION
Technical Field
[0001] This disclosure relates generally to sentiment detectors, and more particularly to method and system for determining real-time customer sentiments and corresponding actionable interventions.
Background
[0002] Currently, customer relationship management in customer interaction entirely rely on the skills and proficiency of human agents (or advisors). The agents may be required to manually identify customer reactions using a variety of signals (such as language, tone of voice, audio pitch, and facial expressions) of the customer. Further, the agent may need to interpret these signals and correlate them with the communication to determine whether the interaction has a positive, negative, or neutral impact on the customer. If the impact is negative or neutral, the agent must then take corrective action.
[0003] However, the manual process has several limitations such as, high dependency on agent skill and availability. When the stakes increase (for example, in interactions relating to higher value transactions), a challenge for the agent may also increase to positively address customer queries. Faltering under such circumstances can negatively impact the customer interactions and possible customer upsell, cross-sell, and retention. Additionally, interpreting customer feedback in form of signals (verbal and non-verbal) across all communication modes is a difficult skill to master and may not come naturally to all individuals.
[0004] The present invention is directed to overcome one or more limitations stated above or any other limitations associated with the known arts.
SUMMARY
[0005] In one embodiment, a method for determining real-time customer sentiments and corresponding actionable interventions is disclosed. In one example, the method may include receiving, via a user interface, multimodal data of a video feed corresponding to a customer-agent interaction. It should be noted that the multimodal data may include text data, audio data, and video data corresponding to each of a customer and an agent. The method may further include determining a set of sentiments corresponding to the customer-agent interaction from the text data, the audio data, and the video data using a set of Artificial Intelligence (AI) models. It should be noted that the set of sentiments may include a text sentiment, an audio sentiment, and a facial sentiment for each of the customer and the agent. The method may further include generating a final sentiment corresponding to the customer-agent interaction using a configurable matrix to merge the text sentiment, the audio sentiment, and the facial sentiment. When the final sentiment of the customer-agent interaction is below a predefined threshold sentiment, the method may further include identifying one or more causal factors for the final sentiment based on the set of sentiments using a topic modelling algorithm. The method may further include generating in real-time, one or more remedies corresponding to the one or more causal factors using a Generative Artificial Intelligence (GenAI) model. The method may further include predicting a customer churn based on the multimodal data corresponding to the customer using a churn prediction model. It should be noted that the churn prediction model is a classifier model.
[0006] In another embodiment, a system for determining real-time customer sentiments and corresponding actionable interventions is disclosed. In one example, the system may include a processor and a computer-readable medium communicatively coupled to the processor. The computer-readable medium may store processor-executable instructions, which, on execution, may cause the processor to receive in real-time, via a user interface, multimodal data of a video feed corresponding to a customer-agent interaction. It should be noted that the multimodal data may include text data, audio data, and video data corresponding to each of a customer and an agent. The processor-executable instructions, on execution, may further cause the processor to determine a set of sentiments corresponding to the customer-agent interaction from the text data, the audio data, and the video data using a set of AI models. It should be noted that the set of sentiments may include a text sentiment, an audio sentiment, and a facial sentiment for each of the customer and the agent. The processor-executable instructions, on execution, may further cause the processor to generate a final sentiment corresponding to the customer-agent interaction using a configurable matrix to merge the text sentiment, the audio sentiment, and the facial sentiment. When the final sentiment of the customer-agent interaction is below a predefined threshold sentiment, the processor-executable instructions, on execution, may further cause the processor to identify one or more causal factors for the final sentiment based on the set of sentiments using a topic modelling algorithm. The processor-executable instructions, on execution, may further cause the processor to generate in real-time one or more remedies corresponding to the one or more causal factors using a GenAI model. The processor-executable instructions, on execution, may further cause the processor to predict a customer churn based on the multimodal data corresponding to the customer using a churn prediction model. It should be noted that the churn prediction model is a classifier model.
[0007] It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the invention, as claimed.
BRIEF DESCRIPTION OF THE DRAWINGS
[0008] The accompanying drawings, which are incorporated in and constitute a part of this disclosure, illustrate exemplary embodiments and, together with the description, serve to explain the disclosed principles.
[0009] FIG. 1 is a block diagram of an exemplary system for determining real-time customer sentiments and corresponding actionable interventions, in accordance with some embodiments of the present disclosure.
[0010] FIG. 2 illustrates a functional block diagram of a system for determining real-time customer sentiments and corresponding actionable interventions, in accordance with some embodiments of the present disclosure.
[0011] FIGS. 3A and 3B illustrate a flow diagram of an exemplary process for determining real-time customer sentiments and corresponding actionable interventions, in accordance with some embodiments of the present disclosure.
[0012] FIG. 4 illustrates a flow diagram of a detailed exemplary process for determining real-time customer sentiments and corresponding actionable interventions, in accordance with some embodiments of the present disclosure.
[0013] FIGS. 5A and 5B illustrate various exemplary Graphical User Interfaces (GUIs), in accordance with some embodiments of the present disclosure.
[0014] FIG. 6 illustrates a flow diagram of an exemplary process for training a churn prediction model, in accordance with some embodiments of the present disclosure.
[0015] FIG. 7 is a block diagram of an exemplary computer system for implementing embodiments consistent with the present disclosure.
DETAILED DESCRIPTION
[0016] Exemplary embodiments are described with reference to the accompanying drawings. Wherever convenient, the same reference numbers are used throughout the drawings to refer to the same or like parts. While examples and features of disclosed principles are described herein, modifications, adaptations, and other implementations are possible without departing from the spirit and scope of the disclosed embodiments. It is intended that the following detailed description be considered as exemplary only, with the true scope and spirit being indicated by the following claims.
[0017] Referring now to FIG. 1, a block diagram of an exemplary system 100 for determining real-time customer sentiments and corresponding actionable interventions is illustrated, in accordance with some embodiments of the present disclosure. The system 100 may include a detecting device 102. The detecting device 102 may be, for example, but may not be limited to, a server, a desktop, a laptop, a notebook, a netbook, a tablet, a smartphone, a mobile phone, or any other computing device, in accordance with some embodiments of the present disclosure. The detecting device 102 may determine in real-time, sentiment corresponding to a customer-agent interaction from multimodal interaction data based on multiple communication signals (such as spoken words, facial expressions, tone of dialogue). For a negative sentiment, the detecting device 102 may identify causal factors for the negative sentiment. Further, the detecting device 102 may generate remedies corresponding to the causal factors. Additionally, the detecting device 102 may predict whether a customer churn.
[0018] As will be described in greater detail in conjunction with FIGS. 2 – 7, the detecting device 102 may receive in real-time, via a user interface, multimodal data of a video feed corresponding to a customer-agent interaction. The multimodal data may include text data, audio data, and video data corresponding to each of a customer and an agent. The detecting device 102 may further determine a set of sentiments corresponding to the customer-agent interaction from the text data, the audio data, and the video data using a set of Artificial Intelligence (AI) models. The set of sentiments may include a text sentiment, an audio sentiment, and a facial sentiment for each of the customer and the agent. The detecting device 102 may further generate a final sentiment corresponding to the customer-agent interaction using a configurable matrix to merge the text sentiment, the audio sentiment, and the facial sentiment. When the final sentiment of the customer-agent interaction is below a predefined threshold sentiment, the detecting device 102 may further identify one or more causal factors for the final sentiment based on the set of sentiments using a topic modelling algorithm. The detecting device 102 may further generate in real-time, one or more remedies corresponding to the one or more causal factors using a Generative Artificial Intelligence (GenAI) model. The detecting device 102 may further predict a customer churn based on the multimodal data corresponding to the customer using a churn prediction model. It should be noted that the churn prediction model is a classifier model.
[0019] In some embodiments, the detecting device 102 may include one or more processors 104 and a memory 106. Further, the memory 106 may store instructions that, when executed by the one or more processors 104, may cause the one or more processors 104 to determine real-time customer sentiments and corresponding actionable interventions, in accordance with aspects of the present disclosure. The memory 106 may also store various data (for example, multimodal data, a set of sentiments, a final sentiment, one or more causal factors, one or more remedies, and the like) that may be captured, processed, and/or required by the system 100.
[0020] The system 100 may further include a display 108. The system 100 may interact with a user interface 110 accessible via the display 108. The system 100 may also include one or more external devices 112. In some embodiments, the detecting device 102 may interact with the one or more external devices 112 over a communication network 114 for sending or receiving various data. The communication network 114 may include, for example, but may not be limited to, a wireless fidelity (Wi-Fi) network, a light fidelity (Li-Fi) network, a local area network (LAN), a wide area network (WAN), a metropolitan area network (MAN), a satellite network, the internet, a fiber optic network, a coaxial cable network, an infrared (IR) network, a radio frequency (RF) network, and a combination thereof. The one or more external devices 112 may include, but may not be limited to, a remote server, a laptop, a netbook, a notebook, a smartphone, a mobile phone, a tablet, or any other computing device.
[0021] Referring now to FIG. 2, a functional block diagram of a system 200 for determining real-time customer sentiments and corresponding actionable interventions is illustrated, in accordance with some embodiments of the present disclosure. FIG. 2 is explained in conjunction with FIG. 1. The system 200 may be analogous to the system 100. The system 200 may implement the detecting device 102. The detecting device 102 may include, within the memory 106, a receiving module 202, a speaker shift detecting module 204, a sentiment determining module 206, a factor identifying module 208, a remedy generating module 210, a churn predicting module 212, and an AI module 214. In an embodiment, the AI module 214 may include a set of AI models, a generative AI model, and a churn prediction model (which is a classifier model).
[0022] Upon initiation of a customer-agent interaction over a video call, the receiving module 202 may receive in real-time, via a user interface (such as the user interface 110), multimodal data of a video feed 216 corresponding to the customer-agent interaction. The multimodal data may include text data, audio data, and video data corresponding to each of a customer and an agent. Further, the receiving module 202 may send the video feed 216 to the speaker shift detecting module 204.
[0023] Further, the speaker shift detecting module 204 may extract the audio data from the video feed 216 using an audio extraction technique. The audio extraction technique may separate audio from the video feed 216. Further, the speaker shift detecting module 204 may process in real time, a set of data points from the audio data based on predefined time intervals (for example, every 10 seconds, 15 seconds, 20 seconds, or the like). The set of data points may correspond to audio frequency, audio amplitude, or a combination thereof.
[0024] To process the set of data points, the speaker shift detecting module 204 may calculate a mean of the set of data points using an audio processing technique. Further, the speaker shift detecting module 204 may detect a speaker shift based on a comparison of the mean with a predefined threshold value to obtain one or more speaker shifts in the customer-agent interaction. The speaker shift is indicative of a change in an active speaker during the customer-agent interaction. It should be noted that the active speaker is one of the customer or the agent. In other words, the speaker shift is a time instance in the customer-agent interaction when the active speaker changes from the customer to the agent or vice versa.
[0025] Further, the speaker shift detecting module 204 may segregate the audio data of the customer-agent interaction into customer audio data and agent audio data based on the one or more speaker shifts. The customer audio data may include one or more parts of the audio data where the active speaker corresponds to the customer. The agent audio data may include one or more parts of the audio data where the active speaker corresponds to the agent. Further, the speaker shift detecting module 204 may send the customer audio data and the agent audio data to the sentiment determining module 206.
[0026] Further, the sentiment determining module 206 may send the text data, the audio data, and the video data to the AI module 214. The AI module 214 may determine a set of sentiments corresponding to the customer-agent interaction from the text data, the audio data, and the video data using the set of AI models. The set of AI models may include AI models configured to determine the set of sentiments. In an embodiment, the set of sentiments may include a text sentiment, an audio sentiment, and a facial sentiment for each of the customer and the agent. In such an embodiment, the set of AI models may include a Natural Language Processing (NLP) model to determine the text sentiment, a supervised Machine Learning (ML) model to determine the audio sentiment, and a Computer Vision (CV) model to determine the facial sentiment.
[0027] To determine the text sentiment, the sentiment determining module 206 may transform the customer audio data and the agent audio data into customer text data (i.e., customer transcript) and agent text data (i.e., agent transcript), respectively, using a speech-to-text algorithm. For example, the speech-to-text algorithm may be, but may not be limited to, Google© Speech Recognition, Wav2Vec, Deep Speech, or the like. The customer text data may include one or more parts of the text data where the active speaker corresponds to the customer. The agent text data may include one or more parts of the text data where the active speaker corresponds to the agent. Further, the AI module 214 may determine, via the NLP model, a customer text sentiment and an agent text sentiment based on the customer text data and the agent text data, respectively.
[0028] To determine the audio sentiment, the sentiment determining module 206 may first extract a set of customer audio features (e.g., tone of the customer audio data, volume of the customer audio data, frequency of the customer audio data, or the like) and a set of agent audio features (e.g., tone of the agent audio data, volume of the agent audio data, frequency of the agent audio data, or the like) from the customer audio data and the agent audio data, respectively, using Mel Frequency Cepstral Coefficients (MFCC). By way of an example, the tone of the customer audio data and the tone of the agent audio data may be sad, happy, fearful, stressful, calm, angry, etc. Further, the AI module 214 may determine, via the supervised ML model, a customer audio sentiment and an agent audio sentiment based on the customer audio features and the agent audio features, respectively. The supervised ML model may be fine-tuned using a dataset Ryerson Audio Visual Database of Emotional Speech Song (RAVDESS).
[0029] To determine the facial sentiment, the sentiment determining module 206 may extract customer video data and agent video data from the multimodal data based on the customer audio data and the agent audio data, respectively. The customer video data may include one or more parts of the video data where the active speaker corresponds to the customer. The agent video data may include one or more parts of the video data where the active speaker corresponds to the agent. In other words, the sentiment determining module 206 may split the video feed 216 at time instances corresponding to the one or more speaker shifts to obtain the customer video data and the agent video data.
[0030] Further, for each frame in each of the customer video data and the agent video data, the AI module 214 may detect, via the CV model, a face of the active speaker in the frame. Further, the AI module 214 may determine, via the image classification model, a facial expression of the detected face using Reading Mind Eyes Test (REMT) analysis. The RMET analysis is a psychological assessment designed to measure an individual's ability to infer mental states (such as emotion, intention, or the like) of others by looking at their eye region. For example, the image classification model may be, but may not be limited to, ResNet (Residual Networks), Visual Geometry Group (VGG), Vision Transformers (ViT), or the like.
[0031] Further, the sentiment determining module 206 may determine, via the image classification model, a customer video sentiment and an agent video sentiment based on the facial expression determined in the customer video data and the agent video data, respectively. Further, the sentiment determining module 206 may generate a final sentiment 218 of the customer-agent interaction using a configurable matrix to merge the text sentiment, the audio sentiment, and the facial sentiment. The final sentiment 218 may be, but may not be limited to, positive, negative, or neutral.
[0032] When the final sentiment 218 of the customer-agent interaction is below a predefined threshold sentiment, the factor identifying module 208 may identify one or more causal factors 220 for the final sentiment 218 based on the set of sentiments using a topic modelling algorithm. For example, the one or more causal factors 220 may be false promises, service delays, quality concerns, or the like. For example, the topic modelling algorithm may be, but may not be limited to, Latent Dirichlet Allocation (LDA), Latent Semantic Analysis (LSA), Non-negative Matrix Factorization (NMF), and the like. Further, the factor identifying module 208 may send the one or more causal factors 220 to the remedy generating module 210. Further, the remedy generating module 210 may send the one or more causal factors 220 to the AI module 214. The AI module 214 may generate in real time, one or more remedies 222 corresponding to the one or more causal factors using the GenAI model. By way of an example, the one or more remedies 222 may include apologizing for miscommunication, clarifying the situation, offering feasible solutions, or the like.
[0033] Further, the churn prediction module 212 may send the multimodal data to the AI module 214. The AI module 214 may predict a customer churn 224 based on the multimodal data corresponding to the customer using a churn prediction model. The churn prediction may be a process of identifying customers or agents who are likely stop using products (or services) in the near future. The churn prediction is widely used across industries (such as telecommunications, subscription services, banks, retails, detailers, or the like. The customer churn 224 may be represented with ‘0’ or ‘1’, where ‘0’ may correspond to a prediction that the customer will not churn, and ‘1’ may correspond to a prediction that the customer will churn in near future (or immediately). It should be noted that the churn prediction model is a classifier model. The churn prediction module 212 may train the churn prediction model using multimodal training data corresponding to historical customer-agent interactions. The churn prediction model may include an ensemble of an NLP model, a deep learning model, and an encoder-based model.
[0034] It should be noted that all such aforementioned modules 202 – 214 may be represented as a single module or a combination of different modules. Further, as will be appreciated by those skilled in the art, each of the modules 202 – 214 may reside, in whole or in parts, on one device or multiple devices in communication with each other. In some embodiments, each of the modules 202 – 214 may be implemented as dedicated hardware circuit comprising custom application-specific integrated circuit (ASIC) or gate arrays, off-the-shelf semiconductors such as logic chips, transistors, or other discrete components. Each of the modules 202 – 214 may also be implemented in a programmable hardware device such as a field programmable gate array (FPGA), programmable array logic, programmable logic device, and so forth. Alternatively, each of the modules 202 – 214 may be implemented in software for execution by various types of processors (e.g., processor 104). An identified module of executable code may, for instance, include one or more physical or logical blocks of computer instructions, which may, for instance, be organized as an object, procedure, function, or other construct. Nevertheless, the executables of an identified module or component need not be physically located together but may include disparate instructions stored in different locations which, when joined logically together, include the module and achieve the stated purpose of the module. Indeed, a module of executable code could be a single instruction, or many instructions, and may even be distributed over several different code segments, among different applications, and across several memory devices.
[0035] As will be appreciated by one skilled in the art, a variety of processes may be employed for determining real-time customer sentiments and corresponding actionable interventions. For example, the exemplary system 100 and the associated detecting device 102, may determine real-time customer sentiments and corresponding actionable interventions, by the processes discussed herein. In particular, as will be appreciated by those of ordinary skill in the art, control logic and/or automated routines for performing the techniques and steps described herein may be implemented by the system 100 and the associated detecting device 102 either by hardware, software, or combinations of hardware and software. For example, suitable code may be accessed and executed by the one or more processors on the system 100 to perform some or all of the techniques described herein. Similarly, application specific integrated circuits (ASICs) configured to perform some or all of the processes described herein may be included in the one or more processors on the system 100.
[0036] Referring now to FIGS. 3A and 3B, an exemplary process 300 for determining real time customer sentiments and corresponding actionable interventions is illustrated via a flow chart, in accordance with some embodiments of the present disclosure. FIGS. 3A, and 3B are explained in conjunction with FIG. 1 and FIG. 2. The process 300 may be implemented by the detecting device 102 of the system 100. In some embodiments, the process 300 may include receiving in real time, by a receiving module (such as the receiving module 202) via a user interface, multimodal data of a video feed (such as the video data 216) corresponding to a customer-agent interaction, at step 302. The multimodal data may include text data, audio data, and video data corresponding to each of a customer and an agent.
[0037] Upon receiving the video data, the process 300 may include extracting, by a speaker shift detecting module (such as the speaker shift detecting module 204), the audio data from the video feed using an audio extraction technique, at step 304. Further, the process 300 may include processing in real-time, by the speaker shift detecting module, a set of data points from the audio data based on predefined time intervals, at step 306. The step 306 may include steps 308 and 310.
[0038] Further, the process 300 may include calculating, by the speaker shift detecting module, a mean of the set of data points using an audio processing technique, at step 308. Further, the process 300 may include detecting, by the speaker shift detecting module, a speaker shift based on a comparison of the mean with a predefined threshold value to obtain one or more speaker shifts in the customer-agent interaction, at step 310. The speaker shift is indicative of a change in an active speaker during the customer-agent interaction. It should be noted that the active speaker is one of the customer or the agent.
[0039] Upon detecting the one or more speaker shifts, the process 300 may include segregating, by the speaker shift detecting module, the audio data of the customer-agent interaction into customer audio data and agent audio data based on the one or more speaker shifts, at step 312. The customer audio data may include one or more parts of the audio data where the active speaker corresponds to the customer. The agent audio data may include one or more parts of the audio data where the active speaker corresponds to the agent.
[0040] Further, the process 300 may include determining, by a sentiment determining module (such as the sentiment determining module 206), a set of sentiments corresponding to the customer-agent interaction from the text data, the audio data, and the video data using a set of AI models, at step 314. The set of sentiments may include a text sentiment, an audio sentiment, and a facial sentiment for each of the customer and the agent. The step 314 may include steps 316, 318, 320, 322, 324, 326, 328, and 330.
[0041] To determine the set of sentiments corresponding to the customer-agent interaction, the process 300 may include transforming, by the sentiment determining module, the customer audio data and the agent audio data into customer text data and agent text data, respectively, using a speech-to-text algorithm, at step 316. The customer text data may include one or more parts of the text data where the active speaker corresponds to the customer. The agent text data may include one or more parts of the text data where the active speaker corresponds to the agent.
[0042] Further, the process 300 may include determining, by the sentiment determining module via a NLP model, a customer text sentiment and an agent text sentiment based on the customer text data and the agent text data, respectively, at step 318. It should be noted that the NLP model is one of the set of AI models.
[0043] Additionally, the process 300 may include extracting, by the sentiment determining module, a set of customer audio features and a set of agent audio features from the customer audio data and the agent audio data, respectively, using MFCC, at step 320. Further, the process 300 may include determining, by the sentiment determining module via a supervised ML model, a customer audio sentiment and an agent audio sentiment based on the customer audio features and the agent audio features, respectively, at step 322. It should be noted that the supervised ML model is one of the set of AI models.
[0044] Additionally, from the multimodal data, the process 300 may include extracting, by the sentiment determining module, customer video data and agent video data based on the customer audio data and the agent audio data, respectively, at step 324. The customer video data may include one or more parts of the video data where the active speaker corresponds to the customer. The agent video data may include one or more parts of the video data where the active speaker corresponds to the agent.
[0045] Further, for each frame in each of the customer video data and the agent video data, the process 300 may include detecting, by the sentiment determining module via a CV model, a face of the active speaker in the frame, at step 326. It should be noted that the CV model is one of the set of AI models. Upon detecting the face of the active speaker, the process 300 may include determining, by the sentiment determining module via an image classification model, a facial expression of the detected face using RMET analysis, at step 328. It should be noted that the image classification model is one of the set of AI models.
[0046] Further, the process 300 may include determining, by the sentiment determining module via the image classification model, a customer video sentiment and an agent video sentiment based on the facial expression determined in the customer video data and the agent video data, respectively, at step 330.
[0047] Once the set of sentiments are determined, the process 300 may include generating, by the sentiment determining module, a final sentiment (such as the final sentiment 218) corresponding to the customer-agent interaction using a configurable matrix to merge the text sentiment, the audio sentiment, and the facial sentiment, at step 332.
[0048] In some embodiments, when the final sentiment of the customer-agent interaction is below a predefined threshold sentiment, the process 300 may include identifying, by a factor identifying module (such as the factor identifying module 208), one or more causal factors (such as the one or more causal factors 220) for the final sentiment based on the set of sentiments using a topic modelling algorithm, at step 334.
[0049] Upon identifying the one or more causal factors, the process 300 may include generating in the real time, by a remedy generating module (such as the remedy generating module 210), one or more remedies (such as the one or more remedies 222) corresponding to the one or more causal factors using a GenAI model, at step 336.
[0050] Further, the process 300 may include predicting, by a churn predicting module (such as the churn predicting module 212), a customer churn (such as the customer churn 224) based on the multimodal data corresponding to the customer using a churn prediction model, at step 338. The churn prediction model is a classifier model. The step 338 may include step 340.
[0051] The process 300 may include training, by the churn predicting module, the churn prediction model using multimodal training data corresponding to historical customer-agent interactions, at step 340. The churn prediction model may include an ensemble of an NLP model, a deep learning model, and an encoder-based model.
[0052] Referring now to FIG. 4, a detailed exemplary process 400 for determining real time customer sentiments and corresponding actionable interventions is illustrated via a flow chart, in accordance with some embodiments of the present disclosure. FIG. 4 is explained in conjunction with FIGS. 1, 2, and 3. In some embodiments, the process 400 may include receiving in real time, a video feed (analogous to the video feed 216) corresponding to a video conferencing 402 between a customer 404 and an agent 406. The video feed may include text data, audio data, and video data corresponding to each of the customer 404 and the agent 406.
[0053] Upon receiving the video feed 402, the process 400 may include audio segmentation for speaker detection, at step 408. To detect the one or more speaker shifts, the audio data may be extracted from the video feed 402 using an audio extraction technique. Further, a set of data points from the audio data may be processed in real-time, based on predefined time intervals. By way of an example, consider a three data point window from the audio data at the predefined time interval of ‘0.1’ seconds. Thus, the three data points may be audio (t), audio (t-0.1), and audio (t-0.2). Further, a mean of the set of data points (i.e., the audio (t), the audio (t-0.1), and the audio (t-0.2)) may be calculated using an audio processing technique.
[0054] Similarly, the mean may be continuously calculated as and when new data points are received in real-time. In an embodiment, each window size may include 3 data points. If the mean decreases and becomes ‘0’ (or if there is any sudden jump in the mean), the immediately previous data point (i.e., audio(t-0.2)) may be considered as the end of the segment (i.e., current active speaker) and the immediately next point (i.e., audio(t)) may be the start of the next segment (i.e., change of active speaker).
[0055] Thus, a speaker shift may be detected based on a comparison of the mean with a predefined threshold value to obtain the one or more speaker shifts in the customer-agent interaction. Further, the audio segmentation may segregate the audio data of the customer-agent interaction into customer audio data and agent audio data based on the one or more speaker shifts.
[0056] Once the customer audio data and the agent audio data are detected, the process 400 may include determining in real time a text sentiment corresponding to the customer-agent interaction using a text sentiment detector 410. To determine the text sentiment corresponding to the customer-agent interaction, the text sentiment detector 410 may transform the customer audio data and the agent audio data into customer text data and agent text data, respectively, using a speech-to-text algorithm (e.g., a Wav2vec).
[0057] Further, the text sentiment detector 410 may determine, via an NLP model, a customer text sentiment and an agent text sentiment based on the customer text data and the agent text data, respectively. Each of the customer text sentiment and the agent text sentiment may be a score indicative of sentiment. Further, the text sentiment detector 410 may determine in real time, the text sentiment corresponding to the video conferencing 402 based on the customer text sentiment and the agent text sentiment. In other words, the text sentiment may be an overall sentiment score based on the customer text sentiment and the agent text sentiment. When the text sentiment is above a predefined threshold score, the text sentiment is classified as ‘positive’ and when the text sentiment is below a predefined threshold score, the text sentiment is classified as ‘negative’. It should be noted that a similar comparison with a predefined threshold score may be used to classify each of the customer text sentiment and the agent text sentiment as ‘positive’ or ‘negative’.
[0058] Additionally, the process 400 may include determining in real-time, an audio sentiment corresponding to the video conferencing 402 using an audio sentiment detector 412. To determine the audio sentiment, the audio sentiment detector 412 may extract a set of customer audio features (e.g., tone of the customer audio data) and a set of agent audio features (e.g., tone of the agent audio data) from the customer audio data and the agent audio data, respectively, using the MFCC. Further, the audio sentiment detector 412 may identify the tone of customer audio data and the tone of agent audio data using a supervised ML model.
[0059] Further, the audio sentiment detector 412 may determine a customer audio sentiment and an agent audio sentiment based on the tone of customer audio data and the tone of agent audio data, respectively, using the supervised ML model. Each of the customer audio sentiment and the agent audio sentiment may be a score indicative of sentiment.
[0060] Upon determining the customer audio sentiment and the agent audio sentiment, the audio sentiment detector 412 may determine, the audio sentiment corresponding to the customer-agent interaction based on the customer audio sentiment and the agent audio sentiment. In other words, the audio sentiment may be an overall sentiment score based on the customer audio sentiment and the agent audio sentiment. When the audio sentiment is above a predefined threshold score, the audio sentiment is classified as ‘positive’ and when the audio sentiment is below a predefined threshold score, the audio sentiment is classified as ‘negative’. It should be noted that a similar comparison with a predefined threshold score may be used to classify each of the customer audio sentiment and the agent audio sentiment as ‘positive’ or ‘negative’.
[0061] Additionally, the process 400 may include determining a facial sentiment corresponding to the customer-agent interaction using a facial emotion detector 414. To determine the facial sentiment, the facial emotion detector 414 may extract customer video data and agent video data based on the customer audio data and the agent audio data, respectively. Further, for each frame in each of the customer video data and the agent video data, the facial emotion detector 414 may detect, via a CV model (e.g., a Cascade Classifier), a face of the active speaker in the frame. Further, the facial emotion detector 414 may determine, via an image classification model (e.g., a State-Of-The-Art (SOTA) image classification), a facial expression of the detected face using an RMET analysis.
[0062] By way of an example, when the active speaker is customer 404, the RMET analysis may recognize the facial expression of the customer (e.g., ‘Irritated’). On the other hand, when the active speaker is agent 406, the RMET analysis may recognize the facial expression of the agent 406 (e.g., ‘Rude’).
[0063] Further, the facial emotion detector 414 may determine, via the image classification model, a customer video sentiment and an agent video sentiment based on the customer facial expression and the agent facial expression, respectively. Each of the customer video sentiment and the agent video sentiment may be a score indicative of sentiment. Further, the facial emotion detector 414 may determine in real time, the facial sentiment corresponding to the customer-agent interaction based on the customer video sentiment and the agent video sentiment. In other words, the facial sentiment may be an overall sentiment score based on the customer video sentiment and the agent video sentiment. When the facial sentiment is above a predefined threshold score, the facial sentiment is classified as ‘positive’ and when the facial sentiment is below a predefined threshold score, the facial sentiment is classified as ‘negative’. It should be noted that a similar comparison with a predefined threshold score may be used to classify each of the customer video sentiment and the agent video sentiment as ‘positive’ or ‘negative’.
[0064] Upon determining the text sentiment, the audio sentiment, and the facial sentiment, the process 400 may include combining (or merging) the text sentiment, the audio sentiment, and the video sentiment using a configurable matrix to generate a final sentiment 416 corresponding to the customer-agent interaction. Further, the process 400 may include comparing the final sentiment 416 corresponding to the customer-agent interaction with a predefined threshold sentiment score.
[0065] In some embodiments, when the final sentiment 416 is below the predefined threshold sentiment score, the final sentiment 416 may correspond to ‘Negative’ sentiment. If the final sentiment 416 is ‘Negative’, the process 400 may include problem identification for the ‘Negative’ final sentiment, at step 418. The identified problems may be analogous to the one or more causal factors 220. The problems may be identified based on the text sentiment, the audio sentiment, and the facial sentiment using a topic modelling algorithm (e.g., an LDA). By way of an example, one of the problems (or key issues) may be ‘False Promises’ for the ‘Negative’ final sentiment 416.
[0066] Further, the process 400 may include real-time remedy generation corresponding to the identified problems, at step 420. The remedies may be analogous to the remedies 222. The remedies may be generated based on the text sentiment, the audio sentiment, the facial sentiment, and the problems using the LLM. In continuation with the above example, the remedy suggested by the LLM for the identified problem may be ‘Apologizing for the False Promises’.
[0067] Further, the process 400 may include predicting a customer churn based on the text data, the audio data, and the video data corresponding to the customer 404 using a churn prediction model 422. In continuation with the above example, the customer churn for the customer 404 may be ‘0’ (i.e., customer churn is not likely). This is further explained in greater detail in conjunction with FIG. 6.
[0068] Referring now to FIG. 5A, an exemplary GUI 500A is illustrated, in accordance with some embodiments of the present disclosure. FIG. 5A is explained in conjunction with FIGS. 1, 2, 3, and 4. The GUI 500A may include a set of elements for the agent 406. The GUI 500A may include an element 502A. For example, the agent 406 may continuously monitor (or see) the customer 404 during a customer-agent interaction, via the element 502A. The GUI 500A may further include an element 504A for a text sentiment corresponding to the customer-agent interaction. For example, the text sentiment 504A may be ‘Positive’. The GUI 500A may further include an element 506A for an audio sentiment corresponding to the customer-agent interaction. For example, the audio sentiment 506A may be ‘Negative’. The GUI 500A may further include an element 508A for a facial sentiment corresponding to the customer-agent interaction. For example, the facial sentiment 508A may be ‘Negative’.
[0069] The GUI 500A may further include an element 510A for a time interval (or boundary). For example, the time interval may be ‘0.30’ – 0.59’ seconds. The GUI 500A may further include an element 512A for a final sentiment 9analogous to the final sentiment 218) corresponding to the customer-agent interaction. For example, the final sentiment 512A may be ‘Negative’. The GUI 500A may further include an element 514A for problems (analogous to the one or more causal factors 220) which cause the final sentiment negative. For example, the problems 514A may be ‘False Promises’. The GUI 500A may further include an element 516A for remedies (analogous to the one or more remedies 222) corresponding to the problems. For example, the remedies 516A may be ‘Apologize for Miscommunication’. The GUI 500A may further include an element 518A for the customer churn (analogous to the customer churn 224). For example, the customer churn may be ‘No’.
[0070] Referring now to FIG. 5B, another exemplary GUI 500B is illustrated, in accordance with some embodiments of the present disclosure. FIG. 5B is explained in conjunction with FIGS. 1, 2, 3, 4, and 5A. The GUI 500B may include the final sentiment decision 502B corresponding to a customer-agent interaction. The final sentiment decision 502B may be presented in a tabular format. The final sentiment decision 502B may include a column 504B for an audio sentiment, a column 506B for a video sentiment, a column 508B for a text sentiment, and a column 510B for a final sentiment corresponding to the customer-agent interaction. Each sentiment may be ‘Positive’, ‘Negative’, or ‘Neutral’. For example, ‘Positive’ sentiment may be represented with a ‘laughing emoji’, ‘Negative’ sentiment may be represented with a ‘sad emoji’, and ‘Neutral’ sentiment may be represented with a ‘smiley emoji’.
[0071] For row ‘1’, the audio sentiment 504B may be ‘sad emoji (i.e., Negative)’, the video sentiment 506B may be ‘sad emoji’, the text sentiment 508B may be ‘sad emoji’, and the final sentiment 510B may be ‘sad emoji’.
[0072] For row ‘2’, the audio sentiment 504B may be ‘sad emoji’, the video sentiment 506B may be ‘sad emoji’, the text sentiment 508B may be ‘laughing emoji (Positive)’, and the final sentiment 510B may be ‘sad emoji’.
[0073] For row ‘3’, the audio sentiment 504B may be ‘sad emoji’, the video sentiment 506B may be ‘laughing emoji’, the text sentiment 508B may be ‘sad emoji’, and the final sentiment 510B may be ‘sad emoji’.
[0074] The GUI 500B may further include an edit option 512B. For example, edit option 512B may be used to edit a configurable matrix of different sentiments (i.e., the text sentiment, the audio sentiment, the video sentiment, and the final sentiment), via the agent 406. By way of an example, if the customer 404 may appear as ‘negative’ in tone and facial expressions but what he is saying is ‘positive’, in such cases, the final sentiment may be edited as ‘positive’. The GUI 500B may further include a save option 514B. The save option 514B may be used to save the changes if occurred.
[0075] Referring now to FIG. 6, an exemplary process for training the churn prediction model 424 is illustrated via a flow chart, in accordance with some embodiments of the present disclosure. FIG. 6 is explained in conjunction with FIGS. 1, 2, 3, 4, 5A, and 5B. In some embodiments, the process 600 may include receiving, by the receiving module 202, multimodal data corresponding to historical customer-agent interactions. The multimodal data may include text data, video data, and audio data. Upon receiving the historical customer-agent interaction, the process 600 may include building an NLP model for predicting the customer churn 224 based on the text data corresponding to the customer 404.
[0076] To build the NLP model, the process 600 may include transforming the audio data of the customer 404 into the text data using a speech-to-text engine. Further, the process 600 may include building the NLP model using a SOTA classification technique. The NLP model may be used to predict customer churn during the customer-agent interaction using the text data.
[0077] Similarly, the process 600 may include building a deep learning model for predicting the customer churn 224 based on the audio data. To build the deep learning model, the process 600 may include identifying different time stamps of the audio data of the customer 404. Further, the process 600 may include extracting customer audio features (e.g., tone of the customer audio data) from the customer audio data using a MFCC. Further, the process 600 may include building a deep learning model for the customer churn prediction using the customer audio data.
[0078] Similarly, the process 600 may include building an encoder based model 602 for predicting customer churn 224 based on the video data corresponding to the customer 404. To build the encoder based model 602, the process 600 may include extracting frames (or images) corresponding to the historical customer video data at each time stamps. For example, an image may be extracted at time ‘t1’. Similarly, an image may be extracted at time ‘t2’, an image may be extracted at time ‘t3’, …, up to time ‘tn’. Upon extracting the images, the process 600 may include detecting a face using a CV model, at step 602. In continuation with the above example, the face may be detected at time ‘t1’. Similarly, the face ‘2’ may be detected at time ‘t2’, the face ‘3’ may be detected at time ‘t3’, …, upto the face detected at time ‘tn’.
[0079] Further, the process 600 may include building the encoder based model 602 to identify attention between different parts of the video data corresponding to the customer 404. Further, the process 600 may include creating a classifier 606 using CLS tokens. The classifier 606 may be used for prediction of customer churn or not. Upon building all the three models, the process 600 may include combining the NLP model, the deep learning model, and the encoder based model 602 to build the churn prediction model 422 for customer churn prediction using an ensemble based learning.
[0080] As will be also appreciated, the above-described techniques may take the form of computer or controller implemented processes and apparatuses for practicing those processes. The disclosure can also be embodied in the form of computer program code containing instructions embodied in tangible media, such as floppy diskettes, solid state drives, CD-ROMs, hard drives, or any other computer-readable storage medium, wherein, when the computer program code is loaded into and executed by a computer or controller, the computer becomes an apparatus for practicing the invention. The disclosure may also be embodied in the form of computer program code or signal, for example, whether stored in a storage medium, loaded into and/or executed by a computer or controller, or transmitted over some transmission medium, such as over electrical wiring or cabling, through fiber optics, or via electromagnetic radiation, wherein, when the computer program code is loaded into and executed by a computer, the computer becomes an apparatus for practicing the invention. When implemented on a general-purpose microprocessor, the computer program code segments configure the microprocessor to create specific logic circuits.
[0081] The disclosed methods and systems may be implemented on a conventional or a general-purpose computer system, such as a personal computer (PC) or server computer. Referring now to FIG. 7, an exemplary computing system 700 that may be employed to implement processing functionality for various embodiments (e.g., as a SIMD device, client device, server device, one or more processors, or the like) is illustrated. Those skilled in the relevant art will also recognize how to implement the invention using other computer systems or architectures. The computing system 700 may represent, for example, a user device such as a desktop, a laptop, a mobile phone, personal entertainment device, DVR, and so on, or any other type of special or general-purpose computing device as may be desirable or appropriate for a given application or environment. The computing system 700 may include one or more processors, such as a processor 702 that may be implemented using a general or special purpose processing engine such as, for example, a microprocessor, microcontroller or other control logic. In this example, the processor 702 is connected to a bus 704 or other communication medium. In some embodiments, the processor 702 may be an Artificial Intelligence (AI) processor, which may be implemented as a Tensor Processing Unit (TPU), or a graphical processor unit, or a custom programmable solution Field-Programmable Gate Array (FPGA).
[0082] The computing system 700 may also include a memory 706 (main memory), for example, Random Access Memory (RAM) or other dynamic memory, for storing information and instructions to be executed by the processor 702. The memory 706 also may be used for storing temporary variables or other intermediate information during execution of instructions to be executed by the processor 702. The computing system 700 may likewise include a read only memory (“ROM”) or other static storage device coupled to bus 704 for storing static information and instructions for the processor 702.
[0083] The computing system 700 may also include a storage devices 708, which may include, for example, a media drive 710 and a removable storage interface. The media drive 710 may include a drive or other mechanism to support fixed or removable storage media, such as a hard disk drive, a floppy disk drive, a magnetic tape drive, an SD card port, a USB port, a micro USB, an optical disk drive, a CD or DVD drive (R or RW), or other removable or fixed media drive. A storage media 712 may include, for example, a hard disk, magnetic tape, flash drive, or other fixed or removable medium that is read by and written to by the media drive 710. As these examples illustrate, the storage media 712 may include a computer-readable storage medium having stored therein particular computer software or data.
[0084] In alternative embodiments, the storage devices 708 may include other similar instrumentalities for allowing computer programs or other instructions or data to be loaded into the computing system 700. Such instrumentalities may include, for example, a removable storage unit 714 and a storage unit interface 716, such as a program cartridge and cartridge interface, a removable memory (for example, a flash memory or other removable memory module) and memory slot, and other removable storage units and interfaces that allow software and data to be transferred from the removable storage unit 714 to the computing system 700.
[0085] The computing system 700 may also include a communications interface 718. The communications interface 718 may be used to allow software and data to be transferred between the computing system 700 and external devices. Examples of the communications interface 718 may include a network interface (such as an Ethernet or other NIC card), a communications port (such as for example, a USB port, a micro USB port), Near field Communication (NFC), etc. Software and data transferred via the communications interface 718 are in the form of signals which may be electronic, electromagnetic, optical, or other signals capable of being received by the communications interface 718. These signals are provided to the communications interface 718 via a channel 720. The channel 720 may carry signals and may be implemented using a wireless medium, wire or cable, fiber optics, or other communications medium. Some examples of the channel 720 may include a phone line, a cellular phone link, an RF link, a Bluetooth link, a network interface, a local or wide area network, and other communications channels.
[0086] The computing system 700 may further include Input/Output (I/O) devices 722. Examples may include, but are not limited to a display, keypad, microphone, audio speakers, vibrating motor, LED lights, etc. The I/O devices 722 may receive input from a user and also display an output of the computation performed by the processor 702. In this document, the terms “computer program product” and “computer-readable medium” may be used generally to refer to media such as, for example, the memory 706, the storage devices 708, the removable storage unit 714, or signal(s) on the channel 720. These and other forms of computer-readable media may be involved in providing one or more sequences of one or more instructions to the processor 702 for execution. Such instructions, generally referred to as “computer program code” (which may be grouped in the form of computer programs or other groupings), when executed, enable the computing system 700 to perform features or functions of embodiments of the present invention.
[0087] In an embodiment where the elements are implemented using software, the software may be stored in a computer-readable medium and loaded into the computing system 700 using, for example, the removable storage unit 714, the media drive 710 or the communications interface 718. The control logic (in this example, software instructions or computer program code), when executed by the processor 702, causes the processor 702 to perform the functions of the invention as described herein.
[0088] Various embodiments provide method and system for determining real time customer sentiments and corresponding actionable interventions. The disclosed method and system may receive in real time, via a user interface, multimodal data of a video feed corresponding to a customer-agent interaction. The multimodal data may include text data, audio data, and video data corresponding to each of a customer and an agent. Further, the disclosed method and system may determine a set of sentiments corresponding to the customer-agent interaction from the text data, the audio data, and the video data using a set of AI models. The set of sentiments may include a text sentiment, an audio sentiment, and a facial sentiment for each of the customer and the agent. Further, the disclosed method and system may generate a final sentiment corresponding to the customer-agent interaction using a configurable matrix to merge the text sentiment, the audio sentiment, and the facial sentiment. Further, when the final sentiment of the customer-agent interaction is below a predefined threshold sentiment, the disclosed method and system may identify one or more causal factors for the final sentiment based on the set of sentiments using a topic modelling algorithm. Moreover, the disclosed method and system may generate in the real time one or more remedies corresponding to the one or more causal factors using a GenAI model. Thereafter, the disclosed method and system may predict a customer churn based on the multimodal data corresponding to the customer using a churn prediction model. It should be noted that the churn prediction model is a classifier model.
[0089] Thus, the disclosed method and system try to overcome the technical problem of determining real time customer sentiments and corresponding actionable interventions. The disclosed method and system may detect customer expressions (i.e., positive, negative, or neutral) in real time, which allow an agent to take immediate corrective actions. This may reduce the delay in addressing the customer concerns during interactions (e.g., video conferencing). Further, the disclosed method and system may automate the interpretation of customer reactions which reduce the cognitive load on agents. This may allow the agents to focus more on problem solving rather than deciphering customer emotions. Further, the disclosed method and system may integrate signals from different sentiments, such as tone of voice (i.e., audio sentiment), audio pitch, and facial expression (i.e., facial sentiment) to ensure comprehensive sentiment detection. This may provide better accuracy than the systems which rely on the single signal. Further, the disclosed method and system may operate seamlessly in high stakes (or high-volume interactions), to make it suitable for organizations with large customer bases. Additionally, the disclosed method and system may reduce the reliance (or dependency) on highly trained agents, enabling scale without compromising quality. Further, the disclosed method and system may deliver more personalized and empathetic responses by identifying the root causes of dissatisfaction. Additionally, the disclosed method and system may help to build trust and strengthen the customer-agent relationships.
[0090] In light of the above mentioned advantages and the technical advancements provided by the disclosed method and system, the claimed steps as discussed above are not routine, conventional, or well understood in the art, as the claimed steps enable the following solutions to the existing problems in conventional technologies. Further, the claimed steps clearly bring an improvement in the functioning of the device itself as the claimed steps provide a technical solution to a technical problem.
[0091] The specification has described method and system for determining real time customer sentiments and corresponding actionable interventions. The illustrated steps are set out to explain the exemplary embodiments shown, and it should be anticipated that ongoing technological development will change the manner in which particular functions are performed. These examples are presented herein for purposes of illustration, and not limitation. Further, the boundaries of the functional building blocks have been arbitrarily defined herein for the convenience of the description. Alternative boundaries can be defined so long as the specified functions and relationships thereof are appropriately performed. Alternatives (including equivalents, extensions, variations, deviations, etc., of those described herein) will be apparent to persons skilled in the relevant art(s) based on the teachings contained herein. Such alternatives fall within the scope and spirit of the disclosed embodiments.
[0092] Furthermore, one or more computer-readable storage media may be utilized in implementing embodiments consistent with the present disclosure. A computer-readable storage medium refers to any type of physical memory on which information or data readable by a processor may be stored. Thus, a computer-readable storage medium may store instructions for execution by one or more processors, including instructions for causing the processor(s) to perform steps or stages consistent with the embodiments described herein. The term “computer-readable medium” should be understood to include tangible items and exclude carrier waves and transient signals, i.e., be non-transitory. Examples include random access memory (RAM), read-only memory (ROM), volatile memory, nonvolatile memory, hard drives, CD ROMs, DVDs, flash drives, disks, and any other known physical storage media.
[0093] It is intended that the disclosure and examples be considered as exemplary only, with a true scope and spirit of disclosed embodiments being indicated by the following claims. , Claims:CLAIMS

I/We Claim:
1. A method (300) for determining real time customer sentiments and corresponding actionable interventions, the method (300) comprising:
receiving (302) in real time, by a detecting device (102) via a user interface (110), multimodal data of a video feed (216) corresponding to a customer-agent interaction, wherein the multimodal data comprises text data, audio data, and video data corresponding to each of a customer (404) and an agent (406);
determining (314), by the detecting device (102), a set of sentiments corresponding to the customer-agent interaction from the text data, the audio data, and the video data using a set of Artificial Intelligence (AI) models, wherein the set of sentiments comprises a text sentiment, an audio sentiment, and a facial sentiment for each of the customer (404) and the agent (406);
generating (332), by the detecting device (102), a final sentiment (416) corresponding to the customer-agent interaction using a configurable matrix to merge the text sentiment, the audio sentiment, and the facial sentiment;
when the final sentiment (218) of the customer-agent interaction is below a predefined threshold sentiment,
identifying (334), by the detecting device (102), one or more causal factors (220) for the final sentiment (218) based on the set of sentiments using a topic modelling algorithm;
generating (336) in the real time, by the detecting device (102), one or more remedies (222) corresponding to the one or more causal factors (220) using a Generative Artificial Intelligence (GenAI) model; and
predicting (338), by the detecting device (102), a customer churn (224) based on the multimodal data corresponding to the customer (404) using a churn prediction model, wherein the churn prediction model is a classifier model.

2. The method (300) as claimed in claim 1, comprising:
extracting (304) audio data from the video feed (216) using an audio extraction technique;
processing (306) in real-time, a set of data points from the audio data based on predefined time intervals, wherein the processing comprises:
calculating (308) a mean of the set of data points using an audio processing technique; and
detecting (310) a speaker shift based on a comparison of the mean with a predefined threshold value to obtain one or more speaker shifts in the customer-agent interaction, wherein the speaker shift is indicative of a change in an active speaker during the customer-agent interaction, wherein the active speaker is one of the customer (404) or the agent (406); and
segregating (312) the audio data of the customer-agent interaction into customer audio data and agent audio data based on the one or more speaker shifts, wherein the customer audio data comprises one or more parts of the audio data where the active speaker corresponds to the customer (404), and wherein the agent audio data comprises one or more parts of the audio data where the active speaker corresponds to the agent (406).

3. The method (300) as claimed in claim 2, wherein determining the set of sentiments corresponding to the customer-agent interaction comprises:
transforming (316) the customer audio data and the agent audio data into customer text data and agent text data, respectively, using a speech-to-text algorithm, wherein customer text data comprises one or more parts of the text data where the active speaker corresponds to the customer (404), and wherein the agent text data comprises one or more parts of the text data where the active speaker corresponds to the agent (406); and
determining (318), via a Natural Language Processing (NLP) model, a customer text sentiment and an agent text sentiment based on the customer text data and the agent text data, respectively, wherein the NLP model is one of the set of AI models.

4. The method (300) as claimed in claim 3, comprising:
extracting (320) a set of customer audio features and a set of agent audio features from the customer audio data and the agent audio data, respectively, using Mel Frequency Cepstral Coefficients (MFCC); and
determining (322), via a supervised Machine Learning (ML) model, a customer audio sentiment and an agent audio sentiment based on the customer audio features and the agent audio features, respectively, wherein the supervised ML model is one of the set of AI models.

5. The method (300) as claimed in claim 3, comprising:
from the multimodal data, extracting (324) customer video data and agent video data based on the customer audio data and the agent audio data, respectively, wherein customer video data comprises one or more parts of the video data where the active speaker corresponds to the customer (404), and wherein the agent video data comprises one or more parts of the video data where the active speaker corresponds to the agent (406);
for each frame in each of the customer video data and the agent video data,
detecting (326), via a Computer Vision (CV) model, a face of the active speaker in the frame, wherein the CV model is one of the set of AI models; and
determining (328), via an image classification model, a facial expression of the detected face using Reading Mind in the Eyes Test (RMET) analysis, wherein the image classification model is one of the set of AI models; and
determining (330), via the image classification model, a customer video sentiment and an agent video sentiment based on the facial expression determined in the customer video data and the agent video data, respectively.

6. The method (300) as claimed in claim 1, wherein predicting the customer churn (224) comprises training (340) the churn prediction model using multimodal training data corresponding to historical customer-agent interactions, wherein the churn prediction model comprises an ensemble of an NLP model, a deep learning model, and an encoder-based model.

7. A system (100) for determining real time customer sentiments and corresponding actionable interventions, the system (100) comprising:
a processor (104); and
a memory (106) communicatively coupled to the processor (104), wherein the memory (106) stores processor executable instructions, which, on execution, causes the processor (104) to:
receive (302) in real time, via a user interface (110), multimodal data of a video feed (216) corresponding to a customer-agent interaction, wherein the multimodal data comprises text data, audio data, and video data corresponding to each of a customer (404) and an agent (406);
determine (314) a set of sentiments corresponding to the customer-agent interaction from the text data, the audio data, and the video data using a set of Artificial Intelligence (AI) models, wherein the set of sentiments comprises a text sentiment, an audio sentiment, and a facial sentiment for each of the customer (404) and the agent (406);
generate (332) a final sentiment (218) corresponding to the customer-agent interaction using a configurable matrix to merge the text sentiment, the audio sentiment and the facial sentiment;
when the final sentiment (218) of the customer-agent interaction is below a predefined threshold sentiment,
identify (334) one or more causal factors (220) for the final sentiment (218) based on the set of sentiments using a topic modelling algorithm;
generate (336) in the real time one or more remedies (222) corresponding to the one or more causal factors (220) using a Generative Artificial Intelligence (GenAI) model; and
predict (338) a customer churn (224) based on the multimodal data corresponding to the customer (404) using a churn prediction model, wherein the churn prediction model is a classifier model.

8. The system (100) as claimed in claim 7, wherein the processor executable instructions cause the processor (104) to:
extract (304) audio data from the video feed (216) using an audio extraction technique;
process (306) in real-time, a set of data points from the audio data based on predefined time intervals, wherein the processing comprises:
calculate (308) a mean of the set of data points using an audio processing technique; and
detect (310) a speaker shift based on a comparison of the mean with a predefined threshold value to obtain one or more speaker shifts in the customer-agent interaction, wherein the speaker shift is indicative of a change in an active speaker during the customer-agent interaction, wherein the active speaker is one of the customer (404) or the agent (406); and
segregate (312) the audio data of the customer-agent interaction into customer audio data and agent audio data based on the one or more speaker shifts, wherein the customer audio data comprises one or more parts of the audio data where the active speaker corresponds to the customer (404), and wherein the agent audio data comprises one or more parts of the audio data where the active speaker corresponds to the agent (406).

9. The system (100) as claimed in claim 8, wherein to determine the set of sentiments corresponding to the customer-agent interaction, the processor executable instructions cause the processor (104) to:
transform (316) the customer audio data and the agent audio data into customer text data and agent text data, respectively, using a speech-to-text algorithm, wherein customer text data comprises one or more parts of the text data where the active speaker corresponds to the customer (404), and wherein the agent text data comprises one or more parts of the text data where the active speaker corresponds to the agent (406); and
determine (318), via a Natural Language Processing (NLP) model, a customer text sentiment and an agent text sentiment based on the customer text data and the agent text data, respectively, wherein the NLP model is one of the set of AI models.

10. The system (100) as claimed in claim 9, wherein the processor executable instructions cause the processor (104) to:
extract (320) a set of customer audio features and a set of agent audio features from the customer audio data and the agent audio data, respectively, using Mel Frequency Cepstral Coefficients (MFCC); and
determine (322), via a supervised Machine Learning (ML) model, a customer audio sentiment and an agent audio sentiment based on the customer audio features and the agent audio features, respectively, wherein the supervised ML model is one of the set of AI models.

11. The system (100) as claimed in claim 9, wherein the processor executable instructions cause the processor (104) to:
from the multimodal data, extract (324) customer video data and agent video data based on the customer audio data and the agent audio data, respectively, wherein customer video data comprises one or more parts of the video data where the active speaker corresponds to the customer (404), and wherein the agent video data comprises one or more parts of the video data where the active speaker corresponds to the agent (406);
for each frame in each of the customer video data and the agent video data,
detect (326), via a Computer Vision (CV) model, a face of the active speaker in the frame, wherein the CV model is one of the set of AI models; and
determine (328), via an image classification model, a facial expression of the detected face using Reading Mind in the Eyes Test (RMET) analysis, wherein the image classification model is one of the set of AI models; and
determine (330), via the image classification model, a customer video sentiment and an agent video sentiment based on the facial expression determined in the customer video data and the agent video data, respectively.

12. The system (100) as claimed in claim 7, wherein to predict the customer churn (224), the processor executable instructions cause the processor (104) to train (340) the churn prediction model using multimodal training data corresponding to historical customer-agent interactions, wherein the churn prediction model comprises an ensemble of an NLP model, a deep learning model, and an encoder-based model.

Documents

Application Documents

#	Name	Date
1	202511087273-STATEMENT OF UNDERTAKING (FORM 3) [12-09-2025(online)].pdf	2025-09-12
2	202511087273-REQUEST FOR EXAMINATION (FORM-18) [12-09-2025(online)].pdf	2025-09-12
3	202511087273-REQUEST FOR EARLY PUBLICATION(FORM-9) [12-09-2025(online)].pdf	2025-09-12
4	202511087273-PROOF OF RIGHT [12-09-2025(online)].pdf	2025-09-12
5	202511087273-POWER OF AUTHORITY [12-09-2025(online)].pdf	2025-09-12
6	202511087273-FORM-9 [12-09-2025(online)].pdf	2025-09-12
7	202511087273-FORM 18 [12-09-2025(online)].pdf	2025-09-12
8	202511087273-FORM 1 [12-09-2025(online)].pdf	2025-09-12
9	202511087273-FIGURE OF ABSTRACT [12-09-2025(online)].pdf	2025-09-12
10	202511087273-DRAWINGS [12-09-2025(online)].pdf	2025-09-12
11	202511087273-DECLARATION OF INVENTORSHIP (FORM 5) [12-09-2025(online)].pdf	2025-09-12
12	202511087273-COMPLETE SPECIFICATION [12-09-2025(online)].pdf	2025-09-12