System And Method For Detecting Active Participants In A

< Back

System And Method For Detecting Active Participants In A Teleconference

Abstract: The present disclosure relates to a system and method for facilitating a robust and effective solution to an entity or an organization by enabling the entity to implement a system for enhancing user experience of a conference user by detecting active participants in a teleconference. Thus, the system and method of the present disclosure may be beneficial for both entities and users.

Get Free WhatsApp Updates!
Notices, Deadlines & Correspondence

Patent Information

Application #

Filing Date

01 September 2021

Publication Number

09/2023

Publication Type

INA

Invention Field

COMPUTER SCIENCE

Status

jioipr@zmail.ril.com

Parent Application

Patent Number

Legal Status

Grant Date

2024-08-02

Renewal Date

Applicants

JIO PLATFORMS LIMITED

Office-101, Saffron, Nr. Centre Point, Panchwati 5 Rasta, Ambawadi, Ahmedabad - 380006, Gujarat, India.

Inventors

1. DUGGAL, Gaurav

Flat 305, Block 18, Rain Tree Park, Kukatpally, Hyderabad - 500072, Telangana, India.

2. KUMAR, Tarun

908, Dasta Concerto, SyNo 55/2B, Sarjpaura Main Road, Anekal Taluk, Yamare Village, Bengaluru - 562125, Karnataka, India.

3. MULPURI, Apparao

Hno:14-24/8/45, Plot no:56/1, Jayalakshmi Nagar Phase 3, Beeramguda, Ameenpur Mandal, Ramachandrapuram - 502032, Telangana, India.

4. KUMARASWAMY, Nuthalapati

16-211/A, Endugumpalem, Nadendla Mandal, Guntur - 522549, Andhra Pradesh, India.

5. GARG, Manoj Kumar

Flat no. 301, Royal Residency, Gwalior Road, Morena - 476001, Madhya Pradesh, India.

Specification

DESC:FIELD OF INVENTION
[0001] The embodiments of the present disclosure generally relate to the field of tele-communication. More particularly, the present disclosure relates to a system and method for detecting the top active participants during a telecon.

BACKGROUND OF THE INVENTION
[0002] The following description of related art is intended to provide background information pertaining to the field of the disclosure. This section may include certain aspects of the art that may be related to various features of the present disclosure. However, it should be appreciated that this section be used only to enhance the understanding of the reader with respect to the present disclosure, and not as admissions of prior art.
[0003] In today’s world, conference calls are becoming an increasingly common communications medium. For example, a large corporation can have offices located throughout the world, but the corporation's employees at different locations are often required to consult with each other by conference call, in order to develop conclusions and solutions for pressing problems. It is also crucial to accurately determine the top n participants as in a video conferencing it is always not possible to show all participants' video.
[0004] With the advent of large virtual conference rooms, it becomes difficult to determine the activity of a large number of participants. Most often, many active participants are missed out due to unviability of tracking measures for determining active participants amidst the huge number of participants. There are other problems associated with making calls such as during a conference call, it is often difficult to recognize who is speaking by voice alone. If there are several other participants in the conference call with similar regional accents or voices that sound similar, the conference call might become a chaos. Furthermore, two or more conference call participants can be speaking at the same time, which degrades the conversations. Existing systems also have limited space to show video of remote participants on the client side as well as limited CPU, Memory and bandwidth. Hence, it is difficult to find top active participants so that we are showing those participants who are active during the call. Currently there is only one observable provided by SDK which identifies who is speaking and who is not. This information will not help us to accurately determine active participants. Most often people forget to unmute themselves before starting to speak and they fail to present their views and be seen as active participants.
[0005] There is therefore a need in the art to provide a system and a method that can facilitate mitigating the problems associated with the prior art.

OBJECTS OF THE PRESENT DISCLOSURE
[0006] Some of the objects of the present disclosure, which at least one embodiment herein satisfies are as listed herein below.
[0007] It is an object of the present disclosure to enhance identification of active participants in a video conference.
[0008] It is an object of the present disclosure to increase the accuracy of process to determine participant activity.
[0009] It is an object of the present disclosure to use lip movements and audio pitch level to determine the activeness.
[0010] It is an object of the present disclosure to use time parameter to increase the accuracy.

SUMMARY
[0011] This section is provided to introduce certain objects and aspects of the present disclosure in a simplified form that are further described below in the detailed description. This summary is not intended to identify the key features or the scope of the claimed subject matter.
[0012] In an aspect, the present disclosure provides for a client-side system enabling communication with a server-side system in a teleconference. The client-side system may include one or more processors operatively coupled with a memory, the memory storing instructions which when executed by the one or more processors causes the system to receive a plurality of signals, from one or more sensors associated with the client-side system, extract audio meta data from the plurality of signals received, extract video meta data from the plurality of signals received, generate a user meta data from the extracted audio and video meta data; and, transmit the user meta data to the server-side system. In an embodiment, the server-side system may generate a conference meta data on receipt of the user meta data from a plurality of client-side systems.
[0013] In an embodiment, the one or more processors in the client-side system may be associated with an event detector module configured to detect a first event whenever the one or more sensors detects a user speaking.
[0014] In an embodiment, the event detector module may be further configured to detect a second event whenever the one or more sensors detect changes in facial expressions of the user.
[0015] In an embodiment, the user meta data may include the audio meta data, the video meta data, the first event and the second event.
[0016] In an embodiment, the one or more processors in the client-side system may be further configured to monitor a plurality of features such as Video frames per second (FPS), Video Bitrate, Video Resolution, Video mute status, Audio bitrate, and Audio mute status.
[0017] In an aspect, the present disclosure provides for a server-side system in communication with a plurality of client-side systems in a teleconference. The system may include one or more processors coupled with a memory, the memory storing instructions which when executed by the one or more processors causes the system to receive a plurality of user meta data from the plurality of client-side systems. The system may further extract a plurality of first set of attributes from the plurality of user meta data, the plurality of first set of attributes pertaining to pitch values of each user for a first predetermined period of time and also extract a plurality of second set of attributes from the plurality of user meta data, the plurality of second set of attributes pertaining to lip movement of each said user for a second predetermined time. The system may then generate a conference meta data based on the extracted plurality of first and second set of attributes; and transmit the conference meta data to the plurality of client-side server systems.
[0018] In an embodiment, upon receipt of the conference meta data the one or more processors in the client-side system may be configured to compare the transmitted conference meta data with a previous conference meta data stored in a database associated with each user computing device and capture one or more changes in the transmitted conference meta data based on the comparison made with the transmitted conference meta data and the previous conference meta data. Further, the system may update a user display interface and manage one or more elements associated with each client-side system based on the captured one or more changes.
[0019] In an embodiment, the conference meta data may be generated in real time.
[0020] In an embodiment, to obtain the conference meta data, the one or more processors operatively coupled to the server side may be configured to verify the audio meta data of each user with the video meta data of said user and establish an active period of each user, and thereby generate a score for each said user based on the verification of the audio meta data of each said user with the video meta data of the user. The system may further compare the score of each user with the respective scores of the plurality of users and, then determine a set of active users in descending order of the score based on the comparison made.
[0021] In an embodiment, the one or more processors operatively coupled to the server-side system may be further configured to detect an audio pitch value of the user from the user meta data whenever the audio pitch value goes above a predefined threshold and thereby determine a background noise based on the detected audio pitch value.
[0022] In an embodiment, the one or more processors in the server-side system may be further configured to determine if the user is actually speaking in the teleconference based on the detected audio pitch and the changes in the facial expressions of the user.
[0023] In an embodiment, the one or more processors in the server-side system may be further configured to update the score of the plurality of users in real time; and, and sort the updated score in a decreasing order of score.
[0024] In an embodiment, the one or more processors in the server-side system may be further configured t purge the updated score periodically after a predetermined period of time and after a threshold value of the score is reached; and automatically increase the score of the user to an upper limit score, if a user reaches the threshold score or speaks for the predetermined period of time.
[0025] In an embodiment, the one or more processors in the server-side system may be further configured to sort the set of active users in a decreasing order whenever a score of a user changes.
[0026] In an aspect, the present disclosure provides for a method for enabling communication of a client-side system with a server-side system in a teleconference. The method may include the steps of receiving, by one or more processors, a plurality of signals, from one or more sensors associated with a client-side system. In an embodiment, the one or more processors may be operatively coupled to the client-side system, the one or more processors coupled with a memory storing instructions executed by the one or more processors. The method may further include the step of extracting, by the one or more processors audio and video meta data from the plurality of signals received and thereby generating, by the one or more processors, a user meta data of the user computing device from the extracted audio and video frames. The method may then include the step of transmitting, by the one or more processors, the user meta data to the server side-system. In an embodiment, the server-side system may generate a conference meta data on receipt of the user meta data from a plurality of client-side systems.
[0027] In an aspect, the present disclosure provides for a method for enabling communication of a server-side system with a plurality of client-side system in a teleconference. The method may include the step of receiving, by one or more processors, a plurality of user meta data from the client-side systems. In an embodiment, the one or more processors may be operatively coupled to the server-side system, the one or more processors coupled with a memory storing instructions executed by the one or more processors. The method may further include the step of extracting, by the one or more processors, a plurality of first set of attributes from the plurality of user meta data, the plurality of first set of attributes pertaining to pitch values of each user for a first predetermined period of time. The method may also include the step of extracting, by the one or more processors, a plurality of second set of attributes from the plurality of user meta data, the plurality of second set of attributes pertaining to lip movement of each user for a second predetermined time and then the method may include the step of generating, by the one or more processors, a conference meta data based on the extracted plurality of first and second set of attributes. Furthermore, the method may include the step of transmitting, by the one or more processors, the conference meta data to the plurality of client-side systems.

BRIEF DESCRIPTION OF DRAWINGS
[0028] The accompanying drawings, which are incorporated herein, and constitute a part of this invention, illustrate exemplary embodiments of the disclosed methods and systems in which like reference numerals refer to the same parts throughout the different drawings. Components in the drawings are not necessarily to scale, emphasis instead being placed upon clearly illustrating the principles of the present invention. Some drawings may indicate the components using block diagrams and may not represent the internal circuitry of each component. It will be appreciated by those skilled in the art that invention of such drawings includes the invention of electrical components, electronic components or circuitry commonly used to implement such components.
[0029] FIG. 1A illustrates an exemplary network architecture in which or with which the client-side system of the present disclosure can be implemented for facilitating enhanced conference call, in accordance with an embodiment of the present disclosure.
[0030] FIG. 1B illustrates an exemplary network architecture in which or with which the server-side system of the present disclosure can be implemented for facilitating enhanced conference call, in accordance with an embodiment of the present disclosure.
[0031] FIG. 2A illustrates an exemplary representation of the client-side system, in accordance with an embodiment of the present disclosure.
[0032] FIG. 2B illustrates an exemplary representation of the server-side system based on an artificial intelligence (AI) based architecture, in accordance with an embodiment of the present disclosure.
[0033] FIG. 3A illustrates exemplary client-side method flow diagram for detecting active participants in a telecon, in accordance with an embodiment of the present disclosure.
[0034] FIG. 3B illustrates exemplary server-side method flow diagram for detecting active participants in a telecon, in accordance with an embodiment of the present disclosure.
[0035] FIG. 4 illustrates an exemplary block flow diagrams depicting components of the system involved in the detection of active participants, in accordance with an embodiment of the present disclosure.
[0036] FIG. 5 illustrates an exemplary block diagram depicting exemplary functional components of the system involved in the detection of active participants, in accordance with an embodiment of the present disclosure.
[0037] FIGs. 6A-6B illustrate generic flow diagrams of implementations of exemplary workflow conference, in accordance with an embodiment of the present disclosure.
[0038] FIG. 7 refers to the exemplary computer system in which or with which embodiments of the present invention can be utilized, in accordance with embodiments of the present disclosure.
[0039] The foregoing shall be more apparent from the following more detailed description of the invention.

DETAILED DESCRIPTION OF INVENTION
[0040] In the following description, for the purposes of explanation, various specific details are set forth in order to provide a thorough understanding of embodiments of the present disclosure. It will be apparent, however, that embodiments of the present disclosure may be practiced without these specific details. Several features described hereafter can each be used independently of one another or with any combination of other features. An individual feature may not address all of the problems discussed above or might address only some of the problems discussed above. Some of the problems discussed above might not be fully addressed by any of the features described herein.
[0041] The ensuing description provides exemplary embodiments only, and is not intended to limit the scope, applicability, or configuration of the disclosure. Rather, the ensuing description of the exemplary embodiments will provide those skilled in the art with an enabling description for implementing an exemplary embodiment. It should be understood that various changes may be made in the function and arrangement of elements without departing from the spirit and scope of the invention as set forth.
[0042] Specific details are given in the following description to provide a thorough understanding of the embodiments. However, it will be understood by one of ordinary skill in the art that the embodiments may be practiced without these specific details. For example, circuits, systems, networks, processes, and other components may be shown as components in block diagram form in order not to obscure the embodiments in unnecessary detail. In other instances, well-known circuits, processes, algorithms, structures, and techniques may be shown without unnecessary detail in order to avoid obscuring the embodiments.
[0043] The present invention provides a robust and effective solution to an entity or an organization by enabling the entity to implement a system for detection of active participants in a teleconference. Thus, the system and method of the present disclosure may be beneficial for both entities and users.
[0044] Referring to FIG. 1A that illustrates an exemplary network architecture (100) in which or with which a client-side system (110) of the present disclosure can be implemented, in accordance with an embodiment of the present disclosure. As illustrated in FIG. 1A, by way of example and not by not limitation, the exemplary architecture (100) may include a plurality of conference participants (102-1, 102-2…102-N) (collectively referred to as conference participants (102) or users (102) or participants (102) and individually as conference participant (102) or user (102) or participant (102)) associated with a plurality of client-side systems (110-1, 110-2,…110-N) (individually referred to as client-side system (110) and collectively referred to as client-side systems (110) hereinafter). The client-side systems (110) may be associated with a plurality of user computing devices (104-1, 104-2,…104-N) (also referred to as user devices (104) or computing devices (104), mobile computing devices (104) collectively and user device (104) individually), at least a network (106), at least a centralized server (112). The user computing devices (104) may be mobile computing devices, laptops, palmtops, personal computers, handheld devices and the like
[0045] Referring to FIG. 1B that illustrates an exemplary network architecture (150) in which or with which a server-side system (114) of the present disclosure can be implemented, in accordance with an embodiment of the present disclosure. As illustrated in FIG. 1B, by way of example and not by not limitation, the exemplary architecture (150) may the plurality of client-side systems (110-1, 110-2,…110-N) (individually referred to as client-side system (110) and collectively referred to as client-side systems (110) hereinafter). The client-side systems (110) may be associated with at least a network (106), at least a centralized server 112 and the server-side system (114) associated with an entity (108). More specifically, the exemplary architecture (150) includes the server-side system (114) equipped with an artificial intelligence (AI) engine (214) (illustrated in FIG. 2B) for enhancing features required for detecting the top active participants (102).
[0046] The user device (104) may be communicably coupled to the centralized server (112) through the network (106) to facilitate communication therewith. As an example, and not by way of limitation, the user computing device (104) may be operatively coupled to the centralised server (112) through the network (106) and may be associated with the entity (108).
[0047] The client-side system (110) may receive a plurality of signals from one or more sensors associated with the client-side system (110). The one or more sensors may include an audio capturer such as a microphone, a camera module such as a camera and the like. The microphone may be configured to capture audio meta data of the user (102) and the camera may be configured to capture video meta of the user (102).
[0048] The client-side system (110) may thus be configured to extract audio and video meta data from the plurality of signals received and then generate a user meta data from the extracted audio and video meta data. The client-side system (110) may then transmit the user meta data to the server-side system (114). In an embodiment, the server-side system (114) may generate a conference meta data on receipt of the user meta data from a plurality of client-side systems (110).
[0049] In an embodiment, the client-side system (110) may be associated with an event detector module (214) configured to detect a first event whenever the one or more sensors detects a user speaking. The event detector module (214) may be further configured to detect a second event whenever the one or more sensors detect changes in facial expressions of the user. In a way of example and not as a limitation, the event detector may be but not limited to an Azure Communication Service (ACS) calling software development kit (SDK) having a set of instructions such as an isSpeaking module but not limited to it. In a way of example and not as a limitation, a lip detector has an event when lips are seen moving by a computer vision module.
[0050] In an embodiment, the user meta data may this include the audio meta data, the video meta data, the first event and the second event.
[0051] In an embodiment, the client-side system (110) may be further configured to monitor a plurality of features such as Video frames per second (FPS), Video Bitrate, Video Resolution, Video mute status, Audio bitrate, and Audio mute status and the like.
[0052] In an aspect, the server-side system (114) may be configured to receive a plurality of user meta data from the plurality of client-side systems (110). The server-side system (114) may then extract a plurality of first set of attributes from the plurality of user meta data, the plurality of first set of attributes pertaining to pitch values of each user for a first predetermined period of time. For example, a pitch analyzer module operatively coupled to the server-side system (114) may analyze the pitch values. The server-side system (114) may further extract a plurality of second set of attributes from the plurality of user meta data, the plurality of second set of attributes pertaining to lip movement of each user for a second predetermined time. The server-side system (114) may then generate a conference meta data based on the extracted plurality of first and second set of attributes and transmit the conference meta data to the plurality of client-side systems (110).
[0053] In an embodiment, upon receipt of the conference meta data the client-side system (110) may be configured to compare the transmitted conference meta data with a previous conference meta data stored in a database associated with each user computing device and then capture one or more changes in the transmitted conference meta data based on the comparison made with the transmitted conference meta data and the previous conference meta data. The client-side system (110) may further update a user display interface and manage one or more elements associated with each client-side system (110) based on the captured one or more changes and manage one or more elements of each said client-side system (110) based on the captured one or more changes.
[0054] In an embodiment, the conference meta data may be generated in real time.
[0055] In another embodiment, to obtain the conference meta data, the server side-system (114) may be configured to verify the audio meta data of each user with the video meta data of user and establish an active period of each user and then generate a score for each user based on the verification made. The server-side system (114) may then compare the score of each user with the respective scores of the plurality of users; and then determine a set of active users in descending order of the score based on the comparison made.
[0056] In another embodiment, the server-side system (114) may be further configured to detect an audio pitch value of the user from the user meta data whenever the audio pitch value goes above a predefined threshold and determine a background noise based on the detected audio pitch value.
[0057] In another embodiment, the server-side system (114) may be further configured to determine if the user is actually speaking in the teleconference based on the detected audio pitch and the changes in the facial expressions of the user.
[0058] In yet another embodiment, the server-side system (114) may be further configured to update the score of the plurality of users in real time and sort the updated score in a decreasing order of score. The server-side system (114) may further purge the updated score periodically after a predetermined period of time and after a threshold value of the score is reached so that the participant list is not affected as a person may always be shown as active participant. For example, if a spoke for straight 1 hour, then it will be extremely hard to beat his score by another participant who might talk for a brief period. The server-side system (114) may further automatically increase the score of the user to an upper limit score, if a user reaches the threshold score or speaks for the predetermined period of time. Thus, the server-side system (114) may be further configured to sort the set of active users in a decreasing order whenever a score of a user changes. In an exemplary embodiment, if a participant (102) reaches the threshold score or speaks for the predetermined period of time, then the score of the participant may be automatically increased to an upper limit score. For example, the upper limit score can be 5350 but not limited to it and the participant may be referred to as a wildcard entry.
[0059] In an exemplary embodiment, the score list may be a mutable list and may get sorted anytime score of any participant is changed.
[0060] In an embodiment, the user computing device (104) may communicate with the client-side system (110) and the server-side system (114) via set of executable instructions residing on any operating system, including but not limited to, Android TM, iOS TM, Kai OS TM and the like. In an embodiment, user computing device (104) may include, but not limited to, any electrical, electronic, electro-mechanical or an equipment or a combination of one or more of the above devices such as mobile phone, smartphone, virtual reality (VR) devices, augmented reality (AR) devices, laptop, a general-purpose computer, desktop, personal digital assistant, tablet computer, mainframe computer, or any other computing device, wherein the computing device may include one or more in-built or externally coupled accessories including, but not limited to, a visual aid device such as camera, audio aid, a microphone, a keyboard, input devices for receiving input from a user such as touch pad, touch enabled screen, electronic pen and the like. It may be appreciated that the user computing device (104) may not be restricted to the mentioned devices and various other devices may be used. A smart computing device may be one of the appropriate systems for storing data and other private/sensitive information.
[0061] In an exemplary embodiment, a network (106) may include, by way of example but not limitation, at least a portion of one or more networks having one or more nodes that transmit, receive, forward, generate, buffer, store, route, switch, process, or a combination thereof, etc. one or more messages, packets, signals, waves, voltage or current levels, some combination thereof, or so forth. A network may include, by way of example but not limitation, one or more of: a wireless network, a wired network, an internet, an intranet, a public network, a private network, a packet-switched network, a circuit-switched network, an ad hoc network, an infrastructure network, a public-switched telephone network (PSTN), a cable network, a cellular network, a satellite network, a fiber optic network, some combination thereof.
[0062] In another exemplary embodiment, the centralized server (112) may include or comprise, by way of example but not limitation, one or more of: a stand-alone server, a server blade, a server rack, a bank of servers, a server farm, hardware supporting a part of a cloud service or system, a home server, hardware running a virtualized server, one or more processors executing code to function as a server, one or more machines performing server-side functionality as described herein, at least a portion of any of the above, some combination thereof.
[0063] In an embodiment, the client-side system (110) may include one or more processors coupled with a memory, wherein the memory may store instructions which when executed by the one or more processors may cause the system to facilitate detection of active participants. FIG. 2A with reference to FIG. 1A, illustrates an exemplary representation of the client-side system (110) for facilitating communication with a server-side system (114), in accordance with an embodiment of the present disclosure. In an aspect, the client-side system (110) may comprise one or more processor(s) (202). The one or more processor(s) (202) may be implemented as one or more microprocessors, microcomputers, microcontrollers, digital signal processors, central processing units, logic circuitries, and/or any devices that process data based on operational instructions. Among other capabilities, the one or more processor(s) (202) may be configured to fetch and execute computer-readable instructions stored in a memory (204) of the client-side system (110). The memory (204) may be configured to store one or more computer-readable instructions or routines in a non-transitory computer readable storage medium, which may be fetched and executed to create or share data packets over a network service. The memory (204) may comprise any non-transitory storage device including, for example, volatile memory such as RAM, or non-volatile memory such as EPROM, flash memory, and the like.
[0064] In an embodiment, the client-side system (110) may include an interface(s) 206. The interface(s) 206 may comprise a variety of interfaces, for example, interfaces for data input and output devices, referred to as I/O devices, storage devices, and the like. The interface(s) 206 may facilitate communication of the client-side system (110). The interface(s) 206 may also provide a communication pathway for one or more components of the client-side system (110). Examples of such components include, but are not limited to, processing engine(s) 208 and a database 210.
[0065] The processing engine(s) (208) may be implemented as a combination of hardware and programming (for example, programmable instructions) to implement one or more functionalities of the processing engine(s) (208). In examples described herein, such combinations of hardware and programming may be implemented in several different ways. For example, the programming for the processing engine(s) (208) may be processor executable instructions stored on a non-transitory machine-readable storage medium and the hardware for the processing engine(s) (208) may comprise a processing resource (for example, one or more processors), to execute such instructions. In the present examples, the machine-readable storage medium may store instructions that, when executed by the processing resource, implement the processing engine(s) (208). In such examples, the client-side system (110) may comprise the machine-readable storage medium storing the instructions and the processing resource to execute the instructions, or the machine-readable storage medium may be separate but accessible to the client- side system (110) and the processing resource. In other examples, the processing engine(s) (208) may be implemented by electronic circuitry.
[0066] The processing engine (208) may include one or more engines selected from any of a data acquisition engine (212), an event detector module (214), and other engines (216).
[0067] In an embodiment, the server-side system (114) may include one or more processors coupled with a memory, wherein the memory may store instructions which when executed by the one or more processors may cause the system to facilitate detection of active participants. FIG. 2B with reference to FIG. 1B, illustrates an exemplary representation of the server-side system (114) for facilitating enhanced conference call based on an artificial intelligence (AI) based architecture, in accordance with an embodiment of the present disclosure. In an aspect, the server-side system (114) may comprise one or more processor(s) (222). The one or more processor(s) (222) may be implemented as one or more microprocessors, microcomputers, microcontrollers, digital signal processors, central processing units, logic circuitries, and/or any devices that process data based on operational instructions. Among other capabilities, the one or more processor(s) (222) may be configured to fetch and execute computer-readable instructions stored in a memory (224) of the server-side system (114). The memory (224) may be configured to store one or more computer-readable instructions or routines in a non-transitory computer readable storage medium, which may be fetched and executed to create or share data packets over a network service. The memory (224) may comprise any non-transitory storage device including, for example, volatile memory such as RAM, or non-volatile memory such as EPROM, flash memory, and the like.
[0068] In an embodiment, the server-side system (114) may include an interface(s) 226. The interface(s) 226 may comprise a variety of interfaces, for example, interfaces for data input and output devices, referred to as I/O devices, storage devices, and the like. The interface(s) 226 may facilitate communication of the server-side system (114). The interface(s) 226 may also provide a communication pathway for one or more components of the server-side system (114). Examples of such components include, but are not limited to, processing engine(s) 228 and a database 230.
[0069] The processing engine(s) (228) may be implemented as a combination of hardware and programming (for example, programmable instructions) to implement one or more functionalities of the processing engine(s) (228). In examples described herein, such combinations of hardware and programming may be implemented in several different ways. For example, the programming for the processing engine(s) (228) may be processor executable instructions stored on a non-transitory machine-readable storage medium and the hardware for the processing engine(s) (228) may comprise a processing resource (for example, one or more processors), to execute such instructions. In the present examples, the machine-readable storage medium may store instructions that, when executed by the processing resource, implement the processing engine(s) (228). In such examples, the server-side system (114) may comprise the machine-readable storage medium storing the instructions and the processing resource to execute the instructions, or the machine-readable storage medium may be separate but accessible to the server-side system (114) and the processing resource. In other examples, the processing engine(s) (228) may be implemented by electronic circuitry.
[0070] The processing engine (228) may include one or more engines selected from any of a data acquisition engine (232), an artificial intelligence (AI) engine (234), and other engines (236).
[0071] FIG. 3A illustrates exemplary client-side method flow diagram (300) for enabling communication with a server-side system, in accordance with an embodiment of the present disclosure. At 302, the method may include the step of receiving, by one or more processors (202), a plurality of signals, from one or more sensors associated with a client-side system (110), and at 304, the method (300) may further include the step of extracting, by the one or more processors (202), audio meta data from the plurality of signals received.
[0072] Further, the method (300) may include at 306 the step of extracting, by the one or more processors (202), from the plurality of video meta data.
[0073] Furthermore, the method may include at 308, the step transmitting, by the one or more processors (202), the user meta data to the server side-system (114).
[0074] FIG. 3B illustrates exemplary server-side method flow diagram (350) for facilitating enhanced conference call, in accordance with an embodiment of the present disclosure. At 352, the method may include the step of receiving, by one or more processors (222), a plurality of user meta data from the client-side systems (110), and at 354, the method (350) may further include the step of extracting, by the one or more processors (222), a plurality of first set of attributes from the plurality of user meta data, the plurality of first set of attributes pertaining to pitch values of each user for a first predetermined period of time.
[0075] Further, the method (350) may include at 356 the step of extracting, by the one or more processors (222), a plurality of second set of attributes from the plurality of user meta data, the plurality of second set of attributes pertaining to lip movement of each said user for a second predetermined time.
[0076] Furthermore, the method may include at 358, the step of generating, by the one or more processors (222), a conference meta data based on the extracted plurality of first and second set of attributes. Further, the method may include at 360, the step of transmitting, by the one or more processors (222), the conference meta data to the plurality of client-side systems (110).
[0077] In an exemplary embodiment, following may be the ways to identify top n participants:
• If the user is speaking by using the isSpeaking module.
• Lip movement if video is unmuted.
• Audio pitch level.
• Audio bitrate. – Audio packets (bits per second) which client is sending to backend for the communication.
• Audio mute status. – Audio mute/unmute status of the participant.
• Video bitrate - Video packets (Kbits per second) which client is sending to backend for communication. This is crucial to determine if participant is active in the call. If video bitrate is flowing, then we can safely give it a higher activity score.
• Video FPS. – Number of video frames per second which client is sending to backend for the communication.
• Video resolution – Height and width of each frame which client is sending to backend for the communication.
• Video mute status. - Video mute/unmute status of the participant.
In a way of example and not as a limitation, If a participant has been speaking (isspeaking module is true) for straight 20 secs, then the participant is the most active speaker and the same for pitch level and lip movement can be added.
[0078] FIG. 4 illustrate exemplary block flow diagrams (400) depicting components of the system involved in the detection of active participants, in accordance with an embodiment of the present disclosure. As illustrated, in a way of example and not as a limitation, a communication platform for video conferencing may be provided by but not limited to Azure Communications services (ACS) calling client software development kit SDK (418). To initialize the ACS calling client SDK (418), an ACS token (406) can be fetched from a backend (404) which in turn talks to MSFT ACS backend (414). Once SDK is initialized, a group call can be joined (416) and client will start receiving the remote participants video and audio and render both in the application.
[0079] In reference to FIG. 4, FIG. 5 illustrates an exemplary block diagram (500) depicting functional components of the system involved in the detection of active participants, in accordance with an embodiment of the present disclosure.
[0080] As shown in FIG. 5, the system (110) may include an Audio capturer (504)
that may interact with an operating system to capture raw audio which is fed to ACS SDK (418) for the local participant. Different operating system (OS) may have different techniques, for example on Android AudioManager can capture raw audio being captured by mic.
[0081] In an exemplary embodiment, an Audio Pitch Analyzer (502) may analyse the pitch of audio being captured by the audio capturer (504). Louder the speaker higher is the pitch. The pitch value will be constantly fed to a backend service (404) for the participant so that the entity can use it to accurately identify the active participant.
[0082] In an exemplary embodiment, an audio module (508) may interact with the audio capturer (504) which in turns talks to the Audio pitch analyzer (502). A Camera capturer module (506) may capture raw camera frames and feed the frames parallelly to a computer vision module (512) as well as a Camera module (510). The computer vision module may analyse each camera frame to see if there is any lip movement and continuously share this information with a voice channel (VC) application (516).
[0083] In an exemplary embodiment, the camera module (510) may interact with the camera capturer module (506) which in turn talks to the computer vision module (512). The VC Application (516) is the core module which may receive the audio pitch information and lip sync movement information to the backend (404) for the participant. The ACS Calling SDK may be a media framework for building video conferencing applications.
[0084] In an exemplary embodiment, a Renderer (518) may render remote participant video according to the active participant list. For example, the renderer may pick top 4 or 9 participants based on the platform, for android this may be 4 and for web/desktop this may be 9 but not limited to it. An Active participants detector module (514) may analyze the remote participants. The Pitch and lip movement events and sort the participants according to it. The resultant list of participants has most active participants sorted from start to last.
[0085] FIGs. 6A-6B illustrate a generic flow diagram of implementations of exemplary workflow conference, in accordance with an embodiment of the present disclosure. As illustrated in FIG. 6A, in an aspect, the VC application may initialize Audio Module and start audio capture. The audio module may generate audio pitch events whenever audio is detected above 60db but not limited to it. The event may be sent to The Audio module may continuously monitor audio bitrate and mute status every second. If during the interval it sees that audio bitrate is flowing and mute status is false, then event is sent to the backend.
[0086] In an exemplary embodiment, the Audio capturer may feed the raw audio frames to audio pitch analyzer which converts that feed to continuous flow of pitch values in decibel. If at any time audio pitch value is greater than 60db for straight 500ms an even is generated to the VC application for this participant which in turn registers this event to the backend. The Camera Module may generate event for lip sync movement when lips are observed continuously moving for straight 500ms. This event is registered at backend in real time. The camera module monitors various aspects of video such as frame rate (fps), resolution, bitrate, and mute status every second. For each aspect, an event is generated to the backend.
[0087] In an exemplary embodiment, the Camera capturer may feed but not limited to raw YUV frames to the Computer Vision module which in an example and not a limitation, may identify the lips and observe its movements. If the Lips are moving for straight 500ms then event is sent to the VC application which sends event to the backend against that participant. The ACS calling SDK may be a third-party SDK which may have isSpeaking observable for all participants. This event may be generated for any participant whenever SDK detects that participant is speaking.
[0088] As illustrated in FIG. 6B, in an aspect, the Active Participant Detector may calculate the scores for participant activity which is used to sort the list in real time. The Events detector may be tracking the following:
• IsSpeaking event generated from third party SDK.
• Audio pitch level event.
• Lip movement detection event.
• Video FPS
• Video Bitrate.
• Video Resolution.
• Video mute status.
• Audio bitrate.
• Audio mute status.
[0089] In an exemplary embodiment, at least two things may be determined from observing isSpeaking audio pitch values and lip sync detector. The at least two things may be Participant Activity score and Wildcard entry.
[0090] FIG. 7 illustrates an exemplary computer system in which or with which embodiments of the present invention can be utilized in accordance with embodiments of the present disclosure. As shown in FIG. 7, computer system 700 can include an external storage device 710, a bus 720, a main memory 730, a read only memory 740, a mass storage device 750, communication port 760, and a processor 770. A person skilled in the art will appreciate that the computer system may include more than one processor and communication ports. Examples of processor 770 include, but are not limited to, an Intel® Itanium® or Itanium 2 processor(s), or AMD® Opteron® or Athlon MP® processor(s), Motorola® lines of processors, FortiSOC™ system on chip processors or other future processors. Processor 770 may include various modules associated with embodiments of the present invention. Communication port 760 can be any of an RS-232 port for use with a modem based dialup connection, a 10/100 Ethernet port, a Gigabit or 10 Gigabit port using copper or fiber, a serial port, a parallel port, or other existing or future ports. Communication port 760 may be chosen depending on a network, such a Local Area Network (LAN), Wide Area Network (WAN), or any network to which computer system connects. Memory 730 can be Random Access Memory (RAM), or any other dynamic storage device commonly known in the art. Read-only memory 740 can be any static storage device(s) e.g., but not limited to, a Programmable Read Only Memory (PROM) chips for storing static information e.g., start-up or BIOS instructions for processor 770. Mass storage 750 may be any current or future mass storage solution, which can be used to store information and/or instructions. Exemplary mass storage solutions include, but are not limited to, Parallel Advanced Technology Attachment (PATA) or Serial Advanced Technology Attachment (SATA) hard disk drives or solid-state drives (internal or external, e.g., having Universal Serial Bus (USB) and/or Firewire interfaces), e.g. those available from Seagate (e.g., the Seagate Barracuda 7102 family) or Hitachi (e.g., the Hitachi Deskstar 7K1000), one or more optical discs, Redundant Array of Independent Disks (RAID) storage, e.g. an array of disks (e.g., SATA arrays), available from various vendors including Dot Hill Systems Corp., LaCie, Nexsan Technologies, Inc. and Enhance Technology, Inc.
[0091] Bus 720 communicatively couples processor(s) 770 with the other memory, storage and communication blocks. Bus 720 can be, e.g. a Peripheral Component Interconnect (PCI) / PCI Extended (PCI-X) bus, Small Computer System Interface (SCSI), USB or the like, for connecting expansion cards, drives and other subsystems as well as other buses, such a front side bus (FSB), which connects processor 770 to software system.
[0092] Optionally, operator and administrative interfaces, e.g. a display, keyboard, and a cursor control device, may also be coupled to bus 720 to support direct operator interaction with a computer system. Other operator and administrative interfaces can be provided through network connections connected through communication port 760. The external storage device 710 can be any kind of external hard-drives, floppy drives, IOMEGA® Zip Drives, Compact Disc - Read Only Memory (CD-ROM), Compact Disc-Re-Writable (CD-RW), Digital Video Disk-Read Only Memory (DVD-ROM). Components described above are meant only to exemplify various possibilities. In no way should the aforementioned exemplary computer system limit the scope of the present disclosure.
[0093] The present disclosure provides an efficient and unique solution for accurately identifying top n participants. This technique can be expanded to observer any number of top n active participants depending on use case. For example, on mobile detect top 4 as screen estate is very small but for a tablet or desktop, at least 9 participants may be shown. Moreover, the system and method further ensure that if a participant starts speaking but mic is muted, then that participant can still be in top active participants and make communication better.
[0094] While considerable emphasis has been placed herein on the preferred embodiments, it will be appreciated that many embodiments can be made and that many changes can be made in the preferred embodiments without departing from the principles of the invention. These and other changes in the preferred embodiments of the invention will be apparent to those skilled in the art from the disclosure herein, whereby it is to be distinctly understood that the foregoing descriptive matter to be implemented merely as illustrative of the invention and not as limitation.

ADVANTAGES OF THE PRESENT DISCLOSURE
[0095] The present disclosure provides for a system and method that enhances identification of active participants in a video conference.
[0096] The present disclosure provides for a system and method that increases the accuracy of process to determine participant activity.
[0097] The present disclosure provides for a system and method that uses lip movements and audio pitch level to determine the activeness.
[0098] The present disclosure provides for a system and method that uses time parameter to increase the accuracy.

,CLAIMS:1. A client-side system (110) enabling communication with a server-side system (114) in a teleconference, said system (110) comprising:
one or more processors (202), said one or more processors (202) operatively coupled with a memory (204), said memory storing instructions which when executed by the one or more processors (202) causes the system to:
receive a plurality of signals, from one or more sensors associated with the client-side system (110);
extract audio meta data from the plurality of signals received;
extract video meta data from the plurality of signals received;
generate a user meta data from the extracted audio and video meta data; and,
transmit the user meta data to the server-side system (114), wherein the server-side system (114) generates a conference meta data on receipt of the user meta data from a plurality of client-side systems (110).
2. The system (110) as claimed in claim 1, wherein the one or more processors (202) in the client-side system (110) are associated with an event detector module (214), wherein the event detector module (214) is configured to detect a first event whenever the one or more sensors detects a user speaking.
3. The system (110) as claimed in claim 2, wherein the event detector module (214) is further configured to:
detect a second event whenever the one or more sensors detect changes in facial expressions of the user.
4. The system (110) as claimed in claim 1, wherein the user meta data comprises the audio meta data, the video meta data, the first event and the second event.
5. The system (110) as claimed in claim 1, wherein the one or more processors (202) in the client-side system (110) is further configured to:
monitor a plurality of features such as Video frames per second (FPS), Video Bitrate, Video Resolution, Video mute status, Audio bitrate, and Audio mute status.
6. A server-side system (114) in communication with a plurality of client-side systems (110) in a teleconference, said system comprising:
one or more processors (222), said one or more processors (222) coupled with a memory (224), said memory storing instructions which when executed by the one or more processors (222) causes the system to:
receive a plurality of user meta data from the plurality of client-side systems (110);
extract a plurality of first set of attributes from the plurality of user meta data, the plurality of first set of attributes pertaining to pitch values of each user for a first predetermined period of time;
extract a plurality of second set of attributes from the plurality of user meta data, the plurality of second set of attributes pertaining to lip movement of each said user for a second predetermined time;
generate a conference meta data based on the extracted plurality of first and second set of attributes; and,
transmit the conference meta data to the plurality of client-side server systems (114).
7. The system (114) as claimed in claimed 6, wherein upon receipt of the conference meta data the one or more processors (202) in the client-side system (110) is configured to:
compare the transmitted conference meta data with a previous conference meta data stored in a database associated with each user computing device;
capture one or more changes in the transmitted conference meta data based on the comparison made with the transmitted conference meta data and the previous conference meta data;
update a user display interface associated with each client-side system (110) based on the captured one or more changes; and,
manage one or more elements of each said client-side system (110) based on the captured one or more changes.
8. The system (114) as claimed in claim 6, wherein the conference meta data is generated in real time.
9. The system (114) as claimed in claim 6, wherein to obtain the conference meta data, the one or more processors (222) operatively coupled to the server side are configured to:
verify the audio meta data of each said user with the video meta data of said user and establish an active period of each said user;
generate a score for each said user based on the verification of the audio meta data of each said user with the video meta data of said user;
compare the score of each said user with the respective scores of the plurality of users; and,
determine a set of active users in descending order of the score based on the comparison made.
10. The system (114) as claimed in claim 7, wherein the one or more processors (222) operatively coupled to the server-side system (114) is further configured to:
detect an audio pitch value of the user from the user meta data whenever the audio pitch value goes above a predefined threshold;
determine a background noise based on the detected audio pitch value.
11. The system (114) as claimed in claim 10, wherein the one or more processors (222) in the server-side system (114) is further configured to:
determine if the user is actually speaking in the teleconference based on the detected audio pitch and the changes in the facial expressions of the user.
12. The system (114) as claimed in claim 10, wherein the one or more processors (222) in the server-side system (114) is further configured to:
update the score of the plurality of users in real time; and,
sort the updated score in a decreasing order of score.
13. The system (114) as claimed in claim 12, wherein the one or more processors (222) in the server-side system (114) is further configured to:
purge the updated score periodically after a predetermined period of time and after a threshold value of the score is reached; and,
automatically increase the score of the user to an upper limit score, if a user reaches the threshold score or speaks for the predetermined period of time.
14. The system (114) as claimed in claim 13, wherein the one or more processors (222) in the server-side system (114) is further configured to:
sort the set of active users in a decreasing order whenever a score of a user changes.
15. A method for enabling communication of a client-side system (110) with a server-side system (114) in a teleconference, said method comprising:
receiving, by one or more processors (202), a plurality of signals, from one or more sensors associated with a client-side system (110),
wherein the one or more processors (202) are operatively coupled to the client-side system (110), said one or more processors (202) coupled with a memory (204) storing instructions executed by the one or more processors (202);

extracting, by the one or more processors (202), audio meta data from the plurality of signals received;

extracting, by the one or more processors (202), from the plurality of video meta data
generating, by the one or more processors (202), a user meta data of the user computing device from the extracted audio and video frames;
transmitting, by the one or more processors (202), the user meta data to the server side-system (114), wherein the server-side system (114) generates a conference meta data on receipt of the user meta data from a plurality of client-side systems (110).
16. The method as claimed in claim 15, wherein the one or more processors (202) in the client-side system (110) are associated with an event detector module, wherein the method further comprises the step of:
detecting, by the event detector module (214), a first event whenever the one or more sensors associated detects a user speaking.
17. The method as claimed in claim 16, wherein the method further comprises the step of:
detecting, by the event detector module (214), a second event whenever the one or more sensors detect changes in facial expressions of the user.
18. The method as claimed in claim 15, wherein the user meta data comprises the audio meta data, the video meta data, the first event and the second event.
19. The method as claimed in claim 15, wherein the one or more processors (202) in the client-side system (110) is further configured to:
monitor a plurality of features such as Video frames per second (FPS), Video Bitrate, Video Resolution, Video mute status, Audio bitrate, and Audio mute status.
20. A method for enabling communication of a server-side system (114) with a plurality of client-side system (110) in a teleconference, said method comprising:

receiving, by one or more processors (222), a plurality of user meta data from the client-side systems (110),
wherein the one or more processors (222) operatively coupled to a memory (224) storing instructions which are executed by the one or more processors (222);
extracting, by the one or more processors (222), a plurality of first set of attributes from the plurality of user meta data, the plurality of first set of attributes pertaining to pitch values of each user for a first predetermined period of time;
extracting, by the one or more processors (222), a plurality of second set of attributes from the plurality of user meta data, the plurality of second set of attributes pertaining to lip movement of each said user for a second predetermined time;
generating, by the one or more processors (222), a conference meta data based on the extracted plurality of first and second set of attributes; and,
transmitting, by the one or more processors (222), the conference meta data to the plurality of client-side systems (110).
21. The method as claimed in claimed 20, wherein upon receipt of the conference meta data by each client-side system (110), the method further comprises the steps of:
comparing, by the one or more processors (202), the transmitted conference meta data with a previous conference meta data stored in a database associated with each user computing device;
capturing, by the one or more processors (202), one or more changes in the transmitted conference meta data based on the comparison made with the transmitted conference meta data and the previous conference meta data;

updating, by the one or more processors (202), a user display interface associated with each client-side system (110) based on the captured one or more changes; and,
managing, by the one or more processors (202), one or more elements of each said client-side system (110) based on the captured one or more changes.
22. The method as claimed in claim 20, wherein the conference meta data is generated in real time.
23. The method as claimed in claim 20, wherein to obtain the conference meta data, the method further comprises the steps of:
verifying, by the one or more processors (222) operatively coupled to the server-side system (114), the audio meta data of each said user with the video meta data of said user and establish an active period of each said user;
generating, by the one or more processors (222), a score for each said user based on the verification of the audio meta data of each said user with the video meta data of said user;
comparing, by the one or more processors (222), the score of each said user with the respective scores of the plurality of users; and,
determining, by the one or more processors (222), a set of active users in descending order of the score based on the comparison made.
24. The method as claimed in claim 23, wherein the method further comprises the steps of:
detecting, by the one or more processors (222), an audio pitch value of the user whenever the audio pitch value goes above a predefined threshold;
determining, by the one or more processors (222), a background noise based on the detected audio pitch value.

25. The method as claimed in claim 24, wherein the method further comprises the steps of:
determining, by the one or more processors (222), if the user is actually speaking in the teleconference based on the detected audio pitch and the changes in the facial expressions of the user.
26. The method as claimed in claim 25, wherein the method further comprises the steps of:
updating, by the one or more processors (222), the score of the plurality of users in real time; and,
sorting, by the one or more processors (222), the updated score in a decreasing order of score.
27. The method as claimed in claim 26, wherein the method further comprises the steps of:
purging, by the one or more processors (222), the updated score periodically after a predetermined period of time and after a threshold value of the score is reached; and,
automatically increasing, by the one or more processors (222), the score of the user to an upper limit score, if a user reaches the threshold score or speaks for the predetermined period of time.
28. The method as claimed in claim 27, wherein the method further comprises the steps of:
sorting, by the one or more processors (222), the set of active users in a decreasing order whenever a score of a user changes.

Documents

Application Documents

#	Name	Date
1	202121039528-STATEMENT OF UNDERTAKING (FORM 3) [01-09-2021(online)].pdf	2021-09-01
2	202121039528-PROVISIONAL SPECIFICATION [01-09-2021(online)].pdf	2021-09-01
3	202121039528-FORM 1 [01-09-2021(online)].pdf	2021-09-01
4	202121039528-DRAWINGS [01-09-2021(online)].pdf	2021-09-01
5	202121039528-DECLARATION OF INVENTORSHIP (FORM 5) [01-09-2021(online)].pdf	2021-09-01
6	202121039528-FORM-26 [14-09-2021(online)].pdf	2021-09-14
7	202121039528-Proof of Right [04-02-2022(online)].pdf	2022-02-04
8	202121039528-ENDORSEMENT BY INVENTORS [27-08-2022(online)].pdf	2022-08-27
9	202121039528-DRAWING [27-08-2022(online)].pdf	2022-08-27
10	202121039528-CORRESPONDENCE-OTHERS [27-08-2022(online)].pdf	2022-08-27
11	202121039528-COMPLETE SPECIFICATION [27-08-2022(online)].pdf	2022-08-27
12	202121039528-FORM 18 [30-08-2022(online)].pdf	2022-08-30
13	Abstract1.jpg	2022-09-15
14	202121039528-FER.pdf	2023-06-20
15	202121039528-FER_SER_REPLY [12-12-2023(online)].pdf	2023-12-12
16	202121039528-CORRESPONDENCE [12-12-2023(online)].pdf	2023-12-12
17	202121039528-COMPLETE SPECIFICATION [12-12-2023(online)].pdf	2023-12-12
18	202121039528-CLAIMS [12-12-2023(online)].pdf	2023-12-12
19	202121039528-ABSTRACT [12-12-2023(online)].pdf	2023-12-12
20	202121039528-US(14)-HearingNotice-(HearingDate-07-06-2024).pdf	2024-05-10
21	202121039528-FORM-26 [05-06-2024(online)].pdf	2024-06-05
22	202121039528-Correspondence to notify the Controller [05-06-2024(online)].pdf	2024-06-05
23	202121039528-Written submissions and relevant documents [24-06-2024(online)].pdf	2024-06-24
24	202121039528-Annexure [24-06-2024(online)].pdf	2024-06-24
25	202121039528-PatentCertificate02-08-2024.pdf	2024-08-02
26	202121039528-IntimationOfGrant02-08-2024.pdf	2024-08-02

Search Strategy

1	sserE_01-06-2023.pdf
2	sseraAE_26-03-2024.pdf