Abstract: ABSTRACT A METHOD AND SYSTEM FOR FACILITATING A VIRTUAL VIDEO CONFERENCE Disclosed herein is a system 102 and method for facilitating a virtual video conference. The system instead of video feed, displays a virtual reality (VR) avatar of each attendee that is pre-generated before each attendee joins the virtual conference. The system 102 identifies the expressions depicted by each attendee during the ongoing conference and maps the expressions onto the pregenerated VR avatars thereby providing real-life experience to the other attendees during the conversation.
FORM 2
THE PATENTS ACT, 1970
(39 OF 1970)
AND
THE PATENT RULES, 2003
COMPLETE SPECIFICATION
(See section 10 and rule 13)
“A METHOD AND SYSTEM FOR FACILITATING A VIRTUAL VIDEO
CONFERENCE”
Name and Address of the Applicant:
Zensar Technologies Limited, Plot#4 Zensar Knowledge Park, MIDC, Kharadi, Off Nagar Road,
Pune, Maharashtra – 411014, India
Nationality: India
The following specification particularly describes the invention and the manner in which it is to be
performed.
2
PRIORITY INFORMATION
This application takes priority from the Indian provisional application number 202021047266.
TECHNICAL FIELD
[001] The present invention generally relates to the field of web-based communications and in
particular, relates to providing a method and a system for facilitating virtual video
conferences between attendees by allowing them to depict their emotions without
compromising their privacy.
BACKGROUND OF INVENTION
[002] The following description includes information that may be useful in understanding the
present invention. It is not an admission that any of the information provided herein is prior
art or relevant to the presently claimed invention, or that any publication specifically or
implicitly referenced is prior art.
[003] Web-based communications such as video conferences have become an essential necessity
for the organizations and a new way of virtually conducting meetings especially since the
Covid-19 pandemic struck and employees of various organizations had to work remotely.
Video and audio-conferencing systems provided the organizations with a necessary
platform to hold meetings in order to ensure smooth running of organizational affairs.
While on one hand, video conferencing systems has made it easier for the organizations to
connect with their employees and clients etc., it also poses certain issues that are generally
faced by attendees while attending a video conference. For instance, it may so happen that
an attendee is not comfortable in sharing his/her visual feed, due to the background of
his/her location or his/her appearance etc. In another instance, it may so happen that though
the attendee does not have any issues with sharing his/her visual feed, there might be
network issues at his/her location that does not allow him/her to do so. Either way, in a
video conference, if the visual feed of an attendee is not visible to other attendees, it creates
a certain disparity as the emotions (or facial expressions) of the attendee during the video
conference cannot be conveyed to other attendees and therefore, creates issues in
understanding how the content of the conference is being perceived. Further, if an attendee
3
is forced to share his/her visual feed, his/her appearance or the background might create a
bias in the minds of other attendees.
[004] There is, therefore, a need for a system that provides an efficient technique for facilitating
virtual video conferences by allowing the attendees to convey their emotions without
compromising their privacy or being restricted by network issues.
SUMMARY OF INVENTION
[005] The present disclosure overcomes one or more shortcomings of the prior art and provides
additional advantages discussed throughout the present disclosure. Additional features and
advantages are realized through the techniques of the present disclosure. Other
embodiments and aspects of the disclosure are described in detail herein and are considered
a part of the claimed disclosure.
[006] In one non-limiting embodiment of the present disclosure, a method for facilitating a virtual
video conference is disclosed. The method comprises detecting faces of attendees from a
live video feed of an ongoing virtual conference, wherein the attendees are attending the
ongoing virtual conference using attendee devices connected through their individual
networks having a network bandwidth associated therewith. In one particular embodiment,
while the ongoing virtual conference progresses, the method comprises identifying one or
more expressions of the attendees based on the live video feed. The method further
comprises dynamically mapping the one or more expressions of the attendees onto pregenerated virtual reality (VR) avatars of the attendees locally on the attendee devices in
such a manner that the pre-generated virtual reality (VR) avatars mimic the attendees’
emotions, thereby preventing the transmission of actual video of the attendees over their
individual networks and yet conveying attendees’ emotions among themselves. Further,
dynamically mapping the one or more expressions of the attendees onto the pre-generated
virtual reality (VR) avatars comprises mapping nod reactions being performed by the
attendees along with a plurality of emotions being depicted by the attendees in a same time
frame onto the corresponding pre-generated VR avatars of the attendees.
4
[007] In one non-limiting embodiment of the present disclosure, a system for facilitating a virtual
video conference is disclosed. The system comprises a facial detection unit configured to
detect faces of attendees from a live video feed of an ongoing virtual conference, wherein
the attendees are attending the ongoing virtual conference using attendee devices connected
through their individual network having a network bandwidth associated therewith. In one
particular embodiment, while the ongoing virtual conference progresses, the system
comprises an expression identification unit configured to identify one or more expressions
of the attendees based on the live video feed. The system further comprises an expression
mapping unit configured to dynamically map the one or more expressions of the attendees
onto pre-generated virtual reality (VR) avatars of the attendees locally on the attendee
devices in such a manner that the pre-generated virtual reality (VR) avatars mimic the
attendees’ emotions, thereby preventing the transmission of actual video of the attendees
over their individual networks and yet conveying attendees’ emotions among themselves.
Further, todynamically map the one or more expressions of the attendees onto the pregenerated virtual reality (VR) avatars, the expression mapping unit maps nod reactions
being performed by the attendees along with a plurality of emotions being depicted by the
attendees in a same time frame onto the corresponding pre-generated VR avatars of the
attendees.
[008] The foregoing summary is illustrative only and is not intended to be in any way limiting.
In addition to the illustrative aspects, embodiments, and features described above, further
aspects, embodiments, and features will become apparent by reference to the drawings and
the following detailed description.
BRIEF DESCRIPTION OF DRAWINGS
[009] The embodiments of the disclosure itself, as well as a preferred mode of use, further
objectives and advantages thereof, will best be understood by reference to the following
detailed description of an illustrative embodiment when read in conjunction with the
accompanying drawings. One or more embodiments are now described, by way of example
only, with reference to the accompanying drawings in which:
5
[0010] Figure 1 shows an environment 100 of a system for facilitating a virtual video conference,
in accordance with an embodiment of the present disclosure;
[0011] Figure 2 shows a block diagram 200 illustrating the system for facilitating a virtual video
conference, in accordance with an embodiment of the present disclosure;
[0012] Figure 3 depicts a method 300, by way of a flow diagram, for facilitating a virtual video
conference, in accordance with an embodiment of the present disclosure;
[0013] Figure 3A depicts a method 300A, by way of a flow diagram, for detecting the faces of
the attendees from the live video feed of the ongoing virtual conference using facial
detection technique, in accordance with an embodiment of the present disclosure;
[0014] Figure 3B depicts a method 300B, by way of a flow diagram, for identifying the one or
more expressions of the attendees based on the live video feed, in accordance with an
embodiment of the present disclosure;
[0015] Figure 4 depicts a method 400, by way of a flow diagram, for generating virtual reality
(VR) avatars of an attendee, in accordance with an embodiment of the present disclosure;
and
[0016] Figure 5 shows a block diagram of an exemplary computer system 500 for implementing
the embodiments consistent with the present disclosure.
[0017] The figures depict embodiments of the disclosure for purposes of illustration only. One
skilled in the art will readily recognize from the following description that alternative
embodiments of the structures and methods illustrated herein may be employed without
departing from the principles of the disclosure described herein.
DETAILED DESCRIPTION
[0018] The foregoing has broadly outlined the features and technical advantages of the present
disclosure in order that the detailed description of the disclosure that follows may be better
understood. It should be appreciated by those skilled in the art that the conception and
6
specific embodiment disclosed may be readily utilized as a basis for modifying or
designing other structures for carrying out the same purposes of the present disclosure.
[0019] The novel features which are believed to be characteristic of the disclosure, both as to its
organization and method of operation, together with further objects and advantages will be
better understood from the following description when considered in connection with the
accompanying figures. It is to be expressly understood, however, that each of the figures
is provided for the purpose of illustration and description only and is not intended as a
definition of the limits of the present disclosure.
[0020] Disclosed herein is a method and system for facilitating a virtual video conference. In
particular, the present disclosure provides a system that creates a virtual reality (VR) avatar
of an attendee that is supposed to join a video conference and displays the generated VR
avatar of the attendee during the video conference instead of his/her live video feed.
Further, the system goes a step further and maps the emotions (facial expressions) depicted
by the attendee during the video conference onto his/her VR avatar such that the VR avatar
is not static but dynamic in nature, that is, the VR avatar mimics the emotions depicted by
the attendee in real-time. Further, the present disclosure, by preventing the transmission of
live video feed of the attendee over the network, optimizes network bandwidth in order to
avoid any glitches that may interrupt the smooth continuity of the video conference due to
network issues being faced by one or more attendees at their end. The detailed working of
the system is described in upcoming paragraphs.
[0021] Figure 1 shows an environment 100 of a system for facilitating a virtual video conference,
in accordance with an embodiment of the present disclosure. It must be understood to a
person skilled in art that the system may also be implemented in various environments,
other than as shown in Fig. 1.
[0022] The detailed explanation of the exemplary environment 100 is explained in conjunction
with Figure 2 that shows a block diagram 200 of a system 102 for facilitating a virtual
video conference, in accordance with an embodiment of the present disclosure. Although
the present disclosure is explained considering that the system 102 is implemented on a
7
server, it may be understood that the system 102 may be implemented in a variety of
computing systems, such as a laptop computer, a desktop computer, a notebook, a
workstation, a mainframe computer, a server, a network server, a cloud-based computing
environment.
[0023] In one implementation, the system 102 may comprise an I/O interface 202, a processor
204, a memory 206 and the units 214. The memory 206 may be communicatively coupled
to the processor 204 and the units 214. The memory 206 stores a plurality of pictures 106
and a reference video 208 corresponding to each attendee. The memory further stores the
pre-generated VR avatars 210 of the attendees and a pre-trained emotion detection model
212. The significance and use of each of the stored quantities is explained in the upcoming
paragraphs of the specification. The processor 204 may be implemented as one or more
microprocessors, microcomputers, microcontrollers, digital signal processors, central
processing units, state machines, logic circuitries, and/or any devices that manipulate
signals based on operational instructions. Among other capabilities, the processor 204 is
configured to fetch and execute computer-readable instructions stored in the memory 206.
The I/O interface 202 may include a variety of software and hardware interfaces, for
example, a web interface, a graphical user interface, and the like. The I/O interface 202
may enable the system 102 to communicate with other computing devices, such as web
servers and external data servers (not shown). The I/O interface 202 may facilitate multiple
communications within a wide variety of networks and protocol types, including wired
networks, for example, LAN, cable, etc., and wireless networks, such as WLAN, cellular,
or satellite. The I/O interface 202 may include one or more ports for connecting many
devices to one another or to another server.
[0024] In one implementation, the units 214 may comprise a facial detection unit 216, an
expression identification unit 218, an expression mapping unit 220, a feature identification
unit 222, a feature mapping unit 224 and an avatar generation unit 226. According to
embodiments of present disclosure, these units 216-226 may comprise hardware
components like processor, microprocessor, microcontrollers, application-specific
integrated circuit for performing various operations of the system 102. It must be
8
understood to a person skilled in art that the processor 204 may perform all the functions
of the units 216-226 according to various embodiments of the present disclosure.
[0025] Now referring back to Figure 1, the environment 100 depicts a system 102 for facilitating
a virtual video conference. For the system 102 to be implemented in real-time, the system
102 first has to perform an onboarding process for each attendee scheduled to join a video
conference.
Onboarding process:
[0026] During the onboarding process, the system 102 creates a VR avatar for each attendee. It
may be noted that the onboarding process may happen in different scenarios. For instance,
when the system 102 is implemented by an organization, the system 102 may before-hand
onboard each member of the organization, irrespective of the fact that whether he/she needs
to attend a virtual conference or not, and therefore, eliminating the need to onboard a
member before the commencement of a video conference. In another scenario, the system
102 is capable of onboarding a member (or an attendee) right before he/she joins a video
conference, if for instance, the attendee is not a part of the organization or if the attendee
is a part of the organization but has not been onboarded before-hand. However, the
underlying fact remains that an attendee who is scheduled to join a video conference is
onboarded before the commencement of the video conference.
[0027] As illustrated in Figure 1, for onboarding an attendee 104, a plurality of pictures 106
depicting a face of the attendee 104 from different angles is received by the system 102. In
one embodiment, before joining the video conference, the system 102 may require the
attendee 104 to provide pictures of his/her face from different angles using an image
capturing device associated with the attendee device such as a webcam. Few examples of
the pictures captured at different angles are shown in Figure 1. In accordance with Figure
2, the plurality of pictures 106 is received by the avatar generation unit 226. In an alternate
embodiment, the avatar generation unit 226 may receive a reference video 208 comprising
a video recording depicting the face of attendee 104 from different angles. From the
plurality of pictures 106 received, the avatar generation unit 226 detects a static face of the
attendee 104 using a facial detection technique. In one embodiment, the facial detection
9
technique as implemented by the avatar generation unit 226 is a HAAR Cascade Classifier
technique that uses a plurality of predefined facial features comprising at least one of line
features, edge features and four rectangles features to detect the static face of the attendee
104. It may be noted by a skilled person that the phrase “static” has been used herein to
describe a face that does not project any emotions and merely provides a reference for
avatar generation.
[0028] Once the static face of the attendee 104 is detected, the avatar generation unit 226 identifies
a plurality of facial identity features corresponding to the static face of the attendee 104.
The plurality of facial identity features comprises at least one of eyes, nose, mouth and
ears. In one embodiment, the avatar generation unit 226 segregates various features of the
face of the attendee 104 such as the eyes, nose, lips, ears etc. and identifies feature points
corresponding to the various features. The feature points may comprise at least one of outer
corner of left eye, outer corner of right eye, inner corner of the right eye, inner corner of
the left eye, bottom of the left eye, bottom of the right eye, top of the left eye, top of the
right eye, left nose corner, right nose corner, left mouth corner, right nose corner, mouth
top, mouth bottom and chin.
[0029] Further, based on the plurality of pictures 106 received by the avatar generation unit 226
and the plurality of facial identity features identified from them, the avatar generation unit
226 creates a three-dimensional model of the static face of the attendee 104 by using
Photogrammetry technique by calculating an overlap between subsequent pictures and the
depth of the features in each picture. This overlap is then stitched together to create the
three-dimensional model of the static face of the attendee 104. The created threedimensional model of the static face of the attendee 104 is mapped into a virtual reality
environment using one or more known graphic techniques including but not limited to
materials painting, texture mapping, etc. in order to create a VR avatar 108 of the attendee
104. A similar procedure is followed for each attendee scheduled to join the video
conference.
Real-time use (When the video conference starts and progresses):
10
[0030] As the video conference begins, the attendee 104 attends the ongoing virtual conference
using an attendee device connected through his/her individual network having a network
bandwidth associated therewith. It may be noted that the same is true for other attendees
attending the ongoing video conference. Now, as the conference begins, the pre-generated
VR avatar 108 of the attendee 104 is displayed rather than the visual feed. However, the
purpose of the system 102 is not just to display a VR avatar but allow the VR avatar to
mimic live emotions depicted by an attendee during the video conference. To achieve this,
the system 102, through the facial detection unit 216 detects the face of the attendee 104
from a live video feed of the ongoing virtual video conference using the facial detection
technique as described in the preceding paragraphs. However, since the video conference
is ongoing, the facial detection unit 216, samples the live video feed into a plurality of
frames and dynamically processes the plurality of frames vis-à-vis the plurality of
predefined facial features in order to detect the face of the attendee 104. According to an
embodiment, sampling is performed by taking snapshots of the live video at a predefined
number of times per second. Further, the feature identification unit 218 identifies the
plurality of facial identity features corresponding to the face of the attendee 104 and the
feature mapping unit 220 dynamically maps the plurality of facial identity features with the
pre-generated VR avatar 108 of the attendee 104, thereby configuring the pre-generated
VR avatar 108 to accurately represent the current facial condition of the attendee 104. It
may be noted by a skilled person that the procedure followed by the facial detection unit
216, the facial identification unit 218 and the feature mapping unit 220 is similar to the
functionalities performed by the avatar generation unit 226 with the only difference that
the facial detection unit 216, the facial identification unit 218 and the feature mapping unit
220 function in real-time while the video conference is ongoing and therefore, utilize the
live video of the ongoing virtual conference, whereas the avatar generation unit 226
functions during the onboarding phase and utilizes either the plurality of pictures 108 or
the reference video 210 provided by the attendee 104 before the commencement of the
video conference. Repetition is hence avoided for the sake of brevity.
[0031] Further, to allow the pre-generated VR avatar 108 of the attendee 104 to mimic the
emotions, the expression identification unit 218, identifies one or more expressions of the
11
attendee 104 based on the live video feed of the ongoing virtual conference. To do so, the
expression identification unit 218 identifies nod reactions being performed by the attendee
104 depicting a reaction of either approval or disapproval during the ongoing virtual
conference. In addition, the expression identification unit 218 also identifies a plurality of
emotions being depicted by the attendee 104 during the ongoing virtual conference based
on a pre-trained emotion detection model 212 and combines the identified nod reactions
and the plurality of emotions depicted to identify the one or more expressions of the
attendee 104.
[0032] In particular, in order to identify nod reactions being performed by the attendee 104, the
expression identification unit 218 first defines a horizontal axis and a vertical axis based
on the face of each attendee such that the horizontal and the vertical axis intersect at a
centroid of the face. Then the expression identification unit 218 detects a movement of the
face along the horizontal axis as a disapproval nod if a distance of the movement along the
horizontal axis is at least equal to a threshold horizontal distance and detects a movement
of the face along the vertical axis as an approval nod if a distance of the movement along
the vertical axis is at least equal to a threshold vertical distance.
[0033] Further, in order to identify the plurality of emotions, an emotion depicted by the face in
each frame is assigned to an emotion category by the pre-trained emotion detection model
212. The pre-trained emotion detection model 212 is trained based on an image dataset
comprising a plurality of images classified into an emotion category based on an emotion
depicted by each of the plurality of images. In one embodiment, the emotion categories
comprise at least one of anger, fear, disgust, happiness, sadness, surprise, and contempt.
However, it may be noted by a skilled person that the emotion categories mentioned herein
are merely exemplary and should not be construed as limiting.
[0034] The identified one or more expressions are dynamically mapped onto the pre-generated
VR avatar 108 of the attendee 104 such that the pre-generated VR avatar 108 mimics the
one or more expressions depicted by the attendee 104 as the virtual conference progresses.
For instance, if the attendee 104 is disapproving something by moving his/her head in a
horizontal direction while showing an emotion of anger on his/her face as established based
12
on the comparison of the attendee’s currently depicted emotions with the image dataset
used to train the motion detection model 212, the pre-generated avatar 108 would also nod
his/her face horizontally and project an emotion of anger in a same time frame. For
instance, when the attendee 104 is angry during the ongoing virtual conference, a deviation
in the feature points of his/her face caused broadening of eyes, flaring up of nostrils will
be used to identify the emotion as “anger”.
[0035] A similar procedure for mapping the one or more expressions is employed for other
attendees of the virtual conference. Further, since the dynamic mapping of the one or more
expressions onto the pre-generated VR avatars of the attendees takes place locally at each
attendee device, the transmission of the live visual feed over the network of each attendee
is prevented, thereby also optimizing network bandwidth. This way, the technical
advantage achieved by the present disclosure is that – the virtual video conference is
conducted even if the network bandwidth is low or not sufficient to support the live feed
of the attendee’s video. Further, technical advantage achieved is that – the actual
expressions of the attendees are conveyed among them by generating the VR avatar and
enabling it to mimic the attendee’s emotions without compromising with attendees’
privacy.
[0036] Figure 3 depicts a method 300 for facilitating a virtual video conference, in accordance
with an embodiment of the present disclosure. The method 300 may be described in the
general context of computer executable instructions. Generally, computer executable
instructions may include routines, programs, objects, components, data structures,
procedures, modules, and functions, which perform specific functions or implement
specific abstract data types.
[0037] The order in which the method 300 is described is not intended to be construed as a
limitation, and any number of the described method blocks may be combined in any order
to implement the method. Additionally, individual blocks may be deleted from the methods
without departing from the spirit and scope of the subject matter described.
13
[0038] At block 302, the method 300 may include detecting faces of attendees from a live video
feed of an ongoing virtual conference. the attendees are attending the ongoing virtual
conference using attendee devices connected through their individual networks having a
network bandwidth associated therewith. To detect the faces of attendees, method 300A is
followed as depicted in Figure 3A through blocks 302-1 to 302-2.
[0039] At block 302-1, the method 300A may include sampling the live video feed into a plurality
of frames.
[0040] At block 302-2, the method 300A may include dynamically processing the plurality of
frames vis-à-vis a plurality of predefined facial features in order to detect the faces of each
of the attendees.
[0041] Now, while the ongoing virtual conference progresses, the method 300 proceeds to blocks
304 and 306.
[0042] At block 304, the method 300 may include identifying one or more expressions of the
attendees based on the live video feed. To identify the one or more expressions, method
300B is followed as depicted in Figure 3B through blocks 304-1 to 304-3.
[0043] At block 304-1, the method 300B may include identifying nod reactions being performed
by the attendees depicting a reaction of either approval or disapproval during the ongoing
virtual conference. Further, to identify the nod reactions, method 300B proceeds to blocks
304-1-1 to 304-1-3.
[0044] At block 304-1-1, the method 300B may include defining a horizontal axis and a vertical
axis based on the face of each attendee such that the horizontal and the vertical axis
intersect at a centroid of the face.
[0045] At block 304-1-2, the method 300B may include detecting a movement of the face along
the horizontal axis as a disapproval nod if a distance of the movement along the horizontal
axis is at least equal to a threshold horizontal distance.
14
[0046] At block 304-1-3, the method 300B may include detecting a movement of the face along
the vertical axis as an approval nod if a distance of the movement along the vertical axis is
at least equal to a threshold vertical distance.
[0047] At block 304-2, the method 300B may include identifying a plurality of emotions being
depicted by the attendees during the ongoing virtual conference based on a pre-trained
emotion detection model.
[0048] At block 304-3, the method 300B may include combining the nod reactions with the
plurality of emotions of the attendees.
[0049] Figure 4 depicts a method 400 for generating the pre-generated avatars in accordance with
an embodiment of the present disclosure.
[0050] At block 402, the method 400 may include receiving at least one of a plurality of pictures
and a reference video depicting attendees’ face from different angles.
[0051] At block 404, the method 400 may include detecting static faces of attendees from at least
one of the plurality of pictures and a plurality of frames of the video using the facial
detection technique.
[0052] At block 406, the method 400 may include identifying a plurality of facial identity features
corresponding to the static faces of each of the attendees detected.
[0053] At block 408, the method 400 may include creating a three-dimensional model of the static
faces of each of the attendees based on the plurality of facial identity features identified.
[0054] At block 410, the method 400 may include mapping the three-dimensional surface model
to a virtual reality environment in order to generate the VR avatar of each attendee.
Computer System
[0055] Figure 5 illustrates a block diagram of an exemplary computer system 500 for
implementing embodiments consistent with the present disclosure. In an embodiment, the
15
computer system 500 may be a peripheral device, which is used for authenticating a user.
The computer system 500 may include a central processing unit (“CPU” or “processor”)
502. The processor 502 may comprise at least one data processor for executing program
components for executing user or system-generated business processes. The processor 502
may include specialized processing units such as integrated system (bus) controllers,
memory management control units, floating point units, graphics processing units, digital
signal processing units, etc.
[0056] The processor 502 may be disposed in communication with one or more input/output (I/O)
devices via I/O interface 501. The I/O interface 501 may employ communication
protocols/methods such as, without limitation, audio, analog, digital, stereo, IEEE-1394,
serial bus, Universal Serial Bus (USB), infrared, PS/2, BNC, coaxial, component,
composite, Digital Visual Interface (DVI), high-definition multimedia interface (HDMI),
Radio Frequency (RF) antennas, S-Video, Video Graphics Array (VGA), IEEE 802.n
/b/g/n/x, Bluetooth, cellular (e.g., Code-Division Multiple Access (CDMA), High-Speed
Packet Access (HSPA+), Global System For Mobile Communications (GSM), Long-Term
Evolution (LTE) or the like), etc. Using the I/O interface, the computer system 500 may
communicate with one or more I/O devices.
[0057] In some embodiments, the processor 502 may be disposed in communication with a
communication network 514 via a network interface 503. The network interface 503 may
communicate with the communication network 514. The communication unit may employ
connection protocols including, without limitation, direct connect, Ethernet (e.g., twisted
pair 10/100/1000 Base T), Transmission Control Protocol/Internet Protocol (TCP/IP),
token ring, IEEE 802.11a/b/g/n/x, etc.
[0058] The communication network 514 can be implemented as one of the several types of
networks, such as intranet or Local Area Network (LAN) and such within the organization.
The communication network 514 may either be a dedicated network or a shared network,
which represents an association of several types of networks that use a variety of protocols,
for example, Hypertext Transfer Protocol (HTTP), Transmission Control Protocol/Internet
Protocol (TCP/IP), Wireless Application Protocol (WAP), etc., to communicate with each
16
other. Further, the communication network 514 may include a variety of network devices,
including routers, bridges, servers, computing devices, storage devices, etc.
[0059] In some embodiments, the processor 502 may be disposed in communication with a
memory 505 (e.g., RAM 512, ROM 513, etc. as shown in FIG. 4) via a storage interface
504. The storage interface 504 may connect to memory 505 including, without limitation,
memory drives, removable disc drives, etc., employing connection protocols such as Serial
Advanced Technology Attachment (SATA), Integrated Drive Electronics (IDE), IEEE1394, Universal Serial Bus (USB), fiber channel, Small Computer Systems Interface
(SCSI), etc. The memory drives may further include a drum, magnetic disc drive, magnetooptical drive, optical drive, Redundant Array of Independent Discs (RAID), solid-state
memory devices, solid-state drives, etc.
[0060] The memory 505 may store a collection of program or database components, including,
without limitation, user /application, an operating system, a web browser, mail client, mail
server, web server and the like. In some embodiments, computer system may store user
/application data, such as the data, variables, records, etc. as described in this invention.
Such databases may be implemented as fault-tolerant, relational, scalable, secure databases
such as OracleR or SybaseR.
[0061] The operating system may facilitate resource management and operation of the computer
system. Examples of operating systems include, without limitation, APPLE
MACINTOSHR OS X, UNIXR, UNIX-like system distributions (E.G., BERKELEY
SOFTWARE DISTRIBUTIONTM (BSD), FREEBSDTM, NETBSDTM, OPENBSDTM,
etc.), LINUX DISTRIBUTIONSTM (E.G., RED HATTM, UBUNTUTM,
KUBUNTUTM, etc.), IBMTM OS/2, MICROSOFTTM WINDOWSTM (XPTM,
VISTATM/7/8, 10 etc.), APPLER IOSTM, GOOGLER ANDROIDTM,
BLACKBERRYR OS, or the like. A user interface may facilitate display, execution,
interaction, manipulation, or operation of program components through textual or graphical
facilities. For example, user interfaces may provide computer interaction interface
elements on a display system operatively connected to the computer system, such as
cursors, icons, check boxes, menus, windows, widgets, etc. Graphical User Interfaces
17
(GUIs) may be employed, including, without limitation, APPLE MACINTOSHR
operating systems, IBMTM OS/2, MICROSOFTTM WINDOWSTM (XPTM,
VISTATM/7/8, 10 etc.), UnixR X-Windows, web interface libraries (e.g., AJAXTM,
DHTMLTM, ADOBE® FLASHTM, JAVASCRIPTTM, JAVATM, etc.), or the like.
[0062] Furthermore, one or more computer-readable storage media may be utilized in
implementing embodiments consistent with the present invention. A computer-readable
storage medium refers to any type of physical memory on which information or data
readable by a processor may be stored. Thus, a computer-readable storage medium may
store instructions for execution by one or more processors, including instructions for
causing the processor(s) to perform steps or stages consistent with the embodiments
described herein. The term “computer-readable medium” should be understood to include
tangible items and exclude carrier waves and transient signals, i.e., non-transitory.
Examples include Random Access Memory (RAM), Read-Only Memory (ROM), volatile
memory, non-volatile memory, hard drives, Compact Disc (CD) ROMs, Digital Video
Disc (DVDs), flash drives, disks, and any other known physical storage media.
[0063] A description of an embodiment with several components in communication with each
other does not imply that all such components are required. On the contrary, a variety of
optional components are described to illustrate the wide variety of possible embodiments
of the invention.
[0064] When a single device or article is described herein, it will be clear that more than one
device/article (whether they cooperate) may be used in place of a single device/article.
Similarly, where more than one device or article is described herein (whether they
cooperate), it will be clear that a single device/article may be used in place of the more than
one device or an article or a different number of devices/articles may be used instead of the
shown number of devices or programs. The functionality and/or the features of a device
may be alternatively embodied by one or more other devices which are not explicitly
described as having such functionality/features. Thus, other embodiments of the invention
need not include the device itself.
18
[0065] Finally, the language used in the specification has been principally selected for readability
and instructional purposes, and it may not have been selected to delineate or circumscribe
the inventive subject matter. It is therefore intended that the scope of the invention be
limited not by this detailed description, but rather by any claims that issue on an application
based here on. Accordingly, the embodiments of the present invention are intended to be
illustrative, but not limiting, of the scope of the invention, which is set forth in the following
claims.
[0066] While various aspects and embodiments have been disclosed herein, other aspects and
embodiments will be apparent to those skilled in the art. The various aspects and
embodiments disclosed herein are for purposes of illustration and are not intended to be
limiting, with the true scope and spirit being indicated by the following claims.
19
Reference Numerals:
Reference Numeral Description
100
Exemplary environment of a system for facilitating virtual video
conference
102 System
104 Attendee
106 Plurality of pictures
108 VR avatar of attendee 104
200 Block diagram of the system 102
202 Input/output interface
204 Processor
206 Memory
208 Reference Video
210 Pre-generated VR avatars
212 Emotion Detection Model
214 Units
216 Facial Detection Unit
218 Expression Identification Unit
220 Expression Mapping Unit
222 Feature Identification Unit
224 Feature Mapping Unit
226 Avatar Generation Unit
20
We Claim:
1. A method for facilitating a virtual video conference, the method comprising:
detecting (302) faces of attendees from a live video feed of an ongoing virtual
conference, wherein the attendees are attending the ongoing virtual conference using
attendee devices connected through their individual networks having a network bandwidth
associated therewith;
while the ongoing virtual conference progresses:
identifying (304) one or more expressions of the attendees based on the live
video feed, and
dynamically mapping (306) the one or more expressions of the attendees
onto pre-generated virtual reality (VR) avatars of the attendees locally on the
attendee devices in such a manner that the pre-generated virtual reality (VR) avatars
mimic the attendees’ emotions, thereby preventing the transmission of actual video
of the attendees over their individual networks and yet conveying attendees’
emotions among themselves,
wherein dynamically mapping the one or more expressions of the attendees
onto the pre-generated virtual reality (VR) avatars comprises mapping nod
reactions being performed by the attendees along with a plurality of emotions being
depicted by the attendees in a same time frame onto the corresponding pregenerated VR avatars of the attendees.
2. The method as claimed in claim 1, wherein detecting (302) the faces of the attendees from
the live video feed of the ongoing virtual conference using facial detection technique
comprising:
sampling (302-1) the live video feed into a plurality of frames, wherein the
sampling is performed by taking snapshots of the live video feed at a predefined number
of times per second; and
dynamically processing (302-2) the plurality of frames vis-à-vis a plurality of
predefined facial features in order to detect the faces of each of the attendees, wherein the
plurality of predefined facial features comprises at least one of line features, edge features
and rectangle features.
3. The method as claimed in claim 1, further comprising:
identifying a plurality of facial identity features corresponding to the faces of each
of the attendees detected, wherein the plurality of facial identity features are identified by
using a set of feature points depicting a location of the plurality of identity features on the
face of each attendee; and
dynamically mapping the plurality of facial identity features vis-à-vis the pregenerated virtual reality (VR) avatars of the attendees, thereby configuring the pre-
21
generated virtual reality (VR) avatars in accordance with a current facial condition of the
attendees.
4. The method as claimed in claim 1, wherein identifying (304) the one or more expressions
of the attendees based on the live video feed comprises:
identifying (304-1) the nod reactions being performed by the attendees depicting a
reaction of either approval or disapproval during the ongoing virtual conference, wherein
the identifying the nod reactions comprises:
defining (3041-1-1) a horizontal axis and a vertical axis based on the face
of each attendee such that the horizontal and the vertical axis intersect at a centroid
of the face;
detecting (304-1-2) a movement of the face along the horizontal axis as a
disapproval nod if a distance of the movement along the horizontal axis is at least
equal to a threshold horizontal distance; and
detecting (304-1-3) a movement of the face along the vertical axis as an
approval nod if a distance of the movement along the vertical axis is at least equal
to a threshold vertical distance; and
identifying (304-2) the plurality of emotions being depicted by the attendees during
the ongoing virtual conference based on a pre-trained emotion detection model, wherein
the pre-trained emotion detection model is trained based on an image dataset comprising a
plurality of pictures classified based on an emotion depicted by each of the plurality of
pictures; and
combining (304-3) the nod reactions with the plurality of emotions of the attendees.
5. The method as claimed in claim 1, wherein the pre-generated VR avatars are generated by:
receiving (402) at least one of a plurality of pictures (106) and a reference video
(208) depicting attendees’ face from different angles;
detecting (404) static faces of attendees from at least one of the plurality of pictures
(106) and a plurality of frames of the video using the facial detection technique;
identifying (406) a plurality of facial identity features corresponding to the static
faces of each of the attendees detected;
creating (408) a three-dimensional model of the static faces of each of the attendees
based on the plurality of facial identity features identified; and
mapping (410) the three-dimensional surface model to a virtual reality environment
in order to generate the VR avatar of each attendee.
6. A system for facilitating a virtual video conference, the system comprising:
a facial detection unit (216) configured to detect faces of attendees from a live video
feed of an ongoing virtual conference, wherein the attendees are attending the ongoing
22
virtual conference using attendee devices connected through their individual network
having a network bandwidth associated therewith;
while the ongoing virtual conference progresses:
an expression identification unit (218) configured to identify one or more
expressions of the attendees based on the live video feed, and
an expression mapping unit (220) configured to dynamically map the one
or more expressions of the attendees onto pre-generated virtual reality (VR) avatars
of the attendees locally on the attendee devices in such a manner that the pregenerated virtual reality (VR) avatars mimic the attendees’ emotions, thereby
preventing the transmission of actual video of the attendees over their individual
networks and yet conveying attendees’ emotions among themselves,
wherein to dynamically map the one or more expressions of the attendees
onto the pre-generated virtual reality (VR) avatars, the expression mapping unit
maps nod reactions being performed by the attendees along with a plurality of
emotions being depicted by the attendees in a same time frame onto the
corresponding pre-generated VR avatars of the attendees.
7. The system as claimed in claim 6, wherein to detect the faces of the attendees from the live
video feed of the ongoing virtual conference using facial detection technique, the facial
detection unit (216) is further configured to:
sample the live video feed into a plurality of frames, wherein to sample, the facial
detection unit (216) is further configured to take snapshots of the live video feed at a
predefined number of times per second; and
dynamically process the plurality of frames vis-à-vis a plurality of predefined facial
features in order to detect the faces of each of the attendees.
8. The system as claimed in claim 6, further comprising:
a feature identification (222) unit configured to identify a plurality of facial identity
features corresponding to the faces of each of the attendees detected, wherein the plurality
of facial identity features are identified by using a set of feature points depicting a location
of the plurality of identity features on the face of each attendee; and
a feature mapping unit (224) configured to dynamically map the plurality of facial
identity features vis-à-vis the pre-generated virtual reality (VR) avatars of the attendees,
thereby configuring the pre-generated virtual reality (VR) avatars in accordance with a
current facial condition of the attendees.
9. The system as claimed in claim 6, wherein to identify the one or more expressions of the
attendees based on the live video feed, the expression identification unit (218) is further
configured to:
23
identify the nod reactions being performed by the attendees depicting a reaction of
either approval or disapproval during the ongoing virtual conference, wherein to identify
the nod reactions, the expression identification unit (218) is configured to:
define a horizontal axis and a vertical axis based on the face of each attendee
such that the horizontal and the vertical axis intersect at a centroid of the face;
detect a movement of the face along the horizontal axis as a disapproval nod
if a distance of the movement along the horizontal axis is at least equal to a
threshold horizontal distance; and
detect a movement of the face along the vertical axis as an approval nod if
a distance of the movement along the vertical axis is at least equal to a threshold
vertical distance; and
identify the plurality of emotions being depicted by the attendees during the
ongoing virtual conference based on a pre-trained emotion detection model, wherein the
pre-trained emotion detection model is trained based on an image dataset comprising a
plurality of pictures classified based on an emotion depicted by each of the plurality of
pictures; and
combine the nod reactions with the plurality of emotions of the attendees.
10. The system as claimed in claim 6, further comprising an avatar generation unit (226)
configured to:
receive at least one of a plurality of pictures (106) and a reference video (208)
depicting attendees’ face from different angles;
detect static faces of attendees from at least one of the plurality of pictures (106)
and a plurality of frames of the video using the facial detection technique;
identify a plurality of facial identity features corresponding to the static faces of
each of the attendees detected;
create a three-dimensional model of the static faces of each of the attendees based
on the plurality of facial identity features identified; and
map the three-dimensional surface model to a virtual reality environment in order
to generate the VR avatar of each attendee
| # | Name | Date |
|---|---|---|
| 1 | 202021047266-STATEMENT OF UNDERTAKING (FORM 3) [29-10-2020(online)].pdf | 2020-10-29 |
| 2 | 202021047266-PROVISIONAL SPECIFICATION [29-10-2020(online)].pdf | 2020-10-29 |
| 3 | 202021047266-POWER OF AUTHORITY [29-10-2020(online)].pdf | 2020-10-29 |
| 4 | 202021047266-FORM 1 [29-10-2020(online)].pdf | 2020-10-29 |
| 5 | 202021047266-FIGURE OF ABSTRACT [29-10-2020(online)].pdf | 2020-10-29 |
| 6 | 202021047266-DRAWINGS [29-10-2020(online)].pdf | 2020-10-29 |
| 7 | 202021047266-DECLARATION OF INVENTORSHIP (FORM 5) [29-10-2020(online)].pdf | 2020-10-29 |
| 8 | 202021047266-Proof of Right [05-03-2021(online)].pdf | 2021-03-05 |
| 9 | 202021047266-DRAWING [29-10-2021(online)].pdf | 2021-10-29 |
| 10 | 202021047266-CORRESPONDENCE-OTHERS [29-10-2021(online)].pdf | 2021-10-29 |
| 11 | 202021047266-COMPLETE SPECIFICATION [29-10-2021(online)].pdf | 2021-10-29 |
| 12 | Abstract 1.jpg | 2022-03-24 |
| 13 | 202021047266-FORM 18 [14-07-2022(online)].pdf | 2022-07-14 |
| 14 | 202021047266-FER.pdf | 2022-11-04 |
| 15 | 202021047266-FER_SER_REPLY [02-05-2023(online)].pdf | 2023-05-02 |
| 16 | 202021047266-COMPLETE SPECIFICATION [02-05-2023(online)].pdf | 2023-05-02 |
| 17 | 202021047266-CLAIMS [02-05-2023(online)].pdf | 2023-05-02 |
| 18 | 202021047266-Response to office action [21-08-2025(online)].pdf | 2025-08-21 |
| 1 | SearchE_03-11-2022.pdf |