Sign In to Follow Application
View All Documents & Correspondence

Method And System To Generate Animated Audio Visual Content

Abstract: Present disclosure relates to techniques for generating animated visual content. Said techniques discuss receiving an audio message and a visual content from a user device and processing the received audio message to generate time-aligned functions, the time-aligned functions include lip shapes, emotions, and body-part movements. It further discusses superimposing the generated time-aligned functions on the visual content in correlation with the audio message and generating an animation of the superimposed audio-visual content to mimic the audio message.

Get Free WhatsApp Updates!
Notices, Deadlines & Correspondence

Patent Information

Application #
Filing Date
10 July 2020
Publication Number
02/2022
Publication Type
INA
Invention Field
COMPUTER SCIENCE
Status
Email
IPO@KNSPARTNERS.COM
Parent Application

Applicants

HIKE PRIVATE LIMITED
4th Floor, Indira Gandhi International Airport, Worldmark 1 Northern Access Rd, Aerocity Delhi New Delhi India 110037

Inventors

1. Dipankar Sarkar
Hike Pvt. Ltd., 4th Floor, Indira Gandhi International Airport, Worldmark 1 Northern Access Rd, Aerocity Delhi New Delhi India 110037
2. Ankur Narang
Hike Pvt. Ltd., 4th Floor, Indira Gandhi International Airport, Worldmark 1 Northern Access Rd, Aerocity Delhi New Delhi India 110037
3. Kavin Bharti Mittal
Hike Pvt. Ltd., 4th Floor, Indira Gandhi International Airport, Worldmark 1 Northern Access Rd, Aerocity Delhi New Delhi India 110037

Specification

[0001] The present disclosure generally relates to communication. More specifically, the
present disclosure relates to generating audio/visual content.
BACKGROUND OF THE INVENTION:
[0002] The popularity and utility of mobile computing devices as well as the prevalent use
of messaging services/ social networking applications have resulted in a corresponding increase in
communication between the users of these devices. For example, users commonly use their devices
to send electronic messages to other users as text messages, chat messages, audio messages, email,
multimedia by using messaging services/ social networking applications.
[0003] These services/applications enables a user to record audio and send the recorded
audio to another user as a voice message. However, when a recipient plays the voice messages on
messaging platform/application, the recipient is unaware of the mood or facial gestures relating to
his temperament. Currently, voice messages are one dimensional, wherein one has to understand
the mood/temperament based on tone/pitch of the voice. The user generally communicates his/her
mood or facial expression to another user by using stickers/emoticon/emojis/avatars.
[0004] Hence, it would be advantageous and there exists a need in the art for messages
being delivered in multidimensions, such as audio-visual messaging.
OBJECTS OF THE INVENTION:
[0005] An object of the present disclosure is to generate visual content in sync with audio
message.
[0006] Another object of the present invention is to create a combined audio-visual
content/message.
[0007] Another object of the present disclosure is to accurately capture and communicate
sender’s emotion using facial expressions, gestures and body movement to the recipient.
SUMMARY OF THE INVENTION:
3
[0008] The present disclosure overcomes one or more shortcomings of the prior art and
provides additional advantages discussed throughout the present disclosure. Additional features
and advantages are realized through the techniques of the present disclosure. Other embodiments
and aspects of the disclosure are described in detail herein and are considered a part of the claimed
disclosure.
[0009] In one non-limiting embodiment of the present disclosure, a method for generating
animated visual content is disclosed. The method comprising receiving an audio message and a
visual content from a user device, processing the received audio message to generate time-aligned
functions, wherein the time-aligned functions include lip shapes, emotions, and body-part
movements, superimposing the generated time-aligned functions on the visual content in
correlation with the audio message, and generating an animation of the superimposed visual
content to mimic the audio message.
[0010] In one non-limiting embodiment of the present disclosure, the processing of the
received audio message comprises extracting data from the received audio message, wherein the
data comprises keywords and phrases and comparing the extracted data with data stored in a first
database, the first database comprises body-part movements corresponding to the stored data,
based on the comparison, retrieving the time-aligned body-part movements for the received audio
message, generating a visual representation of the received audio message, comparing the
generated visual representation with a second database, the second database comprises a plurality
of visual representations or parts of visual representations tagged with corresponding time-aligned
functions including lip shapes and emotions, and based on the comparison, retrieving time-aligned
functions including lip shapes and emotions for the received audio message.
[0011] In still non-limiting embodiment of the present disclosure, the superimposing of the
generated time-aligned functions on the visual content comprises generating an animation data file
for the audio message, wherein the animation data file comprises the time-aligned functions,
processing the animated data file line by line to generate a plurality of frames, each frame
comprises time-aligned functions associated with a corresponding line of the animation data file,
and generating the superimposed visual content by combining the plurality of frames.
[0012] In yet another non-limiting embodiment of the present disclosure, the method
further comprises generating an isometric 2.5D animation for the generated animation using an
isometric technique. The body part movements comprise at least head and hand movements, and
4
the method further comprises storing the generated animation in memory and sending the
generated animation to the user device.
[0013] In yet another non-limiting embodiment of the present disclosure, a system for
generating animated visual content is disclosed. The system comprising a receiving unit, a memory
unit, a processing unit, and a transmitting unit communicatively coupled with each other. The
receiving unit configured to receive an audio message and a visual content from a user device and
the memory unit configured to store the received audio message and the visual content. The
processing unit configured to process the received audio message to generate time-aligned
functions, the time-aligned functions include lip shapes, emotions, and body-part movements,
superimpose the generated time-aligned functions on the visual content in correlation with the
audio message, and generate an animation of the superimposed visual content to mimic the audio
message.
[0014] In yet another non-limiting embodiment of the present disclosure, the processing
unit is further configured to extract data from the received audio message, wherein the data
comprises keywords and phrases, compare the extracted data with data stored in a first database,
wherein the first database comprises body-part movements corresponding to the stored data, based
on the comparison, retrieve the time-aligned functions including body-part movements for the
received audio message, generate a visual representation of the received audio message, compare
the generated visual representation with a second database, wherein the second database comprises
a plurality of visual representations or parts of visual representations tagged with corresponding
functions including lip shapes and emotions, and based on the comparison, retrieve time-aligned
functions including lip shapes and emotions for the received audio message.
[0015] In yet another non-limiting embodiment of the present disclosure, the processing
unit is further configured to generate an animation data file for the audio message, the animation
data file comprises the time-aligned functions, process the animated data file line by line to
generate a plurality of frames, each frame comprises time-aligned functions associated with a
corresponding line of the animation data file, and generate the superimposed visual content by
combining the plurality of frames.
[0016] In yet another non-limiting embodiment of the present disclosure, the processing
unit is further configured to generate an isometric 2.5D animation for the generated animation
using an isometric technique. The body part movements comprise at least head and hand
5
movements, the memory unit is further configured to store the generated animation, and the
transmitting unit is configured to the generated animation to the user device.
[0017] In yet another non-limiting embodiment of the present disclosure, the suspicious
activities data at least comprises transaction leading to insufficient fund and transaction above
daily limit and to categorize the plurality of existing merchants, the processor is configured to
normalize the merchant risk score for the plurality of existing merchants and categorize a merchant
as risky or non-risky based on the normalized merchant risk score of the plurality of existing
merchants.
[0018] The foregoing summary is illustrative only and is not intended to be in any way
limiting. In addition to the illustrative aspects, embodiments, and features described above, further
aspects, embodiments, and features will become apparent by reference to the drawings and the
following detailed description.
BRIEF DESCRIPTION OF DRAWINGS:
[0019] The features, nature, and advantages of the present disclosure will become more
apparent from the detailed description set forth below when taken conjunction with the drawings
in which like reference characters identify correspondingly throughout. Some embodiments of
system and/or methods in accordance with embodiments of the present subject matter are now
described, by way of example only, and with reference to the accompanying figures, in which:
[0020] Fig. 1(a)–(b) shows an exemplary environment in a communication network, in
accordance with an embodiment of the present disclosure;
[0021] Fig. 2(a)-(b) illustrates a block diagram of a system in accordance with an
embodiment of the present disclosure;
[0022] Fig. 3 illustrate a flowchart of an exemplary method, in accordance with an
embodiment of the present disclosure;
[0023] Fig. 4 illustrates a flowchart of an exemplary method, in accordance with an
embodiment of the present disclosure;
[0024] It should be appreciated by those skilled in the art that any block diagrams herein
represent conceptual views of illustrative systems embodying the principles of the present subject
matter. Similarly, it will be appreciated that any flow charts, flow diagrams and the like represent
6
various processes which may be substantially represented in computer readable medium and
executed by a computer or processor, whether or not such computer or processor is explicitly
shown.
DETAILED DESCRIPTION OF DRAWINGS:
[0025] In the present document, the word “exemplary” is used herein to mean “serving as
an example, instance, or illustration.” Any embodiment or implementation of the present subject
matter described herein as “exemplary” is not necessarily to be construed as preferred or
advantageous over other embodiments.
[0026] While the disclosure is susceptible to various modifications and alternative forms,
specific embodiment thereof has been shown by way of example in the drawings and will be
described in detail below. It should be understood, however that it is not intended to limit the
disclosure to the particular forms disclosed, but on the contrary, the disclosure is to cover all
modifications, equivalents, and alternative falling within the scope of the disclosure.
[0027] The terms “comprises”, “comprising”, “include(s)”, or any other variations thereof,
are intended to cover a non-exclusive inclusion, such that a setup, system or method that comprises
a list of components or steps does not include only those components or steps but may include
other components or steps not expressly listed or inherent to such setup or system or method. In
other words, one or more elements in a system or apparatus proceeded by “comprises… a” does
not, without more constraints, preclude the existence of other elements or additional elements in
the system or apparatus.
[0028] In the following detailed description of the embodiments of the disclosure,
reference is made to the accompanying drawings that form a part hereof, and in which are shown
by way of illustration specific embodiments in which the disclosure may be practiced. These
embodiments are described in sufficient detail to enable those skilled in the art to practice the
disclosure, and it is to be understood that other embodiments may be utilized and that changes may
be made without departing from the scope of the present disclosure. The following description is,
therefore, not to be taken in a limiting sense.
7
[0029] Fig. 1(a) shows an exemplary environment 100a illustrating a scenario of
generating video or animation based on a recorded audio message, in accordance with an
embodiment of the present disclosure.
[0030] In one embodiment of the present disclosure, the environment 100a may comprise
a server 101, a first user device 103 operated by a first user 110, and a second user device 105
operated by a second user 120. The first user device 103 and second user device 103 may be in
communication with each other. The first user 110 may generate a visual content (for example not
limited to avatar, emoji, sticker, etc.) using a first user device 101. The first user device 103 may
create the visual content by capturing one or more images of the first user 110 or using any existing
visual content and may transmit the same to the server 101. The image may be customized as per
choice or preference of the first user 110. The image may or may not be the representative of the
first user 110.
[0031] In one non-limiting embodiment of the present disclosure, the first user device 103
may automatically generate the visual content based on the previously stored data of the first user
110. In another non-limiting embodiment, the server 101 may receive one or more images of the
first user 110 and generate the visual content of the first user 110 using the procedure discussed
above.
[0032] The server 101 may receive video data of the first user 110 from the first user device
103. The server 101 may extract lip shapes, emotions, and body-part movements with respect to
the audio present in video data and train a neural network (not shown) for lip shapes, emotions and
body-part movement for the audio content in the video data. The body movements may comprise
head movements, hand movements, eyes movement, shoulder and leg movements etc. The server
101 may store the lip shapes, emotions and body-part movements in the database.
[0033] In an embodiment of the present disclosure, the server 101 may then receive an
audio message from the first user 110 of the first audio device. The server 101 may process the
received audio message to generate time-aligned functions comprising lip shapes, emotions, and
body-part movements. The server 101 then superimposes the time-aligned functions on the visual
content in correlation with the audio message and generate an animation or a video of the
8
superimposed visual content to mimic the audio message. The animation may comprise the audiovisual content.
[0034] In an embodiment of the present disclosure, the server 101 may transmit or send
the generated animation or video to the first user device 103. In one non-limiting embodiment of
the present disclosure, the server 101 may transmit or send the generated animation or video
comprising the audio-visual content to the second user device 105. Thus, the server 101 enable the
first user 110 to accurately communicate first user’s 110 emotion using facial expressions and
body movement to the second user 120.
[0035] Fig. 1(b) shows an exemplary environment 100b illustrating a scenario of
generating video or animation based on a recorded audio message, in accordance with another
embodiment of the present disclosure.
[0036] In one embodiment of the present disclosure, the environment 100b may comprise
a first user 110, a first user device 103 operated by a first user 110, a second user device 103
operated by a second user 120, and a server 105. The first user device 101 and second user device
103 may be in communication with each other via there server 105. The first user device 101 may
receive an audio message, video data and one or more images from the first user 110.
[0037] The first user device 101 may be configured to generate visual content based on the
received one or more images. The first user device 101 may also allow the first user 110 to
customize the generated visual content. The visual content may or may not be the representative
of the first user 110.
[0038] In one non-limiting embodiment of the present disclosure, the first user device 101
may automatically generate the visual content based on the previously stored data of the first user
110. The first user device 101 may train a neural network (not shown) based on the procedure
discussed above.
[0039] The first user device 101 may process the received audio message to generate timealigned functions comprising lip shapes, emotions, and body-part movements. The body
movements may comprise hand movement, eyes movement, shoulder movement, head movement,
legs movement of the first user. The first user device 101 then superimposes the time-aligned
9
functions on the avatar or self-sticker in correlation with the audio message, generate an animation
or a video of the superimposed avatar or self-sticker to mimic the audio message, and transmit the
generated video animation to the second user device 103 through the server 105. The animation or
the video may comprise the audio-visual content. Thus, the first user device 101 enable the first
user 110 to accurately communicate first user’s 110 emotion using facial expressions and body
movement to the second user 120.
[0040] Fig. 2(a) illustrates a block diagram of a system 200 for generating animated visual
content and fig. 2(b) illustrates a block diagram illustrating a processing unit 201, in accordance
with another embodiment of the present disclosure.
[0041] In an embodiment of the present disclosure, the system 200 may include one or
more of elements, but not limited to, internet, a local area network, a wide area network, a peerto-peer network, and/or other similar technologies for connecting various entities as discussed
below. In an aspect, various elements/entities such as a system 210, a first user devices 220, and a
second user device 230 of the network 200 as shown in fig. 2(a) may communicate within the
network 200 through web presence (not shown). In fig. 2(a), only two user devices 220 and 230
are shown only for the sake of ease and should not be construed as limiting the scope and multiple
user devices may be connected to the system 210.
[0042] The first user device 220 and the second user device 230 may be operated by a first
user and a second user respectively for communication. In one non-limiting embodiment, the first
user devices 220 and the second user device 230 may be operated to interact or communicate in a
virtual environment/platform.
[0043] The first user device 220 and the second user device 230 may represent desktop
computers, laptop computers, mobile devices (e.g., Smart phones or personal digital assistants),
tablet devices, or other type of computing devices, which have computing, messaging and
networking capabilities. The first user device 220 and the second user device 230 may be equipped
with one or more computer storage devices (e.g., RAM, ROM, PROM, SRAM, etc.),
communication unit and one or more processing devices (e.g., central processing units) that are
capable of executing computer program instructions.
10
[0044] In an exemplary embodiment of the present disclosure, the system 210 may include
various elements such as a processing unit 201, a memory unit 203, a receiving unit 205, and a
transmitting unit 207. The processing unit 201, the memory unit 203, the receiving unit 205 and
the transmitting unit 207 may be communicatively coupled with each other over a wired or wireless
link. As shown in fig. 2(b), the processing unit 201 comprises a neural network 209, a memory
211, and one or processors 213 coupled to each other. The system 210 may remain operatively
connected to first user device 220. In one non-limiting embodiment, the system 210 may be
connected to first user device 220 and the second user device 230 to receive and process the
communication or interactions received from the first user device 220 and the second user device
230.
[0045] The first user device 210 may be configured to capture one or more images of the
first user and generate a visual content (an avatar or a self-sticker) based on the captured one or
more images. The first user may customize or edit the generated visual content based on choice or
preference of the first user or automatically. The first user device may then transmit the visual
content to the system 210.
[0046] The receiving unit 205 comprising a receiver may be configured to receive an audio
message and the visual content from a user device or the first user device 220. The memory unit
203 of the of the system 210 may be configured to store the received audio message and the visual
content. The processing unit 201 of the system 201 may be configured to process the received
audio message to generate time-aligned functions. The time-aligned functions may include lip
shapes, emotions, and body-part movements. The processing unit 201 of the system 201 may be
configured to superimpose the generated time-aligned functions on the visual content in correlation
with the audio message and generate an animation of the superimposed visual content to mimic
the audio message. The generated animation may be stored in the memory unit 203.
[0047] In an embodiment of the present disclosure, to process the received audio message
to generate time-aligned functions the processing unit 201 may be configured to extract data from
the received audio message. The extracted data may comprise keywords and phrases. The extracted
data may be then compared with stored keywords and phrases in a first database of the memory
211. The first database comprises body-part movements corresponding to the stored keywords and
phrases. The body movements may comprise hand movement, shoulder movement, eyes
11
movement, head movement, legs movement of the first user. The processing unit 201 may be then
configured to retrieve the time-aligned functions including the body-part movements for the
received audio message based on the comparison.
[0048] The processing unit 201 may be configured to generate a visual representation of
the received audio message. The visual representation may represent a spectrogram of speech
signal present in the received audio message. The generated spectrogram is a visual representation
of a signal strength of audio signal over time at different frequencies present in certain waveform.
It is represented by a two-dimensional graph in which time is shown along the horizontal axis,
frequency along the vertical axis, and the amplitude of the frequency components at a particular
time is indicated by the intensity or color of that point in the graph.
[0049] The processing unit 201 may be configured to compute or determine the
spectrogram from the speech signal by applying Fast Fourier transform (FFT) to speech signal,
which form time-frequency representation. In one non-limiting embodiment of the present
disclosure, the visual representation or the spectrogram may be computed using any other
procedure known to a person skilled in the art.
[0050] The processing unit 201 may be configured to compare the generated visual
representation with a second database of the memory 211. The second database comprises a
plurality of visual representations or parts of visual representations tagged with corresponding
time-aligned functions including lip shapes and emotions. The processing unit 201 may be then
configured to retrieve time-aligned functions including lip shapes and emotions/gestures for the
received audio message based on the comparison of the generated visual representation with the
plurality of visual representations or parts of visual representations.
[0051] The processing unit 201 may be configured to generate an animation data file for
the audio message, the animation data file comprises the time-aligned functions include lip shapes,
emotions/gestures, and body-part movements. The processing unit 201 may be then configured to
process the animated data file line by line to generate a plurality of frames, each frame comprises
time-aligned functions associated with a corresponding line of the animation data file. The
animated data file is processed line by line for synchronizing the audio message with the visual
content. The processing unit 201 may be then configured to generate the superimposed visual
12
content by combining the plurality of frames. In one non-limiting embodiment of the present
disclosure, the processing unit may be further configured to generate an isometric 2.5D animation
for the generated animation using an isometric technique known to a person skilled in the art.
[0052] The transmitting unit 207 comprising a transmitter may be configured to transmit
the generated animation to the first user device 220. The animation may comprise the audio-visual
content. In one non-limiting embodiment, the transmitting unit 207 may be configured to transmit
the generated animation to the second user device 230.Thus, the system 210 facilitates the sender
to accurately communicate sender’s emotion using facial expressions and body movement by
generating an animated visual content representing the audio message to communicate the mood,
facial expressions, and gestures of the sender i.e. first user. In one non-limiting embodiment, the
system 210 may be a part of the first user device 220.
[0053] In an embodiment of the present disclosure, the one or more processors 213 may
train the neural network. The processing unit 201 may be configured to receive, via the receiving
unit 205, video data from a first user device. The video data may comprise a video of the user. The
one or more processors 213 of the processing unit 201 may be configured to extract the keywords
and phrases from the video data and process the video data to determine body-part movements
corresponding to the extracted keywords and phrases using an image processing technique known
to a person skilled in the art. The body-part movements may comprise various gesture made by
the user while speaking. The body movement may comprise hand movement, eyes movement,
shoulder movement, head movement, and legs movement of the first user.
[0054] The one or more processors 213 may be configured to train the neural network 209
with body-part movements corresponding to the extracted keywords and phrases as input. The
neural network 209 is configured to generate time-aligned function of the body-part movements
as an output based on the training. The memory 211 of the processing unit 201 may be configured
to store the time-aligned function including the body-part movements against the respective
keywords and phrases in the first database.
[0055] In one embodiment of the present disclosure, the one or more processors 213 may
be configured to extract the audio signal or speech signal from the video data and generate a visual
representation or spectrogram for the extracted audio signal or speech signal using the procedure
discussed above. The one or more processors 213 may be then configured to extract the visual
features of the video data corresponding to the spectrogram. The visual features may comprise lip
13
shapes and emotions such as anger, disgust, fear, happiness, and sadness, etc. The visual features
may be extracted using namely 2-D discrete cosine transform (DCT) and cross-DCT techniques.
The visual features extraction is not limited to above mentioned techniques and a person skilled
may visual features using any techniques know to a person skilled in the art.
[0056] The one or more processors 213 may be configured to train the neural network 209
with lip shapes and emotions corresponding to the spectrogram for the extracted audio signal or
speech signal as input. The neural network 209 is configured to generate time-aligned functions of
the lip shapes and emotions as an output based on the training. The memory 211 of the processing
unit 201 may be configured to store the time-aligned functions including the lip shapes and
emotions against the spectrogram for the extracted audio signal or speech signal in the second
database. The stored time-aligned functions of the lip shapes, emotions, and body movements may
be used for generating animation to mimic the audio message send by a user.
[0057] Fig. 3 illustrate a flowchart of an exemplary method 300 for generating animated
visual content, in accordance with another embodiment of the present disclosure.
[0058] At block 301, an audio message and a visual content (for example not limited to
avatar, emoji, sticker, etc.) may be received from a user device. The user device may be the first
user device 220 as discussed above. The visual content may comprise an avatar or a self-sticker.
The visual content may be generated by the user device based on one or more captured images of
the user. The visual content generated by the user device may be customized based on choice or
preference of the user.
[0059] At block 303, the received audio message may be processed to generate timealigned functions. The time-aligned functions may include lip shapes, emotions, and body-part
movements. The processing of the received audio message may comprise extracting data from the
received audio message. The extracted data may comprise keywords and phrases. The extracted
data may be then compared with stored keywords and phrases in a first database of the memory
211. The first database comprises body-part movements corresponding to the stored keywords and
phrases. Based on the comparison, the time-aligned functions including the body-part movements
may be retrieved for the received audio message.
14
[0060] In an embodiment of the present disclosure, a visual representation or a spectrogram
is generated based on the received audio message using the procedure as discussed above. The
generated visual representation may be then compared with a second database of the memory. The
second database comprises a plurality of visual representations or parts of visual representations
tagged with corresponding time-aligned functions including lip shapes and emotions. Based on the
comparison with the second database, time-aligned functions including lip shapes and emotions
may be retrieved for the received audio message.
[0061] At block 305, the generated time-aligned functions including the lip shapes,
emotions, and body-part movements may be superimposed on the visual content in correlation
with the audio message. The superimposing of the generated time-aligned functions on the visual
content may comprise generating an animation data file for the audio message. The animation data
file comprises the time-aligned functions including the lip shapes, emotions, and body-part
movements.
[0062] Then, the animated data file may be processed line by line to generate a plurality of
frames. Each frame comprises time-aligned functions may be associated with a corresponding line
of the animation data file. The animated data file is processed line by line for synchronizing the
audio message with the visual content. The superimposed visual content may be generated by
combining the plurality of frames.
[0063] At block 307, an animation of the superimposed visual content is generated to
mimic the audio message is generated. The animation may comprise the audio-visual content. In
one non-limiting embodiment of the present disclosure, an isometric 2.5D animation for the
generated animation may generated using an isometric technique known to a person skilled in the
art. In another non-limiting embodiment of the present disclosure, an isometric 3D animation for
the generated animation may generated using an isometric technique known to a person skilled in
the art. The method 300 may further comprise transmitting the generated animation to the user
device. In one non-limiting embodiment, the generated animation may be transmitted to a device
different from the user device.
[0064] Thus, the method 300 facilitates the sender to accurately communicate sender’s
emotion using facial expressions and body movement by generating an animated visual content
15
representing the audio message to communicate the mood, facial expressions, and gestures of the
sender i.e. first user. In an embodiment of the present disclosure, the steps of method 300 may be
performed in an order different from the order described above.
[0065] Fig. 4 illustrates a flowchart of an exemplary method 400 for training a neural
network, in accordance with another embodiment of the present disclosure.
[0066] At block 401, video data may be received from a user device. The video data may
comprise a video of the user. At block 403, the video data is processed to determine the lip shapes,
emotions, and body-part movement with respect to audio present in the video data. The lip shapes,
emotions, and body-part movement with respect to audio may be determined using the techniques
as discussed above.
[0067] At block 407, the neural network may be trained with the lip shapes, emotions, and
body-part movement as input and the neural network may generate the time-aligned function
including the lip shapes, emotions, and body-part movement. The time-aligned functions including
the lip shapes and emotions against the spectrogram for the extracted audio signal or speech signal
in the second database. The time-aligned function including the body-part movements against the
respective keywords and phrases is stored in the first database. The stored time-aligned functions
of the lip shapes, emotions, and body movements may be used for generating animation to mimic
the audio message send by a user.
[0068] The memory unit 203 and memory 211 may maintain software organized in
loadable code segments, modules, applications, programs, etc., which may be referred to herein as
software modules. Each of the software modules may include instructions and data that, when
installed or loaded on the one or more processors 213 and executed by the one or more processors
213, contribute to a run-time image that controls the operation of the processors 213. When
executed, certain instructions may cause the one or more processors 213 to perform functions in
accordance with certain methods, algorithms and processes described herein.
[0069] Furthermore, one or more computer-readable storage media may be utilized in
implementing embodiments consistent with the present disclosure. A computer-readable storage
medium refers to any type of physical memory on which information or data readable by a
processor may be stored. Thus, a computer-readable storage medium may store instructions for
execution by one or more processors, including instructions for causing the processor(s) to perform
16
steps or stages consistent with the embodiments described herein. The term “computer- readable
medium” should be understood to include tangible items and exclude carrier waves and transient
signals, i.e., are non-transitory. Examples include random access memory (RAM), read-only
memory (ROM), volatile memory, nonvolatile memory, hard drives, CD ROMs, DVDs, flash
drives, disks, and any other known physical storage media.
[0070] Suitable processors include, by way of example, a special purpose processor, a
digital signal processor (DSP), a plurality of microprocessors, one or more microprocessors in
association with a DSP core, a controller, a microcontroller, Application Specific Integrated
Circuits (ASICs), Field Programmable Gate Arrays (FPGAs) circuits, any other type of integrated
circuit (IC), and/or a state machine.
[0071] The illustrated steps are set out to explain the exemplary embodiments shown, and
it should be anticipated that ongoing technological development will change the manner in which
particular functions are performed. These examples are presented herein for purposes of
illustration, and not limitation. Further, the boundaries of the functional building blocks have been
arbitrarily defined herein for the convenience of the description. Alternative boundaries can be
defined so long as the specified functions and relationships thereof are appropriately performed.
Alternatives (including equivalents, extensions, variations, deviations, etc., of those described
herein) will be apparent to persons skilled in the relevant art(s) based on the teachings contained
herein. Such alternatives fall within the scope and spirit of the disclosed embodiments. Also, the
words “comprising,” “having,” “containing,” and “including,” and other similar forms are intended
to be equivalent in meaning and be open ended in that an item or items following any one of these
words is not meant to be an exhaustive listing of such item or items, or meant to be limited to only
the listed item or items. It must also be noted that as used herein and in the appended claims, the
singular forms “a,” “an,” and “the” include plural references unless the context clearly dictates
otherwise.

We Claim:

1. A method for generating animated audio-visual content, the method comprising:
receiving an audio message and a visual content from a user device;
processing the received audio message to generate time-aligned functions, wherein the
time-aligned functions include lip shapes, emotions, and body-part movements;
superimposing the generated time-aligned functions on the visual content in correlation
with the audio message; and
generating an animation of the superimposed audio-visual content to mimic the audio
message.
2. The method as claimed in claim 1, wherein processing the received audio message
comprises:
extracting data from the received audio message, wherein the data comprises keywords and
phrases;
comparing the extracted data with data stored in a first database, wherein the first database
comprises body-part movements corresponding to the stored data;
based on the comparison, retrieving the time-aligned function including the body-part
movements for the received audio message;
generating a visual representation of the received audio message;
comparing the generated visual representation with a second database, wherein the second
database comprises a plurality of visual representations or parts of visual representations tagged
with corresponding time-aligned functions including lip shapes and emotions; and
based on the comparison, retrieving time-aligned functions including lip shapes and
emotions for the received audio message.
3. The method as claimed in claim 1, wherein superimposing the generated time-aligned
functions on the visual content comprises:
generating an animation data file for the audio message, wherein the animation data file
comprises the time-aligned functions;
18
processing the animated data file line by line to generate a plurality of frames, wherein each
frame comprises time-aligned functions associated with a corresponding line of the animation data
file; and
generating the superimposed visual content by combining the plurality of frames.
4. The method as claimed in claim 1, further comprising generating an isometric 2.5D
animation for the generated animation using an isometric technique.
5. The method as claimed in claim 1, wherein the body part movements comprise at least
head and hand movements, and wherein the method further comprises:
storing the generated animation in memory; and
sending the generated animation to the user device.
6. A system for generating animated audio-visual content, the system comprising:
a receiving unit configured to receive an audio message and a visual content from a user
device;
a memory unit configured to store the received audio message and the visual content; and
a processing unit coupled to the receiving unit and the memory unit, and configured to:
process the received audio message to generate time-aligned functions, wherein the
time-aligned functions include lip shapes, emotions, and body-part movements;
superimpose the generated time-aligned functions on the visual content in
correlation with the audio message; and
generate an animation of the superimposed audio-visual content to mimic the audio
message.
7. The system as claimed in claim 6, wherein the processing unit is further configured to:
extract data from the received audio message, wherein the data comprises keywords and
phrases;
compare the extracted data with data stored in a first database, wherein the first database
comprises body-part movements corresponding to the stored data;
19
based on the comparison, retrieve the time-aligned functions including body-part
movements for the received audio message;
generate a visual representation of the received audio message;
compare the generated visual representation with a second database, wherein the second
database comprises a plurality of visual representations or parts of visual representations tagged
with corresponding functions including lip shapes and emotions; and
based on the comparison, retrieve time-aligned functions including lip shapes and emotions
for the received audio message.
8. The system as claimed in claim 6, wherein the processing unit is further configured to:
generate an animation data file for the audio message, wherein the animation data file
comprises the time-aligned functions;
process the animated data file line by line to generate a plurality of frames, wherein each
frame comprises time-aligned functions associated with a corresponding line of the animation data
file; and
generate the superimposed visual content by combining the plurality of frames.
9. The system as claimed in claim 6, wherein the processing unit is further configured to
generate an isometric 2.5D animation for the generated animation using an isometric technique.
10. The system as claimed in claim 6, wherein:
the body part movements comprise at least head and hand movements,
the memory unit is further configured to store the generated animation, and
the system further comprises a transmitting unit configured to send the generated animation
to the user device.

Documents

Application Documents

# Name Date
1 202011029365-FORM 18 [15-05-2024(online)].pdf 2024-05-15
1 202011029365-STATEMENT OF UNDERTAKING (FORM 3) [10-07-2020(online)].pdf 2020-07-10
2 202011029365-POWER OF AUTHORITY [10-07-2020(online)].pdf 2020-07-10
2 202011029365-Proof of Right [23-10-2020(online)].pdf 2020-10-23
3 202011029365-COMPLETE SPECIFICATION [10-07-2020(online)].pdf 2020-07-10
3 202011029365-FORM 1 [10-07-2020(online)].pdf 2020-07-10
4 202011029365-DECLARATION OF INVENTORSHIP (FORM 5) [10-07-2020(online)].pdf 2020-07-10
4 202011029365-DRAWINGS [10-07-2020(online)].pdf 2020-07-10
5 202011029365-DRAWINGS [10-07-2020(online)].pdf 2020-07-10
5 202011029365-DECLARATION OF INVENTORSHIP (FORM 5) [10-07-2020(online)].pdf 2020-07-10
6 202011029365-COMPLETE SPECIFICATION [10-07-2020(online)].pdf 2020-07-10
6 202011029365-FORM 1 [10-07-2020(online)].pdf 2020-07-10
7 202011029365-POWER OF AUTHORITY [10-07-2020(online)].pdf 2020-07-10
7 202011029365-Proof of Right [23-10-2020(online)].pdf 2020-10-23
8 202011029365-FORM 18 [15-05-2024(online)].pdf 2024-05-15
8 202011029365-STATEMENT OF UNDERTAKING (FORM 3) [10-07-2020(online)].pdf 2020-07-10
9 202011029365-FER.pdf 2025-07-04
10 202011029365-FORM 3 [28-08-2025(online)].pdf 2025-08-28

Search Strategy

1 202011029365_SearchStrategyNew_E_SearchHistoryE_04-02-2025.pdf