Abstract: The present disclosure relates to a method comprising receiving source data from a user. The source data comprises user voice data and one or more parameters related a target voice. The method further comprises modulating the user voice data based on the one or more parameters related to the target voice. The method further comprises synchronizing the modulated user voice data over an animated representation of the user. The method also comprises generating a background based on the modulated voice data and generating a video using the synchronized animated representation and the background.
0001] The present disclosure generally relates to a video generation system. In
particular, the present disclosure relates to a system and a method for generating an
animated video based on at least user voice data.
BACKGROUND
[0002] The information in this section merely provide background information related
to the present disclosure and may not constitute prior art(s).
[0003] Enormous growth in mobile device technology has enabled users to use the
mobile devices for a variety of application. As these media devices have become a new
form of entertainment, people are interacting with said devices to record their audio
and video and sharing said recorded audio and video with their friends. While
conventional media systems facilitate audio/video recording and sharing, such systems
provide a relatively narrow entertainment experience and do not provide any analysis
or comparison of user recorded audio with other users. Such kind of analysis and
comparison is required for allowing emerging singers to learn and improve their vocal
skills. Moreover, such conventional systems fails to provide interactive and
entertaining components in recorded audio/video which genuinely engage a user.
[0004] Therefore, there is a need of media generating system which can generate
highly interactive and entertaining video based on at least user voice data.
SUMMARY OF THE INVENTION
[0005] One or more shortcomings of the prior art are overcome, and additional
advantages are provided by the present disclosure. Additional features and advantages
are realized through the techniques of the present disclosure. Other embodiments and
aspects of the disclosure are described in detail herein and are considered a part of the
disclosure.
3
[0006] In a main aspect, the present disclosure provides a method comprising receiving
source data from a user. The source data comprising user voice data and one or more
parameters related a target voice. The method disclosed in present disclosure further
comprises modulating the user voice data based on the one or more parameters related
to the target voice and synchronizing the modulated user voice data over an animated
representation of the user. Further, the method comprises generating a background
based on the modulated voice data. The background being generated using one or more
background parameters. The method further comprises generating a video using the
synchronized animated representation and the background.
[0007] According to another aspect, the one or more parameters related to the target
voice comprises at least name of singer in the target voice.
[0008] According to yet another aspect, the method comprises receiving at least one of
an image from the user and one or more inputs from the user. The method also
comprises generating the animated representation of the user based on at least one of
the received image and the one or more inputs from the user. The animated
representation of the user at least comprises facial representation of the user.
[0009] According to yet another aspect, the one or more background parameters
include at least one of genre of song, tempo of song and one or more prestored semantic
maps.
[0010] According to yet another aspect, the method further comprises determining a
score for the user based on modulation required to match the user voice data to the
target voice and providing the score to the user.
[0011] According to yet another aspect, the method comprises determining a score for
multiple users based on modulation required to match user voice data to the target voice
for each user. The method also comprises matching the scores of the multiple users
against each other and providing a rank to the multiple users based on the matching of
scores.
4
[0012] According to yet another aspect, the method comprises identifying a song and
associated parameters from the source data.
[0013] In another main aspect, the present disclosure provides a system comprising
one or more memories and one or more processors operatively coupled to the one or
more memories. The one or more processors is configured to receive source data from
a user. The source data comprises user voice data and one or more parameters related
a target voice. The one or more processors is further configured to modulate the user
voice data based on the one or more parameters related to the target voice. The one or
more processors is also configured to synchronize the modulated user voice data over
an animated representation of the user. The one or more processors are configured to
generate a background based on the modulated voice data. The background is generated
using one or more background parameters. The one or more processors are also
configured to generate a video using the synchronized animated representation and the
background. The system also comprises a display device configured to display the
video to the user.
[0014] According to yet another aspect, the one or more processors are further
configured to receive at least one of an image from the user and one or more inputs
from the user. The one or more processors are configured to generate the animated
representation of the user based on at least one of the received image and the one or
more inputs from the user. The animated representation of the user at least comprises
facial representation of the user.
[0015] According to yet another aspect, the one or more processors are further
configured to determine a score for the user based on modulation required to match the
user voice data to the target voice and provide the score to the user.
[0016] According to yet another aspect, the one or more processors are configured to
determine a score for multiple users based on modulation required to match the user
voice data to the target voice for each user. The one or more processor are configured
5
to match the scores of the multiple users against each other and provide a rank to the
multiple users based on the matching of scores.
[0017] In yet another aspect, the one or more processors are further configured to
identify a song and associated parameters from the source data.
[0018] In the above paragraphs, the most important features of the invention have been
outlined, in order that the detailed description thereof that follows may be better
understood and in order that the present contribution to the art may be better understood
and in order that the present contribution to the art may be better appreciated. There
are, of course, additional features of the invention that will be described hereinafter and
which will form the subject of the claims appended hereto. Those skilled in the art will
appreciate that the conception upon which this disclosure is based may readily be
utilized as a basis for the designing of other structures for carrying out the several
purposes of the invention. It is important therefore that the claims be regarded as
including such equivalent constructions as do not depart from the spirit and scope of
the invention.
OBJECT OF THE INVENTION
[0019] The object of the present disclosure is to provide a system that generates an
animated video based on at least user voice data.
[0020] Another object of the present disclosure is to provide a system that converts a
user voice data into a target signer voice and providing a score based on modulation
required to match the user voice with the target voice.
[0021] Yet another object of the present disclosure is to provide a system that may
generate a background for a video based on at least one of genre of a song, tempo of a
song and one or more prestored semantic maps.
[0022] Still another object of the present disclosure is to provide a system that
generates a score for multiple users based on user voice data of each user and performs
6
modulation required to match the user voice data of each user with the target voice.
The system may further provide a ranking to each user based on their generated score.
BREIF DESCRIPTION OF DRAWINGS
[0023] Further aspects and advantages of the present invention will be readily
understood from the following detailed description with reference to the accompanying
drawings, where like reference numerals refer to identical or functionally similar
elements throughout the separate views. The figures together with the detailed
description below, are incorporated in and form part of the specification, and serve to
further illustrate the aspects and explain various principles and advantages, in
accordance with the present invention wherein:
[0024] Fig. 1 illustrates a system for generating and displaying a video in accordance
with an embodiment of the present disclosure.
[0025] Fig. 2 is a block diagram illustrating a user device in accordance with an
embodiment of the present disclosure.
[0026] Fig. 3 is a block diagram illustrating a server for generating a video in
accordance with an embodiment of the present disclosure.
[0027] Fig. 4 illustrate flow chart of method of generating and displaying a video in
accordance with an embodiment of the present disclosure.
[0028] Skilled person in art will appreciate that elements in the drawings are illustrated
for simplicity and have not necessarily been drawn to scale. For example, the
dimensions of some of the elements in the drawings may be exaggerated relative to
other elements to help to improve understanding of aspects of the present invention.
DETAILED DESCRIPTION OF THE INVENTION
[0029] Referring in the present document, the word "exemplary" is used herein to mean
"serving as an example, instance, or illustration." Any embodiment or implementation
7
of the present subject matter described herein as "exemplary" is not necessarily to be
construed as preferred or advantageous over other embodiments.
[0030] While the disclosure is susceptible to various modifications and alternative
forms, specific embodiment thereof has been shown by way of example in the drawings
and will be described in detail below. It should be understood, however that it is not
intended to limit the disclosure to the particular forms disclosed, but on the contrary,
the disclosure is to cover all modifications, equivalents, and alternatives falling within
the scope of the disclosure.
[0031] The terms “comprises”, “comprising”, or any other variations thereof, are
intended to cover a non-exclusive inclusion, such that a setup, device that comprises a
list of components does not include only those components but may include other
components not expressly listed or inherent to such setup or device. In other words,
one or more elements in a system or apparatus proceeded by “comprises… a” does not,
without more constraints, preclude the existence of other elements or additional
elements in the system or apparatus or device.
[0032] Disclosed herein is a technique for modulating a user voice into a target voice
and generating an animated video thereof. The technique includes generating an
animated representation of the user and synchronizing the modulated user voice over
the animated representation of the user. The technique also includes generating a
background based on the modulated user voice and one or more background
parameters. The technique further includes generating a video using the generated
background and synchronized animated user representation. The technique therefore
provides an interactive and entertaining video generation and sharing system.
[0033] Fig. 1 illustrates a system 100 for generating and displaying a video in
accordance with an embodiment of the present disclosure. The system 100 includes one
or more user devices 102a-102n (interchangeably referred to as “the user device 102 or
device 102”) and a server 106 communicably coupled to the user device(s) 102 via a
8
network 104. Example of the network 104 may include, but not limited to, the Internet,
a Wide Area Network (WAN), a Local Area Network (LAN), or any combination
thereof.
[0034] The user device 102 may be configured to provide a user interface to a user to
enable the user to interact within the system 100 and generate an animated video, in
accordance with an embodiment of present disclosure. The user device 102 may include
any mobile computing or communication device, such as, but not limited to, a notebook
computer, a personal digital assistant (PDA), a mobile phone, a smartphone, a laptop,
a tablet or any similar class of mobile computing device with sufficient processing,
communication, and audio/video recording and playback capabilities.
[0035] In an exemplary embodiment, the user device 102 may be configured to receive
one or more user inputs. The one or more user inputs may also be referred to as the
source data. The one or more user inputs may include user recorded audio, one or more
parameters related to a target voice, a captured image of the user and like data. In an
exemplary embodiment, the user device 102 may be configured to pre-store or retrieve
from external environment (not shown) a plurality of parameters related to target
sample/voice. The target voice may be defined as a desired voice of a singer in
accordance which the user wants to perform conversion/modulation of his/her recorded
voice. In an embodiment, the one or more parameters related to the target voice may
include at least name of the singer in the target voice. In alternative embodiments, the
one or more parameters relates to the target voice may include voice related parameter
such as, but not limited to, pitch, amplitude, tone and so forth. In some embodiments,
the user device 102 may receive said one or more user inputs in real time. In alternative
embodiments, the user device 102 may be configured to store the one or more user
inputs and process the stored one or more user inputs when required. In an exemplary
embodiment, the user device 102 may be configured to transmit the one or more user
inputs to the server 106 via the network 104. The user device 102 may include any
suitable components of a mobile communication and/computing device to perform
desired operations of the user device 102.
9
[0036] In such an embodiment, the server 106 may be configured to receive the source
data (also referred to as the one or more user inputs) from the user through the user
device 102. The server 106 may include one or more memories 108 and one or more
processors 110. The one or more processors 110 may be operatively coupled to the one
or more memories 108. In some embodiments, the one or more memories 108 may be
located within the server 106. In alternative embodiments, the one or more memories
108 may be a part of one or more systems located remotely to the server 106 and
connected via one or more communication networks. The one or more processors 110
may receive the source data from the user. The one or more processors 110 may identify
a song and associated parameters from the source data. The one or more processors 110
may be configured to identify one or more parameters related to the target voice. The
one or more memories 108 coupled to the one or more processors 110 may include
voice related data associated with the target voice. In some embodiments, the one or
more processors 110 may modulate the user voice data included in the source data
based on the one or more parameters related to the target voice. In an embodiment, the
one or more processors 110 may retrieve voice related data associated with target voice
to modulate the user voice data.
[0037] Embodiment described herein can be understood with an example, wherein the
user may record his/her voice and specify a name of signer (for example, Sonu Nigam)
in the one or more parameters related to the target voice. The one or more processors
110 may retrieve voice related data such as pitch, tone, amplitude etc. of the singer
from specified as Sonu Nigam from the one or more memories 108. The one or more
processors 110 may then modulate the user voice into the voice of the targeted singer
by modulating the voice data of the user in accordance with voice related data of target
singer.
[0038] In some embodiments, the one or more processors 110 may be further
configured to receive at least one of an image from the user and one or more inputs
from the user to generate an animated representation of the user. The one or more
10
processors 110 may generate the animated representation of the user based on at least
one of the received image and the one or more inputs from the user. In an embodiment,
the animated representation of the user at least comprises facial representation of the
user. In another embodiment, the animated representation of the user comprises
complete body representation of the user. In an exemplary embodiment, the one or
more processors 110 may synchronize the modulated user voice data over the animated
representation of the user. The synchronize may comprise synchronizing facial
expression by varying the characteristics of facial components such as eyes, lips, nose
and so forth in accordance with the modulated user voice data. Example of
synchronizing facial expression include lips synchronization based on lyrics of song
identified from the user voice data.
[0039] Further, the one or more processors 110 may be configured to generate a
background based on the modulated voice data. In an embodiment, the background is
generated using one or more background parameters. The one or more background
parameters include at least one of genre of song, tempo of song and one or more
prestored semantic maps. In some embodiments one or more background parameters
are stored in the one or more memories 108. In alternative embodiment, the one or more
background parameters may be received from the user through the user device 102.
[0040] The one or more processors 110 may then be configured to generate a video
using the synchronized animated representation of the user and the background. In an
exemplary embodiment, the server 106 may receive user recorded voice, a name of
singer and an image of the user. As a result, the server 106 may generate a video of the
animated representation of user dancing and singing the user recorded voice with a
background resembling theme of the song. The server 106 may then transmit the
generated video to the user device 102 (may also be referred as a display device 102)
to display the video to the user. In some other embodiments, the server 106 may receive
one or more user input data from one or more user devices 102a-102n and generate
corresponding animated video for each of the one or more users.
11
[0041] In some embodiments, the one or more processors 110 may be configured to
determine a score for the user based on the modulation required to match the user voice
data to the target voice. The one or more processors 110 may provide the generated
score to the user. The score may be represented in form of number or a graphical
representation of one or more parameters related to an audio. The score may enable a
user to enhance his voice quality and match with the target singer voice. Therefore, the
system 100 enable amateur singer to efficiently learn and enhance his/her singing
capabilities. In some other embodiments the one or more processors 110 may be
configured to determine a score for multiple users based on modulation required to
match the user voice data to the target voice for each user. The one or more processors
110 may be configured to match the scores of the multiple users against each other and
provide a rank to multiple users based on the matching of scores. Therefore, the system
100 may provide a virtual singing competition environment to the users.
[0042] Embodiment illustrated above is exemplary in nature and the system 100 and/or
any element of the system 100 may include any number of addition components
required to perform the desired operation of the system 100.
[0043] Fig. 2 is a block diagram illustrating a user device 200 in accordance with an
embodiment of the present disclosure. The user device 200 may be similar to the user
device 102. In an exemplary embodiment, said user device 200 may be any user device
among user devices 102a-102n, represented in figure 1. The user device 200 may
include any mobile computing or communication device, such as, but not limited to, a
notebook computer, a personal digital assistant (PDA), a mobile phone, a smartphone,
a laptop, a tablet or any similar class of mobile computing device with sufficient
processing, communication, and audio/video recording and playback capabilities. In an
exemplary embodiment, the user device 200 may include an Input / Output unit 202
(also referred to as IO unit 202), a processing unit 204, a memory unit 206 and a
transceiver 208. Further, in an exemplary embodiment, the user device 200 may include
other essential elements that may be required for carrying out one or more
12
functionalities of the user device 200, however same are not explained for the sake of
brevity.
[0044] Coming back to figure 2, the IO unit 202 may be configured to receive one or
more user inputs. In an embodiment, the IO unit 202 comprises input/output devices
such as, but not limited to, of an audio recording device, an image capturing device, a
touch display, a keypad, one or more sensors, speakers and so forth. In an embodiment,
the audio recording device may be configured to enable the user to record user audio.
The image capturing device may be configured to capture an image and a video in
combination with audio recording device of the user. The touch display or the keypad
may enable a user to provide one or more manual input to the user device 200. In an
embodiment, the IO unit 202 may also be configured to receive one or more parameters
related to a target voice. The target voice may refer a desired voice of a singer in
accordance which the user wants to perform conversion/modulation of his/her recorded
voice. In an embodiment, the one or more parameters related to the target voice may
include at least name of the singer in the target voice. In alternative embodiments, the
one or more parameters relates to the target voice may include voice related parameter
such as, but not limited to, pitch, amplitude, tone and so forth.
[0045] The user device 200 may also include a processing unit 204 configured to
process the one or more inputs received from the user. The user device 200 may also
include a memory unit 206 configured to store data and/or instruction required for
processing of the processing unit 204. In some embodiments, the memories unit 206
may include memory storage devices such as, but not limited to, Read Only Memory
(ROM), Random Access Memory (RAM), flash disk, and so forth. The memory unit
206 may store the one or more inputs received from the user.
[0046] In some embodiments, the processing unit 204 may execute a set of instructions
stored in the memory unit 206 to provide a user interface to the user. The user interface
may allow the user to interact within the system 100 (shown in Fig. 1). The processing
unit 204 may process the one or more user inputs and transmit the processed user inputs
13
to the server 106. In an exemplary embodiment, the processing unit 204 may process
the user voice data received as one or more user inputs to determine one or more
associated parameters.
[0047] The user device 200 also includes the transceiver 208 configured to transmit
and receive data to and from the server 106 via the network 104 (shown in Fig. 1). In
some embodiments, the processing unit 204 may transmit the one or more user inputs
and/or processed data to the server 106 via the transceiver 208. In some other
embodiments, the processing unit 204 may receive an animated video generated based
the one or more user inputs and/or processed data from the server 106 via the
transceiver 208. The IO unit 202 may include a display device configured to display
the received animated video to the user. In an essential embodiment, the user device
200 may remain connected to the server 106, via the network 104, to carry out the
functionalities described in above paragraphs or in other words to achieve the desired
objectives of the present invention.
[0048] In an embodiment, the user device 200 may be implement with an Artificial
Intelligence (AI) model. The AI model may be used to extract information such as
images, search history etc. The user device 200 may transmit the extracted information
to the server 106. In an embodiment, the extracted information may form a part of the
source data and/or one or more user inputs.
[0049] Embodiments illustrated above is exemplary in nature and the user device 200
may include any other addition components required to perform desired functionality
of the user device 200.
[0050] Fig. 3 is a block diagram illustrating the server 300 for generating a video in
accordance with an embodiment of the present disclosure. In an exemplary
embodiment, the server 300 is similar to the server 106 disclosed in figure 1. The server
300 includes a transceiver unit 302, a modulation unit 304, an avatar generation unit
306, a synchronization unit 308, a background generation unit 310, a video generation
unit 312, a score generation unit 314 and the one or more memory units 316. Each of
14
the transceiver unit 302, the modulation unit 304, the avatar generation unit 306, the
synchronization unit 308, the background generation unit 310, the video generation unit
312, the score generation unit 314 and the one or more memory units 316 may be
operatively coupled to each other.
[0051] The transceiver unit 302 may be configured to establish a communication
between the one or more user devices 102, 200 and the server 106 via the network 104.
The transceiver unit 302 may be coupled to the transceiver 208 (shown in Fig. 2) of the
user device 200. The transceiver unit 302 may be configured to receive and transmit
data from and to the user device 102.
[0052] In an embodiment, the modulation unit 304, the avatar generation unit 306, the
synchronization unit 308, the background generation unit 310, the video generation unit
312, the score generation unit 314 may be implemented using one or more processors
110 illustrated in Fig. 1. In another embodiment, the modulation unit 304, the avatar
generation unit 306, the synchronization unit 308, the background generation unit 310,
the video generation unit 312 and the score generation unit 314 may be implemented
using one or more processors 110 through a neural network.
[0053] The modulation unit 304 may be configured to receive source data from the
user. The source data comprising user voice data and one or more parameters related
to a target voice. In some embodiments, the modulation unit 304 may be configured to
extract parameters of the user voice data and modulate the extracted parameters
corresponding to the parameters of the target voice. In an embodiment, the parameters
associated with the target voice may be stored in the one or more memory units 316. In
alternative embodiments, the parameters associated with the target voice may be
provided by the user via the user device 102. In some embodiments, the extracted
parameters and/or the parameters associated with the target voice may comprise audio
related characteristics such as, tone, pitch, frequency, amplitude and so forth. The
modulation unit 304 may be configured to modulate the extracted parameters of the
user voice in accordance with the parameters associated with the target voice to match
the user voice with the target voice. In some embodiments, the modulation unit 304
15
may include any suitable electronic or electrical component such as, amplifier,
modulator, phase shifter, auto tuner, and so forth. Embodiments illustrated above either
cover or intend to cover any other suitable component to be a part of the modulation
unit 304 which may be required to perform desired functionality of the modulation unit
304. In an exemplary embodiment, the modulation unit 304 may be based on
Generative Adversarial Network (GAN). The one or more functionality of the
modulation unit 304 may be implemented using Artificial Intelligence (AI) based
model such as “MelGan” to provide high quality audio synthesis with relatively less
parameters than competing models.
[0054] The server 106 further includes the avatar generation unit 306 configured to
generate an animated representation of the user. In some embodiments, the avatar
generation unit 306 may be configured to receive image of a user as an input to generate
an animated representation of the user. In some other embodiments, the avatar
generation unit 306 may be configured to receive a video as an input to generate the
animated representation of the user in form of video. In alternative embodiments, the
avatar generation unit 306 may be configured to receive one or more user inputs
defining various characteristic of the user to generate the animated representation of
the user. In some embodiments, the animated representation of the user may at least
comprise facial representation of the user. In some other embodiments, the animated
representation of the user may comprise complete body representation of the user. The
avatar generation unit 306 may be operatively coupled to the one or more memory units
316 to generate the animated representation of the user. The avatar generation unit 306
may include any suitable image processing hardware components and/or associated set
of instructions when executed by the hardware components configured to generate the
animated representation of the user based on at least one of the image of the user and
the one or more user inputs. In an exemplary embodiment, the avatar generation unit
306 may be implement using AI based state of art model such as, but not limited to,
Densepose. The Densepose may facilitate the avatar generation unit 306 to convert a
2-Dimensional (2D) image of the user to a 3D, surface base representation of the user.
16
In an embodiment, the avatar generation unit 306 may comprise densepose block (not
shown) and Unsupervised Generative Attentional Networks with Adaptive LayerInstance Normalization for Image-to-Image Translation (U-GAT-IT) block (not
shown). The Densepose block may receive a user video as an input and generate pose
for user body. The user input image may be combined with the generated pose of the
user body and provided as input to the UFAT-IT block. The UGAT-IT may perform
style transfer and converts the combined video into the avatar style animated
representation of the user.
[0055] The server 106 further includes the synchronization unit 308 configured to
synchronize the modulated user voice data over the animated representation of the user.
In some embodiments, the synchronization unit 308 may be configured to synchronize
the facial expression by varying the characteristics of facial components such as eyes,
lips, nose and so forth in accordance with the modulated user voice data. Example of
synchronizing facial expression include lips synchronization based on lyrics of song
identified from the user voice data. In some other embodiments, the synchronization
unit 308 may be configured to synchronize the body and the associated component of
the animated representation of the user to illustrate dancing of the animated
representation of the used based on the modulated user voice data. The synchronization
unit 308 may be configured to receive and/or extract parameters such as, but not limited
to, tone, beats, bass, lyrics and so forth associated with the modulated user voice data
to synchronize the animated representation of the user. In an exemplary embodiment,
the synchronization unit 308 may animate the static animated representation of the user
by adding dynamic features to the animated representation of the user. The
synchronization unit 308 may include any suitable image processing hardware
components and/or associated set of instructions when executed by the hardware
components configured to generate the animated representation of the user based on at
least one of the image of the user and the one or more user inputs.
[0056] The server 106 also includes the background generation unit 310 configured to
generate a background based on the modulated voice data. In an embodiment, the
17
background generation unit 310 may be configured to generate the background using
one or more background parameters. The one or more background parameters include
at least one of genre of song, tempo of song and one or more prestored semantic maps.
In alternative embodiments, the one or more background parameters may include
information extracted from the user device 102. In an embodiment, the background
generation unit 310 may determine the one or more background parameters from the
modulated voice data. In another embodiment, the background generation unit 310 may
receive one or more background parameters from the user device 102 and/or any other
unit of the server 300. The one or more semantic maps used for generating the
background may be stored the one or more memory units 316. The background
generation unit 310 may be configured to retrieve the one or more sematic maps from
the one or more memory units 316 to generate the background.
[0057] In some embodiments, the one or more background parameters may also
include mode, situation, timing and seasons. In alternative embodiment, the
background generation unit 310 may receive one or more inputs from the user to
generate the background. The one or more inputs from the user may comprises
information such as, but not limited to, a name of location. In an embodiment, the
background may be static image generated based on one or more background
parameters. In another embodiment, the background may be a set of images generated
representing a dynamic image and/or video which is generated based on the one or
more background parameters. The background generation unit 310 may be configured
to change the background based on sematic map given in the video in accordance with
the one or more background parameters. Example of generation of background may
include determining a genre of song such as Disco, the background being generated
may represent an environment of a disco with lots of lights. The background generation
unit 310 may include any suitable image processing hardware components and/or
associated set of instructions when executed by the hardware components configured
to generate the background, in accordance with the embodiments of the present
disclosure. In an embodiment, the background generation unit 310 may utilize 3D
convolution-based classification approach to generate the background.
18
[0058] The server 300 may further includes the video generation unit 312 configured
to generate a video based on synchronized animated representation of the user and the
generated background. The video generation unit 312 may superimpose the animated
representation of the user over the generated background to generate the desired video.
The video generated unit 312 may receive one or more inputs from the one or more
other units (304, 308, 306 and 310) of the server 300 to generate the video. The video
may include a set of images and synchronized audio components. The video generated
by the video generated device included automatic generated animated representation
and background and therefore includes a number of entertaining components required
to engage the users. The video generation unit 312 may include any suitable hardware
components and/or associated set of instructions when executed by the hardware
configured to generate the video, in accordance with the embodiments of the present
disclosure. The generated video may be stored in the one or more memory units 316.
In another embodiment, the generated video may be transmitted to the user device 102
via the transceiver unit 302.
[0059] The server 106 further includes a score generation unit 314 configured to
determine a score for the user based on modulation required to match the user voice
data to the target voice. In an embodiment, the score generation unit 314 may be
configured to receive one or more inputs form the one or more other units (304, 308,
306, 310 and 312) of the server 300 to determine the score of the users. The score
generation unit 314 may provide the score to the user via the transceiver unit 302. In
an alternative embodiment, the score generation unit 314 may store the generated score
in the one or more memory units 316. In an embodiment, the score generation unit 314
may utilize an AI based model such as RankNet method to generate score and rank of
multiple users. The score generation unit 314 may include any suitable hardware
components and/or associated set of instruction when executed by the hardware
components configured to compare the user voice data with the modulate voice data to
determine modulation require to match the user voice to the target voice. In some
embodiments, the score generation unit 314 may be configured to determine a score for
19
multiple users based on modulation required to match user voice data to the target voice
for each user. The score generation unit 314 may match the score of the multiple users
against each other. The score generation unit 314 may provide a rank to multiple users
based on the matching of score. The ranking may be provided to the user in form of a
list or any graphical representation.
[0060] Embodiment illustrated above are exemplary in nature and the server 300 may
include any additional unit required to perform the desired operation of the server 300.
Further, embodiments of the present disclosure covers or intend to cover any one or
more operations as illustrative above to be performed at the user device 102. The user
device 102 may include any of the corresponding unit required to perform said
operation of the server 300.
[0061] Fig. 4 is a flowchart of exemplary method 400 for generating a video, in
accordance with an embodiment of the present disclosure. This flowchart is provided
for illustration purposes, and embodiments are intended to include or otherwise cover
any methods or procedures for generating video. Fig. 4 is described in reference to Figs.
1-3.
[0062] At block 402, the server 106 may be configured for receiving the source data
from a user via the transceiver unit 302. The source data may comprise user voice data
and one or more parameters related to a target voice. In an embodiment, the one or
more parameters related to the target voice comprises at least name of singer in the
target voice.
[0063] At step 404, the modulation unit 304 of the server 106 may be configured for
modulating the user voice data based on the one or more parameters related to the target
voice. In some embodiments, the modulation unit 304 may modulate the voice related
parameters such as pitch, tone, frequency etc. in accordance with the corresponding
parameters of the target voice. In an embodiment, the one or more parameters related
20
to the target voice may be used to identify the target voice and the corresponding
parameters are extracted from the one or more memory units 316.
[0064] At step 406, the synchronization unit 308 may be configured for synchronizing
the modulated user voice data over an animated representation of the user. In an
embodiment, the avatar generation unit 306 may be configured to receive at least one
of an image from the user and one or more inputs from the user to generate the animated
representation of the user. The animated representation of the user at least comprises
facial representation of the user. The synchronization unit 308 may be configured to
synchronize the facial expression of the animated representation of the user based on
the modulate user voice data.
[0065] At step 408, the background generation unit 310 may be configured for
generating a background based on the modulated voice data using one or more
background parameters. The one or more background parameters may include at least
one of genre of song, tempo of song, mood, timing, situation and one or more prestored
semantic maps.
[0066] At step 810, the video generation unit 312 may be configured for generating a
video based on the synchronized animated representation and the generated
background. In an embodiment, the generated video may be displayed to the user via
the user device 102.
[0067] The flowchart and block diagrams in the Figures illustrate the architecture,
functionality, and operation of possible implementations of devices, methods, and
computer program products according to various embodiments of the present
invention. In this regard, each block in the flowchart or block diagrams may represent
a module, unit, segment, or portion of instructions, which comprises one or more
executable instructions for implementing the specified logical function(s). In some
alternative implementations, the functions noted in the block may occur out of the order
noted in the figures. For example, two blocks shown in succession may, in fact, be
executed substantially concurrently, or the blocks may sometimes be executed in the
21
reverse order, depending upon the functionality involved. It will also be noted that each
block of the block diagrams and/or flowchart illustration, and combinations of blocks
in the block diagrams and/or flowchart illustration, can be implemented by special
purpose hardware-based systems that perform the specified functions or acts or carry
out combinations of special purpose hardware and computer instructions.
[0068] The foregoing description of the various embodiments is provided to enable any
person skilled in the art to make or use the present invention. Various modifications to
these embodiments will be readily apparent to those skilled in the art, and the generic
principles defined herein may be applied to other embodiments without departing from
the spirit or scope of the invention. Thus, the present invention is not intended to be
limited to the embodiments shown herein, and instead the claims should be accorded
the widest scope consistent with the principles and novel features disclosed herein.
[0069] While the invention has been described with reference to a preferred
embodiment, it is apparent that variations and modifications will occur without
departing the spirit and scope of the invention. It is therefore contemplated that the
present disclosure covers any and all modifications, variations or equivalents that fall
within the scope of the basic underlying principles disclosed above.
We Claim:
1. A method comprising:
receiving source data from a user, wherein the source data comprises
user voice data and one or more parameters related a target voice;
modulating the user voice data based on the one or more parameters
related to the target voice;
synchronizing the modulated user voice data over an animated
representation of the user;
generating a background based on the modulated voice data, wherein
the background is generated using one or more background parameters; and
generating a video using the synchronized animated representation and
the background.
2. The method according to claim 1, wherein the one or more parameters related
to the target voice comprises at least name of singer in the target voice.
3. The method according to claim 1, further comprising:
receiving at least one of:
an image from the user; and
one or more inputs from the user; and
generating the animated representation of the user based on at least one
of the received image and the one or more inputs from the user, wherein the
animated representation of the user at least comprises facial representation of
the user.
4. The method according to claim 1, wherein the one or more background
parameters include at least one of: genre of song, tempo of song and one or
more prestored semantic maps.
5. The method according to claim 1, further comprising:
23
determining a score for the user based on modulation required to match
the user voice data to the target voice; and
providing the score to the user.
6. The method according to claims 1 and 5, further comprising:
determining a score for multiple users based on modulation required to
match user voice data to the target voice for each user;
matching the scores of the multiple users against each other; and
providing a rank to the multiple users based on the matching of scores.
7. The method according to claim 1, further comprising:
identifying a song and associated parameters from the source data.
8. A system comprising:
one or more memories; and
one or more processors operatively coupled to the one or more
memories, the one or more processors configured to:
receive source data from a user, wherein the source data comprises user
voice data and one or more parameters related a target voice;
modulate the user voice data based on the one or more parameters
related to the target voice;
synchronize the modulated user voice data over an animated
representation of the user;
generate a background based on the modulated voice data, wherein the
background is generated using one or more background parameters; and
generate a video using the synchronized animated representation and the
background; and
a display device configured to display the video to the user.
9. The system according to claim 10, wherein the one or more parameters related
to the target voice comprises at least name of singer in the target voice.
24
10. The system according to claim 10, wherein the one or more processors are
further configured to:
receive at least one of:
an image from the user; and
one or more inputs from the user; and
generate the animated representation of the user based on at least one of
the received image and the one or more inputs from the user, wherein the
animated representation of the user at least comprises facial representation of
the user.
11. The system according to claim 10, wherein the one or more background
parameters include at least one of: genre of song, tempo of song and one or
more prestored semantic maps.
12. The system according to claim 10, wherein the one or more processors are
further configured to:
determine a score for the user based on modulation required to match
the user voice data to the target voice; and
provide the score to the user.
13. The system according to claims 10 and 12, wherein the one or more processors
are further configured to:
determine a score for multiple users based on modulation required to
match the user voice data to the target voice for each user;
match the scores of the multiple users against each other; and
provide a rank to the multiple users based on the matching of scores.
14. The system according to claim 10, wherein the one or more processors are
further configured to:
identify a song and associated parameters from the source data.
| # | Name | Date |
|---|---|---|
| 1 | 202011027883-STATEMENT OF UNDERTAKING (FORM 3) [30-06-2020(online)].pdf | 2020-06-30 |
| 2 | 202011027883-POWER OF AUTHORITY [30-06-2020(online)].pdf | 2020-06-30 |
| 3 | 202011027883-FORM 1 [30-06-2020(online)].pdf | 2020-06-30 |
| 4 | 202011027883-DRAWINGS [30-06-2020(online)].pdf | 2020-06-30 |
| 5 | 202011027883-DECLARATION OF INVENTORSHIP (FORM 5) [30-06-2020(online)].pdf | 2020-06-30 |
| 6 | 202011027883-COMPLETE SPECIFICATION [30-06-2020(online)].pdf | 2020-06-30 |
| 7 | 202011027883-Proof of Right [27-11-2020(online)].pdf | 2020-11-27 |
| 8 | 202011027883-FORM 18 [26-04-2024(online)].pdf | 2024-04-26 |
| 9 | 202011027883-FER.pdf | 2025-07-01 |
| 10 | 202011027883-FORM 3 [21-08-2025(online)].pdf | 2025-08-21 |
| 1 | 202011027883_SearchStrategyNew_E_202011027883E_27-01-2025.pdf |