“Method And System For Generating A Video Based On At Least One Word

< Back

“Method And System For Generating A Video Based On At Least One Word And Context”

Abstract: Present disclosure relates to techniques for automatically generating a video. Said techniques discuss receiving at least one user input, the at least one user input comprising at least one word and a context, generating lyrical data based on the received user input using Generative Pretrained Transformer 2 (GPT-2) model, and display a plurality of genre of music corresponding to the lyrical data. It further discusses selecting a specific genre of music from the plurality of genre of music, based on user input, generating an audio based on the selected genre of music and the generated lyrical data, retrieving an image of the user, and generating a video at least based on the retrieved image and the generated audio.

Get Free WhatsApp Updates!
Notices, Deadlines & Correspondence

Patent Information

Application #

Filing Date

31 March 2021

Publication Number

07/2023

Publication Type

INA

Invention Field

COMMUNICATION

Status

ipo@knspartners.com

Parent Application

Applicants

Hike Private Limited

Bharti Crescent, 1, Nelson Mandela Road Vasant Kunj, Phase - II New Delhi – 110070

Inventors

1. Srishti Goel

Bharti Crescent, 1, Nelson Mandela Road Vasant Kunj, Phase - II New Delhi – 110070, India

2. Neeraj Kumar

Bharti Crescent, 1, Nelson Mandela Road Vasant Kunj, Phase - II New Delhi – 110070, India

3. Ankur Narang

Bharti Crescent, 1, Nelson Mandela Road Vasant Kunj, Phase - II New Delhi – 110070, India

4. Kavin Bharti Mittal

Bharti Crescent, 1, Nelson Mandela Road Vasant Kunj, Phase - II New Delhi – 110070, India

Specification

The present disclosure generally relates to communication. More specifically, the present disclosure relates to automatically generating a video based on at least one word and context.
BACKGROUND OF THE INVENTION:
[0002] Music has become a crucial element in everyday life of the people. Music is a formidable therapy that relaxes people in the moment of sorrow and makes people cheerful in the moment of happiness. People spend hours listening to it and billions of dollars buying it.
[0003] Presently, there are several mobile applications that provides audio and video music surfing online. Generally, a user of a mobile device spend a lot of time searching for an audio or a video music that best suits his mood and present context. Most of the times, the user end up not liking the lyrics or theme of the music.
[0004] There is no service/application available that generates a music video based on a context and mood of the user and generates lyrics of the music video based on a word or phrase provided by the user. Further, there is no such application that generates background of the video based on user's mood and context.
[0005] Thus, there is need of the technique that overcome the above-mentioned drawbacks/fulfils the above needs and that generates a video based on a context and at least one word provided by the user.
OBJECTS OF THE INVENTION:
[0006] An object of the present disclosure is to generate lyrical data of the music video based on
a word or phrase provided by a user. [0007] Another object of the present invention is to generate music video based on context
provided by the user. [0008] Another object of the present disclosure is to generate background of the music video based
on a user mood and context.
SUMMARY OF THE INVENTION:

[0009] The present disclosure overcomes one or more shortcomings of the prior art and provides additional advantages discussed throughout the present disclosure. Additional features and advantages are realized through the techniques of the present disclosure. Other embodiments and aspects of the disclosure are described in detail herein and are considered a part of the claimed disclosure.
[0010] In one non-limiting embodiment of the present disclosure, a system for automatically generating a video is disclosed. The system comprises a memory, a user interface in communication with the memory, at least one processor in communication with the memory and the user interface. The at least one processor is configured to receive at least one user input. The at least one user input comprises at least one word and a context. The at least one processor is configured to generate lyrical data based on the received user input using Generative Pretrained Transformer 2 (GPT-2) model, display a plurality of genre of music corresponding to the lyrical data, select a specific genre of music from the plurality of genre of music, based on user input, and retrieve an image of the user. The system further comprises an audio generation unit in communication with the at least one processor and configured to generate an audio based on the selected genre of music and the generated lyrical data, and a video generation unit configured to generate a video at least based on the retrieved image and the generated audio.
[0011] In one non-limiting embodiment of the present disclosure, the at least one processor is configured to receive a plurality of word and one or more context and train the GPT-2 model with plurality of word associated and one or more context to generate the lyrical data.
[0012] In still non-limiting embodiment of the present disclosure, to generate the lyrical data the at least one processor is configured to predict one or more words based on the at least one word and the context and arrange the predicted one or more words to generate the lyrical data.
[0013] In yet another non-limiting embodiment of the present disclosure, to generate the video the video generation unit is configured to classify a background class for the video based on the context and generate the video based on the classified background class, the retrieved image, and the generated audio.

[0014] In yet another non-limiting embodiment of the present disclosure, to receive the at least one user input the at least one processor is configured to receive features associated with the at least one user input at a server, and to retrieve the image the at least one processor is configured to retrieve features associated with the image.
[0015] In yet another non-limiting embodiment of the present disclosure, a method for automatically generating a video is disclosed. The method comprises receiving at least one user input, the at least one user input comprising at least one word and a context, generating lyrical data based on the received user input using Generative Pretrained Transformer 2 (GPT-2) model, displaying a plurality of genre of music corresponding to the lyrical data, selecting a specific genre of music from the plurality of genre of music, based on user input, generating an audio based on the selected genre of music and the generated lyrical data, retrieving an image of the user, and generating a video at least based on the retrieved image and the generated audio.
[0016] In yet another non-limiting embodiment of the present disclosure, the processing unit is further configured to generate an animation data file for the audio message, the animation data file comprises the time-aligned functions, process the animated data file line by line to generate a plurality of frames, each frame comprises time-aligned functions associated with a corresponding line of the animation data file, and generate the superimposed visual content by combining the plurality of frames.
[0017] In yet another non-limiting embodiment of the present disclosure, the method further comprises receiving a plurality of word and one or more context and training the GPT-2 model with plurality of word associated and one or more context to generate the lyrical data.
[0018] In yet another non-limiting embodiment of the present disclosure, the step of generating the lyrical data using GPT-2 model comprises predicting one or more words based on the at least one word and the context and arranging the predicted one or more words to generate the lyrical data.
[0019] In yet another non-limiting embodiment of the present disclosure, the step of generating the video comprises classifying a background class for the video based on the context and generating the video based on the classified background class, the retrieved image, and the

[0020] In yet another non-limiting embodiment of the present disclosure, the step of receiving the at least one user input comprises receiving features associated with the at least one user input at a server, and the step of retrieving the image comprises retrieving features associated with the image.
[0021] The foregoing summary is illustrative only and is not intended to be in any way limiting. In addition to the illustrative aspects, embodiments, and features described above, further aspects, embodiments, and features will become apparent by reference to the drawings and the following detailed description.
BRIEF DESCRIPTION OF DRAWINGS:
[0022] The features, nature, and advantages of the present disclosure will become more apparent from the detailed description set forth below when taken conjunction with the drawings in which like reference characters identify correspondingly throughout. Some embodiments of system and/or methods in accordance with embodiments of the present subject matter are now described, by way of example only, and with reference to the accompanying figures, in which:
[0023] Fig. 1(a) shows an exemplary environment in a communication network, in accordance with an embodiment of the present disclosure;
[0024] Fig. 1(b) shows an exemplary environment in a communication network, in accordance with an embodiment of the present disclosure;
[0025] Fig. 2 illustrates an exemplary data flow for automatically generating a video, in accordance with an embodiment of the present disclosure;
[0026] Fig. 3(a) illustrates a block diagram of a network for automatically generating a video, in accordance with another embodiment of the present disclosure;
[0027] Fig. 3(b) illustrates a block diagram illustrating a video generation unit, in accordance with another embodiment of the present disclosure;
[0028] Fig. 4 illustrate a flowchart of an exemplary method for automatically generating a video, in accordance with an embodiment of the present disclosure;
[0029] It should be appreciated by those skilled in the art that any block diagrams herein represent conceptual views of illustrative systems embodying the principles of the present subject

represent various processes which may be substantially represented in computer readable medium and executed by a computer or processor, whether or not such computer or processor is explicitly shown.
DETAILED DESCRIPTION OF DRAWINGS:
[0030] In the present document, the word "exemplary" is used herein to mean "serving as an example, instance, or illustration." Any embodiment or implementation of the present subject matter described herein as "exemplary" is not necessarily to be construed as preferred or advantageous over other embodiments.
[0031] While the disclosure is susceptible to various modifications and alternative forms, specific embodiment thereof has been shown by way of example in the drawings and will be described in detail below. It should be understood, however that it is not intended to limit the disclosure to the particular forms disclosed, but on the contrary, the disclosure is to cover all modifications, equivalents, and alternative falling within the scope of the disclosure.
[0032] The terms "comprises", "comprising", "include(s)", or any other variations thereof, are intended to cover a non-exclusive inclusion, such that a setup, system or method that comprises a list of components or steps does not include only those components or steps but may include other components or steps not expressly listed or inherent to such setup or system or method. In other words, one or more elements in a system or apparatus proceeded by "comprises... a" does not, without more constraints, preclude the existence of other elements or additional elements in the system or apparatus.
[0033] In the following detailed description of the embodiments of the disclosure, reference is made to the accompanying drawings that form a part hereof, and in which are shown by way of illustration specific embodiments in which the disclosure may be practiced. These embodiments are described in sufficient detail to enable those skilled in the art to practice the disclosure, and it is to be understood that other embodiments may be utilized and that changes may be made without departing from the scope of the present disclosure. The following description is, therefore, not to be taken in a limiting sense.
[0034] Fig. 1(a) shows an exemplary environment in a communication network, in accordance with an embodiment of the present disclosure.

[0035] In an embodiment of the present disclosure, the environment 100a may comprise a user device 101 and a server 103 in communication with each other. The user device 101 may be operated by a user 110. The user 110 may provide at least one input to the user device 101. The at least one input may comprise at least one word, a context, and an image. The user device 101 may forward or transmit the at least one input to the server 103.
[0036] In another embodiment of the present disclosure, the user device 101 may only forward features associated with the at least one input to the server 103. The features associated with the at least one input may be extracted and shared with the server using Federated Learning technique.
[0037] Federated Learning allows for smarter models, lower latency, and less power consumption, all while ensuring privacy as only the features associated with the at least one user input is transmitted to the server 103 and the entire data is retained at the user device 101. In one non-limiting embodiment, the data sharing is not limited to above mentioned technique and a person skilled may apply any other technique known to a person skilled in the art.
[0038] The server 103 may be configured to receive the features associated with the at least one input from the user device 101 and re-build or re construct the user input based on the federated learning technique. The server may be configured to generate the lyrical data based on the at least one word and the context. The server may be then configured to generate the audio based on the lyrical data and a selected genre of music. In one non-limiting embodiment of the present disclosure, the genre of music may be automatically selected based on a user's interest or previously selected genre.
[0039] The server 103 may be then configured to generate a video based on the generated audio and image provided by the user 110 of the user device 101. The server 103 may then transmit the generated video to the user device 101. In an embodiment of the present disclosure, the image may be customized as per choice or preference of the user 110. The image may or may not be the representative of the user 110.
[0040] In one non-limiting embodiment, the image may be retrieved from the user device 101. The retrieved image and the generated audio may be used by the server 103 to generate the video, thereby facilitating generation of video based on context and at least one word provided by the user.

[0041] Fig. 1(b) shows an exemplary environment 100b in a communication network, in accordance with an embodiment of the present disclosure.
[0042] In an embodiment of the present disclosure, the environment 100b may comprise a user 110 and a user device 101. The user 110 may provide at least one word, context, and an image to the user device 101. In one non-limiting embodiment of the present disclosure, the user device 101 may allow the user 110 to select at least one image stored in the memory of the user device 101. In another non-limiting embodiment, the user 110 may capture the at one image using the user device 101.
[0043] The user device 101 may be configured to generate the lyrical data based on the at least one word and a context provided by the user 110. The user device 101 may the allow the user 110 to select a genre of music via a user interface of the user device 101. The user device 101 may automatically select a genre of music previously selected by the user or a genre of music based on an interest of the user 110.
[0044] The user device 101 may be then configured to generate an audio based on the lyrical data and the genre of music. The user device 101 may be then configured to generate a video music based on the at least one image and the generated audio. The background of the video may be generated based on a mood of the user 110 and the context provided by the user 110. Thus, the user device 110 facilitates generation of video based on a choice of the user 110.
[0045] Fig. 2 illustrates an exemplary data flow for automatically generating a video, in accordance with an embodiment of the present disclosure.
[0046] In an embodiment of the present disclosure, a Generative Pretrained Transformer 2 (GPT-2) model may receive the features associated with the user input i.e., at least one word and context provided by the user. The GPT-2 model may predict one or more words based on the at least one word and the context provided by the user. The one or more words may be arranged to form the lyrical data.
[0047] The lyrical data may be then provided to a music generation unit along with a genre of music select by the user. The genre of music may comprise song, storytelling, verse telling, gazal, etc. The music generation unit may then generate an audio based on the lyrical data and genre of music selected by the user.

[0048] The generated audio and at one image may be then provided to a video generation unit. The video generation unit may be then configured to generate a video based on the audio and the at least one image provided by the user. In one non-limiting embodiment, the image may be retrieved from a memory of the user device. The features of the retrieved image may be then transmitted to the server according to the federated learning technique as discussed above.
[0049] In an embodiment of the present disclosure, a current mood of the user and context may be provided to a background generation unit. The background generation unit may be configured to classify a background class of the video based on the mood and context provided by the user. The background generation unit may operate on a residual neural network (ResNet) model. The background class classified by the background generation unit may be provided to the video generation unit for generating the background of the video.
[0050] In one non-limiting embodiment, the background generation is not limited to technique discussed above, the background generation is not limited to above mentioned technique and a person skilled may apply any other technique known to a person skilled in the art.
[0051] Fig. 3(a) illustrates a block diagram of a network 300 for automatically generating a video and fig. 3(b) illustrates a block diagram illustrating a video generation unit 309, in accordance with another embodiment of the present disclosure.
[0052] In an embodiment of the present disclosure, the system 300 may include one or more of elements, but not limited to, internet, a local area network, a wide area network, a peer-to-peer network, and/or other similar technologies for connecting various entities as discussed below. In an aspect, various elements/entities such as a system 310 and user device 320 of the network 300 as shown in fig. 3(a) may communicate within the network 300 through web presence (not shown). In one non-limiting embodiment the system 310 may be a server.
[0053] The user device 320 may represent a desktop computer, a laptop computer, mobile device (e.g., smart phone or personal digital assistant), tablet device, or another type of computing device, which has computing, messaging and networking capabilities. The user device 320 may be equipped with one or more computer storage devices (e.g., RAM, ROM, PROM,

SRAM, etc.), communication unit and one or more processing devices (e.g., central processing units) that are capable of executing computer program instructions.
[0054] The user device 320 may be operated by a user. The user device 320 may comprise a user interface (not shown) configured to receive at least one user input. The at least one user input may comprise at least one word and a context. The user device 320 may be configured to extract the features associated with the at least one input based on the federated learning technique.
[0055] The user device 320 may be configured to transmit the features associated with the at least one user input to the server 310. In one non-limiting embodiment, the user device 320 may be configured to transmit the one user input to the server 310 without applying the federated learning technique.
[0056] In an embodiment of the present disclosure, the server 310 may comprise various components such a memory 301, a user interface 303, at least one processor 305, an audio generation unit 307, a video generation unit 309, a transceiver 311 communicatively coupled with each other over a wired or wireless link. The at least one processor 305 may be configured to receive, via the transceiver 311, at least one user input from the user device 320.
[0057] The at least one user input may comprise at least one of a context, and at least one word. In one non-limiting embodiment, to receive the at least one user input the at least one processor 305 may be configured to receive features associated with the at least one user input at the server 310.
[0058] In an embodiment of the present disclosure, the server 320 may comprise a Generative Pretrained Transformer 2 (GPT-2) model. The at least one processor 305 may be configured receive a plurality of word and one or more context from the user and train the GPT-2 model with plurality of word associated and one or more context for lyrics generation task. During the training time, the model is fine-tuned with few words and context as an input and lyrics as an output.
[0059] The at least one processor 305 may be configured to generate lyrical data based on the received user input using GPT-2 model. The at least one processor 305 for generating the lyrical data may be configured to predict one or more words based on the at least one word and the context based on the training done at the initial phase. The at least one processor

305 may be then configured to arrange the predicted one or more words to generate the lyrical data.
[0060] In an embodiment of the present disclosure, the at least one processor 305 may be configured to display a plurality of genre of music corresponding to the lyrical data on the user interface 303 of the user device 320. The user device 320 may be configured to receive a selection of genre of music from the user of the user device 320. The selection may be transmitted as user input from the user device 320 to the server 320.
[0061] The at least one processor 305 may be configured to select a specific genre of music from the plurality of genre of music, based on the user input received from the user device 320. The at least one processor 305 may be then configured to retrieve at least one image of the user from the memory of the user device 320. In one non-limiting embodiment, the at least one processor 305 may have the at least one image stored in the memory 301 with respect to the user of the user device 320.
[0062] In an embodiment of the present disclosure, the audio generation unit 307 may be configured to generate an audio based on the selected genre of music and the generated lyrical data. The audio generation unit 307 may comprise a plurality of templates of music for plurality of genre of music. The audio generation unit 307 may sync the one of the template of music with lyrical data based on the genre of music selected by the user. The audio generation unit may comprise a memory and one or more processors for generating the audio output.
[0063] In an embodiment of the present disclosure, the video generation unit 309 may be configured to generate a video at least based on the retrieved image and the generated audio. The video generation unit 309 may comprise an audio to video generator 313 and a background generation unit 315 in communication with each other. The background generation unit 315 may operate on a residual neural network (ResNet) model.
[0064] The ResNet model of the background generation unit 315 may be trained with plurality of background classes and associated context and mood of the user. The background class may comprise mountains, beaches, stage performance, etc. The background generation unit 315 may be configured to classify a background class for the video based on the context provided in the at least one user input. The background generation unit 315 may be then configured to generate a background of the video based on the classified background class.

[0065] The audio to video generator 313 of the video generation unit 309 may be configured to generate the video based on the generated background, the retrieved image, and the generated audio. The generated video may be then transmitted to the user device 320 through the transceiver 311. The user device 320 may be displayed to the user via the user interface 303 of the user device 320.
[0066] In one non-limiting embodiments of the present disclosure, the user device 320 may comprise all the component of the server 310 and the user device 320 may be configured to generate the video based on at least one user input. Thus, the system 310 facilitates accurately generating video based on a context and mood of the user. Further, the system facilitates generation of lyrical data based on at least one word and context provided by the user.
[0067] Fig. 4 illustrate a flowchart of an exemplary method 400 for automatically generating a video, in accordance with an embodiment of the present disclosure.
[0068] At block 401, the method 400 discloses receiving at least one user input from the user device. The user device may capture the input receive from the user and transmit it to the user device. The at least one user input comprises at least one word and a context. In an embodiment of the present disclosure, receiving the at least one user input comprises receiving features associated with the at least one user input at a server.
[0069] The user device may extract features associated with the at least one input based on the federated learning technique and transmit the features associated with the at least one user input to the server. The present disclosure is not limited to data sharing technique discussed above and a person skilled may apply any other data sharing technique known to a person skilled in the art.
[0070] At block 403, the method 400 discloses generating lyrical data based on the received user input using Generative Pretrained Transformer 2 (GPT-2) model. The GPT-2 model may be trained with a plurality of word and one or more context from the user for lyrics generation task. During the training time, the model is fine-tuned with few words and context as an input and lyrics as an output.
[0071] In an embodiment of the present disclosure, the step of generating the lyrical data using GPT-2 model may comprise predicting one or more words based on the at least one word

and the context and arranging the predicted one or more words to generate the lyrical data. The one or more words may be predicted based on the training of the GPT-2 model.
[0072] At block 405, the method 400 discloses displaying a plurality of genre of music corresponding to the lyrical data. The plurality of genre of music corresponding to the lyrical data may be displayed on the user interface of the user device. The user device may receive a selection of genre of music from the user and transmit it as at least one user input to the server.
[0073] At block 407, the method 400 discloses selecting a specific genre of music from the plurality of genre of music, based on the user input received from the user device. At block 409, the method 400 discloses generating an audio based on the selected genre of music and the generated lyrical data. The audio may be generated as per procedure discussed above.
[0074] At block 411, the method 400 discloses retrieving at least one image of the user from the memory of the user device. In one non-limiting embodiment, the at least one image stored in a memory of the server with respect to the user of the user device. The image may be directly retrieved from the memory of the server without any user input.
[0075] At block 413, the method 400 discloses generating a video at least based on the retrieved image and the generated audio at step 409. The step of generating the video may comprise classifying a background class for the video based on the context, and generating the video based on the classified background class, the retrieved image, and the generated audio.
[0076] In an embodiment of the present disclosure, a ResNet model may be used for generating the background of the video. The ResNet model may be trained with plurality of background classes and associated context and mood of the user. The background class may comprise mountains, beaches, stage performance, etc. The background of the video may be generated based on the training of classified background class.
[0077] In an embodiment of the present disclosure, the video at step 413 may be generated based on the generated background, the retrieved image, and the generated audio. The generated video may be then transmitted to the user device and may be displayed to the user via the user interface of the user device.

[0078] Thus, the method 400 facilitates accurately generating video based on a context and mood of the user. Further, the method 400 facilitates generation of lyrical data based on at least one word and context provided by the user.
[0079] In one non-limiting embodiments of the present disclosure, the user device may perform all the step of method 400 instead of server. In an embodiment of the present disclosure, the steps of method 400 may be performed in an order different from the order described above.
[0080] The user interface 303 may include at least one of a key input means, such as a keyboard or keypad, a touch input means, such as a touch sensor or touchpad, and the user interface may include a gesture input means. Further, the user interface 403 may include all types of input means that are currently in development or are to be developed in the future. The user interface 303 may receive information from the user through the touch panel of the display and transfer at least one processor 305.
[0081] The at least one processor 305 may comprise a memory and communication interface. The memory may be software maintained and/or organized in loadable code segments, modules, applications, programs, etc., which may be referred to herein as software modules. Each of the software modules may include instructions and data that, when installed or loaded on a processor and executed by the processor, contribute to a run-time image that controls the operation of the processors. When executed, certain instructions may cause the processor to perform functions in accordance with certain methods and processes described herein.
[0082] The illustrated steps are set out to explain the exemplary embodiments shown, and it should be anticipated that ongoing technological development will change the manner in which particular functions are performed. These examples are presented herein for purposes of illustration, and not limitation. Further, the boundaries of the functional building blocks have been arbitrarily defined herein for the convenience of the description. Alternative boundaries can be defined so long as the specified functions and relationships thereof are appropriately performed. Alternatives (including equivalents, extensions, variations, deviations, etc., of those described herein) will be apparent to persons skilled in the relevant art(s) based on the teachings contained herein. Such alternatives fall within the scope and spirit of the disclosed embodiments.

[0083] Furthermore, one or more computer-readable storage media may be utilized in implementing embodiments consistent with the present disclosure. A computer-readable storage medium refers to any type of physical memory on which information or data readable by a processor may be stored. Thus, a computer-readable storage medium may store instructions for execution by one or more processors, including instructions for causing the processor(s) to perform steps or stages consistent with the embodiments described herein. The term "computer- readable medium" should be understood to include tangible items and exclude carrier waves and transient signals, i.e., are non-transitory. Examples include random access memory (RAM), read-only memory (ROM), volatile memory, nonvolatile memory, hard drives, CD ROMs, DVDs, flash drives, disks, and any other known physical storage media.
[0084] Suitable processors include, by way of example, a special purpose processor, a digital signal processor (DSP), a plurality of microprocessors, one or more microprocessors in association with a DSP core, a controller, a microcontroller, Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs) circuits, any other type of integrated circuit (IC), and/or a state machine.
[0085] The illustrated steps are set out to explain the exemplary embodiments shown, and it should be anticipated that ongoing technological development will change the manner in which particular functions are performed. These examples are presented herein for purposes of illustration, and not limitation. Further, the boundaries of the functional building blocks have been arbitrarily defined herein for the convenience of the description. Alternative boundaries can be defined so long as the specified functions and relationships thereof are appropriately performed. Alternatives (including equivalents, extensions, variations, deviations, etc., of those described herein) will be apparent to persons skilled in the relevant art(s) based on the teachings contained herein. Such alternatives fall within the scope and spirit of the disclosed embodiments. Also, the words "comprising," "having," "containing," and "including," and other similar forms are intended to be equivalent in meaning and be open ended in that an item or items following any one of these words is not meant to be an exhaustive listing of such item or items or meant to be limited to only the listed item or items. It must also be noted that as used herein and in the appended claims,

the singular forms "a," "an," and "the" include plural references unless the context clearly dictates otherwise.

We Claim:

1. A system for automatically generating a video, the system comprising:
a memory;
a user interface in communication with the memory;
at least one processor in communication with the memory and the user interface,
wherein the at least one processor is configured to:
receive at least one user input, wherein the at least one user input comprises at least one word and a context;
generate lyrical data based on the received user input using Generative Pretrained Transformer 2 (GPT-2) model;
display a plurality of genre of music corresponding to the lyrical data;
select a specific genre of music from the plurality of genre of music, based on user input; and
retrieve an image of the user; wherein the system further comprises:
an audio generation unit in communication with the at least one processor and configured to generate an audio based on the selected genre of music and the generated lyrical data; and
a video generation unit configured to generate a video at least based on the retrieved image and the generated audio.
2. The system as claimed in claim 1, wherein the at least one processor is configured to:
receive a plurality of word and one or more context; and
train the GPT-2 model with plurality of word associated and one or more context to generate the lyrical data.
3. The system as claimed in claim 1, wherein to generate the lyrical data the at least one
processor is configured to:
predict one or more words based on the at least one word and the context; and arrange the predicted one or more words to generate the lyrical data.

4. The system as claimed in claim 1, wherein to generate the video the video generation unit
is configured to:
classify a background class for the video based on the context; and generate the video based on the classified background class, the retrieved image, and the generated audio.
5. The system as claimed in claim 1, wherein to receive the at least one user input the at least one processor is configured to receive features associated with the at least one user input at a server, and wherein to retrieve the image the at least one processor is configured to retrieve features associated with the image.
6. A method for automatically generating a video, the method comprising:
receiving at least one user input, wherein the at least one user input comprises at least one word and a context;
generating lyrical data based on the received user input using Generative Pretrained Transformer 2 (GPT-2) model;
displaying a plurality of genre of music corresponding to the lyrical data;
selecting a specific genre of music from the plurality of genre of music, based on user input;
generating an audio based on the selected genre of music and the generated lyrical data;
retrieving an image of the user; and
generating a video at least based on the retrieved image and the generated audio.
7. The method as claimed in claim 6, further comprising:
receiving a plurality of word and one or more context; and
training the GPT-2 model with plurality of word associated and one or more context to generate the lyrical data.
8. The method as claimed in claim 6, wherein generating the lyrical data using GPT-2 model
comprises:
predicting one or more words based on the at least one word and the context; and

arranging the predicted one or more words to generate the lyrical data.

9. The method as claimed in claim 6, wherein generating the video comprises:
classifying a background class for the video based on the context; and
generating the video based on the classified background class, the retrieved image, and the generated audio.
10. The method as claimed in claim 6, wherein receiving the at least one user input comprises
receiving features associated with the at least one user input at a server, and wherein retrieving the
image comprises retrieving features associated with.

Documents

Application Documents

#	Name	Date
1	202111015287-STATEMENT OF UNDERTAKING (FORM 3) [31-03-2021(online)].pdf	2021-03-31
2	202111015287-POWER OF AUTHORITY [31-03-2021(online)].pdf	2021-03-31
3	202111015287-FORM 1 [31-03-2021(online)].pdf	2021-03-31
4	202111015287-DRAWINGS [31-03-2021(online)].pdf	2021-03-31
5	202111015287-DECLARATION OF INVENTORSHIP (FORM 5) [31-03-2021(online)].pdf	2021-03-31
6	202111015287-COMPLETE SPECIFICATION [31-03-2021(online)].pdf	2021-03-31
7	202111015287-FORM 18 [03-01-2025(online)].pdf	2025-01-03