System And Method For Real Time Generation Of Realistic Talking Videos

< Back

System And Method For Real Time Generation Of Realistic Talking Videos On Edge Devices

Abstract: A system (100) for real-time generation of realistic talking videos (218, 420) on an edge device (102) is presented. The system (100) includes an acquisition subsystem (110) and a processing subsystem (112) including a realistic talking video generating platform (114) configured to process the audio input (104, 402), select essential phonemes (410) from the audio input (104, 402), for each essential phoneme (410), retrieve a pre-generated viseme (318) and a blink sequence viseme (322) corresponding to the source (106, 402) from the edge device (102), generate intermediate image frames for smooth transition between pre-generated visemes (318), dynamically render the pre-generated visemes (318) and the blink sequence visemes (322) in synchronization with live audio input (104, 402) to generate a realistic animated talking video (218, 422), and an interface unit (116, 118) configured to provide, in real-time on the edge device (102), the realistic animated talking video (218, 422). FIG. 1

Get Free WhatsApp Updates!
Notices, Deadlines & Correspondence

Patent Information

Application #

Filing Date

14 December 2023

Publication Number

51/2024

Publication Type

INA

Invention Field

COMPUTER SCIENCE

Status

Parent Application

Applicants

Myelin Foundry Private Limited

A-202/203, Miraya Rose, 66/1, Siddapura Village, Varthur Road, Whitefield, Bengaluru 560066, Karnataka, India

Inventors

1. Gopichand Katragadda

C/O Myelin Foundry Private Limited, A-202/203, Miraya Rose, 66/1, Siddapura Village, Varthur Road, Whitefield, Bengaluru 560066, Karnataka, India

2. Anshal Singh

C/O Myelin Foundry Private Limited, A-202/203, Miraya Rose, 66/1, Siddapura Village, Varthur Road, Whitefield, Bengaluru 560066, Karnataka, India

3. Harish L

C/O Myelin Foundry Private Limited, A-202/203, Miraya Rose, 66/1, Siddapura Village, Varthur Road, Whitefield, Bengaluru 560066, Karnataka, India

Specification

DESC:BACKGROUND
[0001] Embodiments of the present specification relate generally to real-time video processing and animation, and more particularly to systems and methods for generating in real-time realistic talking or animated videos from singular static images on edge devices with constrained computational resources.
[0002] With the advent of digital communication, individuals and businesses often communicate with each other via various types of networks, including the internet. Various devices such as personal computers, handheld devices, personal digital assistants (PDAs), cell phones, e-mail, instant messaging services, video conferencing, audio and video streaming, gaming, and the like are employed to convey information between users. Currently, the information is being communicated in both animated and text based formats having video and audio content. There is an increasing need for animated avatars of human beings, which are capable of adequately representing the human being in conversation such as visual desktop agents, digital actors, virtual avatars, and the like.
[0003] In the recent years, various systems and methods for video processing and specifically systems for generating realistic talking videos from static images have garnered an increasing amount of attention and interest. In one example, photographic images of a human being have been used to generate animated videos with motion. However, these videos exhibit low quality due to artifacts that blur the video image when compressed to reduce download time and exhibit poor lip synchronization. In addition, traditionally, generating lifelike talking or animated video from a static image requires significant computational power and resources and often rely on cloud-based services. Further, these techniques entail the use of advanced hardware and require high-bandwidth internet connectivity. Hence, the currently available methods are unable to perform offline, making them unsuitable for scenarios where connectivity is unavailable or intermittent. Therefore, these presently available approaches fall short of offering a comprehensive solution optimized for edge devices and hence are unsuitable for environments requiring low-resource processing, offline capabilities, and scalability for diverse applications like virtual assistants, online education, and entertainment. More particularly, the currently available methods are not feasible for edge devices such as smartphones, tablets, or embedded systems having limited computational processing capabilities and memory resources. Also, the existing methods lack efficiency and cannot operate independently without internet connectivity. Hence, there is a growing need for a technique that is optimized for generating realistic talking videos on edge devices that overcomes the shortcomings of the currently available approaches.

BRIEF DESCRIPTION
[0004] In accordance with aspects of the present specification, a system for real-time generation of realistic talking videos on an edge device is presented. The system includes an acquisition subsystem configured to receive an audio input and a source of the audio input. Moreover, the system includes a processing subsystem in operative association with the acquisition subsystem and including a realistic talking video generating platform, where the realistic talking video generating platform is, on the edge device, configured to if the audio input includes a text, process the audio input to generate an audio corresponding to the text, select one or more essential phonemes from the audio input, where the essential phonemes facilitate accurate lip synchronization, for each essential phoneme, retrieve a pre-generated viseme corresponding to the source and a pre-generated blink sequence viseme corresponding to the source, where the pre-generated viseme and the pre-generated blink sequence viseme corresponding to the source is retrieved from a data repository on the edge device, generate one or more intermediate image frames for smooth transition between pre-generated visemes, dynamically render the pre-generated visemes and the blink sequence visemes in synchronization with live audio input to generate a realistic animated talking video that is synchronized with the live audio input, and an interface unit configured to provide, in real-time on the edge device, the realistic animated talking video that is synchronized with the live audio input.
[0005] In accordance with another aspect of the present specification, a method for real-time generation of realistic talking videos on an edge device is presented. The method includes (a) receiving an audio input and a source of the audio input, (b) if the audio input include a text, processing the text to generate an audio, (c) selecting one or more essential phonemes from the audio input, where the essential phonemes facilitate accurate lip synchronization, (d) for each essential phoneme, retrieving a pre-generated viseme corresponding to the source and a pre-generated blink sequence viseme corresponding to the source, (e) generating one or more intermediate image frames for smooth transition between visemes, (f) dynamically rendering the pre-generated visemes and the pre-generated blink sequence visemes in synchronization with live audio input to generate a realistic animated talking video that is synchronized with the live audio input, and (g) providing, on the edge device, the realistic animated talking video that is synchronized with the live audio input.
[0006] In accordance with yet another aspect of the present specification, a processing system for real-time generation of realistic talking videos on an edge device is presented. The processing system includes a realistic talking video generating platform including a landmark generation unit configured to receive one or more images from one or more sources, for each of the one or more sources, process a corresponding image to identify facial landmarks, a viseme creation unit, for each of the one or more sources, configured to receive a set of distinguishable phonemes, receive a set of viseme landmarks corresponding to each of the distinguishable phonemes, for each distinguishable phoneme, modify the facial landmarks based on the distinguishable phoneme and viseme landmarks corresponding to the distinguishable phonemes to generate a viseme data map corresponding to the source, a blink sequence viseme generation unit for each of the one or more sources configured to adjust the facial landmarks to generate a blink sequence viseme, where the blink sequence viseme is configured to simulate natural eye movement, an intermediate frame generation unit configured to interpolate intermediate frames to smoothen transition between visemes, a phoneme selection unit configured to select one or more essential phonemes in the audio input and a real-time playback unit, for each of the one or more sources, configured to dynamically render the one or more visemes in synchronization with live audio input, in real-time.

DRAWINGS
[0007] These and other features and aspects of embodiments of the present specification will become better understood when the following detailed description is read with reference to the accompanying drawings in which like characters represent like parts throughout the drawings, wherein:
[0008] FIG. 1 is a schematic representation of an exemplary system for real-time generation of realistic talking videos on an edge device, in accordance with aspects of the present specification;
[0009] FIG. 2 is a flow chart illustrating an exemplary method for real-time generation of realistic talking videos on an edge device, in accordance with aspects of the present specification;
[0010] FIG. 3 is a schematic illustration of some steps of an exemplary method for generation of visemes for use in the method of FIG. 2, in accordance with aspects of the present specification;
[0011] FIG. 4 is a schematic illustration of one embodiment of the method for real-time generation of realistic talking videos on an edge device of FIG. 2, in accordance with aspects of the present specification;
[0012] FIG. 5 is a diagrammatical illustration of one embodiment of a realistic talking video generating platform for use in the system of FIG. 1, in accordance with aspects of the present specification; and
[0013] FIG. 6 is a schematic representation of one embodiment of a digital processing system implementing a realistic talking video generating platform for use in the system of FIG. 1, in accordance with aspects of the present specification.
DETAILED DESCRIPTION
[0014] The following description presents exemplary systems and methods for real-time generation of realistic talking videos on an edge device. In particular, the systems and methods described herein present an exemplary approach for generating realistic talking videos that are suitably synchronized with live audio input on edge devices. Embodiments described hereinafter present exemplary systems and methods that employ a live audio stream and a static image to animate lip-sync videos efficiently, optimizing the use of limited computational and memory resources of the edge device. Specifically, the systems and methods presented herein entail use of pre-generated visemes corresponding to a source such as a person and other visemes such as blink sequence visemes to convert a static image to a talking video that is synchronized with live audio input. These methods and systems are designed to be highly efficient and robust, thereby ensuring a high-quality output in the form of a realistic talking video that is synchronized with live audio input. Use of the present systems and methods presents significant advantages in revolutionizing the generation of realistic talking videos on edge devices. More particularly, the optimized resource utilization, offline capability, and real-time processing offered by the systems and methods described hereinafter present a groundbreaking solution in lip-sync animation and real-time video processing. The systems and methods for real-time generation of realistic talking videos on an edge device address the challenges of efficient lip-sync video generation in resource-constrained environments, enabling real-time applications without reliance on cloud-based infrastructure, network availability and bandwidths constraints, thereby overcoming the drawbacks of currently available methods of generating talking videos.
[0015] For ease of understanding, the exemplary embodiments of the present systems and methods are described in the context of real-time generation of talking videos on an edge device. However, use of the exemplary embodiments illustrated hereinafter in other systems and applications such as broadcast transmission, virtual assistants, online education, gaming, and entertainment is also contemplated. An exemplary environment that is suitable for practising various implementations of the present systems and methods is discussed in the following sections with reference to FIG. 1.
[0016] As used herein, the term “user” or “source” or “contact” refers to a person using an edge device or the system of FIG. 1 for streaming multimedia content. Also, in one example, the “user” or “source” or “contact” may represent one of a plurality of contacts listed or saved on an edge device that provides an input. The terms “user,” “viewer,” “consumer,” “source”, ‘contact,” “end user,” and “end consumer” may be used interchangeably. It may be noted that in some examples, the source of the input may be a non-human source such as an automobile. Further, as used herein, the term “input” is used to refer to any input provided by the source. For example, the input may include a text sent by the source. The input may also include an audio or live audio provided by the source.
[0017] Also, as used herein, the term “edge device” refers to a device that is a part of a distributed computing topology in which information processing is performed close to where things and/or people produce or consume information. Some non-limiting examples of the edge device include a mobile phone, a tablet, a laptop, a smart television (TV), and the like. Additionally, the term “edge device” may also be used to encompass a device that is operatively coupled to an edge device noted hereinabove. Some non-limiting examples of such a device include a streaming media player that is connected to a viewing device such as a TV and allows a user to stream video and/or music, a gaming device/console, and the like. Other examples of the edge device also include networking devices such as a router, a modem, and the like.
[0018] Moreover, as used herein, the term “phoneme” is used to refer to one of the smallest units of speech or smallest distinct sound unit in a language that distinguishes between words. Phonemes are the smallest units of sound that carry meaning and are distinguishable by a number of characteristics, including place and manner of articulation, voicing properties, and degree of aspiration. Different languages use different phonemes. For example, the English language uses 44 distinguishable English phonemes. As used herein, the term “distinguishable phonemes” may be used to refer to a minimal set of phonemes used to represent a given language.
[0019] Further, as used herein, the term “viseme” is used to represent a position of the face and mouth of a person when saying a word. The viseme is a visual equivalent of a phoneme in a spoken language and defines the position of the face and mouth while a person is speaking. Each viseme depicts the key facial poses for a specific set of phonemes.
[0020] As used herein, the term “facial landmark” is used to refer to specific key localized points on the frontal face, such as eye contours, eyebrow contours, nose, mouth corners, lip, chin, pupils, nostrils, and the like, that can be detected and tracked. Traditionally, a set of facial landmarks includes 68 landmarks that are used to represent the face.
[0021] Furthermore, as used herein, the term “real-time” is used to refer to imperceptible delays in user experience of multimedia content. By way of example, “real-time” processing entails a minimum continuous processing of at least 25 frames per second of video and aural content. The real-time processing is typically dependent upon the application. Further, the term “real-time” processing may also be used to encompass “near real-time” processing.
[0022] Referring now to the drawings, FIG. 1 illustrates an exemplary system 100 for real-time generation of talking videos on an edge device 102. In particular, the system 100 is configured to leverage pre-generated visemes, efficient phoneme selection, and optimized resource utilization to enable generation in real-time realistic talking videos on the edge device 102. The system 100 generates the realistic talking videos by enabling real-time video animation synchronized with live audio input. Also, the system 100 is configured to enhance quality of experience for a user or viewer on an edge device by facilitating real-time generation of talking videos that are synchronized with audio on the edge device. The system 100 employs a live audio stream and a static image to animate lip-sync videos efficiently, optimizing the use of limited computational and memory resources of the edge device 102. The system 100 is specifically designed for low-resource environments, allowing for offline functionality and seamless real-time playback.
[0023] In accordance with aspects of the present specification, the system 100 is configured to receive an audio input 104. The input audio 104 may include a text sent by a source or contact 106. Additionally or alternatively, the input 104 may include an audio such as a live audio sent by the source 106. It may be noted that in certain embodiments the input 104 may be received from the source 106 that is non-human such as an automobile.
[0024] In a presently contemplated configuration, the system 100 includes an edge device 102. As previously noted, the edge device 104 may be a mobile phone, a smart TV, a tablet, a laptop, a streaming media player, a gaming device/console, a router, a modem, and the like. In accordance with aspects of the present specification, to facilitate the real-time generation of the talking videos, the edge device 102 includes a talking video generating system 108. The talking video generating system 108 is configured to generate realistic talking videos from singular static images on the edge device 102 with limited computational resources. More particularly, using a single static image corresponding to the source or contact 106, the talking video generating system 108 facilitates real-time generation of realistic talking videos that are synchronized with live audio on the edge device 102. The talking video generating system 108 employs a live audio stream and a static image of the source 106 to animate lip-sync videos efficiently, optimizing the use of limited computational and memory resources of the edge device 102.
[0025] In the embodiment depicted in FIG. 1, the talking video generating system 108 is shown as including an acquisition subsystem 110 that is configured to receive the input audio 104 sent by the source 106. Additionally, the acquisition subsystem 110 is also configured to receive information associated with the source 106. By way of example, the information associated with the source 106 may include a single static image of the source 106, their contact number, and other information.
[0026] Further, the talking video generating system 108 includes a processing subsystem 112 that is operatively associated with the acquisition subsystem 110. The processing subsystem 112 is configured to receive the input audio 104 and the information about the source 106 from the acquisition subsystem 112 and process the input audio 104 and the single static image of the source 106 to generate a realistic talking video of the source 106 that is synchronized with live audio on the edge device 102. In one embodiment, to facilitate the generation of the talking video, the processing subsystem 112 includes a realistic talking video generating platform 114.
[0027] In accordance with aspects of the present specification, the realistic talking video generating system 114 is configured to facilitate the real-time generation of realistic talking videos from static images on the edge device 102 by efficiently combining audio processing, landmark-based facial animation, and optimized computational techniques that are specifically designed for low-resource environments, allowing for offline functionality and seamless real-time playback.
[0028] To that end, the realistic talking video generating system 114 is configured to pre-generate a set of visemes corresponding to a set of distinguishable phonemes associated with each of one or more sources 106. As previously noted, the sources may include one or more contacts stored in the edge device 102. Additionally, the realistic talking video generating system 114 is configured to process the input audio 104 to generate a talking video corresponding to the source 106 that is synchronized with live audio using the pre-generated visemes. In particular, the realistic talking video generating system 114 is configured to employ the input audio 104 such as a live audio stream and a static image corresponding to the source 106 to animate lip-sync videos efficiently, optimizing the use of limited computational and memory resources of the edge device 102.
[0029] As previously noted, the realistic talking video generating platform 114 is configured to pre-generate a set of visemes corresponding to a set of distinguishable phonemes associated with each of one or more sources 106. Accordingly, for each source, the realistic talking video generating platform 114 is configured to receive a reference image corresponding to a source 106. The realistic talking video generating platform 114 may be configured to process the reference image to identify one or more facial landmarks. In one embodiment, the realistic talking video generating platform 114 may employ a face landmarks model to identify the facial landmarks. As previously noted, the facial landmarks refer to specific key localized points on the frontal face, such as eye contours, eyebrow contours, nose, mouth corners, lip, chin, pupils, nostrils, and the like, that can be detected and tracked.
[0030] Subsequently, for each distinguishable phoneme, the realistic talking video generating platform 114 may be configured to modify the facial landmarks corresponding to the source 106 based on the distinguishable phoneme and corresponding viseme landmarks. Each viseme landmark may include the positions of facial landmarks corresponding to a given distinguishable phoneme. Accordingly, the realistic talking video generating platform 114 may retrieve/receive a set of distinguishable phonemes and a set of viseme landmarks corresponding to each distinguishable phoneme 310 associated with the input reference image of the source 106. In one embodiment, the realistic talking video generating platform 114 may retrieve the set of distinguishable phonemes and a set of viseme landmarks corresponding to each distinguishable phoneme from a data storage such as a data repository 120 on the edge device 102 corresponding to the source 106. However, in other embodiments, a set of distinguishable phonemes and a set of viseme landmarks corresponding to each distinguishable phoneme may also be retrieved from other storage means such as, but not limited to, physical storage devices such as local or remote hard disks, CDs, DVDs, Blu-ray disks, and the like. Use of other means of storage is also envisaged.
[0031] For each distinguishable phoneme, once the realistic talking video generating platform 114 receives the distinguishable phoneme and corresponding viseme landmarks associated with the input reference image of the source 106, the realistic talking video generating platform 114 is configured to modify or morph the facial landmarks based on the distinguishable phoneme and the corresponding viseme landmarks. In particular, the realistic talking video generating platform 114 may be configured to map the facial landmarks on the reference image based on the viseme landmarks corresponding to the distinguishable phoneme to generate a viseme data map. This process may be repeated for each distinguishable phoneme in the set of distinguishable phonemes to generate a set of viseme data maps corresponding to the reference image of the source 106.
[0032] In accordance with further aspects of the present specification, the realistic talking video generating platform 114 may also be configured to generate one or more blink sequence visemes. These blink sequence visemes are configured to simulate natural eye movement for adding realism via natural eye blinks to the animation. In one embodiment, the realistic talking video generating platform 114 may be configured to adjust the facial landmarks to generate the blink sequence visemes. By way of example, the adjustments of the facial landmarks is driven by intermediate predictions of three-dimensional (3D) landmark displacements. Use of the blink sequence visemes enhances the realism of the generated talking video by simulating involuntary human behavior.
[0033] Moreover, for each phoneme, the realistic talking video generating platform 114 may be configured to modify the reference image of the source 106 based on an associated viseme data map to generate a corresponding viseme. In one embodiment, an image-to-image generation model may be used to modify the reference image of the source 106 based on the associated viseme data map to generate the corresponding viseme. This process may be repeated to modify the reference image of the source 106 based on each of the viseme data maps to generate corresponding visemes. The visemes so generated are tailored to accurately depict facial movements corresponding to the distinguishable phonemes/speech sounds. Consequently, for a reference image associated with each source 106 a set of visemes corresponding to the distinguishable phonemes is generated. Furthermore, the set of visemes and the set of blink sequence visemes may be locally stored in the data repository 120 on the edge device 102.
[0034] The process of generating the set of visemes and the set of blink sequence visemes may be repeated for each source 106 in the list of sources saved on the edge device 102. Consequent to this processing by the realistic talking video generating platform 114, a set of visemes and a set of blink sequence visemes may be pre-generated for each source 106 and stored locally in the data repository 120 on the edge device 102 to minimize processing overhead during real-time playback. It may be noted that the realistic talking video generating platform 114 is configured to create a minimal set of discrete visemes that is optimized to represent essential speech movements, thereby reducing computational complexity.
[0035] In accordance with exemplary aspects of the present specification, the realistic talking video generating platform 114 is configured to generate realistic talking videos on the edge device 102 by leveraging the pre-generated visemes, the pre-generated blink sequence visemes, efficient phoneme selection, and optimized resource utilization of the edge device 102, thereby enabling real-time video animation synchronized with live audio input and providing a groundbreaking solution for edge-based lip-sync animation.
[0036] Accordingly, the realistic talking video generating platform 114 is configured to receive an audio input 104. The audio input 104 may be a text or a live audio input. The received audio input 104 may be processed or analyzed in real-time. By way of example, if the audio input 104 includes a text, the received text may be processed by the realistic talking video generating platform 114 to generate an audio output that is representative of live audio input. In one example, a text-to-speech model may be used to convert the text to a corresponding audio output. However, if an audio 104 representative of live audio input is directly received, then the step of processing the text may be omitted.
[0037] Moreover, continuing the audio analysis, the live audio input 104 may be processed by the realistic talking video generating platform 114 in real-time to identify or detect one or more essential phonemes. It may be noted that a set of essential phonemes represents a minimal set of distinguishable phonemes that is optimized to represent essential speech movements and required for accurate lip synchronization in an animated or talking video. This set of essential phonemes may be sufficient to optimally process the live audio input 104 to generate corresponding visemes, which in turn aids in the generation of a realistic animated or talking video. The selection of the essential phonemes aids in reducing computational complexity. In certain embodiments, a lightweight algorithm may be employed to facilitate the efficient detection of the essential phonemes with minimal latency. Also, in one example, a phoneme recognition model may be utilized to efficiently detect the essential phonemes.
[0038] Additionally, to facilitate the generation of realistic talking videos, the realistic talking video generating platform 114 may be configured to receive contact information associated with the source 106 of the input audio 104. In one example, the realistic talking video generating platform 114 may be configured to receive a static image. The static image may represent the source 106, in one example. If an image of the source 106 is not available, the realistic talking video generating platform 114 may be configured to generate an avatar or cartoon image of the source 106 and use the generated avatar/cartoon image for further processing. In one embodiment, the static image may be processed by the realistic talking video generating platform 114 to extract facial landmarks. These landmarks may be mapped to the pre-generated visemes that are locally stored on the edge device 102.
[0039] Subsequent to the receipt of the static image of the source 106 and the input audio 104, an animation pipeline may be initiated. Accordingly, for each of the selected essential phonemes associated with the source 106 of the input audio 104, one or more corresponding pre-generated visemes may be retrieved from the local storage on the edge device 102 such as the data repository 120. More particularly, for each detected essential phoneme, a corresponding set of one or more pre-generated visemes associated with the source 106 is retrieved from the data repository 120. Additionally, one or more blink sequence visemes associated with the source 106 may also be retrieved from the data repository 120. The blink sequence visemes are employed to add realism to the animation. Consequent to this processing, a pre-generated set of visemes and a pre-generated set of blink sequence visemes associated with the source 106 and corresponding to the essential phonemes may be generated. It may be noted that the realistic talking video generating platform 114 is configured to create a minimal set of discrete visemes corresponding to the essential phonemes that is optimized to represent essential speech movements, thereby reducing computational complexity.
[0040] As will be appreciated, any transitions between visemes may include abrupt movements. To alleviate the issue of abrupt transitions, in accordance with aspects of the present specification, one or more intermediate frames may be generated to smooth any transition between the visemes. In one embodiment, the realistic talking video generating platform 114 may employ a landmark-based interpolation to generate one or more intermediate frames between selected visemes to ensure smooth and realistic transitions without reliance on exhaustive viseme morphing or large datasets. In one example, the realistic talking video generating platform 114 is configured to interpolate intermediate frames by averaging the positions of facial landmarks to ensure smooth transitions between visemes. Use of the intermediate frames prevents abrupt movements and creates fluid animations.
[0041] In addition, the realistic talking video generating platform 114 may be configured to play back, in real-time, the visemes that are dynamically synchronized with live audio input 104 to generate a realistic animated talking video. In particular, the realistic talking video generating platform 114 is configured dynamically render the visemes in synchronization with the phonemes of the live audio input 104, thereby ensuring accurate lip synchronization and natural animation. Moreover, realistic talking video generating platform 114 is also configured to the dynamically align the selected visemes with the phonemes of the live audio input 104, while adapting to variations in speech pace and tone. In accordance with further aspects of the present specification, the realistic talking video generating platform (114) may also be configured to export the generated animated videos in a format compatible with online education, virtual assistants, and entertainment platforms.
[0042] In a non-limiting example, the processing subsystem 112 may include one or more application-specific processors, digital signal processors, microcomputers, graphical processing units, microcontrollers, Application Specific Integrated Circuits (ASICs), Programmable Logic Arrays (PLAs), Field Programmable Gate Arrays (FGPAs), and/or any other suitable processing devices. In some embodiments, the processing subsystem 112 may also be configured to retrieve the set of visemes and the set of blink sequence visemes from the data repository 120. The data repository 120 may include a hard disk drive, a floppy disk drive, a read/write CD, a DVD, a Blu-ray disc, a flash drive, a solid-state storage device, a local database, and the like.
[0043] In addition, the examples, demonstrations, and/or process steps performed by certain components of the system 100 such as the processing subsystem 112 may be implemented by suitable code on a processor-based system, where the processor-based system may include a general-purpose computer or a special-purpose computer. Also, different implementations of the present specification may perform some or all of the steps described herein in different orders or substantially concurrently.
[0044] With continuing reference to FIG. 1, in certain embodiments, the talking video generating system 108 may be integrated with the edge device 102. However, in other embodiments, the talking video generating system 108 may be a standalone unit and may be communicatively coupled to the edge device 102. In a presently contemplated configuration depicted in FIG. 1, the talking video generating system 108 is depicted as being integrated with the edge device 102. In some embodiments, the talking video generating system 108 may include a display 116 and a user interface 118. However, in some other embodiments, the display 116 and the user interface 118 may be representative of the display and user interface of the edge device 102.
[0045] The display 116 and the user interface 118 may overlap in some embodiments such as a touch screen. Further, in some embodiments, the display 116 and the user interface 118 may include a common area. The display 116 may be configured to visualize or present any relevant information to the user of the edge device 102 or the source 106. Additionally, the realistic talking videos generated by the talking video generating system 108 and other information may be provided or visualized on the display 116 and/or the user interface 118.
[0046] Implementing the system 100 and the talking video generating system 108 that includes the realistic talking video generating platform 114 as described hereinabove aids in generating realistic talking videos from static images on edge devices 102 with limited computational resources. In particular, the system 100 is designed to effectively address the challenges of efficient lip-synchronization video generation in resource-constrained environments, enabling real-time applications without reliance on cloud-based infrastructure. Moreover, by leveraging pre-generated visemes, pre-generated blink sequence visemes, efficient phoneme selection, and optimized resource utilization, the system 100 enables real-time video animation that is synchronized with live audio input, thereby providing a groundbreaking solution for edge-based lip-sync animation. Additionally, the system 100 facilitates the real-time generation of realistic talking videos from static images on edge devices 102 by efficiently combining audio processing, landmark-based facial animation, and optimized computational techniques. The system 100 provides a robust framework for low-resource environments, allowing for offline functionality and seamless real-time playback. Furthermore, the system 100 is designed to provide offline operation functionality without internet connectivity by leveraging locally stored viseme data and real-time processing.
[0047] The overall operation of the talking video generating system 108 is designed to be highly efficient and robust, thereby ensuring a reliable and high-quality output. Furthermore, the talking video generating system 108 is designed to be highly flexible and customizable, thereby allowing the system 100 to be adapted for a wide range of applications and settings. Moreover, the system 100 revolutionizes the field of lip-sync video animation by addressing the limitations of existing technologies. The optimized design for edge devices, offline functionality, and real-time processing capabilities provide a robust solution for creating realistic talking videos in resource constrained environments. The working of the system 100 may be better understood with reference to FIGs. 2-6.
[0048] Embodiments of the exemplary methods of FIG. 2-4 may be described in a general context of computer executable instructions on computing systems or a processor. Generally, computer executable instructions may include routines, programs, objects, components, data structures, procedures, modules, functions, and the like that perform particular functions or implement particular abstract data types.
[0049] Moreover, the embodiments of the exemplary methods may be practised in a distributed computing environment where optimization functions are performed by remote processing devices that are linked through a wired and/or wireless communication network. In the distributed computing environment, the computer executable instructions may be located in both local and remote computer storage media, including memory storage devices.
[0050] Additionally, in FIGs. 2-4 the exemplary methods are illustrated as a collection of blocks in a logical flow chart, which represents operations that may be implemented in hardware, software, firmware, or combinations thereof. It may be noted that the various operations are depicted in the blocks to illustrate the functions that are performed. In the context of software, the blocks represent computer instructions that, when executed by one or more processing subsystems, perform the recited operations.
[0051] Moreover, the order in which the exemplary methods are described is not intended to be construed as a limitation, and any number of the described blocks may be combined in any order to implement the exemplary methods disclosed herein, or equivalent alternative methods. Further, certain blocks may be deleted from the exemplary methods or augmented by additional blocks with added functionality without departing from the spirit and scope of the subject matter described herein.
[0052] Referring to FIG. 2, a flow chart 200 illustrating an exemplary method for real-time generation of realistic talking videos on an edge device, in accordance with aspects of the present specification, is presented. The method 200 may be described with reference to the components of FIG. 1. Also, the exemplary method 200 for the real-time generation of realistic talking videos on an edge device may be performed for the realistic talking video generating platform 114.
[0053] In accordance with aspects of the present specification, the method 200 includes pre-generating and storing a set of one or more visemes corresponding to a set of distinguishable phonemes for each source or contact 106 stored on the edge device 102, as indicated by step 202. As previously noted, a viseme is a visual equivalent of a phoneme in a spoken language and defines the position of the face and mouth while a person is speaking. Additionally, at step 202, a set of blink sequence visemes may be pre-generated and stored. The blink sequence viseme is configured to simulate natural eye movement. At step 202, a set of visemes and a set of blink sequence visemes may be generated for each source 106 and stored on the edge device 102. The sets of visemes and sets of blink sequence visemes corresponding to the one or more sources 106 may be stored in the data repository 120 on the edge device 102. In accordance with aspects of the present specification, the pre-generation and storing of the sets of visemes and the sets of blink sequence visemes corresponding to a plurality of contacts/sources 106 of step 202 may be performed offline. The pre-generation of the sets of visemes and sets of blink sequence visemes corresponding to the one or more sources will be described in greater detail with reference to FIG. 3.
[0054] Further, as indicated by step 204, an audio input and information related to a source 106 of the audio input may be received. The source information may include an image of the source 106, for example. The audio input may include a text or a live audio input. Moreover, at step 206, the audio input may be processed. By way of example, if the audio input includes a text, the text may be processed to generate an audio output. However, if the audio input includes a live audio input, the processing step 206 may be omitted.
[0055] Subsequently, at step 208, the audio input 104 may be processed to select one or more essential phonemes from the audio input 104. As previously noted, the essential phonemes represent a minimal set of distinguishable phonemes that is optimized to represent essential speech movements and facilitate accurate lip synchronization. The selection of the essential phonemes aids in reducing computational complexity.
[0056] Moreover, as depicted by step 210, for each essential phoneme, one or more pre-generated visemes corresponding to the source 106 are retrieved. In one example, the one or more pre-generated visemes corresponding to the source 106 are retrieved from the data repository 120 on the edge device 102. Similarly, for each essential phoneme, one or more pre-generated blink sequence visemes corresponding to the source 106 are retrieved. By way of example, the one or more pre-generated blink sequence visemes corresponding to the source 106 are retrieved from the data repository 120 on the edge device 102.
[0057] Furthermore, at step 214, one or more intermediate frames may be generated to smooth any transition between the visemes. In one embodiment, a landmark-based interpolation may be employed to generate one or more intermediate frames between selected visemes to ensure smooth and realistic transitions without reliance on exhaustive viseme morphing or large datasets. In one example, the intermediate frames may be interpolated by averaging the positions of facial landmarks to ensure smooth transitions between visemes. Use of the intermediate frames prevents abrupt movements and creates fluid animations.
[0058] Additionally, as indicated by step 216, the visemes and the blink sequence visemes are dynamically rendered in synchronization with phonemes of the live audio input in real-time on the edge device 102 to generate a realistic animated talking video 218 that is synchronized with the live audio input. Generating the realistic talking video 218 as described with reference to step 216 ensures accurate lip synchronization and natural animation. The generated realistic talking video 218 may be played back, in real-time. Moreover, the selected visemes and blink sequence visemes may also be dynamically aligned with the live audio input to adapt, in real-time, to variations in speech pace and tone to generate a realistic animated talking video 218 that is synchronized with the live audio input.
[0059] Consequent to the processing of step 202-216, a realistic animated talking video 218 that is synchronized with the live audio input 104 is generated. Furthermore, at step 220, the realistic animated talking video 218 that is synchronized with the live audio input 104 may be provided to the user in real-time on the edge device 102. In one example, the realistic animated talking video 218 that is synchronized with the live audio input 104 may be displayed on the display 116 and/or the user interface 118 of the edge device 102. Moreover, in accordance with further aspects of the present specification, the realistic talking video generating platform 114 may also be configured to export the generated realistic animated talking videos 218 in a format compatible with online education, virtual assistants, and entertainment platforms.
[0060] Implementing the method 200 as described hereinabove aids in generating realistic talking videos from static images on edge devices with limited computational resources. The method 200 is designed to be highly efficient and robust, thereby ensuring a reliable and high-quality output. Also, the method 200 is highly flexible and customizable, thereby allowing the method 200 to be adapted for a wide range of applications and settings. The method is designed to leverage pre-generated visemes, pre-generated blink sequence visemes, efficient phoneme selection, and optimized resource utilization, to facilitate real-time video animation that is synchronized with live audio input, thereby providing a groundbreaking solution for edge-based lip-sync animation. Additionally, the method 200 supports the real-time generation of realistic talking videos from static images on edge devices by efficiently combining audio processing, landmark-based facial animation, and optimized computational techniques. The details of the method 200 may be better understood with reference to FIGs. 3-6.
[0061] Turning now to FIG. 3, a schematic illustration 300 of some steps of an exemplary method for generation of visemes and blink sequence visemes corresponding to a plurality of sources or contacts 106 for use in the method of FIG. 2, in accordance with aspects of the present specification, is presented. The schematic illustration 300 may be described with reference to the components of FIGs. 1-2.
[0062] It may be noted that in accordance with aspects of the present specification, the visemes and the blink sequence visemes may be pre-generated and stored in an offline operation in the data repository 120 on the edge device 102. Also, the exemplary method 300 for the pre-generation of visemes and blink sequence visemes corresponding to a plurality of sources or contacts 106 for use in the method of FIG. 2 may be performed for the realistic talking video generating platform 114.
[0063] In accordance with aspects of the present specification, the realistic talking video generating system 114 is configured to pre-generate a set of visemes corresponding to a set of distinguishable phonemes associated with each of one or more sources 106. Moreover, the realistic talking video generating system 114 is also configured to pre-generate a set of blink sequence visemes corresponding to each of one or more sources 106. As previously noted, the sources 106 may include one or more contacts stored in the edge device 102. For ease of explanation, the pre-generation and storing of a set of visemes corresponding to a set of distinguishable phonemes and a set of blink sequence visemes is described with reference to a single source or contact 106. This process may be repeated for each of the sources 106 and stored in the data repository 120.
[0064] For a source 106 , to pre-generate a set of visemes corresponding to a set of distinguishable phonemes, a reference image 302 corresponding to the source 106 is received by the realistic talking video generating platform 120. At step 304, the reference image 302 may be processed by the realistic talking video generating platform 114 to identify one or more facial landmarks 306. In one embodiment, the realistic talking video generating platform 114 may employ a face landmarks model to process the input reference image 302 to identify the facial landmarks 306.
[0065] Further, the realistic talking video generating platform 114 is configured to receive or retrieve as input a set of distinguishable phonemes 310 and a set of viseme landmarks 312 corresponding to each distinguishable phoneme 310. In one embodiment, the set of distinguishable phonemes 310 and the set of viseme landmarks 312 corresponding to each distinguishable phoneme 310 may be retrieved from the data repository 120 on the edge device 102. However, in other embodiments, a set of distinguishable phonemes 310 and a set of viseme landmarks 312 corresponding to each distinguishable phoneme 310 may also be retrieved from other storage means such as, but not limited to, physical storage devices such as local or remote hard disks, CDs, DVDs, Blu-ray disks, and the like. Use of other means of storage is also envisaged.
[0066] Moreover, as indicated by step 308, for each distinguishable phoneme 310, the realistic talking video generating platform 114 is configured to modify the facial landmarks 306 corresponding to the source 106 based on the distinguishable phoneme 310 and corresponding viseme landmarks. As previously noted, each viseme landmark associated with the source 106 may include the positions of facial landmarks 306 corresponding to each distinguishable phoneme 310.
[0067] Accordingly, subsequent to receiving a distinguishable phoneme 310 and corresponding viseme landmarks 312 associated with the input reference image 302 of the source 106, the realistic talking video generating platform 114 is configured to modify or morph the facial landmarks 306 based on the distinguishable phoneme 310 and the corresponding viseme landmarks 312. Specifically, the realistic talking video generating platform 114 may be configured to map the facial landmarks 306 on the reference image 302 based on the viseme landmarks 312 corresponding to the distinguishable phoneme 310 to generate a viseme data map 314 that is associated with that distinguishable phoneme 310. This process may be repeated for each distinguishable phoneme 310 in the set of distinguishable phonemes 310 to generate a set of viseme data maps 314 corresponding to the reference image 302 of the source 106, where the set of viseme data maps 314 includes viseme data maps corresponding to each distinguishable phoneme 310.
[0068] Additionally, the realistic talking video generating platform 114 may also be configured to generate one or more blink sequence visemes 322, as depicted by step 320. In certain embodiments, the realistic talking video generating platform 114 may be configured to adjust the facial landmarks 306 to generate the blink sequence visemes 322 associated with the source 106. In one non-limiting example, the adjustments of the facial landmarks is driven by intermediate predictions of three-dimensional (3D) landmark displacements. As previously noted, these blink sequence visemes 322 are configured to simulate natural eye movement for adding realism via natural eye blinks to the animation. Also, use of the blink sequence visemes 322 enhances the realism of the generated talking video by simulating involuntary human behavior.
[0069] Further, as depicted by step 316, for each distinguishable phoneme 310, the realistic talking video generating platform 114 may be configured to generate a corresponding viseme. In one example, to generate a viseme corresponding to the distinguishable phoneme 310, the realistic talking video generating platform 114 may be configured to retrieve a viseme data map 314 that corresponds to the distinguishable phoneme 310 from the data repository 120 on the edge device 102. Subsequently, the reference image 302 of the source 106 may be modified or morphed based on an associated viseme data map 314 to generate a viseme that corresponds to a distinguishable phoneme 310. In one embodiment, an image-to-image generation model may be used to modify the reference image 302 of the source 106 based on the associated viseme data map 314 to generate the corresponding viseme.
[0070] The process of step 316 may be repeated to modify the reference image 302 of the source 106 based on each of the viseme data maps 314 to generate a viseme that corresponds to each of the distinguishable phonemes 310. The visemes 318 are tailored to accurately depict facial movements corresponding to phonemes/speech sounds. Consequently, for a reference image 302 corresponding to each source 106, a set of visemes 318 corresponding to the set of distinguishable phonemes 310 is generated. The process of generating the set of visemes 318 and the set of blink sequence visemes 322 may be repeated for each source 106 in the list of sources/contacts 106 saved on the edge device 102.
[0071] Furthermore, as indicated by step 324, the set of visemes 318 and the set of blink sequence visemes 322 corresponding to the plurality of source/contacts 106 may be stored locally in the data repository 120 on the edge device 102. Locally storing the set of visemes 318 and the set of blink sequence visemes 322 on the edge device 102 aids in minimizing processing overhead during real-time playback.
[0072] Consequent to the processing of steps 302-324 by the realistic talking video generating platform 114, the set of visemes 318 and the set of blink sequence visemes 322 may be pre-generated for each source 106. It may be noted that the realistic talking video generating platform 114 is configured to create a minimal set of discrete visemes 318 that is optimized to represent essential speech movements, thereby reducing computational complexity.
[0073] Referring to FIG. 4, a schematic illustration 400 of one embodiment of the method 200 for real-time generation of realistic talking videos on an edge device of FIG. 2, in accordance with aspects of the present specification, is presented. The schematic illustration 400 may be described with reference to the components of FIGs. 1-3. Also, the steps of the method 400 may be performed by the realistic talking video generating platform 114.
[0074] The method 400 enhances quality of experience for a user or viewer on the edge device 102 by facilitating real-time generation of talking videos that are synchronized with live audio on the edge device 102. In particular, the method 400 employs a live audio stream and a static image to animate lip-sync videos efficiently, optimizing the use of limited computational and memory resources of the edge device 102.
[0075] In accordance with exemplary aspects of the present specification, the realistic talking video generating platform 114 is configured to generate realistic talking videos on the edge device 102 by leveraging the pre-generated visemes 318, the pre-generated blink sequence visemes 322, efficient phoneme selection, and optimized resource utilization of the edge device 102, thereby enabling real-time video animation synchronized with live audio input 402 and providing a groundbreaking solution for edge-based lip-sync animation.
[0076] To generate realistic talking videos on the edge device 102, the realistic talking video generating platform 114 is configured to receive an audio input 402. The audio input 402 may be a text or a live audio input. Additionally, contact information 404 associated with the source of the input audio 402 may also be received by the realistic talking video generating platform 114.
[0077] In some embodiments, the contact information 404 associated with the source 106 may be extracted from the contact list stored on the edge device 102. In some embodiments, source/contact information 404 may be retrieved from the data repository 120 on the edge device 102. By way of example, the contact information 404 may include an image of the source/contact 106 and may be retrieved from the edge device 102. It may be noted that if an image of the source 106 is not readily available, the realistic talking video generating platform 114 may be configured to generate an avatar or a cartoon character to represent the source 106.
[0078] Furthermore, the received audio input 402 may be processed or analyzed in real-time, as indicated by step 406. By way of example, if the audio input 402 includes a text, the received text 402 may be processed by the realistic talking video generating platform 114 to generate an audio that is representative of live audio input. In one example, a text-to-speech model may be used to convert the text to a corresponding audio output. However, if the received input audio 402 is directly representative of live audio input, then the step 402 of processing the text may be omitted.
[0079] Moreover, at step 408, the live audio input 402 may be processed by the realistic talking video generating platform 114 in real-time to identify or select one or more essential phonemes 410. As previously noted, a set of essential phonemes 410 represents a minimal set of distinguishable phonemes 310 that is optimized to characterize essential speech movements and is required for accurate lip synchronization in an animated or talking video. This set of essential phonemes 410 may be sufficient to optimally process the live audio input 402 to generate corresponding visemes 318, which in turn aids in the generation of a realistic animated or talking video. The selection of the essential phonemes 410 aids in reducing computational complexity. In certain embodiments, a lightweight algorithm may be employed to facilitate the efficient detection of the essential phonemes 410 with minimal latency. Also, in one example, a phoneme recognition model may be utilized to efficiently detect the essential phonemes.
[0080] Subsequently, an animation pipeline may be initiated. Accordingly, at step 412, one or more pre-generated visemes 318 corresponding to the selected essential phonemes 410 associated with the source 106 of the input audio 402 may be retrieved from the local storage on the edge device 102 such as the data repository 120. In particular, for each selected essential phoneme 410, a corresponding set of one or more pre-generated visemes 318 associated with the source 106 may be retrieved from the data repository 120 on the edge device 102.
[0081] Additionally, one or more blink sequence visemes 322 associated with the source 106 may also be retrieved from the data repository 120 on the edge device 102, as depicted by step 414. The blink sequence visemes 322 are employed to add realism to the animation. Consequent to this processing, a set of visemes 318 and a set of blink sequence visemes 322 associated with the source 106 and corresponding to the essential phonemes 410 may be created. This group of visemes may be generally represented by reference numeral 416. It may be noted that the realistic talking video generating platform 114 is configured to create a minimal set of discrete visemes 318 corresponding to the essential phonemes 410 that is optimized to represent essential speech movements, thereby reducing computational complexity.
[0082] As will be appreciated, there may be some abrupt movements in a generated talking video. Traditionally, large datasets or exhaustive viseme morphing are required to produce smooth transitions. To alleviate this issue, in accordance with aspects of the present specification, one or more intermediate frames may be generated to smooth any transition between the visemes 318, as indicated by step 418. In one embodiment, the realistic talking video generating platform 114 may employ a landmark-based interpolation to generate one or more intermediate frames between selected visemes to ensure smooth and realistic transitions without reliance on exhaustive viseme morphing or large datasets. In one example, the realistic talking video generating platform 114 is configured to interpolate intermediate frames by averaging the positions of facial landmarks 306 to ensure smooth transitions between visemes 318. Use of the intermediate frames prevents abrupt movements and creates fluid animations.
[0083] Moreover, at step 420, the realistic talking video generating platform 114 may be configured to create and play back, in real-time, the visemes 318 that are dynamically synchronized with the live audio input 402 to generate a realistic animated talking video. In particular, the realistic talking video generating platform 114 is configured dynamically render the visemes 318 in synchronization with the phonemes of the live audio input 402, thereby ensuring accurate lip synchronization and natural animation. Furthermore, the realistic talking video generating platform 114 is also configured to dynamically align the selected visemes 318 with the phonemes of the live audio input 402, while adapting to variations in speech pace and tone. Consequent to the processing of step 402-420, a realistic talking video/animation 422 that is synchronized with the live audio input 402 is generated. It may be noted that the realistic talking video/animation 422 that is synchronized with the live audio input 402 may be created based on the static image of the source 106 and the pre-generated visemes 318, 322. Furthermore, in accordance with further aspects of the present specification, the realistic talking video generating platform 114 may also be configured to export the generated realistic animated talking videos 422 in a format compatible with online education, virtual assistants, and entertainment platforms.
[0084] FIG. 5 is a diagrammatical illustration of one embodiment 500 of the realistic talking video generating platform 114 for use in the system 100 of FIG. 1, in accordance with aspects of the present specification. The working of FIG. 5 is described with reference to the components of FIGs. 1-4.
[0085] In a presently contemplated configuration, the realistic talking video generating platform 114 may include a landmark generation unit 502. The landmark generation unit 502 is configured to process a static image 302 to identify one or more facial landmarks 306. Some non-limiting examples of the facial landmarks 306 represent key points on the face, such as the eyes, nose, mouth, and chin, and serve as the foundation for facial animations.
[0086] Furthermore, the realistic talking video generating platform 114 may include a viseme creation unit 504. The viseme creation unit 504 is configured to modify the identified facial landmarks 306 to create visemes 318 corresponding to a set of distinguishable phonemes 310. As previously noted, the visemes 318 are visual representations of the distinguishable phonemes 310. The visemes 318 are tailored to accurately depict facial movements corresponding to speech sounds.
[0087] In addition, the realistic talking video generating platform 114 may include a blink sequence viseme generation unit 506. The blink sequence viseme generation unit 506 is configured to adjust the facial landmarks 306 to generate a set of blink sequence visemes 322. Further, the blink sequence visemes 322 represent natural eye movements. Also, the blink sequence visemes 322 are added to the animation to enhance the realism of the generated talking video 422 by simulating involuntary human behavior.
[0088] Moreover, the realistic talking video generating platform 114 may include an intermediate frame generation unit 508. The intermediate frame generation unit 508 is configured to generate intermediate frames to ensure smooth transitions between visemes 318. In one example, the intermediate frame generation unit 508 interpolates intermediate frames by averaging the positions of facial landmarks 306. Use of the intermediate frames prevents abrupt movements and creates fluid animations.
[0089] Additionally, the realistic talking video generating platform 114 may include a phoneme selection unit 510. The phoneme selection unit 510 is configured to process the audio input 402 to extract one or more essential phonemes 410. As previously noted, the essential phonemes 410 represent a minimal set of distinguishable phonemes 310 required for accurate lip synchronization. This set of essential phonemes 410 may be sufficient to optimally process the live audio input 402 to generate corresponding visemes 38. The selection of the essential phonemes 410 aids in reducing computational complexity while maintaining accurate lip synchronization.
[0090] Also, the realistic talking video generating platform 114 may include a real-time playback unit 512. The real-time playback unit 512 is configured to dynamically render the visemes 318 in synchronization with the phonemes of the live audio input 402, ensuring accurate lip-sync and natural animation.
[0091] Referring now to FIG. 6, a schematic representation 600 of one embodiment 602 of a digital processing system implementing the implementing the realistic talking video generating platform 114 (see FIG. 1), in accordance with aspects of the present specification, is depicted. Also, FIG. 6 is described with reference to the components of FIGs. 1-5.
[0092] It may be noted that while the realistic talking video generating platform 114 is shown as being a part of the talking video generating system 108, in certain embodiments, the realistic talking video generating platform 114 may also be integrated into other end user systems. Moreover, the example of the digital processing system 602 presented in FIG. 6 is for illustrative purposes. Other designs are also anticipated.
[0093] The digital processing system 602 may contain one or more processors such as a central processing unit (CPU) 604, a random access memory (RAM) 606, a secondary memory 608, a graphics controller 610, a display unit 612, a network interface 614, and an input interface 616. It may be noted that the components of the digital processing system 602 except the display unit 612 may communicate with each other over a communication path 618. In certain embodiments, the communication path 618 may include several buses, as is well known in the relevant arts.
[0094] The CPU 604 may execute instructions stored in the RAM 606 to provide several features of the present specification. Moreover, the CPU 604 may include multiple processing units, with each processing unit potentially being designed for a specific task. Alternatively, the CPU 604 may include only a single general-purpose processing unit.
[0095] Furthermore, the RAM 606 may receive instructions from the secondary memory 608 using the communication path 618. Also, in the embodiment of FIG. 6, the RAM 606 is shown as including software instructions constituting a shared operating environment 620 and/or other user programs 622 (such as other applications, DBMS, and the like). In addition to the shared operating environment 620, the RAM 606 may also include other software programs such as device drivers, virtual machines, and the like, which provide a (common) run time environment for execution of other/user programs.
[0096] With continuing reference to FIG. 6, the graphics controller 610 is configured to generate display signals (e.g., in RGB format) for display on the display unit 612 based on data/instructions received from the CPU 604. The display unit 612 may include a display screen to display images defined by the display signals. Furthermore, the input interface 616 may correspond to a keyboard and a pointing device (e.g., a touchpad, a mouse, and the like) and may be used to provide inputs. In addition, the network interface 614 may be configured to provide connectivity to a network (e.g., using Internet Protocol), and may be used to communicate with other systems connected to a network, for example.
[0097] Moreover, the secondary memory 608 may include a hard drive 626, a flash memory 628, and a removable storage drive 630. The secondary memory 608 may store data generated by the system 100 (see FIG. 1) and software instructions (for example, for implementing the various features of the present specification), which enable the digital processing system 602 to provide several features in accordance with the present specification. The code/instructions stored in the secondary memory 608 may either be copied to the RAM 606 prior to execution by the CPU 604 for higher execution speeds or may be directly executed by the CPU 604.
[0098] Some or all of the data and/or instructions may be provided on a removable storage unit 632, and the data and/or instructions may be read and provided by the removable storage drive 630 to the CPU 604. Further, the removable storage unit 632 may be implemented using medium and storage format compatible with the removable storage drive 630 such that the removable storage drive 630 can read the data and/or instructions. Thus, the removable storage unit 632 includes a computer readable (storage) medium having stored therein computer software and/or data. However, the computer (or machine, in general) readable medium can also be in other forms (e.g., non-removable, random access, and the like.).
[0099] It may be noted that as used herein, the term “computer program product” is used to generally refer to the removable storage unit 632 or a hard disk installed in the hard drive 626. These computer program products are means for providing software to the digital processing system 602. The CPU 604 may retrieve the software instructions and execute the instructions to provide various features of the present specification.
[0100] Also, the term “storage media/medium” as used herein refers to any non-transitory media that store data and/or instructions that cause a machine to operate in a specific fashion. Such storage media may include non-volatile media and/or volatile media. Non-volatile media include, for example, optical disks, magnetic disks, or solid-state drives, such as the secondary memory 608. Volatile media include dynamic memory, such as the RAM 606. Common forms of storage media include, for example, a floppy disk, a flexible disk, hard disk, solid-state drive, magnetic tape, or any other magnetic data storage medium, a CD-ROM, any other optical data storage medium, any physical medium with patterns of holes, a RAM, a PROM, and EPROM, a FLASH-EPROM, NVRAM, any other memory chip or cartridge.
[0101] Storage media is distinct from but may be used in conjunction with transmission media. Transmission media participates in transferring information between storage media. For example, the transmission media may include coaxial cables, copper wire, and fiber optics, including the wires that include the communication path 618. Moreover, the transmission media can also take the form of acoustic or light waves, such as those generated during radio-wave and infra-red data communications.
[0102] Reference throughout this specification to “one embodiment,” “an embodiment,” or similar language means that a particular feature, structure, or characteristic described in connection with the embodiment is included in at least one embodiment of the present specification. Thus, appearances of the phrases “in one embodiment,” “in an embodiment” and similar language throughout this specification may, but do not necessarily, all refer to the same embodiment.
[0103] Furthermore, the described features, structures, or characteristics of the specification may be combined in any suitable manner in one or more embodiments. In the description presented hereinabove, numerous specific details are provided such as examples of programming, software modules, user selections, network transactions, database queries, database structures, hardware modules, hardware circuits, hardware chips, and the like, to provide a thorough understanding of embodiments of the specification.
[0104] The aforementioned components may be dedicated hardware elements such as circuit boards with digital signal processors or may be software running on a general-purpose computer or processor such as a commercial, off-the-shelf personal computer (PC). The various components may be combined or separated according to various embodiments of the invention.
[0105] Furthermore, the foregoing examples, demonstrations, and process steps such as those that may be performed by the system may be implemented by suitable code on a processor-based system, such as a general-purpose or special-purpose computer. It should also be noted that different implementations of the present specification may perform some or all of the steps described herein in different orders or substantially concurrently, that is, in parallel. Furthermore, the functions may be implemented in a variety of programming languages, including but not limited to C++, Python, and Java. Such code may be stored or adapted for storage on one or more tangible, machine readable media, such as on data repository chips, local or remote hard disks, optical disks (that is, CDs or DVDs), memory or other media, which may be accessed by a processor-based system to execute the stored code. Note that the tangible media may include paper or another suitable medium upon which the instructions are printed. For instance, the instructions may be electronically captured via optical scanning of the paper or other medium, then compiled, interpreted, or otherwise processed in a suitable manner if necessary, and then stored in the data repository or memory.
[0106] The aforementioned components may be dedicated hardware elements such as circuit boards with digital signal processors or may be software running on a general-purpose computer or processor such as a commercial, off-the-shelf personal computer (PC). The various components may be combined or separated according to various embodiments of the invention.
[0107] Embodiments of the systems and methods for real-time generation of realistic talking videos synchronized with audio on edge devices described hereinabove advantageously present a robust framework for generating realistic talking videos from static images on edge devices with limited computational resources. The system employs a live audio stream and a static image to animate lip-sync videos efficiently, optimizing the use of limited computational and memory resources. Furthermore, the system is designed to effectively address the challenges of efficient lip-synchronized video generation in resource-constrained environments, enabling real-time applications without reliance on cloud-based infrastructure. Moreover, by leveraging pre-generated visemes, pre-generated blink sequence visemes, efficient phoneme selection, and optimized resource utilization, the system enables real-time video animation that is synchronized with live audio input, thereby providing a groundbreaking solution for edge-based lip-sync animation. Additionally, the system facilitates the real-time generation of realistic talking videos from static images on edge devices by efficiently combining audio processing, landmark-based facial animation, and optimized computational techniques. Moreover, the system provides a robust framework for low-resource environments, allowing for offline functionality and seamless real-time playback. In particular, the system is designed to provide offline operation functionality without requiring internet connectivity by leveraging locally stored viseme data and real-time processing. Pre-generated visemes and stored configurations allow for uninterrupted performance in offline scenarios. The overall operation of the talking video generating system and method are designed to be highly efficient and robust, thereby ensuring a reliable and high-quality output. Furthermore, the talking video generating system is designed to be highly flexible and customizable, thereby allowing the system to be adapted for a wide range of applications and settings.
[0108] The system and method described hereinabove are designed for efficient resource optimization. In particular, the system is specifically designed for edge devices with limited computational and memory resources. Also, the system employs efficient algorithms and pre-generated assets to minimize processing overhead. Furthermore, the systems and methods support offline capability. For example, unlike existing solutions, the system and method function without relying on cloud services, making it ideal for use in remote or disconnected environments. Moreover, the systems and methods provide real-time performance. By way of example, the system processes audio and generates synchronized video frames within a latency of 40 milliseconds, ensuring seamless real-time animation. In addition, the system and method are very versatile. For example, the system and method support multiple applications, including virtual assistants, educational platforms, gaming systems, and entertainment systems, where edge devices are predominantly used. Furthermore, the viseme library generated by the system and method is extensible to accommodate multiple languages and dialects, thereby making the system highly adaptable for diverse user bases and providing language and accent adaptability:
[0109] Additionally, the systems and methods described hereinabove may find application in various fields such as, but not limited to, virtual assistants and conversational artificial intelligence (AI), online education platforms requiring engaging, human-like tutors, entertainment and gaming applications for creating realistic avatars, real-time video communication on mobile devices, and training and simulation systems in low-resource environments.
[0110] Moreover, the systems and methods revolutionize the field of lip-sync video animation by addressing the limitations of existing technologies. The optimized design for edge devices, offline functionality, and real-time processing capabilities provide a robust solution for creating realistic talking videos in constrained environments.
[0111] Although specific features of embodiments of the present specification may be shown in and/or described with respect to some drawings and not in others, this is for convenience only. It is to be understood that the described features, structures, and/or characteristics may be combined and/or used interchangeably in any suitable manner in the various embodiments.
[0112] While only certain features of the present specification have been illustrated and described herein, many modifications and changes will occur to those skilled in the art. It is, therefore, to be understood that the present specification is intended to cover all such modifications and changes as fall within the true spirit of the invention.
,CLAIMS:1. A system (100) for real-time generation of realistic talking videos (218, 420) on an edge device (102), the system (100) comprising:
an acquisition subsystem (110) configured to receive an audio input (104, 402) and a source (106, 404) of the audio input (104, 402);
a processing subsystem (112) in operative association with the acquisition subsystem (110) and comprising a realistic talking video generating platform (114), wherein the realistic talking video generating platform (114) is, on the edge device (102), configured to:
if the audio input (104, 402) comprises a text, process the audio input (104, 402) to generate an audio corresponding to the text;
select one or more essential phonemes (410) from the audio input (104, 402), wherein the essential phonemes (410) facilitate accurate lip synchronization;
for each essential phoneme (410), retrieve a pre-generated viseme (318) corresponding to the source (106, 404) and a pre-generated blink sequence viseme (322) corresponding to the source (106, 402), wherein the pre-generated viseme (318) and the pre-generated blink sequence viseme (322) corresponding to the source (106, 404) is retrieved from a data repository (118) on the edge device (102);
generate one or more intermediate image frames for smooth transition between pre-generated visemes (318);
dynamically render the pre-generated visemes (318) and the blink sequence visemes (322) in synchronization with live audio input (104, 402) to generate a realistic animated talking video (218, 422) that is synchronized with the live audio input (104, 402); and
an interface unit (116, 118) configured to provide, in real-time on the edge device (102), the realistic animated talking video (218, 422) that is synchronized with the live audio input (104, 402).

2. The system (100) of claim 1, wherein the realistic talking video generating platform (114) is configured to perform the generation of the realistic animated talking videos (218, 422) in real-time, on the edge device (102).

3. The system (100) of claim 1, wherein the realistic talking video generating platform (114) is configured to:
generate one or more visemes (318) corresponding to the source (106, 404);
generate one or more blink sequence visemes (322) corresponding to the source (106, 404); and
store the one or more visemes (318) and the one or more blink sequence visemes (322) in the data repository (118) on the edge device (102).

4. The system (100) of claim 3, wherein to generate the one or more visemes (318) corresponding to the source (106, 404) the realistic talking video generating platform (114) is configured to:
identify facial landmarks (306) from a reference image (302) corresponding to the source (106, 404);
receive a set of distinguishable phonemes (310);
receive a set of viseme landmarks (312) corresponding to each of the distinguishable phonemes (310);
for each distinguishable phoneme (310), modify the facial landmarks (312) based on the distinguishable phoneme (310) and viseme landmarks (312) corresponding to the distinguishable phonemes (310) to generate a viseme data map (314) associated with the source (106, 404);
for each distinguishable phoneme (310), modify the reference image (302) based on a viseme data map (314) to generate a corresponding viseme (318);
adjust the facial landmarks (306) to produce one or more blink sequence visemes (322) corresponding to the source (106, 404); and
store the visemes (318) and blink sequence visemes (322) on the edge device (102).

5. The system (100) of claim 1, wherein the realistic talking video generating platform (114) comprises:
a landmark generation unit (502) configured to:
receive one or more images (302) from one or more sources (106, 404);
for each of the one or more sources (106, 404), process a corresponding image (302) to identify facial landmarks (306);
a viseme creation unit (504), for each of the one or more sources (106, 404), configured to:
receive a set of distinguishable phonemes (310);
receive a set of viseme landmarks (312) corresponding to each of the distinguishable phonemes (310);
for each distinguishable phoneme (310), modify the facial (306) landmarks based on the distinguishable phoneme (310) and viseme landmarks (312) corresponding to the distinguishable phonemes (310) to generate a viseme data map (314) corresponding to the source;
a blink sequence viseme generation unit (506) for each of the one or more sources (106, 404) configured to adjust the facial landmarks (306) to generate a blink sequence viseme (322), wherein the blink sequence viseme (322) is configured to simulate natural eye movement;
an intermediate frame generation unit (508) configured to interpolate intermediate frames to smoothen transition between visemes (318);
a phoneme selection unit (510) configured to select one or more essential phonemes (410) in the audio input (402); and
a real-time playback unit (512), for each of the one or more sources (106, 404), configured to dynamically render the one or more visemes (318) in synchronization with live audio input (402), in real-time.

6. The system (100) of claim 5, wherein the viseme creation unit (504) is configured to leverage the pre-generated visemes (318) to reduce computational overhead on the edge device (102).

7. The system (100) of claim 5, wherein the intermediate frame generation unit (508) uses a weighted averaging algorithm to average facial landmarks (306) between frames to ensure smooth transitions between visemes (318).

8. The system (100) of claim 5, wherein the real-time playback unit (512) is configured to dynamically adapt to variations in live audio input (104, 402) to maintain synchronization between the live audio input (104, 402) and animation.

9. The system (100) of claim 1, wherein the realistic talking video generating platform (114) is configured to pre-generate and store the set of visemes (318) and the set of blink sequence visemes (322) corresponding to each source (106, 404) offline.

10. The system (100) of claim 1, wherein the realistic talking video generating platform (114) is configured to operate the generation and playback of realistic animated talking videos (218, 422) without an active internet connection.

11. The system (100) of claim 1, wherein realistic talking video generating platform (114) is configured to export the generated animated videos (218, 422) in a format compatible with online education, virtual assistants, and entertainment platforms.

12. The system (100) of claim 1, wherein the realistic talking video generating platform (114) is configured to reduce memory and processing requirements to fit the constraints of the edge device (102).

13. A method (200) for real-time generation of realistic talking videos (216, 420) on an edge device (102), the method (200) comprising:
(a) receiving (204) an audio input (104, 402) and a source (106, 404) of the audio input (104, 402);
(b) if the audio input (104, 402) comprises a text, processing (206) the text to generate an audio;
(c) selecting (208, 408) one or more essential phonemes (410) from the audio input (104, 402), wherein the essential phonemes (410) facilitate accurate lip synchronization;
(d) for each essential phoneme (410), retrieving (210, 212, 412, 414) a pre-generated viseme (318, 416) corresponding to the source (106, 404) and a pre-generated blink sequence viseme (322, 416) corresponding to the source (106, 404);
(e) generating (214, 418) one or more intermediate image frames for smooth transition between visemes (318, 416);
(f) dynamically rendering (216, 420) the pre-generated visemes (318, 416) and the pre-generated blink sequence visemes (322, 416) in synchronization with live audio input (104, 402) to generate a realistic animated talking video (218, 422) that is synchronized with the live audio input (104, 402); and
(g) providing (220), on the edge device (102), the realistic animated talking video (218, 422) that is synchronized with the live audio input (104, 402).

14. The method (200) of claim 13, wherein steps (a)-(g) are performed in real-time on the edge device (102).

15. The method (200) of claim 14, further comprising pre-generating and storing a set of visemes (318, 416) and a set of blink sequence visemes (322, 416) corresponding to each source (106, 404) offline.

16. The method (200) of claim 15, wherein pre-generating and storing the set of visemes (318, 416) corresponding to each of the one or more sources (106, 404) offline comprises:
generating (300) one or more visemes (318, 416) corresponding to one or more sources (106, 404); and
storing (324) the one or more visemes (318) on the edge device (102).

17. The method (200) of claim 16, wherein generating (300) the one or more visemes (318, 416) corresponding to each of the one or more sources (106, 404) comprises:
receiving one or more images (302) from the one or more sources (106, 404);
for each of the one or more sources (106, 404), processing (304) a corresponding image (302) to identify facial landmarks (306);
receiving a set of distinguishable phonemes (310);
receiving a set of viseme landmarks (312) corresponding to each of the distinguishable phonemes (310);
for each distinguishable phoneme (310), modifying (308) the facial landmarks (306) based on the distinguishable phoneme (310) and viseme landmarks (312) corresponding to the distinguishable phoneme (310) to generate a viseme data map (314) associated with the source (106, 404);
for each distinguishable phoneme (310), modifying (316) the reference image (302) based on a viseme data map (314) to generate a corresponding viseme (318); and
storing (324) the visemes (318) in a data repository (120) on the edge device (102).

18. The method (200) of claim 15, wherein pre-generating (320) and storing the one or more blink sequence visemes (322) corresponding to each of the one or more sources (106, 404) offline comprises:
adjusting (320) the facial landmarks (306) to produce one or more blink sequence visemes (322) corresponding to the source (106, 404), wherein the blink sequence viseme is configured to simulate natural eye movement; and
storing (324) the blink sequence visemes (322) in a data repository (118) on the edge device (102).

19. The method (200) of claim 14, further comprising leveraging the pre-generated visemes (318) to reduce computational overhead on the edge device (102).

20. The method (200) of claim 14, wherein generating (214, 418) the one or more intermediate image frames for smooth transition between visemes (318, 416) comprises using a weighted averaging algorithm to average facial landmarks (306) between frames to ensure smooth transitions between visemes (318).

21. The method (200) of claim 14, further comprising operating the generation and playback of the realistic animated talking videos (218, 422) without an active internet connection.

22. The method (200) of claim 14, further comprising dynamically adapting the rendering of the visemes (318) to variations in live audio input (104, 402) to maintain synchronization between the live audio input (104, 402) and animation.

23. The method of claim 14, further comprising exporting the generated animated videos in a format compatible with online education, virtual assistants, and entertainment platforms.

24. A processing system (112) for real-time generation of realistic talking videos (218, 422) on an edge device (102), the processing system (112) comprising:
a realistic talking video generating platform (114) comprising:
a landmark generation unit (502) configured to:
receive one or more images (302) from one or more sources (106, 404);
for each of the one or more sources (106, 404), process a corresponding image (302) to identify facial landmarks (306);
a viseme creation unit (504), for each of the one or more sources (106, 404), configured to:
receive a set of distinguishable phonemes (310);
receive a set of viseme landmarks (312) corresponding to each of the distinguishable phonemes (310);
for each distinguishable phoneme (310), modify the facial (306) landmarks based on the distinguishable phoneme (310) and viseme landmarks (312) corresponding to the distinguishable phonemes (310) to generate a viseme data map (314) corresponding to the source (106, 404);
a blink sequence viseme generation unit (506) for each of the one or more sources (106, 404) configured to adjust the facial landmarks (306) to generate a blink sequence viseme (322), wherein the blink sequence viseme (322) is configured to simulate natural eye movement;
an intermediate frame generation unit (508) configured to interpolate intermediate frames to smoothen transition between visemes (318);
a phoneme selection unit (510) configured to select one or more essential phonemes (410) in the audio input (402); and
a real-time playback unit (512), for each of the one or more sources (106, 404), configured to dynamically render the one or more visemes (318) in synchronization with live audio input (402), in real-time.

Documents

Application Documents

#	Name	Date
1	202341085458-PROVISIONAL SPECIFICATION [14-12-2023(online)].pdf	2023-12-14
2	202341085458-POWER OF AUTHORITY [14-12-2023(online)].pdf	2023-12-14
3	202341085458-FORM FOR SMALL ENTITY(FORM-28) [14-12-2023(online)].pdf	2023-12-14
4	202341085458-FORM 1 [14-12-2023(online)].pdf	2023-12-14
5	202341085458-EVIDENCE FOR REGISTRATION UNDER SSI(FORM-28) [14-12-2023(online)].pdf	2023-12-14
6	202341085458-EVIDENCE FOR REGISTRATION UNDER SSI [14-12-2023(online)].pdf	2023-12-14
7	202341085458-Request Letter-Correspondence [20-12-2023(online)].pdf	2023-12-20
8	202341085458-Power of Attorney [20-12-2023(online)].pdf	2023-12-20
9	202341085458-Form 1 (Submitted on date of filing) [20-12-2023(online)].pdf	2023-12-20
10	202341085458-Covering Letter [20-12-2023(online)].pdf	2023-12-20
11	202341085458-FORM 3 [14-06-2024(online)].pdf	2024-06-14
12	202341085458-DRAWING [11-12-2024(online)].pdf	2024-12-11
13	202341085458-CORRESPONDENCE-OTHERS [11-12-2024(online)].pdf	2024-12-11
14	202341085458-COMPLETE SPECIFICATION [11-12-2024(online)].pdf	2024-12-11
15	202341085458-FORM-9 [13-12-2024(online)].pdf	2024-12-13
16	202341085458-STARTUP [16-12-2024(online)].pdf	2024-12-16
17	202341085458-FORM28 [16-12-2024(online)].pdf	2024-12-16
18	202341085458-FORM 18A [16-12-2024(online)].pdf	2024-12-16
19	202341085458-FER.pdf	2025-04-15
20	202341085458-FORM 3 [16-05-2025(online)].pdf	2025-05-16

Search Strategy

1	202341085458_SearchStrategyNew_E_202341085458(1)E_03-02-2025.pdf