System For Automated Audio Visual Localization And Method Thereof

Abstract: SYSTEM FOR AUTOMATED AUDIO-VISUAL LOCALIZATION AND METHOD THEREOF Abstract Disclosed are a system (100) for automated audio-visual localization and a method thereof. The system (100) and the method facilitate less time-consuming and inexpensive automatic audio-visual localization. The system (100) and the method provide translation in many different languages and also facilitate human intervention for improving quality. The system (100) and the method allow a user to extract an output at each step of the method/processing based on his/her requirement. Ref. Fig.: Figure 1

Patent Information

Application #

Filing Date

19 June 2020

Publication Number

02/2022

Publication Type

INA

Invention Field

COMPUTER SCIENCE

Status

Email

ipr@bhateponkshe.com

Parent Application

Patent Number

Legal Status

Grant Date

2023-03-23

Renewal Date

Applicants

Rikaian Technology Pvt. Ltd.

Office No. 3, S. No. 846, Near Marathwada College, Shivajinagar, Pune 41100, Maharastra, India

Inventors

1. Anandsagar Shiralkar

: Rikaian Technology Pvt. Ltd. Office No. 3, S. No. 846, Near Marathwada College, Shivajinagar, Pune 41100, Maharastra, India

2. Mangesh Shinde

Rikaian Technology Pvt. Ltd. Office No. 3, S. No. 846, Near Marathwada College, Shivajinagar, Pune 41100, Maharastra, India

3. Pavan Agrawal

Rikaian Technology Pvt. Ltd. Office No. 3, S. No. 846, Near Marathwada College, Shivajinagar, Pune 41100, Maharastra, India

4. Saijash Padicharayil

Rikaian Technology Pvt. Ltd. Office No. 3, S. No. 846, Near Marathwada College, Shivajinagar, Pune 41100, Maharastra, India

Specification

DESC:SYSTEM FOR AUTOMATED AUDIO-VISUAL LOCALIZATION AND METHOD THEREOF

Field of the invention

The present invention relates to a system and a method for audio-visual localization and more particularly, to an automate platform for translating voice language of a video.

Background of the invention

Human voice recordings are used in many places such as online training modules, advertisements, websites, message machines and the like. If someone wants to publish that voice recording or video internationally, there is a need to translate voice recordings into the languages of that specific region. The prior audio-visual localization process is a numerous stage process and considered as highly person dependent as it requires a transcriptor, a translator, a reviewer, voice over artist(s), an audio engineer, a studio manager and a project manager. Further, this prior art process of translating records in each language is time-consuming and complicated as it requires separate price lists, separate processes, and separate people to talk with for each project and language. Moreover, this process is really difficult in case of revision to go back on every previous stage as there is no structure or standardization of this process.

Efforts are seen in the art to provide an automatic translation of the language of the video. Reference may be made to US9552807B2 that discloses a system and a method for automatically dubbing a video in a first language into a second language. The system comprises an audio/video pre-processor configured to provide separate original audio and video files of the same media; a text analysis unit configured to receive a first text file of the video's subtitles in the first language and a second text file of the video's sub-titles in the second language, and re-divide them into text sentences; a text-to-speech unit configured to receive the text sentences in the first and second languages from the text analysis unit and produce therefrom first and second standard TTS spoken sentences; a prosody unit configured to receive the first and second spoken sentences, the separated audio file and timing parameters and produce therefrom dubbing recommendations; and a dubbing unit configured to receive the second spoken sentence and the recommendations and produce therefrom an automatically dubbed sentence in the second language. However, this invention works on audio/video stream and not on saved media files and does not have subtitle extraction module. Further, the invention does not disclose about addition of background music to an original audio. Furthermore, the invention does not mention about human intervention and does not provide details of how the solution is installed.

US20200211565A1 discloses a system and a method to perform dubbing automatically for multiple languages at the same time using speech-to-text transcriptions, language translation, and artificial intelligence engines to perform the actual dubbing in the voice likeness of the original speaker. However, this invention does not disclose about addition of background music to an original audio. Further, the invention does not mention about human intervention and does not provide details of how the solution is installed.

Accordingly, there is a need of an automated platform for audio-visual localization for quick voice language translation.

Objects of the invention

An object of the present invention is to provide an end to end automated voice translation platform.

Another object of the present invention is to provide simplified voice translation process that reduces the translation time and cost.

Summary of the invention

Accordingly, the present invention provides a system for automated audio-visual localization. The system comprises a pre-processing module, a translation module and a post-processing module.

The pre-processing module is used by a user for attaching/detaching a media file from a subtitle file, and for deleting and downloading the subtitle file and the media file. The pre-processing module includes an upload unit, a subtitle generating unit, a segmentation unit and a subtitle editor. The upload unit is used by the user for uploading anyone of the media files and the subtitle files. The subtitle generating unit automatically generates the subtitle file on non-availability thereof through transcription, finding speech regions and generating subtitles. To generate subtitles the input media file is first converted into a wave format. The time coded data from the transcription, and start and end time from the speech regions are combined to create the subtitle segments. The segmentation unit segments a final subtitle file generated by the subtitle generating unit. Specifically, each time coded entry in the subtitle file is considered as one segment. The subtitle editor is used by the user for editing the generated subtitle file.

The translation module is operably connected to the pre-processing module for receiving the pre-processed files therefrom. The translation module includes a subtitle editor, a voiceover unit, an adjusting unit, an audio generating unit, a background music unit and a memory unit. The subtitle editor is used by the user to edit a target subtitle file. Each segment is translated by the user using any one of a machine translation engine and using humans. The segments are translated using any one of a translation logic implemented and configured to get executed from the memory unit or through the machine translation engine. The voiceover unit is used by the user for giving voiceover to each individual segment. Specifically, the voiceover is given using any of a machine/digital voice, a human voice and by uploading a high-quality audio file created from the external sources.

The adjusting unit automatically adjusts timestamp or length of the video. The length of the segment is elongated to match with the time duration of the recorded voice. The audio generating unit generates a final audio file. The background music unit is used by the user to attach background music to an output audio file.

The post-processing module is operably connected to the translation module for receiving the translated/ voiceover files therefrom. The post-processing module includes a mixing unit and export unit. The mixing unit is used by the user to create a final video file by selecting a plurality of files and mixing of all selected files. The plurality of files includes an original subtitle file, a translated subtitle file, a system generated translated audio file, an externally uploaded high-quality audio file and an original audio file. The export unit is used by the user for extracting the output at each step of the processing.

In another aspect, the present invention provides for a method for automated audio-visual localization.

Brief description of the drawings

The detailed description is described with reference to the accompanying figures.
The objects and advantages of the present invention will become apparent when the disclosure is read in conjunction with the following figures, wherein
Figure 1 shows a block diagram of a system for automated audio-visual localization, in accordance with the present invention;

Figure 2 shows an overview of a method for automated audio-visual localization, in accordance with the present invention;

Figure 3 shows a flowchart of a pre-processing step of the method for automated audio-visual localization, in accordance with the present invention;

Figure 4 a flowchart of a translation step of the method for automated audio-visual localization, in accordance with the present invention; and

Figure 5 a flowchart of a post-processing step of the method for automated audio-visual localization, in accordance with the present invention.

It should be appreciated by those skilled in the art that any block diagrams herein represent conceptual views of illustrative systems embodying the principles of the present invention. Similarly, it will be appreciated that any flowcharts, flow diagrams, and the like represent various processes which may be substantially represented in computer readable medium and so executed by a computer or processor, whether or not such computer or processor is explicitly shown.

Detail description of the invention

The foregoing objects of the invention are accomplished, and the problems and shortcomings associated with prior art techniques and approaches are overcome by the present invention described in the present embodiments.

The embodiments herein provide a system for automated audio-visual localization and a method thereof. Further the embodiments may be easily implemented in data, information communication and management structures. Embodiments may also be implemented as one or more applications performed by stand alone or embedded systems.

The systems and methods described herein are explained using examples with specific details for better understanding. However, the disclosed embodiments can be worked on by a person skilled in the art without the use of these specific details.

Throughout this application, with respect to all reasonable derivatives of such terms, and unless otherwise specified (and/or unless the particular context clearly dictates otherwise), each usage of:
“a” or “an” is meant to read as “at least one.”
“the” is meant to be read as “the at least one.”

References in the specification to “one embodiment” or “an embodiment” mean that a particular feature, structure, characteristic, or function described in connection with the embodiment is included in at least one embodiment. The appearances of the phrase “in one embodiment” in various places in the specification are not necessarily all referring to the same embodiment.

Hereinafter, embodiments will be described in detail. For clarity of the description, known constructions and functions will be omitted.

Parts of the description may be presented in terms of operations performed by at least one electrical / electronic circuit, a computer system, using terms such as data, state, link, fault, packet, and the like, consistent with the manner commonly employed by those skilled in the art to convey the substance of their work to others skilled in the art. As is well understood by those skilled in the art, these quantities take the form of data stored/transferred in the form of non-transitory, computer-readable electrical, magnetic, or optical signals capable of being stored, transferred, combined, and otherwise manipulated through mechanical and electrical components of the computer system; and the term computer system includes general purpose as well as special purpose data processing machines, switches, and the like, that are standalone, adjunct or embedded. For instance, some embodiments may be implemented by a processing system that executes program instructions so as to cause the processing system to perform operations involved in one or more of the methods described herein. The program instructions may be computer-readable code, such as compiled or non-compiled program logic and/or machine code, stored in a data storage that takes the form of a non-transitory computer-readable medium, such as a magnetic, optical, and/or flash data storage medium. Moreover, such processing system and/or data storage may be implemented using a single computer system or may be distributed across multiple computer systems (e.g., servers) that are communicatively linked through a network to allow the computer systems to operate in a coordinated manner.

The present invention provides a system and a method for automatic audio-visual localization. The invention also facilitates human intervention for improving quality. The system and the method of the present invention provide translation in many different languages. The present invention provides a complete end to end process with automation and human intervention, wherever required.

The present invention is now illustrated with reference to the accompanying drawings, throughout which reference numbers indicate corresponding parts in the various figures. These reference numbers are shown in bracket in the following description.

Referring to figure 1, a system (100) for automated audio-visual localization in accordance with the present invention is shown. Specifically, the system (100) is a cloud/web based system. The system (100) allows a user to access/modify every step. The system (100) comprises a pre-processing module (10), a translation module (30) and a post-processing module (40).

As shown in figure 1, the pre-processing module (10) includes an upload unit (2), a subtitle generating unit (4), a segmentation unit (6) and a subtitle editor (8). The each of the upload unit (2), the subtitle generating unit (4), the segmentation unit (6) and the subtitle editor (8) comprise of at least one processor, at least one memory unit communicatively coupled to the processor, and data communication unit to communicate data and information among other modules/units in communication.

In one another implementation of the system (100) of the present invention, the pre-processing module (10) comprises at least one processor, at least one memory unit communicatively coupled to the processor, and data communication unit to communicate data and information among other modules/units in communication. The processor of the pre-processing module (10) is configured to have the upload unit (2), the subtitle generating unit (4), the segmentation unit (6) and the subtitle editor (8) embedded therein.

In one another implementation of the system of the present invention, the processor(s) of the pre-processing module (10) can be cloud-computing platform based processors/processing units.

The upload unit (2) is utilised by the user to upload single/bulk media files and / or subtitle files for quick voice language translation. The system (100) allows the user to upload only subtitle files or only media files or both. The various file formats generally supported by the upload unit (2) are as follows:

However, it is understood here that the upload unit (2) can be configured to accept any other file format other than the above-mentioned formats in other alternative embodiments of the present invention.

The subtitle generating unit (4) automatically generates a subtitle file if subtitle file is not available. In the context of the present invention to generate subtitles, the input media file is first converted into a wave format. The subtitle generation involves three steps namely, transcription, finding speech regions and generating subtitles.

The segmentation unit (6) segments a final subtitle file generated by the subtitle generating unit (4). Specifically, each time coded entry in the subtitle file is considered as one segment. The subtitle editor (8) allows the user to edit the generated subtitle file. The user can modify the subtitles while simultaneously watching a video and listening to an audio. The subtitle editor (8) also allows the user to choose the speed while playing the media file and to choose to listen to the audio for segment and keep on playing in loop. This helps while curating or reviewing the subtitle. The user as per the requirement sets a subtitle profile/ subtitle guidelines, and the subtitle editor (8) shows any deviations from the set subtitle profile/ subtitle guidelines. Further, the subtitle editor (8) allows the user to watch the timeline of audio waveform and use the simple and intuitive interface to merge/break the subtitles, to edit the start time/ end time, if required and also to choose to add/delete the subtitles. Through this user downloads the output either transcription or subtitle file.

In accordance with the present invention, the pre-processing module (10) allows the user to attach/detach the media file from the subtitle file, delete the subtitle/ the media file and download the subtitle/ the media file. Once this process is done, the user can choose to download the files or continue with the further processing of the file.

The translation module (30) is operably connected to the pre-processing module (10) for receiving the pre-processed files therefrom. Once the user access the translation module (30), the pre-processing step is blocked by the system (100). In the translation module (30), the segmentation of the input subtitle file is carried out. If the user wants to change the source segments, then the system (100) allows the user to delete the file from this stage and move back to previous stage.

The translation module (30) includes a subtitle editor (12), a voiceover unit (14), an adjusting unit (16), an audio generating unit (18), a background music unit (20) and a memory unit (22).

In an implementation according to an embodiment of the present invention, the translation module (30) comprises at least one processor communicatively coupled to at least one memory unit, the at least one processor configured to embed the subtitle editor (12), the voiceover unit (14), the adjusting unit (16), the audio generating unit (18) and the background music unit (20). The at least one memory unit communicatively coupled to the processor is configured to embed the at least one memory (22).

In one another implementation of the system (100) of the present invention, the each of the subtitle editor (12), the voiceover unit (14), the adjusting unit (16), the audio generating unit (18) and the background music unit (20) comprise of at least one processor, at least one memory unit communicatively coupled to the processor, and data communication unit to communicate data and information among other modules/units in communication.

In one another implementation of the system of the present invention, the processor(s) of the translation module (30) can be cloud-computing platform based processors/ processing units.

The subtitle editor (12) allows the user to edit the target subtitle file. Each segment is translated by the user using any one of a machine translation engine and using humans. The segments are translated using any one of the translation logic implemented and configured to get executed from the memory unit (22) or through the machine translation engine. Further, the subtitle editor (12) allows the user to translate the subtitles while simultaneously watching the video and listening to audio, to merge / undo merge the segments, to edit the start time/ end time, if required, to set the alignment of subtitle, apply any formatting, to filter too big / small audio files to reconsider the translation or change the recording and also to set the subtitle profile/ subtitle guidelines for each target language. The subtitle editor (12) shows any deviations from the set subtitle profile/ subtitle guidelines.

The voiceover unit (14) allows the user to give voiceover to each individual segment. In the context of the present invention, the voiceover is given using any of a machine/digital voice and a human voice. The system (100) supports multiple text to speech (machine voice) engines. The user can do hands-free recording of the segments for continuous recording of all the segments. Alternatively, the user can also upload the recording from an external source. The voiceover unit (14) allows the user to delete the existing recoding and record new one, to edit the audio to remove certain part, for example unwanted leading, trailing or in between silences, to auto trim (leading and trailing silence) for all or filtered segments. The user is also allowed to retain the original sound, such as background music, for desired segments.

The adjusting unit (16) automatically adjusts timestamp or length of the video in case the translated voiceover of the segment does not fit in the same time duration as that of a source voice. Specifically, the length of such segment is elongated to match with the time duration of the recorded voice (human/machine). This way the output file matches the translated voice at the same place as that of the source voice.

The audio generating unit (18) generates a final audio file. The audio generating unit (18) allows the user to decide to generate the final audio file of entire input subtitles. The audio generating unit (18) also allows the user to increase the volume of the audio and to choose to compress the audio segment to fit into segment duration if the original segment audio overruns the available time, for example disclaimers audio at the start of video are played very fast to save some time. The user can upload an externally processed audio file and this file can also be downloaded, if required.

The background music unit (20) allows the user to attach the background music to an output audio file. The background music is applied to the entire output audio file. However, it is understood here that the background music unit (20) can be customized to attach the background music to a selected segment of the output audio file in other alternative embodiments of the present invention. The background music unit (20) allows the user to set the volume of the background music as well as download the audio file with the background music, if required.

The post-processing module (40) is operably connected to the translation module (30) to receive the translated/ voiceover files therefrom. The post-processing module (40) includes a mixing unit (32) and an export unit (34). The each of the mixing unit (32) and the export unit (34) comprise of at least one processor, at least one memory unit communicatively coupled to the processor, and data communication unit to communicate data and information among other modules/units in communication.

In one another implementation of the system (100) of the present invention, the post-processing module (40) comprises at least one processor, at least one memory unit communicatively coupled to the processor, and data communication unit to communicate data and information among other modules/units in communication. The processor of the post-processing module (40) is configured to have the mixing unit (32) and the export unit (34) embedded therein.

In one another implementation of the system (100) of the present invention, the processor(s) of the post-processing module (40) can be cloud-computing platform based processors/ processing units.
The mixing unit (32) is used by the user to create a final video file. The user selects a plurality of files for making the final video. The final video is prepared by mixing all the selected files. The plurality of files includes an original subtitle file, a translated subtitle file, a system generated translated audio file, an externally uploaded high-quality audio file, an original audio file and like. The mixing unit (32) allows the user to choose to generate an output file in single or multiple languages. In case the user chooses multiple languages, then the mixing unit (32) allows the user to select different subtitles and audios to be played on different tracks. If the user needs multiple language output files, then each language files can be packaged into a single video file.

In accordance with the present invention, the mixing unit (32) also allows the user to set font size, font colour, background colour if not set at segment level, to set the opacity to the colour to make subtitle font / background colour transparent, to set opaque box for the subtitles and to choose to attach the subtitles or burn them with the output video. However, it is understood here that only one subtitle (source / target) can be burned at a time. The user can attach source and one or all target subtitle files. In case of the reprocessing of the mixed video for the changes made in translation or audio files, previous selections are pre-populated in a mixing video dialog box which saves the user’s time. The export unit (44) allows the user to extract the output at each step of the processing based on his/her requirement.

Referring to figures 2-5, in another aspect, the present invention provides a method for automated audio-visual localization. Specifically, the method is described herein below in conjunction with the system (100) of figure 1. In accordance with the present invention, each step of the method is triggered by the user. The user has choice whether to carry out the next step or not. In the context of the present invention, the method allows the user to extract an output at each step of the method based on his/her requirement.

As shown in figure 2, the method in accordance with the present invention involves three steps such as a pre-processing step, a translation step and a post-processing step.

Figure 3 shows a detailed flowchart of the pre-processing step according to the method of the present invention. The user uploads single/bulk subtitle files and / or media files by using the upload unit (2). The user can automatically group multiple files while working on a large project. If the user uploads only media files, then the system (100) generates at least one subtitle file automatically. The user is required to provide an input as a source file and a target language. The method of the present invention allows the user to select multiple target languages for multiple language outputs. Table 1 below enlists the languages generally supported by the system (100):
Table 1
Arabic Dutch Hungarian Maithili Portuguese Tamil
Assamese English Indonesian Malay Punjabi Telugu
Bengali Estonian Irish Malayalam Romanian Thai
Bulgarian Finnish Italian Maltese Russian Turkish
Burmese French Japanese Manipuri Sanskrit Urdu
Chinese (Simplified) German Kannada Marathi Sinhala Vietnamese
Chinese (Traditional) Greek Konkani Nepali Slovak
Croatian Gujarati Korean Norwegian Slovenian
Czech Hebrew Latvian Oriya Spanish
Danish Hindi Lithuanian Polish Swedish

However, it is understood here that the system (100) can be configured to support any other languages in other alternative embodiments of the present invention.

Once the files are uploaded by the user, the subtitle generating unit (4) automatically generates a subtitle file(s) if the subtitle file(s) is not available and if only the media files are uploaded by the user without uploading any corresponding subtitle file(s). In the context of the present invention to generate the subtitles, the input media file is first converted into a wave format. The subtitle generation involves three sub steps namely, transcription, finding speech regions and generating subtitles.

In the transcription according to the method of the present invention, the audio is converted to text using a cloud-based Speech to Text engines. Transcription contains the time coded data for each word. The system (100) is configured to support multiple Speech to Text engines so based on the language and dialect best possible option can be selected.

For finding speech regions according to the method of the present invention, the speech regions are extracted from the audio file. The speech regions are those where the sound is present. These speech regions give the time encoding of the subtitle segment i.e. start and end time. Each speech region is then converted into text. In the subtitle generation according to the method of the present invention, the time coded data from the transcription, and start and end time from the speech regions are combined to create the subtitle segments. Any boundary conditions are also handled in this process. The user can edit generated subtitle file, if required.

Once the final subtitle file is generated, the segmentation is done by the segmentation unit (6). Specifically, each time coded entry in the subtitle file is considered as one segment. Although subtitle file is automatically generated from the uploaded media file, there are some cases where the user needs to edit the generated file. The subtitle editor (8) allows the user to edit the generated subtitle file. The user can modify the subtitles while simultaneously watching the video and listening to the audio. The subtitle editor (8) also allows the user to choose the speed while playing the media file and to choose to listen to the audio for segment and keep on playing in loop. This helps while curating or reviewing the subtitle. The user as per the requirement sets a subtitle profile/ subtitle guidelines, and the subtitle editor (8) shows any deviations from the set subtitle profile/ subtitle guidelines. Further, the subtitle editor (8) allows the user to watch the timeline of audio waveform and use the simple and intuitive interface to merge/ break the subtitles, to edit the start time/ end time, if required and also to choose to add/ delete the subtitles. Through this user is allowed to download the output as a transcription file or the subtitle file. Thereafter, the files are moved to the translation step.

Figure 4 shows a detailed flowchart of the translation step according to the method of the present invention. Once the files are transferred to the translation step, the pre-processing step is blocked by the system (100). However, if the user wants to change the source segments, the user can delete the file from the translation step and move back to the pre-processing step.

In the translation step in accordance with the method of the present invention, the system (100) segments the inputted subtitle file. If the user wants to change the source segments, then he/she can delete the file from this stage and move back to the pre-processing step.

The user edits the target subtitle file using the subtitle editor (12) in accordance with the method of the present invention. Each segment is translated by the user using any one of a machine translation engine and using humans. Specifically, the segments are translated using any one of the translation logic implemented and configured to get executed from the memory unit (22) or through the machine translation engine. The subtitle editor (12) also allows the user to translate the subtitles while simultaneously watching the video and listening to audio, to merge / undo merge the segments, to edit the start time/ end time, if required, to set the alignment of the subtitle, apply any formatting, to filter too big / small audio files to reconsider the translation or change the recording and also to set the subtitle profile/ subtitle guidelines for each target language. The subtitle editor (12) shows any deviations from the set subtitle profile/ subtitle guidelines.

After editing the target subtitle file, the user provides voiceover to each segment using the voiceover unit (14) in accordance with the method of the present invention. In the context of the present invention, the voiceover is given using any of a machine/digital voice and a human voice. The user can also upload the high-quality audio file created from the external sources in place of the voiceover. The individual voiceover audio clips are generated. The user then selects the adjusting unit (16) to automatically adjust timestamp or length of the video in case the translated voiceover of the segment does not fit in the same time duration as that of a source voice. Specifically, the length of such segment is elongated to match with the time duration of the recorded voice (human/machine). This way the output file matches the translated voice at the same place as that of the source voice.

The user then generates the final audio file of entire input subtitles using the audio generating unit (18) in accordance with the method of the present invention. The audio generating unit (18) also allows the user to increase the volume of the audio and to choose to compress the audio segment to fit into the segment duration if the original segment audio overruns the available time, for example disclaimers audio at the start of video are played very fast to save some time.

The user then attaches the background music to the output audio file using the background music unit (20) in accordance with the method of the present invention. The background music is applied to the entire output audio file. However, it is understood here that the background music unit (20) can be customized to attach the background music to a selected segment of the output audio file in other alternative embodiments of the present invention.

After the translation step, the files move to the post-processing step. Figure 5 shows a detailed flowchart of the post-processing step according to the method of the present invention. In the post-processing step, the user selects plurality of files for making a final video. The final video is prepared by mixing all the selected files by the user using the mixing unit (32). The plurality of files includes an original subtitle file, a translated subtitle file, a system generated translated audio file, an externally uploaded high-quality audio file, an original audio file and like. If user needs multiple language output files, then each language files can be packaged into single video file. The final video is then exported by the user using the export unit (34). The export unit (34) also allows the user to extract the output at each step of the method based on his/her requirement.

Advantages of the invention

1. The system (100) and the method facilitate less time-consuming and inexpensive automatic audio-visual localization.
2. The system (100) and the method provide translation in many different languages.
3. The system (100) and the method provide access to every stage for human interactions.
4. The system (100) and the method allow the user to extract the output at each step of the method/processing based on his/her requirement.

The foregoing objects of the invention are accomplished and the problems and shortcomings associated with prior art techniques and approaches are overcome by the present invention described in the present embodiment. Detailed descriptions of the preferred embodiment are provided herein; however, it is to be understood that the present invention may be embodied in various forms. Therefore, specific details disclosed herein are not to be interpreted as limiting, but rather as a basis for the claims and as a representative basis for teaching one skilled in the art to employ the present invention in virtually any appropriately detailed system, structure, or matter. The embodiments of the invention as described above and the methods disclosed herein will suggest further modification and alterations to those skilled in the art. Such further modifications and alterations may be made without departing from the scope of the invention. ,CLAIMS:We claim:

1. A system (100) for automated audio-visual localization, the system (100) comprising:
a pre-processing module (10) for attaching/detaching a media file from a subtitle file, and for deleting and downloading the subtitle file and the media file by a user, the pre-processing module (10) having,
an upload unit (2) for uploading the media files and / or the subtitle files,
a subtitle generating unit (4) for automatically generating the subtitle file on non-availability thereof through transcription, finding speech regions and generating subtitles, wherein to generate subtitles an input media file is first converted into a wave format,
a segmentation unit (6) for segmenting a final subtitle file generated by the subtitle generating unit (4), wherein each time coded entry in the subtitle file is considered as one segment, and
a subtitle editor (10) to be used by the user for editing the generated subtitle file;
a translation module (30) operably connected to the pre-processing module (10) for receiving the pre-processed files therefrom, the translation module (30) having,
a subtitle editor (12) to be used by the user to edit a target subtitle file, wherein each segment is translated by the user using any one of a machine translation engine and using humans,
a voiceover unit (14) for giving voiceover to each individual segment,
an adjusting unit (16) for automatically adjusting timestamp or length of the video, wherein length of the segment is elongated to match with the time duration of a recorded voice,
an audio generating unit (18) for generating a final audio file, and
a background music unit (20) to be used by the user to attach background music to an output audio file; and
a post-processing module (40) operably connected to the translation module (30) to receive translated/ voiceover files therefrom, the post-processing module (40) having,
a mixing unit (32) to be used by the user to create a final video file by selecting a plurality of files and mixing all the selected files, and
an export unit (34) for extracting an output at each step of the processing.

2. The system (100) as claimed in claim 1, wherein the segments are translated using any one of a translation logic implemented and configured to get executed from a memory unit (22) or through the machine translation engine.

3. The system (100) as claimed in claim 1, wherein the time coded data from the transcription, and start and end time from the speech regions are combined to create the subtitle segments.

4. The system (100) as claimed in claim 1, wherein the voiceover is given using any of a machine/ digital voice, a human voice and by uploading a high-quality audio file created from the external sources.

5. The system (100) as claimed in claim 1, wherein the plurality of files includes an original subtitle file, a translated subtitle file, a system generated translated audio file, an externally uploaded high-quality audio file and an original audio file.

6. A method for automated audio-visual localization, the method comprising the steps of:
uploading, through an upload unit (2), subtitle files and / or media files, wherein the user is required to provide an input as a source file and a target language;

generating, by a subtitle generating unit (4), at least one subtitle file on non-availability thereof through transcription, finding speech regions and generating subtitles;
segmenting, by a segmentation unit (6), a final subtitle file, wherein each time coded entry in the subtitle file is considered as one segment;
editing, by the user through a subtitle editor (8), the generated subtitle file;
editing, through a subtitle editor (12), a target subtitle file, wherein each segment is translated by the user using any one of a machine translation engine and using humans;
providing, through a voiceover unit (14), voiceover to each segment to generate individual voiceover audio clips;
adjusting, by selecting an adjusting unit (16), timestamp or length of the video, wherein the length of the segment is elongated to match with the time duration of a recorded voice;
generating, through an audio generating unit (18), a final audio file of entire input subtitles;
attaching, through a background music unit (20), a background music to an output audio file;
mixing, through a mixing unit (32), a plurality of files, the plurality of files selected by the user for making a final video; and
exporting, through an export unit (34), the final video.

7. The method as claimed in claim 6, wherein to generate the subtitles the input media file is first converted into a wave format.

8. The method as claimed in claim 6, wherein in the transcription an audio is converted to text using a cloud-based Speech to Text engines and the transcription includes time coded data for each word.

9. The method as claimed in claim 6, wherein for finding the speech regions the speech regions are extracted from the audio file and each speech region is converted into text to give start and end time of the subtitle segment.

10. The method as claimed in claim 6, wherein the time coded data from the transcription, and start and end time from the speech regions are combined to create the subtitle segments.

11. The method as claimed in claim 6, wherein the segments are translated using any one of a translation logic implemented and configured to get executed from a memory unit (22) or through the machine translation engine.

12. The method as claimed in claim 6, wherein the plurality of files includes an original subtitle file, a translated subtitle file, a system generated translated audio file, an externally uploaded high-quality audio file and an original audio file.

13. The method as claimed in claim 6, wherein the voiceover is given using any of a machine/digital voice, a human voice and by uploading a high-quality audio file created from the external sources.

14. The method as claimed in claim 6, wherein editing of the generated subtitle file includes
choosing the speed while playing the media file and choosing to listen to the audio for segment and keep on playing in loop;
setting a subtitle profile/ subtitle guidelines;
watching the timeline of audio waveform and using the simple and intuitive interface to merge/ break the subtitles; and
editing the start time/ end time, if required and choosing to add/delete the subtitles.

15. The method as claimed in claim 6, wherein editing of the target subtitle file using the subtitle editor (22) includes
translating the subtitles while simultaneously watching the video and listening to audio;
merging / undo merge the segments;
editing the start time/end time, if required;
setting the alignment of subtitle; and
applying any formatting, filtering too big / small audio files to reconsider the translation or change the recording and setting the subtitle profile/ subtitle guidelines for each target language.

16. The method as claimed in claim 6, wherein for generating the final audio file the user increases the volume of the audio and chooses to compress the audio segment to fit into the segment duration if the original segment audio overruns the available time.

Documents

Orders

Section	Controller	Decision Date

Application Documents

#	Name	Date
1	202021026037-IntimationOfGrant23-03-2023.pdf	2023-03-23
1	202021026037-PROVISIONAL SPECIFICATION [19-06-2020(online)].pdf	2020-06-19
2	202021026037-FORM FOR STARTUP [19-06-2020(online)].pdf	2020-06-19
2	202021026037-PatentCertificate23-03-2023.pdf	2023-03-23
3	202021026037-FORM FOR SMALL ENTITY(FORM-28) [19-06-2020(online)].pdf	2020-06-19
3	202021026037-Annexure [15-03-2023(online)].pdf	2023-03-15
4	202021026037-FORM-26 [15-03-2023(online)].pdf	2023-03-15
4	202021026037-FORM 1 [19-06-2020(online)].pdf	2020-06-19
5	202021026037-PETITION UNDER RULE 137 [15-03-2023(online)].pdf	2023-03-15
5	202021026037-EVIDENCE FOR REGISTRATION UNDER SSI(FORM-28) [19-06-2020(online)].pdf	2020-06-19
6	202021026037-Proof of Right [15-03-2023(online)].pdf	2023-03-15
6	202021026037-EVIDENCE FOR REGISTRATION UNDER SSI [19-06-2020(online)].pdf	2020-06-19
7	202021026037-Response to office action [15-03-2023(online)].pdf	2023-03-15
7	202021026037-DRAWINGS [19-06-2020(online)].pdf	2020-06-19
8	202021026037-FORM-26 [20-01-2021(online)].pdf	2021-01-20
8	202021026037-Correspondence to notify the Controller [27-02-2023(online)].pdf	2023-02-27
9	202021026037-FORM 3 [19-06-2021(online)].pdf	2021-06-19
9	202021026037-US(14)-HearingNotice-(HearingDate-01-03-2023).pdf	2023-02-15
10	202021026037-CLAIMS [23-09-2022(online)].pdf	2022-09-23
10	202021026037-ENDORSEMENT BY INVENTORS [19-06-2021(online)].pdf	2021-06-19
11	202021026037-DRAWING [19-06-2021(online)].pdf	2021-06-19
11	202021026037-FER_SER_REPLY [23-09-2022(online)].pdf	2022-09-23
12	202021026037-COMPLETE SPECIFICATION [19-06-2021(online)].pdf	2021-06-19
12	202021026037-OTHERS [23-09-2022(online)].pdf	2022-09-23
13	202021026037-FER.pdf	2022-07-06
13	Abstract1.jpg	2022-01-08
14	202021026037-FORM 18A [22-04-2022(online)].pdf	2022-04-22
14	202021026037-STARTUP [22-04-2022(online)].pdf	2022-04-22
15	202021026037-FORM28 [22-04-2022(online)].pdf	2022-04-22
16	202021026037-FORM 18A [22-04-2022(online)].pdf	2022-04-22
16	202021026037-STARTUP [22-04-2022(online)].pdf	2022-04-22
17	Abstract1.jpg	2022-01-08
17	202021026037-FER.pdf	2022-07-06
18	202021026037-OTHERS [23-09-2022(online)].pdf	2022-09-23
18	202021026037-COMPLETE SPECIFICATION [19-06-2021(online)].pdf	2021-06-19
19	202021026037-DRAWING [19-06-2021(online)].pdf	2021-06-19
19	202021026037-FER_SER_REPLY [23-09-2022(online)].pdf	2022-09-23
20	202021026037-CLAIMS [23-09-2022(online)].pdf	2022-09-23
20	202021026037-ENDORSEMENT BY INVENTORS [19-06-2021(online)].pdf	2021-06-19
21	202021026037-FORM 3 [19-06-2021(online)].pdf	2021-06-19
21	202021026037-US(14)-HearingNotice-(HearingDate-01-03-2023).pdf	2023-02-15
22	202021026037-Correspondence to notify the Controller [27-02-2023(online)].pdf	2023-02-27
22	202021026037-FORM-26 [20-01-2021(online)].pdf	2021-01-20
23	202021026037-DRAWINGS [19-06-2020(online)].pdf	2020-06-19
23	202021026037-Response to office action [15-03-2023(online)].pdf	2023-03-15
24	202021026037-Proof of Right [15-03-2023(online)].pdf	2023-03-15
24	202021026037-EVIDENCE FOR REGISTRATION UNDER SSI [19-06-2020(online)].pdf	2020-06-19
25	202021026037-PETITION UNDER RULE 137 [15-03-2023(online)].pdf	2023-03-15
25	202021026037-EVIDENCE FOR REGISTRATION UNDER SSI(FORM-28) [19-06-2020(online)].pdf	2020-06-19
26	202021026037-FORM-26 [15-03-2023(online)].pdf	2023-03-15
26	202021026037-FORM 1 [19-06-2020(online)].pdf	2020-06-19
27	202021026037-FORM FOR SMALL ENTITY(FORM-28) [19-06-2020(online)].pdf	2020-06-19
27	202021026037-Annexure [15-03-2023(online)].pdf	2023-03-15
28	202021026037-PatentCertificate23-03-2023.pdf	2023-03-23
28	202021026037-FORM FOR STARTUP [19-06-2020(online)].pdf	2020-06-19
29	202021026037-PROVISIONAL SPECIFICATION [19-06-2020(online)].pdf	2020-06-19
29	202021026037-IntimationOfGrant23-03-2023.pdf	2023-03-23
30	202021026037-FORM-27 [03-09-2025(online)].pdf	2025-09-03

Search Strategy

1	SearchHistoryE_06-07-2022.pdf