A Method To Insert Visual Subtitles In Videos

Abstract: A system and method to insert visual subtitles in videos is described. The method comprises segmenting an input video signal to extract the speech segments and music segments. Next, a speaker representation is associated for each speech segment corresponding to a speaker visible in the frame. Further, speech segments are analysed to compute the phones and the duration of each phone. The phones are mapped to a corresponding viseme and a viseme based language model is created with a corresponding score. Most relevant viseme is selected for the speech segments by computing a total viseme score. Further, a speaker representation sequence is created such that phones and emotions in the speech segments are represented as reconstructed lip movements and eyebrow movements. The speaker representation sequence is then integrated with the music segments and super imposed on the input video signal to create subtitles.

Patent Information

Application #

Filing Date

31 March 2016

Publication Number

46/2017

Publication Type

INA

Invention Field

COMMUNICATION

Status

Email

iprdel@lakshmisri.com

Parent Application

Patent Number

Legal Status

Grant Date

2022-11-17

Renewal Date

Applicants

TATA CONSULTANCY SERVICES LIMITED

Nirmal Building, 9th Floor, Nariman Point, Mumbai - 400021, Maharashtra, India

Inventors

1. BHAT, Chitralekha

Tata Consultancy Services Limited Yantra Park -(STPI), 2nd Pokharan Road, Subash Nagar Unit No 6 Thane - 400601 Maharashtra, India

2. KOPPARAPU, Sunil Kumar

Tata Consultancy Services Limited Yantra Park -(STPI), 2nd Pokharan Road, Subash Nagar Unit No 6 Thane - 400601 Maharashtra, India

3. PANDA, Ashish

Tata Consultancy Services Limited Yantra Park -(STPI), 2nd Pokharan Road, Subash Nagar Unit No 6 Thane - 400601 Maharashtra, India

Specification

Claims:1. A computer implemented system for creation of visually subtitled videos, wherein said system comprises:
a memory storing instructions;
a memory coupled to the processor, wherein the memory has a plurality of modules stored therein that are executable by the processor, the plurality of modules comprising:
a segmenting module to segment an audio signal with at least one speaker in a video frame from an input video signal, and a music segment from the input video signal;
an analysis module to determine the most relevant viseme from the segmented audio signal;
a synthesizer module to generate a speaker representation sequence of the segmented audio signal; and
an integration module to integrate the segmented audio signal and viseme sequence to generate visual subtitles.
2. The system as claimed in claim 1, wherein the segmenting module further comprises creating a speaker representation by separating speech segments for at least one speaker visible in the video frame, and separating music segments from an input video frame.
3. The system as claimed in claim 2, wherein the speaker representation further comprises extracting a speaker face of the at least one speaker visible in the video frame and associating the speaker face with the speech segment
4. The system as claimed in claim 3, wherein the speaker representation for the speaker not visible in the video frame comprises creating a generic speaker face, wherein gender of the speaker is identified from the speech segment.
5. The system as claimed in claim 1, wherein the analysis module further comprises:
mapping a plurality of recognized phones to a single viseme or to multiple visemes using a language model;
determining a likelihood score of the recognised phones;
determining a viseme based language model score (VLM) for the at least one mapped visemes; and
computing a total viseme score for the at least one mapped viseme.
6. The system as claimed in claim 5, wherein the computing a total viseme score further comprises:
determining the sum of the product of the likelihood score and the VLM score when more than one phone map to the same viseme or determining the product of the likelihood score and VLM score when one phone maps to one single viseme; and
selecting the most relevant viseme further comprises comparing the total viseme score and selecting the viseme with the highest total score.
7. The system as claimed in claim 1, wherein the synthesizer module further comprises:
removing the lip region and mouth region of the speaker representation; and reconstructing a lip movement and a eyebrow movement sequence using a series of facial animation points executed through an interpolation simulation that is corresponding to the most relevant viseme for the computed time duration of the viseme.
8. The system as claimed in claim 1, wherein the integration module further comprises integrating the music segment and speaker representation sequence in time synchrony with the input video signal.
9. A computer implemented method for creation of visually subtitled videos, wherein the method comprises:
segmenting at least one audio signal from an input video frame of a video signal by an audio segmenting module;
analysing the segmented audio signal for the input video frame to select the most relevant viseme by an analysis module;
generating a speaker representation sequence subsequent to audio segmentation by a synthesizer module; and
integrating the video signal with the speaker representation sequence to generate visual subtitles by an integration module.
10. The method as claimed in claim 9, wherein the segmenting of the at least one audio signal from an input video frame further comprises:
separating speech segments for at least one speaker visible in the video frame, and separating music segments from an input video frame; and
associating the at least one acoustic model to a speaker representation.
11. The method as claimed in claim 10, wherein the speaker representation comprises extracting a speaker face of the at least one speaker visible in the video frame and associating the speaker face with the speech segment.
12. The method as claimed in claim 10, wherein the speaker representation for the speaker not visible in the video frame comprises creating a generic speaker face, wherein gender of the speaker is identified from the speech segment.
13. The method as claimed in claim 9, wherein the selecting the most relevant viseme further comprises:
mapping a plurality of recognized phones to a single viseme or to multiple visemes using a language model; and
determining a likelihood score of the recognised phones;
determining a viseme based language model score (VLM) for the at least one mapped visemes; and
computing a total viseme score for the at least one mapped viseme.
14. The method as claimed in claim 13, wherein the computing a total viseme score further comprises determining the sum of the product of the likelihood score and the VLM score when more than one phones map to the same viseme or determining the product of the likelihood score and VLM score when one phone maps to one single viseme.
15. The method as claimed in claim 13, wherein selecting the most relevant viseme further comprises comparing the total viseme score and selecting the viseme with the highest total score.
16. The method as claimed in claim 9, wherein generating the speaker representation sequence further comprises removing the lip region and mouth region of the speaker representation and reconstructing a lip movement and an eyebrow movement sequence.
17. The method as claimed in claim 16, wherein generating the speaker representation sequence further comprises a series of facial animation points to form a lip shape or an eyebrow shape, executed through an interpolation simulation that is corresponding to the most relevant viseme for the computed time duration of the viseme.
18. The method as claimed in 17 wherein the eyebrow movement and the lip movement corresponds to emotion detected in the speech segment of the segmented audio.
19. The method as claimed in claim 9, wherein the integrating the video signal further comprises integrating the viseme sequence and notational representation of the music segment in time synchrony with the input video signal.
, Description:As Attached

Documents

Application Documents

#	Name	Date
1	Form 5 [31-03-2016(online)].pdf	2016-03-31
2	Form 3 [31-03-2016(online)].pdf	2016-03-31
3	Form 18 [31-03-2016(online)].pdf	2016-03-31
4	Drawing [31-03-2016(online)].pdf	2016-03-31
5	Description(Complete) [31-03-2016(online)].pdf	2016-03-31
6	201621011523-FORM 1-(21-04-2016).pdf	2016-04-21
7	201621011523-CORRESPONDENCE-(21-04-2016).pdf	2016-04-21
8	201621011523-POWER OF AUTHORITY-(28-04-2016).pdf	2016-04-28
9	201621011523-CORRESPONDENCE-(28-04-2016).pdf	2016-04-28
10	REQUEST FOR CERTIFIED COPY [03-04-2017(online)].pdf	2017-04-03
11	201621011523-CORRESPONDENCE(IPO)-(CERTIFIED LETTER)-(21-04-2017).pdf	2017-04-21
12	201621011523-CORRESPONDENCE(IPO)-(DISPATCH LETTER)-(24-04-2017).pdf	2017-04-24
13	Form 3 [19-05-2017(online)].pdf	2017-05-19
14	Abstract.jpg	2018-08-11
15	201621011523-FER.pdf	2019-01-04
16	201621011523-FORM 3 [22-05-2019(online)].pdf	2019-05-22
17	201621011523-OTHERS [23-05-2019(online)].pdf	2019-05-23
18	201621011523-FER_SER_REPLY [23-05-2019(online)].pdf	2019-05-23
19	201621011523-DRAWING [23-05-2019(online)].pdf	2019-05-23
20	201621011523-COMPLETE SPECIFICATION [23-05-2019(online)].pdf	2019-05-23
21	201621011523-CLAIMS [23-05-2019(online)].pdf	2019-05-23
22	201621011523-US(14)-HearingNotice-(HearingDate-18-10-2022).pdf	2022-08-25
23	201621011523-Correspondence to notify the Controller [27-08-2022(online)].pdf	2022-08-27
24	201621011523-FORM-26 [14-10-2022(online)].pdf	2022-10-14
25	201621011523-Written submissions and relevant documents [01-11-2022(online)].pdf	2022-11-01
26	201621011523-PatentCertificate17-11-2022.pdf	2022-11-17
27	201621011523-IntimationOfGrant17-11-2022.pdf	2022-11-17

Search Strategy

1	searchStrategy_16-11-2018.pdf
2	D1-NPL_16-11-2018.pdf