System And Method For Generating Highlights Of Media Content

< Back

System And Method For Generating Highlights Of Media Content

Abstract: A system and associated method for accurately generating summaries of media content belonging to different types of genres is provided. The method includes receiving a media file (109) including an entire media content and selecting a first segment of the media content from an opening segment of the media file (109) and a second segment of the media content from a closing segment of the media file (109). Further, the method includes determining text, audio, and video parameters from the first segment and the second segment. Additionally, the method employs a genre identification system (116) that identifies a genre of the media content as a reality show or as one of other genres based on whether the determined text, audio, and video parameters adhere to a first set of rules. A new media file including the summary of the media content is generated based on the identified genre.

Get Free WhatsApp Updates!
Notices, Deadlines & Correspondence

Patent Information

Application #

Filing Date

31 March 2022

Publication Number

15/2022

Publication Type

INA

Invention Field

COMMUNICATION

Status

Parent Application

Patent Number

Legal Status

Grant Date

2025-01-30

Renewal Date

Applicants

TATA ELXSI LIMITED

ITPB Road, Whitefield, Bangalore – 560048, India

Inventors

1. LIPIKA SREEDHARAN

TATA ELXSI LIMITED, ITPB Road, Whitefield, Bangalore – 560048, India

2. JAGAN SESHADRI

TATA ELXSI LIMITED, ITPB Road, Whitefield, Bangalore – 560048, India

3. ANUP SRIMANGALAM SOMASEKHARAN NAIR

TATA ELXSI LIMITED, ITPB Road, Whitefield, Bangalore – 560048, India

4. BISWAJIT BISWAS

TATA ELXSI LIMITED, ITPB Road, Whitefield, Bangalore – 560048, India

Specification

Claims:
We claim:

1. A method for generating a summary of media content, comprising:
receiving a media file (109) comprising an entire media content and audio associated with the media content by a highlights generation system (100) from a content server (102);
selecting a first segment of the media content from an opening segment of the media file (109) and a second segment of the media content from a closing segment of the media file (109);
determining one or more text parameters from the first segment and the second segment, wherein the one or more text parameters comprise a character density and a duration of scrolling text in the first segment and the second segment;
determining one or more audio parameters from audio associated with the first segment and the second segment, wherein the one or more audio parameters comprise a duration of music sequence in the first segment and the second segment, an amount of speech in the first segment and the second segment, and an average audio intensity associated with the first segment and the second segment;
determining one or more video parameters from frames associated with the first segment and the second segment, wherein the one or more video parameters comprise attributes of apparels worn by individuals in the associated frames, and a percentage of frames in a magnified mode in the associated frames;
identifying a genre of the media content as a reality show by a genre identification system (116) when the determined text parameters, audio parameters, and video parameters adhere to a first set of rules, wherein the first set of rules defines that the duration of music sequence is within a first threshold range, character density is less than a second threshold, duration of scrolling text is less than a third threshold range, average audio intensity is less than a fourth threshold, percentage of frames in the magnified mode is less than a fifth threshold, attributes of apparels worn by individuals in the associated frames are different, and amount of speech is less than a sixth threshold;
identifying the genre of the media content as one of sports, television serial, movie, education, and news when the determined text parameters, audio parameters, and video parameters fail to adhere to the first set of rules; and
generating a new media file comprising the summary of the media content based on the identified genre of the media content.

2. The method as claimed in claim 1, wherein the one or more audio parameters further comprise a duration of monologue in the first segment and the second segment, and wherein the one or more video parameters further comprise a percentage of subset of frames in the associated frames comprising a facial image of a particular person, and a total number of users in the first segment and the second segment.

3. The method as claimed in claim 1, wherein the genre identification system (116) identifies a sub-genre of the media content as:
a singing reality show when the determined text parameters, audio parameters, and video parameters adhere to the first set of rules and when an offset distance of a particular pixel from one frame to another frame in the associated frames is less than a designated threshold; and
a dancing reality show when the determined text parameters, audio parameters, and video parameters adhere to the first set of rules and when the offset distance of the particular pixel from one frame to another frame in the associated frames is greater than the designated threshold.

4. The method as claimed in claim 2, wherein the genre identification system (116) identifies the genre of the media content as sports when the duration of music sequence is less than the first threshold range, character density is less than the second threshold, duration of scrolling text is less than the third threshold range, percentage of frames in the magnified mode is greater than the fifth threshold, attributes of apparels worn by individuals in the associated frames are same, amount of speech is less than the sixth threshold, and total number of users is greater than a seventh threshold;
wherein the genre identification system (116) identifies the genre of the media content as a serial when the duration of music sequence is less than the first threshold range, character density is less than the second threshold, duration of scrolling text is less than the third threshold range, percentage of frames in the magnified mode is less than the fifth threshold, attributes of apparels worn by individuals in the associated frames are different, amount of speech is greater than the sixth threshold, and total number of users is less than the seventh threshold;
wherein the genre identification system (116) identifies the genre of the media content as an educational video when the duration of music sequence is less than the first threshold range, character density is less than the second threshold, duration of scrolling text is less than the third threshold range, percentage of frames in the magnified mode is less than the fifth threshold, amount of speech is greater than the sixth threshold, attributes of apparels worn by individuals in the associated frames are different, percentage of subset of frames in the associated frames comprising the facial image of the particular person is greater than an eighth threshold, and duration of monologue is greater than a ninth threshold; and
wherein the genre identification system (116) identifies the genre of the media content as news when the character density is greater than the second threshold, duration of scrolling text is greater than the third threshold range, amount of speech is greater than the sixth threshold, percentage of subset of frames in the associated frames comprising the facial image of the particular person is less than the eighth threshold, and duration of monologue is greater than the ninth threshold.

5. The method as claimed in claim 2, wherein the genre identification system (116) identifies the genre of the media content as movie when the duration of music sequence is less than the first threshold range, character density is greater than the second threshold, duration of scrolling text is greater than the third threshold range, percentage of frames in the magnified mode is less than the fifth threshold, attributes of apparels worn by individuals in the associated frames are different, amount of speech is less than the sixth threshold, and total number of users is less than the seventh threshold.

6. The method as claimed in claim 5, further comprising:
extracting a plurality of segments comprising the first segment and the second segment from the media content upon identifying that the genre of the media content is movie;
extracting audio associated with the plurality of segments from the media file (109);
determining a level of noise and an amount of speech comprising monologues and dialogues in the plurality of segments by analyzing the extracted audio;
selecting a pixel from a frame associated with a particular segment selected from the plurality of segments and determining an offset distance of the selected pixel from one frame to another frame in the particular segment;
determining one or more segments in the plurality of segments comprising a sequence of video frames whose frame-to-frame pixel difference is continuously greater than a particular threshold;
identifying a sub-genre of the media content as a drama movie when the level of noise is less than a noise threshold, amount of speech segments is greater than a speech threshold, offset distance of the selected pixel is less than a distance threshold, and a total number of the determined segments is less than a particular threshold; and
identifying the sub-genre of the media content as an action movie when the level of noise is greater than the noise threshold, offset distance of the selected pixel is greater than the distance threshold, and total number of the determined segments is equal to or greater than the particular threshold.

7. The method as claimed in claim 1, further comprising:
identifying a plurality of shots in the media content with corresponding confidence values when the genre of the media content identified to be one of sports, television serial, movie, education, and news;
selecting values of configuration parameters that are specific to the genre of the media content from a database of the highlights generation system (100), wherein the configuration parameters comprise a minimum shot duration and a confidence threshold;
selecting a set of shots from the plurality of shots whose associated shot duration is equal to or greater than the minimum shot duration;
selecting only a subset of shots from the set of shots whose associated confidence values are equal to or greater than the confidence threshold;
extracting audios corresponding to the subset of shots from the media file (109);
determining an average audio intensity of each of the subset of shots by analyzing the extracted audios; and
selecting candidate shots from the subset of shots whose associated average audio intensities are greater than a defined threshold, wherein the candidate shots selected from the subset of shots are candidates for generating the summary of the media content.

8. The method as claimed in claim 7, further comprising:
estimating a level of noise in an audio segment associated with each of the candidate shots using a first machine learning system in an audio processing system (112);
reducing the level of noise in the audio segment using a second machine learning system in the audio processing system (112);
verifying if speech in the audio segment associated with each of the candidate shots is complete using a third machine learning system in the audio processing system (112);
identifying that speech in a specific audio segment associated with a specific candidate shot selected from the plurality of candidate shots is incomplete using the third machine learning system, wherein the specific audio segment comprises an associated start timestamp and an associated end timestamp;
modifying the selected candidate shot by adding a first audio of a designated length to the specific audio segment prior to the associated start timestamp and a second audio of the designated length after the associated end timestamp to extend a length of the specific audio segment and to complete speech in the specific audio segment; and
generating the summary of the media content by stitching one or more of the candidate shots and the modified candidate shot.

9. The method as claimed in claim 8, wherein the media content corresponds to a television serial, wherein the method comprises stitching the summary of a particular episode of the television serial with a subsequent episode of the television serial prior to an opening segment of the subsequent episode to obtain a stitched media content, and automatically transmitting the stitched media content to a plurality of user devices (130A-N), wherein the summary corresponds to a recap of the particular episode.

10. The method as claimed in claim 8, further comprising:
identifying emotions of one or more users in each of the candidate shots by processing image frames associated with the candidate shots;
assigning ranks to the candidate shots based on associated average audio intensities and emotions of the users in the candidate shots;
generating a highlights of the media content of a desired length based on the ranks assigned to the candidate shots; and
uploading one or more of the highlights and the summary of the media content to one or more of a content delivery server, an over-the-top platform, a video-on-demand platform, and a social media platform via a communication link (104).

11. The method as claimed in claim 10, wherein the media content corresponds to a television serial, wherein the method comprises stitching the highlights of a subsequent episode of the television serial with a present-day episode of the television serial post to a closing segment of the present-day episode to obtain a stitched media content, and automatically transmitting the stitched media content to a plurality of user devices (130A-N), wherein the highlights corresponds to a preview of the subsequent episode.

12. The method as claimed in claim 1, further comprising:
converting the audio associated with the media content into a spectrogram when the genre of the media content is identified to be the reality show;
identifying a plurality of performance segments and a plurality of speech segments in the audio associated with the media content using a machine learning system in an audio processing system (112);
storing the plurality of performance segments along with their corresponding start times and corresponding end times in a database associated with the highlights generation system (100);
extracting videos associated with the plurality of performance segments from the media file (109) based on the corresponding start times and corresponding end times, wherein each video selected from the videos comprises a performance of a contestant, wherein the performance corresponds to one of a singing performance and a dancing performance;
generating the summary of the media content by stitching one or more of the videos and corresponding audios associated with the plurality of performance segments; and
adding an additional video segment of a designated length at the end of each of the videos to extend lengths of the videos.

13. The method as claimed in claim 12, further comprising:
detecting one or more non-moveable objects in a video selected from the videos by processing image frames associated with the video, wherein the one or more detected non-moveable objects comprise chairs, a table, and a microphone;
identifying a set of users who are in proximity to the non-moveable objects in the video by processing images frames associated with the video and recognizing the set of users as judges of the reality show;
identifying emotions of the judges in each of the videos by processing associated image frames;
processing images frames associated with each of the videos to identify if each of the videos comprises an event of standing ovation from the judges;
assigning ranks to the videos based on emotions of the judges in the videos and presence or absence of standing ovations in the videos; and
generating a highlights of the media content of a desired length based on the ranks assigned to the videos.

14. The method as claimed in claim 13, wherein the method comprises uploading one or more of the highlights and the summary of the media content to one or more of a content delivery server, an over-the-top platform, a video-on-demand platform, and a social media platform via a communication link (104).

15. A system for generating a summary of media content, comprising:
a content server (102) comprising a media content storage system (106) that stores a plurality of media files (109);
a highlights generation system (100) that is communicatively coupled to the content server (102) via a communication link (104), wherein the highlights generation system (100):
receives a media file (109) selected from the plurality of media files (109) from the content server (102) via the communication link (104), wherein the received media file (109) comprises an entire media content and audio associated with the media content;
selects a first segment of the media content from an opening segment of the media file (109) and a second segment of the media content from a closing segment of the media file (109), wherein the highlights generation system (100) comprises:
a text processing system (110) that determines one or more text parameters from the first segment and the second segment, wherein the one or more text parameters comprise a character density and a duration of scrolling text in the first segment and the second segment;
an audio processing system (112) that determines one or more audio parameters from audio associated with the first segment and the second segment, wherein the one or more audio parameters comprise a duration of music sequence in the first segment and the second segment, an amount of speech in the first segment and the second segment, and an average audio intensity associated with the first segment and the second segment;
a video processing system (114) that determines one or more video parameters from frames associated with the first segment and the second segment, wherein the one or more video parameters comprise attributes of apparels worn by individuals in the associated frames, and a percentage of frames in a magnified mode in the associated frames;
a genre identification system (116) that:
identifies a genre of the media content as a reality show when the determined text parameters, audio parameters, and video parameters adhere to a first set of rules, wherein the first set of rules defines that the duration of music sequence is within a first threshold range, character density is less than a second threshold, duration of scrolling text is less than a third threshold range, average audio intensity is less than a fourth threshold, percentage of frames in the magnified mode is less than a fifth threshold, attributes of apparels worn by individuals in the associated frames are different, and amount of speech is less than a sixth threshold; and
identifies the genre of the media content as one of sports, television serial, movie, education, and news when the determined text parameters, audio parameters, and video parameters fail to adhere to the first set of rules, and
a summary generation system (118) that generates a new media file comprising the summary of the media content based on the identified genre of the media content.

16. The system as claimed in claim 15, wherein the system comprises one or more of a broadcast system, an over-the-top system, and a surveillance system.
, Description:SYSTEM AND METHOD FOR GENERATING HIGHLIGHTS OF MEDIA CONTENT

RELATED ART

[0001] Embodiments of the present disclosure relate generally to summarization of media content. More particularly, the present disclosure relates to a system and method for automatically generating highlights of the media content for previews.
[0002] Content distributors such as television (TV) channel providers, over-the-top (OTT) media content providers, and video-on-demand providers create highlights of media content for different purposes. For example, a TV channel provider creates highlights that includes best moments or scenes of a TV serial episode, and uses the highlights for providing a quick recap of a previously telecast episode or for providing a preview of an upcoming episode. In another example, an OTT media content provider creates highlights reel showcasing the exciting moments from a sports match for televised audiences.
[0003] Presently, certain content distributors employ human operators who create such highlights by manually reviewing the entire media content, and selecting and stitching specific segments that they personally find important or interesting. However, manually reviewing the entire media content, for example of multiple hours long movies and cricket matches, for selection of specific interesting segments requires a lot of time, resources, manual efforts, and an army of human operators, leading to prohibitive costs and management complexity. Additionally, manually created highlights are subjective in nature as different people find different segments of media content as interesting. Accordingly, highlights of the media content created manually by one person may not be same as highlights of the same media content created manually by another person.
[0004] Therefore, certain present day approaches have been explored to aid content providers to automatically generate highlights of media content. For example, US patent US10455297B1 describes a system that automatically categorizes scenes from a video and generates a customized summary of the video based on user preference. Specifically, the system analyzes frames associated with various scenes in the video and interprets meaning or context of those scenes. For example, the system analyzes a set of frames associated with a particular scene in the video using an object recognition algorithm and identifies that the particular scene includes a car chasing another car. In this example, the system categorizes the scene in an action category.
[0005] In another example, the system analyzes a set of frames associated with a different scene in the video using image recognition algorithms and identifies that the scene includes male and female characters embracing each other. In this example, the system categorizes the scene into a romance category. Subsequently, the system provides a customized summary of the video to a user based on his or her preferences. For example, the system provides the car chase scene in the customized summary to the user when the action category is highly preferred by the user.
[0006] Though the system described in the US patent US10455297B1 generates a summary of the video based on user preference, the system needs to perform enormous amount of processing to understand and interpret meaning of the thousands of various scenes in the video, thus needing significant amounts of computation resources and computation time for the system. Increased demand for computation resources and computation time, in turn, leads to a prohibitively expensive, utterly complex, and non-scalable system. Moreover, such a system cannot be easily extended to cover other types of media content such as educational videos, news programs, and reality shows that lack such categorizations.
[0007] Accordingly, there is a need for an improved system and associated method for automatically generating highlights of those media content accurately and efficiently with minimal utilization of computation resources.

BRIEF DESCRIPTION

[0008] It is an objective of the present disclosure to provide a method for generating a summary of media content. The method includes receiving a media file including an entire media content and audio associated with the media content by a highlights generation system from a content server. The method further includes selecting a first segment of the media content from an opening segment of the media file and a second segment of the media content from a closing segment of the media file. Further, the method includes determining one or more text parameters from the first segment and the second segment. The one or more text parameters include a character density and a duration of scrolling text in the first segment and the second segment. Furthermore, the method includes determining one or more audio parameters from audio associated with the first segment and the second segment. The one or more audio parameters include a duration of music sequence in the first segment and the second segment, an amount of speech in the first segment and the second segment, and an average audio intensity associated with the first segment and the second segment.
[0009] In addition, the method includes determining one or more video parameters from frames associated with the first segment and the second segment. The one or more video parameters include attributes of apparels worn by individuals in the associated frames, and a percentage of frames in a magnified mode in the associated frames. Moreover, the method includes identifying a genre of the media content as a reality show by a genre identification system when the determined text parameters, audio parameters, and video parameters adhere to a first set of rules. The first set of rules defines that the duration of music sequence is within a first threshold range, character density is less than a second threshold, duration of scrolling text is less than a third threshold range, and average audio intensity is less than a fourth threshold. The first set of rules defines that the percentage of frames in the magnified mode is less than a fifth threshold, attributes of apparels worn by individuals in the associated frames are different, and amount of speech is less than a sixth threshold. Further, the method includes identifying the genre of the media content as one of sports, television serial, movie, education, and news when the determined text parameters, audio parameters, and video parameters fail to adhere to the first set of rules. Furthermore, the method includes generating a new media file including the summary of the media content based on the identified genre of the media content.
[0010] The one or more audio parameters further include a duration of monologue in the first segment and the second segment. The one or more video parameters further include a percentage of subset of frames in the associated frames including a facial image of a particular person, and a total number of users in the first segment and the second segment. The genre identification system identifies a sub-genre of the media content as a singing reality show when the determined text parameters, audio parameters, and video parameters adhere to the first set of rules and when an offset distance of a particular pixel from one frame to another frame in the associated frames is less than a designated threshold. The genre identification system identifies the sub-genre of the media content as a dancing reality show when the determined text parameters, audio parameters, and video parameters adhere to the first set of rules and when the offset distance of the particular pixel from one frame to another frame in the associated frames is greater than the designated threshold.
[0011] The genre identification system identifies the genre of the media content as sports when the duration of music sequence is less than the first threshold range, character density is less than the second threshold, duration of scrolling text is less than the third threshold range, and percentage of frames in the magnified mode is greater than the fifth threshold. Further, the genre identification system identifies the genre of the media content as sports only when the attributes of apparels worn by individuals in the associated frames are same, amount of speech is less than the sixth threshold, and total number of users is greater than a seventh threshold. The genre identification system identifies the genre of the media content as a serial when the duration of music sequence is less than the first threshold range, character density is less than the second threshold, duration of scrolling text is less than the third threshold range, and percentage of frames in the magnified mode is less than the fifth threshold. Further, the genre identification system identifies the genre of the media content as the serial only when the attributes of apparels worn by individuals in the associated frames are different, amount of speech is greater than the sixth threshold, and total number of users is less than the seventh threshold.
[0012] The genre identification system identifies the genre of the media content as an educational video when the duration of music sequence is less than the first threshold range, character density is less than the second threshold, and duration of scrolling text is less than the third threshold range. Further, the genre identification system identifies the genre of the media content as the educational video only when the percentage of frames in the magnified mode is less than the fifth threshold, amount of speech is greater than the sixth threshold, and attributes of apparels worn by individuals in the associated frames are different. Furthermore, the genre identification system identifies the genre of the media content as the educational video only when the percentage of subset of frames in the associated frames including the facial image of the particular person is greater than an eighth threshold, and duration of monologue is greater than a ninth threshold. The genre identification system identifies the genre of the media content as news when the character density is greater than the second threshold, duration of scrolling text is greater than the third threshold range, and amount of speech is greater than the sixth threshold. Further, the genre identification system identifies the genre of the media content as news only when the percentage of subset of frames in the associated frames including the facial image of the particular person is less than the eighth threshold, and duration of monologue is greater than the ninth threshold.
[0013] The genre identification system identifies the genre of the media content as movie when the duration of music sequence is less than the first threshold range, character density is greater than the second threshold, duration of scrolling text is greater than the third threshold range, and percentage of frames in the magnified mode is less than the fifth threshold. Further, the genre identification system identifies the genre of the media content as movie only when the attributes of apparels worn by individuals in the associated frames are different, amount of speech is less than the sixth threshold, and total number of users is less than the seventh threshold. The method includes extracting a plurality of segments including the first segment and the second segment from the media content upon identifying that the genre of the media content is movie. The method further includes extracting audio associated with the plurality of segments from the media file. Further, the method includes determining a level of noise and an amount of speech including monologues and dialogues in the plurality of segments by analyzing the extracted audio.
[0014] Furthermore, the method includes selecting a pixel from a frame associated with a particular segment selected from the plurality of segments and determining an offset distance of the selected pixel from one frame to another frame in the particular segment. In addition, the method includes determining one or more segments in the plurality of segments including a sequence of video frames whose frame-to-frame pixel difference is continuously greater than a particular threshold. Moreover, a sub-genre of the media content is identified as a drama movie when the level of noise is less than a noise threshold, amount of speech segments is greater than a speech threshold, offset distance of the selected pixel is less than a distance threshold, and a total number of the determined segments is less than a particular threshold. Alternatively, the sub-genre of the media content is identified as an action movie when the level of noise is greater than the noise threshold, offset distance of the selected pixel is greater than the distance threshold, and total number of the determined segments is equal to or greater than the particular threshold.
[0015] The method includes identifying a plurality of shots in the media content with corresponding confidence values when the genre of the media content identified to be one of sports, television serial, movie, education, and news. Further, the method includes selecting values of configuration parameters that are specific to the genre of the media content from a database of the highlights generation system. The configuration parameters include a minimum shot duration and a confidence threshold. Furthermore, the method includes selecting a set of shots from the plurality of shots whose associated shot duration is equal to or greater than the minimum shot duration. In addition, the method includes selecting only a subset of shots from the set of shots whose associated confidence values are equal to or greater than the confidence threshold. Moreover, the method includes extracting audios corresponding to the subset of shots from the media file, determining an average audio intensity of each of the subset of shots by analyzing the extracted audios, and selecting candidate shots from the subset of shots whose associated average audio intensities are greater than a defined threshold. The candidate shots selected from the subset of shots are candidates for generating the summary of the media content.
[0016] The method includes estimating a level of noise in an audio segment associated with each of the candidate shots using a first machine learning system in an audio processing system. Further, the method includes reducing the level of noise in the audio segment using a second machine learning system in the audio processing system. Furthermore, the method includes verifying if speech in the audio segment associated with each of the candidate shots is complete using a third machine learning system in the audio processing system. In addition, the method includes identifying that speech in a specific audio segment associated with a specific candidate shot selected from the plurality of candidate shots is incomplete using the third machine learning system. The specific audio segment includes an associated start timestamp and an associated end timestamp. The method further includes modifying the selected candidate shot by adding a first audio of a designated length to the specific audio segment prior to the associated start timestamp and a second audio of the designated length after the associated end timestamp to extend a length of the specific audio segment and to complete speech in the specific audio segment. Moreover, the method includes generating the summary of the media content by stitching one or more of the candidate shots and the modified candidate shot.
[0017] The media content corresponds to a television serial. The method includes stitching the summary of a particular episode of the television serial with a subsequent episode of the television serial prior to an opening segment of the subsequent episode to obtain a stitched media content, and automatically transmitting the stitched media content to a plurality of user devices. The summary corresponds to a recap of the particular episode. The method includes identifying emotions of one or more users in each of the candidate shots by processing image frames associated with the candidate shots. Further, the method includes assigning ranks to the candidate shots based on associated average audio intensities and emotions of the users in the candidate shots, and generating a highlights of the media content of a desired length based on the ranks assigned to the candidate shots. Furthermore, the method includes uploading one or more of the highlights and the summary of the media content to one or more of a content delivery server, an over-the-top platform, a video-on-demand platform, and a social media platform via a communication link.
[0018] The media content corresponds to a television serial. The method includes stitching the highlights of a subsequent episode of the television serial with a present-day episode of the television serial post to a closing segment of the present-day episode to obtain a stitched media content, and automatically transmitting the stitched media content to a plurality of user devices. The highlights corresponds to a preview of the subsequent episode. The method includes converting the audio associated with the media content into a spectrogram when the genre of the media content is identified to be the reality show, and identifying a plurality of performance segments and a plurality of speech segments in the audio associated with the media content using a machine learning system in an audio processing system. Further, the method includes storing the plurality of performance segments along with their corresponding start times and corresponding end times in a database associated with the highlights generation system. Furthermore, the method includes extracting videos associated with the plurality of performance segments from the media file based on the corresponding start times and corresponding end times. Each video selected from the videos includes a performance of a contestant. The performance corresponds to one of a singing performance and a dancing performance.
[0019] In addition, the method includes generating the summary of the media content by stitching one or more of the videos and corresponding audios associated with the plurality of performance segments, and adding an additional video segment of a designated length at the end of each of the videos to extend lengths of the videos. Moreover, the method includes detecting one or more non-moveable objects in a video selected from the videos by processing image frames associated with the video. The one or more detected non-moveable objects include chairs, a table, and a microphone. A set of users who are in proximity to the non-moveable objects in the video is identified by processing images frames associated with the video and recognizing the set of users as judges of the reality show. The method includes identifying emotions of the judges in each of the videos by processing associated image frames, and processing images frames associated with each of the videos to identify if each of the videos includes an event of standing ovation from the judges. Further, the method includes assigning ranks to the videos based on emotions of the judges in the videos and presence or absence of standing ovations in the videos, and generating a highlights of the media content of a desired length based on the ranks assigned to the videos. Furthermore, the method includes uploading one or more of the highlights and the summary of the media content to one or more of a content delivery server, an over-the-top platform, a video-on-demand platform, and a social media platform via a communication link.
[0020] It is another objective of the present disclosure to provide a system for generating a summary of media content. The system includes a content server and a highlights generation system. The content server includes a media content storage system that stores a plurality of media files. The highlights generation system is communicatively coupled to the content server via a communication link. The highlights generation system receives a media file selected from the plurality of media files from the content server via the communication link. The received media file includes an entire media content and audio associated with the media content. The highlights generation system selects a first segment of the media content from an opening segment of the media file and a second segment of the media content from a closing segment of the media file. The highlights generation system includes a text processing system, an audio processing system, a video processing system, a genre identification system, and a summary generation system. The text processing system determines one or more text parameters from the first segment and the second segment. The one or more text parameters include a character density and a duration of scrolling text in the first segment and the second segment. The audio processing system determines one or more audio parameters from audio associated with the first segment and the second segment. The one or more audio parameters include a duration of music sequence in the first segment and the second segment, an amount of speech in the first segment and the second segment, and an average audio intensity associated with the first segment and the second segment.
[0021] The video processing system determines one or more video parameters from frames associated with the first segment and the second segment. The one or more video parameters include attributes of apparels worn by individuals in the associated frames, and a percentage of frames in a magnified mode in the associated frames. The genre identification system identifies a genre of the media content as a reality show when the determined text parameters, audio parameters, and video parameters adhere to a first set of rules. The first set of rules defines that the duration of music sequence is within a first threshold range, character density is less than a second threshold, duration of scrolling text is less than a third threshold range, average audio intensity is less than a fourth threshold, and percentage of frames in the magnified mode is less than a fifth threshold. The first set of rules further defines that the attributes of apparels worn by individuals in the associated frames are different, and amount of speech is less than a sixth threshold. The genre identification system identifies the genre of the media content as one of sports, television serial, movie, education, and news when the determined text parameters, audio parameters, and video parameters fail to adhere to the first set of rules. The summary generation system generates a new media file including the summary of the media content based on the identified genre of the media content. The system includes one or more of a broadcast system, an over-the-top system, and a surveillance system.

BRIEF DESCRIPTION OF DRAWINGS

[0022] These and other features, aspects, and advantages of the claimed subject matter will become better understood when the following detailed description is read with reference to the accompanying drawings in which like characters represent like parts throughout the drawings, wherein:
[0023] FIG. 1 illustrates a block diagram depicting an exemplary highlights generation system for automatically generating summaries of different types of media content, in accordance with aspects of the present disclosure;
[0024] FIGS. 2A-C illustrate a flow diagram depicting an exemplary method for identifying a genre of a particular media content using the highlights generation system of FIG. 1, in accordance with aspects of the present disclosure;
[0025] FIG. 3 illustrates a graphical representation depicting movement of a pixel across various frames of the particular media content, in accordance with aspects of the present disclosure;
[0026] FIGS. 4A-B illustrate a flow diagram depicting an exemplary method for generating a summary of the particular media content that belongs to one of a first set of genres using the highlights generation system of FIG. 1, in accordance with aspects of the present disclosure; and
[0027] FIGS. 5A-B illustrate a flow diagram depicting an exemplary method for generating a summary of the particular media content that belongs to one of a second set of genres using the highlights generation system of FIG. 1, in accordance with aspects of the present disclosure.

DETAILED DESCRIPTION

[0028] The following description presents an exemplary system and associated method for generating highlights of the media content for preview. Particularly, embodiments described herein disclose a highlights generation system that first identifies a genre of the media content by processing text, audio, and video data associated with only specific segments of the media content. Subsequently, the highlights generation system generates the highlights by applying custom processing steps tailored to the identified genre of the content.
[0029] As noted previously, certain conventional systems need to process the entire media content and interpret meanings of thousands of scenes using enormous amounts of computational time and resources to identify associated categories and generate a corresponding summary. In contrast, the present highlights generation system identifies the genre of the media content by processing small segment, for example, first and last five minutes of the media content in lieu of analyzing the entire media content. Thus, the present highlights generation system quickly and efficiently identifies the genre of the media content with minimal utilization of computation resources to facilitate more accurate highlights generation. In particular, the highlights generation system employs customized processing steps specifically suited to generate highlights of different types of media content including, but not limited to, reality shows, sports programs, television (TV) serials, movies, education programs, and news programs.
[0030] It may be noted that different embodiments of the present highlights generation system may be used in many application areas. For example, when used in a broadcast system, the highlights generation system generates highlights including best scenes of a TV serial episode, and uses the highlights for providing a quick recap of a previously telecast episode or a preview of an upcoming episode. Similarly, the highlights generation system generates highlights of upcoming segments of a reality show that is currently under broadcast, and provides the highlights prior to advertisement breaks to retain audience interest and to increase viewership for the show.
[0031] When used in an over-the-top (OTT) system, the highlights generation system generates a trailer for a movie for an OTT application that includes notable highlights to attract new audiences. The highlights generation system may also share the generated trailer across various social platforms to publicize the movie. In another OTT application, the highlights generation system generates highlights reel showcasing exciting moments from a sports match for televised audiences. The highlights generation system may also extract song sequences from the movie or show to generate a music album or playlist that is made available on Spotify and other media platforms.
[0032] Similarly, the present highlights generation system generates highlights of online tutorials that can be used by students to revise various topics before their assessments. When used as a surveillance system, the highlights generation system processes the entire video feed captured by a surveillance camera to identify specific video segments capturing occurrences of unusual incidents such as theft and violence. The highlights generation system may similarly generate a highlight for a news program and present the generated highlight to viewers to provide a quick preview of the entire news program. Furthermore, the present highlights generation system may be used to generate highlights of interviews, speech provided by eminent personalities, or technology, entertainment, and design (TED) talks, and share the generated highlights across social platforms for user consumption. The highlights generation system can be used to automatically generate highlights of different movies, where the generated highlights are displayed as part of an OSCAR event. In addition, the highlights generation system identifies intense moments of a particular actor in a particular movie, where the identified moments are displayed as part of the OSCAR event.
[0033] When used for summarizing a video associated with a ceremony such as a wedding ceremony or a birthday ceremony, the highlights generation system may identify important moments based on greetings from people and actions including gifts provided by people, and generates the summary of the ceremony based on the moments identified as important. When used in an automotive application, the highlights generation system may be used to process the entire video feed recorded by a camera mounted to a dashboard of a vehicle, and identifies specific video segments capturing occurrences of abnormal events such as accidents. When used to provide a premium media service, the highlights generation system may be used to process a pre-recorded media program that includes both actual media program and advertisements inserted in between the actual media program. The highlights generation system processes the pre-recorded media program and removes the advertisements from the pre-recorded media program for enabling a user to watch advertisement free content. Though the highlights generation system can be used in many application areas, for clarity, an embodiment of the present highlights generation system will be described herein in greater detail with reference to a media content broadcasting application.
[0034] FIG. 1 illustrates a block diagram depicting an exemplary highlights generation system (100) for automatically generating summaries of different types of media content including, but not limited to, singing reality shows, dancing reality shows, TV serials, movies, education related programs, and news programs. To that end, the highlights generation system (100) is communicatively coupled to a content server (102) via a communication link (104). Examples of the communication link (104) include a satellite-based communications system, a cable system, an over-the-top system, and the internet. Additionally, the content server (102) corresponds to a server owned by a content distributor such as a TV channel provider, an over-the-top (OTT) media content provider, or a video-on-demand (VOD) service provider.
[0035] In one embodiment, the content server (102) includes a media content storage system (106) that stores different media content for broadcast. Examples of the media content storage system (106) includes a hard drive, a universal serial bus (USB) flash drive, a secure digital card, and a solid-state drive. In one embodiment, the media content storage system (106) stores the media content in one or more designated formats and one or more bitrates. Examples of the designated formats include moving picture experts group-4 (MPEG-4) part 14 format, matroska multimedia container (MKV) format, and/or audio video interleave (AVI) format.
[0036] In certain embodiments, the content server (102) further includes a media formatting system (108) that identifies the media content stored in the media content storage system (106) in other formats different from the designated formats. The media formatting system (108) converts those media content into one of the designated formats, for example, using a ffmpeg library. Further, the media formatting system (108) stores the media content converted into one of the designated formats back in the media content storage system (106) for broadcast and for generating summaries using the highlights generation system (100).
[0037] As noted previously, conventional highlights generation systems such as the system described in the US patent US10455297B1 require significant amounts of computation resources to understand and interpret meaning of the thousands of various scenes in the video in order to generate highlights. Conventionally, such systems employ metadata information and or extensive video processing to generate highlights from a particular type of media content, for example, a movie or a TV show. However, such systems would fail to generate highlights for media content of different genres such as a news program or an education program as these may lack specific categorizations such as violence or romance. Therefore, such systems cannot be used universally to generate highlights for all different types of genres of media content as every genre has its own peculiar characteristics.
[0038] The present highlights generation system (100) addresses such specific peculiarities in media content by first identifying the specific genre of the input media content before applying custom processing tailored for that specific genre to generate highlights of greater relevance. Particularly, the present highlights generation system (100) is capable of identifying different types of genres such as a signing reality show, a dancing reality show, a sports program, a TV serial, an action movie, a drama movie, an education related program, and a news program. Unlike conventional systems that require significant processing of the entire media content, the highlights generation system (100) automatically identifies genres of media content by processing text, audio, and video data associated with small segments selected from opening and closing segments of the media content, thus needing significantly lesser processing and storage resources.
[0039] To that end, in one embodiment, the highlights generation system (100) receives a particular media file (109), including a video to be summarized along with audio associated with the video, as an input from the content server (102) via the communication link (104). Upon receiving the media file (109), the highlights generation system (100) selects a first segment of a first length from an opening segment of the media file (109), and further selects a second segment of a second length from a closing segment of the media file (109). For example, the highlights generation system (100) selects the first segment that corresponds to first five minutes of the media file (109), and further selects the second segment that corresponds to last five minutes of the media file (109).
[0040] Subsequently, the highlights generation system (100) inputs the selected first and second segments to an associated text processing system (110), an audio processing system (112), and a video processing system (114), which respectively determines text, audio, and video parameters from the first and second segments. Further, a genre identification system (116) in the highlights generation system (100) identifies or predicts a genre of the video with certain confidence based on the determined text, audio, and video parameters. In one embodiment, the genre identification system (116) identifies the genre of the video as one of a singing or dancing reality show, a TV serial, a movie, an education related program, and a news program based on the determined text, audio, and video parameters, as described in detail with reference to FIGS. 2A-C.
[0041] When the identified genre of the video is one of the TV serial, movie, education related program, and news program, the highlights generation system (100) generates a summary of the video using custom steps specifically suited to the identified genre. To that end, the highlights generation system (100) configures values associated with a set of configuration parameters predefined for each specific genre for dividing the video into a plurality of shots. In certain embodiments, the term “shot,” as used herein throughout the various embodiments of the present disclosure, refers to a series of interrelated pictures taken consecutively using one or more cameras and represents a continuous action in time and space. In one embodiment, the video is divided into the plurality of shots based on values of the configuration parameters that are predefined for the identified genre of the video. Examples of the configuration parameters include a confidence threshold and a minimum shot duration.
[0042] Upon dividing the video into the plurality of shots, the audio processing system (112) extracts the audio associated with the plurality of shots from the media file (109) and processes the extracted audio to filter and identify a subset of shots that are candidates for generating a summary of the video, as described in detail with reference to FIGS. 4A-B. The candidate shots identified by the audio processing system (112) include occurrence of most significant, exciting, and/or interesting moments in the video. Further, the highlights generation system (100) ranks the candidate shots based on their audio intensity level and emotions of a protagonist in the shots identified using the video processing system (114). The highlights generation system (100) then generates the summary of the video based on ranks assigned to the candidate shots.
[0043] When the identified genre of the video is a singing or dancing reality show, the audio processing system (112) and video processing system (114) process audio and video data associated with the reality show to extract individual performances from the reality show, as described in detail with reference to FIGS. 5A-B. In one embodiment, the highlights generation system (100) includes a performance ranking system (120) that ranks the extracted performances based on one or more of emotions of judges of the reality show and standing ovation from the judges identified using the video processing system (114). The highlights generation system (100) then generates a summary of the reality show based on the performances that are ranked highly by the performance ranking system (120).
[0044] In one embodiment, the highlights generation system (100) and associated text processing, audio processing, video processing, genre identification, summary generation, and performance ranking systems (110, 112, 1114, 116, 118, and 120) may be implemented by suitable code on a processor-based system, such as a general-purpose or a special-purpose computer. Accordingly, the highlights generation system (100) and associated text processing, audio processing, video processing, genre identification, summary generation, and performance ranking systems (110, 112, 1114, 116, 118, 120), for example, include one or more general-purpose processors, specialized processors, graphical processing units, microprocessors, programming logic arrays, field programming gate arrays, integrated circuits, systems on chips, and/or other suitable computing devices.
[0045] In certain embodiments, the highlights generation system (100) transmits the generated summary of the video to a content management system (122) via the communication link (104). In one embodiment, the content management system (122) is another content server owned by the content distributor such as the TV channel provider, OTT service provider, or VOD service provider. The content management system (122) stores the summary of the video in a highlights storage system (124). Examples of the highlights storage system (124) includes a storage system such as a hard drive, a universal serial bus (USB) flash drive, a secure digital card, and a solid-state drive.
[0046] Further, the content management system (122) shares or uploads the summary of the video to various platforms including a content delivery server, social media platforms (126A-N), OTT platforms (128A-N), and to user devices (130A-N) via the communication link (104) for providing users with a preview of the video. In certain embodiments, the highlights generation system (100) identifies a genre of a particular media content prior to generating a summary of the particular media content, as noted previously. An exemplary methodology by which the highlights generation system (100) identifies a genre of the particular media content is described in detail with reference to FIGS. 2A-C.
[0047] FIGS. 2A-C illustrate a flow diagram depicting an exemplary method (200) for identifying a genre of a particular media content using the highlights generation system (100) of FIG. 1. The order in which the exemplary method (300) is described is not intended to be construed as a limitation, and any number of the described blocks may be combined in any order to implement the exemplary method disclosed herein, or an equivalent alternative method. Additionally, certain blocks may be deleted from the exemplary method or augmented by additional blocks with added functionality without departing from the claimed scope of the subject matter described herein.
[0048] At step (202), the highlights generation system (100) receives the media file (109) that includes a video to be summarized and audio associated with the video from the content server (102) via the communication link (104). At step (204), the highlights generation system (100) selects a first segment of a first length from an opening segment of the media file (109) and a second segment of a second length from a closing segment of the media file (109). For example, the highlights generation system (100) selects the first segment that corresponds to first five minutes of the media file (109), and further selects the second segment that corresponds to last five minutes of the media file (109).
[0049] Generally, the first segment corresponding to first five minutes and the second segment corresponding to last five minutes of media content include peculiar characteristics that vary from one genre to another genre. For example, the first five and last five minutes of media content belonging to a genre ‘movie’ generally include a substantial amount of scrolling texts. In another example, the first five and last five minutes of media content belonging to a genre ‘sports’ generally include a substantial amount of shots captured in a magnified mode. Similarly, the first five and last five minutes of media content belonging to other genres such as ‘reality show’, ‘TV serial’, ‘education program’, and ‘news program’ have their own peculiar characteristics. Identifying these peculiar characteristics from the first and second segments corresponding to the first five and last five minutes of the media file (109) would assist in identifying the genre of the video contained in the media file (109).
[0050] Accordingly, at step (206), the text processing system (110) determines one or more text parameters including a character density from the first and second segments selected from the media file (109). In certain embodiments, the text processing system (110) implements an optical character recognition (OCR) algorithm for identifying the number of characters in each of the frames in the first and second segments. An example of the OCR algorithm used by the text processing system (110) includes EasyOCR, which is capable of recognizing characters associated with multiple languages and accurately counting the number of characters in each frame of the first and second segments. Further, the text processing system (110) determines an average number of characters per frame in the first and second segments, where the determined average corresponds to the character density.
[0051] Generally, media content such as movies would have substantial amount of text in their first and last five minutes when compared to other types of media content such as reality shows, sports programs, TV serials, and education related programs. Hence, identifying the character density, for example, from the first and last five minutes of a media content would indicate whether the media content is a movie or is of another program type.
[0052] At step (208), the text processing system (110) determines a duration of scrolling text from the first and second segments of the media file (109). Generally, a media content such as a movie includes opening and closing credits sections that have scrolling texts displaying the title and names of various key production and cast members. Similarly, a media content such as a news program generally includes a news ticker that has scrolling texts dedicated to present headlines or other news items. In certain embodiments, the text processing system (110) uses the OCR algorithm to process the first and second segments of the media file (109) to identify if the first and second segments include such scrolling texts. When the first and second segments are identified to have the scrolling texts, the text processing system (110) identifies a duration of the scrolling texts using the OCR algorithm.
[0053] At step (210), the audio processing system (112) determines one or more audio parameters including an average audio intensity associated with the first and second segments selected from the media file (109). For example, the audio processing system (112) extracts a first audio from the first segment of the media file (109). Similarly, the audio processing system (112) extracts a second audio from the second segment of the media file (109). Further, the audio processing system (112) divides the first audio into 300 acoustic frames, and the second audio into another 300 acoustic frames. The audio processing system (112) then determines an audio intensity for the first acoustic frame, for example, based on a sum of square of amplitude of the first acoustic frame in the time domain. Similarly, the audio processing system (112) determines audio intensities of all other 599 acoustic frames, and determines the average audio intensity by averaging audio intensities determined for all 600 acoustic frames.
[0054] At step (212), the audio processing system (112) determines a duration of music sequence in the first and second segments of the media file (109). Specifically, the audio processing system (112) uses, for example, Ina speech Segmenter or WebRTC algorithm to identify a first duration of music sequence in the first segment, and to identify a second duration of music sequence in the second segment. The audio processing system (112) then determines an average of the first duration and the second duration, where the determined average corresponds to the duration of music sequence in the first and second segments of the media file (109).
[0055] Generally, the first and second segments, for example, the first and last five minutes of a singing or dancing reality show have one or more contestants singing or dancing with a song in the background, respectively. Hence, the first and last five minutes of the singing or dancing reality show has naturally more musical sequences when compared to a musical sequence present in the first and last five minutes of other types of media content such as a sports program, a movie, a TV serial, an education program, and a news program. Therefore, determining the duration of musical sequences in the first and last few minutes of a particular media content would indicate whether the particular media content is a reality show.
[0056] At step (214), the audio processing system (112) determines an amount of speech, for example including monologues and dialogues, in the first and second segments of the media file (109). To that end, the audio processing system (112) includes a machine learning system that identifies start and end time of different individual speech sections in the first and second segments. Examples of the machine learning system include a webRTC and inaSpeechSegmenter. For example, the machine leaning system processes audio extracted from the first segment of the media file (109), and identifies that a first speech section starts at 60th second and ends at 68th second of the media filer (109). In this example, the audio processing system (112) determines a length of the first speech section as 8 seconds.
[0057] Similarly, the audio processing system (112) determines lengths of other speech sections in the first segment based on their corresponding start time and end time identified by the machine learning system. In addition, the audio processing system (112) determines lengths of all speech sections in the second segment of the media file (109). Further, the audio processing system (112) determines a sum of lengths of all speech sections in the first and second segments. The audio processing system (112) then determines the amount of speech in the first and second segments, for example, by computing a ratio between the sum of lengths of all speech sections and a total length of the first and second segments and by multiplying the ratio with hundred.
[0058] Generally, media content such as TV serials, drama movies, comedy movies, cartoon movies, educational videos, and news videos include substantial amount of speech in their first and last five minutes when compared to speech in the first and last five minutes of other types of media content such as a sports program. Hence, determining that the amount of speech in the first and second segments of the media file (109) is above a particular threshold would indicate that the genre of the video in the media file (109) is one of a TV serial, a drama movie, a comedy movie, a cartoon movie, an educational video, and a news video.
[0059] At step (216), the audio processing system (112) determines a duration of monologue in the first and second segments of the media file (109). In one embodiment, the audio processing system (112) extracts audio from the first and second segments of the media file (109). The audio processing system (112) then processes the extracted audio and identifies a duration of the extracted audio that includes voice of only a particular person using a machine learning system, for example, using inaSpeechSegmenter. Subsequently, the audio processing system (112) determines the duration of monologue, for example, by computing a ratio between the identified duration and a total duration of the first and second segments and by multiplying the computed ratio with hundred.
[0060] Generally, media content such as an educational video or a news video includes a particular person speaking throughout various segments of the video. Hence, determining that the duration of monologue in the first and second segments of the media file (109) is above a particular threshold would indicate that the genre of the video in the media file (109) is one of an educational video and a news video.
[0061] At step (218), the audio processing system (112) determines an average level of noise across various segments including the first segment, second segment, and other segments of the media file (109). In one embodiment, the average level of noise determined by the audio processing system (112) refers to an average of audio intensities of various segments including the first segment, second segment, and other segments of the media file (109). In certain embodiments, the audio processing system (112) uses a machine learning system such as a Deep Xi or inaSpeechSegmenter to identify the average level of noise across multiple segments of the media file (109). Generally, media content such as an action movie includes substantial amount of noise when compared to noise present in other types of media content such as reality shows, sports programs, TV serials, drama movies, educational videos, and news videos. Hence, determining the average level of noise across multiple segments of the media file (109) would indicate whether the genre of the video in the media file (109) is an action movie or is of another program type.
[0062] At step (220), the video processing system (114) determines one or more video parameters including a percentage of subset of frames in the first and second segments of the media file (109) that include a human face. In one embodiment, the video processing system (114) includes one or more machine learning systems that detect and recognize the human face present in each frame associated with the first and second segments of the media file (109). Examples of the machine learning systems include one or more of multi-task cascaded convolutional networks (MTCNN), DLib, and FaceNet. Further, the machine learning systems identify the subset of frames that include the human face from the frames associated with the first and second segments of the media file (109). The video processing system (114) then determines a ratio between a total number of the subset of frames and a total number of frames in the first and second segments, and multiples the ratio by hundred.
[0063] Generally, an educational video or a news video has the same person appearing in most segments of the video. Hence, performing facial recognition and determining that a substantial part of the video has a facial image of the same person would indicate that a type of the video is the educational video or the news video.
[0064] At step (222), the video processing system (114) determines attributes of apparels worn by individuals in the frames associated with the first and second segments of the media file (109). In one embodiment, the video processing system (114) includes a machine learning system, for example, YoloV3 that identifies individuals present in the frames associated with the first and second segments of the media file (109). In addition, the video processing system (114) includes another machine learning system, for example, an openCV system that identifies colors of apparels worn by the identified individuals. Generally, a sports match such as a football match and a cricket match include players’ wearing apparels such as jerseys that are of same color. Hence, determining that the attributes of apparels worn by different individuals appearing in different segments of a particular media content are same would indicate that the particular media content corresponds to a sport video. Accordingly, the video processing system (114) identifies a potential genre of the video in the media file (109) as sports when the machine learning system identifies that the colors of the apparels worn by all or a significant proportion of the identified individuals are same.
[0065] At step (224), the video processing system (114) determines a percentage of frames in a magnified mode in the first and second segments of the media file (109). In one embodiment, the video processing system (114) uses, for example, TransNet V2 or FFProbe algorithm to identify a number of subset of frames in the first and second segments of the media file (109) in a zoom-in mode and a zoom-out mode. The video processing system (114) then determines the percentage of frames in the magnified mode by computing a ratio between the number of subset of frames in zoom-in and zoom-out modes and a total number of frames in the first and second segments, and by multiplying the ratio with hundred.
[0066] Generally, a number of frames in the magnified mode is higher for a sports video when compared to a number of frames in the magnified mode for other types of media content including a singing or dancing reality show, a TV serial, a movie, an educational video, and a news video. Hence, determining the number of frames in the magnified mode in a media content would indicate whether the media content is a sports video or is of another program type.
[0067] At step (226), the video processing system (114) determines a total number of users in the first and second segments of the media file (109). In one embodiment, the video processing system (114) uses a machine learning system, for example, a YoloV3 to identify unique users in frames of the first segment and in frames of the second segment. The video processing system (114) then determines a total number of users in the first and second segments by adding a total number of unique users identified from the first segment and a total number of unique users identified from the second segment.
[0068] Generally, a sports video captures audiences cheering from a gallery. Hence, the sports video naturally includes a higher number of unique users when compared to a number of unique users present in an educational video, a news video, a TV serial, and a movie. Hence, determining the total number of users in the first and second segments would indicate whether the media content is a sports video or is of another program type.
[0069] At step (228), the video processing system (114) determines an offset distance that indicates a distance by which a pixel has moved from one frame to another frame in the first or second segment of the media file (109). For example, the video processing system (114) identifies that a location of a particular pixel (302) in a first frame (304) selected from the first segment of the media file (109) is (X1, Y1), as depicted in FIG. 3. Further, the video processing system (114) identifies that the pixel (302) has moved from the location (X1, Y1) and is currently positioned at a location (X2, Y2) in a second frame (306) associated with the first segment of the media file (109). In this example, the video processing system (114) determines a horizontal distance (308) between the X1 coordinate of the pixel (302) in the first frame (304) and the X2 coordinate of the pixel (302) in the second frame (306). Similarly, the video processing system (114) determines a vertical distance (310) between the Y1 coordinate of the pixel (302) in the first frame (304) and the Y2 coordinate of the pixel (302) in the second frame (306). Subsequently, the video processing system (114) determines the offset distance, for example, using equation (1):

OD=√(X2-X1)^2+(Y2-Y1)^2 (1)

where, ‘OD’ corresponds to an offset distance that indicates a distance by which a pixel has moved from one frame to another frame.

[0070] In certain embodiments, the video processing system (114) identifies the determined offset distance as high when a value of the determined offset distance is equal to or above a designated threshold, for example, 40. Generally, the offset distance, indicating the distance by which a pixel has moved from one frame to another frame, is above 40 for media content such as a dance reality show, an action movie, and a sports program. However, the offset distance would be less than 40 for other types of media content such as a singing reality show and a drama movie. Hence, determining that the offset distance is above 40 for a particular media content would indicate that the particular media content corresponds to one of a dance reality show, an action movie, and a sports program.
[0071] At step (230), the genre identification system (116) identifies a genre of the video in the media file (109) as a reality show when a set of determined text, audio, and video parameters adhere to a first set of rules.
[0072] Table 1 – Rules associated with identification of genres of a video

PAR Genre

DRS SRS Sports TVS Movie AM DM EDU NEWS

CD <100 <100 <100 <100 >100 >100 >100 <100 >100
DOST <30-120 <30-120 <30-120 <30-120 =30-120 =30-120 =30-120 <30-120 =30-120
DOMS =150-240 =150-240 < 150 < 150 < 150 - - < 150 -
OD >40 <40 - - - >40 <40 - -
AAI <120 <120 - - - - - - -
AOS <80% <80% <80% >80% <80% - >80% >80% >80%
PFMM <30% <30% >30% <30% <30% - - <30% -
AOA D D S D D - - D -
NOAS - - - - - ≥8 - - -
LON - - - - - ≥90% <90% - -
FIPP - - - - - - - ≥90% ≥40%
DOM - - - - - - - >80% >80%
TNU - - >20 =2-20 =2-20 - - - -

[0073] In Table 1, ‘PAR’ corresponds to text, audio, and video parameters, ‘DRS’ corresponds to a dancing reality show, ‘SRS’ corresponds to a singing reality show, ‘TVS’ corresponds to a TV serial, ‘AM’ corresponds to an action movie, ‘DM’ corresponds to a drama movie, ‘EDU’ corresponds to an educational video, and ‘NEWS’ corresponds to a news video. Further, ‘CD’ corresponds to a character density, ‘DOST’ corresponds to a duration of scrolling text, ‘DOMS’ corresponds to a duration of music sequence, ‘OD’ corresponds to an offset distance, ‘AAI’ corresponds to an average audio intensity, and ‘AOS’ corresponds to an amount of speech including a monologue and a dialogue in the first and second segments of a media file. In addition, ‘PFMM’ corresponds to a percentage of frames in the magnified mode, ‘AOA’ corresponds to attributes of apparels, ‘NOAS’ corresponds to a number of action sequences, and ‘LON’ corresponds to a level of noise. Furthermore, ‘FIPP’ corresponds to a percentage of frames in the first and second segments of a media file that include a facial image of a particular person, ‘DOM’ corresponds to a duration of monologue, ‘S’ corresponds to an indication that attributes of apparels are all same, and ‘D’ corresponds to another indication that attributes of apparels are all different. ‘TNU’ corresponds to a total number of users in the first and second segments.
[0074] As noted previously, the genre identification system (116) identifies the genre of the video as a reality show when the associated duration of music sequence is in between an exemplary first threshold range of 150 to 240 seconds, character density is lesser than an exemplary second threshold of 100, duration of scrolling text is lesser than an exemplary third threshold range of 30 to 120 seconds, and average audio intensity is lesser than an exemplary fourth threshold of 120. Additionally, the genre identification system (116) identifies the genre of the video as a reality show only when the associated percentage of frames in the magnified mode is lesser than an exemplary fifth threshold of 30, attributes of apparels worn by individuals are all different, and amount of speech in the first and second segments of the video is lesser than an exemplary sixth threshold of 80.
[0075] In certain embodiments, at step (232), the genre identification system (116) identifies a subgenre of the video in the media file (109) as one of a dancing reality show and a singing reality show when the genre of the video is identified to be a reality show. Specifically, the genre identification system (116) identifies the subgenre as a dancing reality show or a singing reality show based on an offset distance between a location of a pixel in a first video frame and a location of the same pixel in a second video frame.
[0076] Generally, a first performance in the dancing reality show starts within the first five minutes of the show. Further, a speed at which the pixel moves from one frame to another frame when the dancing performance is ongoing would be higher, and accordingly, the offset distance would also be higher for the dancing reality show. However, a speed at which the pixel moves from one frame to another frame when a singing performance is ongoing would be comparatively lesser, and accordingly, the offset distance would also be lesser for the singing reality show. Accordingly, the genre identification system (116) identifies the subgenre of the video in the media file (109) as the dancing reality show when the associated offset distance is equal to or above, for example, 40. Alternatively, the genre identification system (116) identifies the subgenre of the video in the media file (109) as the singing reality show when the associated offset distance is lesser than 40.
[0077] At step (234), the genre identification system (116) identifies a genre of the video in the media file (109) as sports when the determined text, audio, and video parameters adhere to a second set of rules. Specifically, the genre identification system (116) identifies the genre of the video as sports when the percentage of frames in the magnified mode is above 30, attributes of apparels worn by individuals are all same, duration of music sequence is lesser than 150 to 240 seconds, and character density is lesser than 100. Additionally, the genre identification system (116) identifies the genre of the video as sports only when the associated duration of scrolling text is lesser than 30 to 120 seconds.
[0078] At step (236), the genre identification system (116) identifies a genre of the video as a TV serial when the determined text, audio, and video parameters adhere to a third set of rules. Specifically, the genre identification system (116) identifies the genre of the video as a TV serial when the amount of speech in the first and second segments of the media file (109) is above 80, duration of music sequence is lesser than 150 to 240 seconds, and character density is lesser than 100. Additionally, the genre identification system (116) identifies the genre of the video as the TV serial only when the associated duration of scrolling text is lesser than 30 to 120 seconds, percentage of frames in the magnified mode is lesser than 30, and attributes of apparels worn by individuals are all different.
[0079] At step (238), the genre identification system (116) identifies a genre of the video as a movie when the determined text, audio, and video parameters adhere to a fourth set of rules. Specifically, the genre identification system (116) identifies the genre of the video as the movie when the associated character density is above 100, duration of scrolling text is in between 30 to 120 seconds, and duration of music sequence is lesser than 150 to 240 seconds, percentage of frames in the magnified mode is lesser than 30. Additionally, the genre identification system (116) identifies the genre of the video as the movie only when the attributes of apparels worn by individuals are all different, and amount of speech in the first and second segments of the media file (109) is lesser than 80.
[0080] In certain embodiments, at step (240), the genre identification system (116) further identifies a subgenre of the video as one of an action movie and a drama movie. Specifically, upon identifying the genre of the video to be a movie, the highlights generation system (100) divides the media file (109) into a set of segments, where each segment is of, for example, 10-minutes length. Subsequently, the highlights generation system (100) selects a subset of segments from the set of segments including every alternative 10-minutes segment of the media file (109) for processing related audio and video data.
[0081] Specifically, the audio processing system (112) extracts audio associated with the selected segments from the media file (109). The audio processing system (112) then determines the amount of speech including a monologue and a dialogue in the selected segments using the machine learning system, as noted previously. In addition, the audio processing system (112) determines the level of noise in the selected segments using the machine learning system such as Deep Xi or inaSpeechSegmenter.
[0082] In addition, the video processing system (114) determines the offset distance indicating a distance by which a pixel has moved across frames in one or more of the selected segments of the media file (109). The video processing system (114) then identifies the determined offset distance to be high when a value of the determined offset distance is equivalent to or above an exemplary distance threshold of 40. The video processing system (114) identifies the determined offset distance to be low when a value of the determined offset distance is below 40. In one embodiment, the video processing system (114) tracks movement of a set of pixels across multiple video frames. However, for the sake of simplicity, tracking of movement of only one pixel by the video processing system (114) is described herein. Further, the video processing system (114) randomly selects the pixel whose movement across multiple video frames has to be tracked. Additionally, the video processing system (114) recognizes the selected pixel in various video frames based on an associated pixel intensity value.
[0083] Furthermore, the video processing system (114) determines a number of action sequences in the video in the media file (109). Specifically, the video processing system (114) identifies if the first segment in the selected segments of the media file (109) includes a sequences of frames with high frame-to-frame pixel difference. For example, the video processing system (114) identifies if an offset distance between corresponding locations of a pixel in the first and second frames is above an exemplary particular threshold of 40, the offset distance between corresponding locations of the pixel in the second and third frames is above 40, and so on. The video processing system (114) then identifies that the first segment of the media file (109) includes an action sequence when the identified offset distance is continuously above 40 across a sequence of frames in the first segment. Similarly, the video processing system (114) identifies presence of action sequences in other segments in the selected segments of the media file (109). Further, the video processing system (114) identifies the subgenre of the video as an action movie, for example, when a number of action sequences identified from segments of the media file (109) is equivalent to or above an exemplary particular threshold of 8.
[0084] In certain embodiments, the genre identification system (116) identifies the subgenre of the video as an action movie when the level of noise in the selected segments of the media file (109) is above an exemplary noise threshold of 90, number of action sequences in the selected segments is equivalent to or above 8, and offset distance of the pixel is above 40. Alternatively, the genre identification system (116) identifies the subgenre of the video as a drama movie when the level of noise in the selected segments of the media file (109) is lesser than 90, amount of speech in the selected segments is above an exemplary speech threshold of 80, and offset distance of the pixel is below 40.
[0085] At step (242), the genre identification system (116) identifies the genre of the video as an educational video when the determined text, audio, and video parameters adhere to a fifth set of rules. Specifically, the genre identification system (116) identifies the genre of the video as the educational video when the amount of speech in the first and second segments is above 80, percentage of frames in the first and second segments include a facial image of a particular person is above an eighth threshold of 90, and duration of monologue is above 80. Additionally, the genre identification system (116) identifies the genre as the educational video when the duration of music sequence is lesser than 150-240 seconds, character density is lesser than 100, duration of scrolling text is lesser than 30-120 seconds, percentage of frames in the magnified mode is lesser than 30, and attributes of apparels worn by individuals are different.
[0086] At step (244), the genre identification system (116) identifies the genre of the video as a news video when the determined text, audio, and video parameters adhere to a sixth set of rules. Specifically, the genre identification system (116) identifies the genre of the video as the news video when the character density identified from the first and second segments is above 100, and duration of scrolling text is in between 30 to 120 seconds. In addition, the genre identification system (116) identifies the genre of the video as the news video only when the amount of speech is above 80, percentage of frames in the first and second segments include a facial image of a particular person is less than 90, and duration of monologue is above an exemplary ninth threshold of 80.
[0087] In certain embodiments, post identifying the genre of the video in the media file (109), the highlights generation system (100) generates a highlights or summary of the video by applying custom processing steps tailored to the identified genre of the video. An example of the custom processing steps is depicted and described with reference to FIGS. 4A-B when the identified genre of the video is one of TV serial, sports, action movie, drama movie, education and news. Alternatively, the highlights generation system (100) generates a summary of the video by applying an alternative set of custom processing, as steps depicted and described with reference to FIGS. 5A-B, when the identified genre of the video is one of a singing reality show and a dancing reality show.
[0088] FIGS. 4A-B illustrate a flow diagram depicting an exemplary method (400) for generating a summary of a video, where the associated genre of the video corresponds to one of sports, TV serial, movie, education, and news. Specifically, at step (402), the highlights generation system (100) receives the media file (109) that includes the video to be summarized and the audio associated with the video from the content server (102). At step (404), the genre identification system (116) identifies that a genre of the video in the media file (109) is one of sports, TV serial, movie, education, and news, for example, using the method described previously with reference to FIGS. 2A-C.
[0089] At step (406), the video processing system (114) identifies a plurality of shots in the video with a corresponding confidence value for each of the shots. Specifically, the video processing system (114) implements a machine learning system that identifies the shots in the video with the corresponding confidence value for each of the shots. Examples of the machine learning system used for identifying the shots and corresponding confidence values include Transnet V2, ffprobe, shotdetect, and pyscenedetect. In one embodiment, the confidence value identified by the machine learning system varies between the range of 0 to 1. For example, a particular segment in the video identified as a shot with the confidence value of ‘1’ indicates that the particular segment is actually one of the shots in the video. In yet another example, the particular segment in the video identified as a shot with the confidence value of ‘0.5’ indicates that there is only 50% possibility for the particular segment to be one of the shots in the video.
[0090] In one embodiment, the machine learning system in the video processing system (114) identifies the confidence values of the shots in the video based on a training previously provided to the machine learning system. Specifically, during a training phase, a plurality of sample videos are provided as an input to the machine learning system. Subsequently, the machine learning system learns to identify boundaries including start times and end times of shots in each of the sample videos based on similarities between frames in the shots. Once trained, the machine learning system identifies similarities among frames in the received media file (109) and further identifies boundaries of the shots in the received media file (109) with corresponding confidence values based on patterns learnt during the training phase. For example, the machine learning system identifies a particular shot in the media file (109) with a high confidence value, for example, 0.8 when features of a key transition frame have significant amount of dissimilarity with features of a set of previous frames. In one embodiment, the key transition frame refers to a complete image that is more than 50% different from a set of previous frames indicating beginning of a set of frames that describe a new scene. Alternatively, the machine learning system identifies the particular shot with a less confidence value, for example, 0.3 when features of the key transition frame have significant amount of similarity with features of the set of previous frames.
[0091] At step (408), the highlights generation system (100) filters a subset of shots from the shots identified from the video based on values of configuration parameters that are predefined and specific to the identified genre of the video. Examples of the configuration parameters include a confidence threshold and a minimum shot duration, whose associated values vary from one genre to other genre and are stored in a database of the highlights generation system (100). For example, in one implementation, values of confidence threshold and minimum shot duration for a genre ‘movie’ correspond to 0.5 and 45 seconds, respectively.
[0092] Accordingly, in an exemplary implementation, the genre identification system (116) identifies that a genre of the video is movie, which includes 50 shots. Out of 50 shots, the video processing system (114) identifies that the movie includes 40 shots whose individual shot duration exceeds the minimum shot duration of 45 seconds, and includes another 10 shots whose individual shot duration is equal to or less than 45 seconds. In the previously noted example, the highlights generation system (100) performs a first level filtering by selecting only a set of shots including those 40 shots whose individual shot duration is above 45 seconds.
[0093] Further, the highlights generation system (100) performs a second level filtering by selecting the subset of shots from 40 shots whose associated confidence values are equal to or above 0.5, which is predefined for the genre ‘movie’. For example, the highlights generation system (100) determines that only 10 shots out of 40 shots include associated confidence values that are equivalent to or greater than 0.5. In this example, the highlights generation system (100) performs the second level filtering by selecting only those 10 shots whose associated confidence values are equal to or greater than 0.5 as the subset of shots.
[0094] Though the previously noted example is specific to filtering the subset of shots from the video belonging to the genre ‘movie,’ it is to be understood that the highlights generation system (100) similarly filters the subset of shots from the video belonging to other genres such as ‘sports’, ‘TV serial’, ‘education’, and ‘news’.
[0095] As noted previously, the values of the configuration parameters including the confidence threshold and minimum shot duration vary from one genre to other genre. Exemplary values of the confidence threshold and minimum shot duration for the genre ‘sports’ may correspond to 0.7 and 15 seconds, respectively. Exemplary values of the confidence threshold and minimum shot duration for a genre ‘TV serial’ may correspond to 0.5 and 45 seconds, respectively. Further, exemplary values of the confidence threshold and minimum shot duration for the genre ‘education’ may correspond to 0.4 and 60 seconds, respectively. Exemplary values of the confidence threshold and minimum shot duration for the genre ‘news’ may correspond to 0.6 and 45 seconds, respectively.
[0096] In one embodiment, the values of the confidence threshold and minimum shot duration are manually selected for each genre based on trial and error. For example, for selecting the values of the confidence threshold and minimum shot duration for the genre ‘sports’, the video processing system (114) may input one or more sports videos, for example, a video of a football match to the machine learning system in the video processing system (114). Subsequently, the machine learning system, for example, TransNet V2 identifies a plurality of shots including the first shot, second shot, and third shot from the input video along with corresponding confidence values and corresponding shot duration, as indicated subsequently in Table 2.
[0097] Table 2 – A plurality of shots identified from an input video

Shots Content Captured by Shots Confidence Values Duration

1st Shot Player A running towards a ball and kicking the ball 0.5 5 seconds
2nd Shot Passing of the ball up in the air and receiving of the ball by a player B 0.3 3 seconds
3rd Shot Player A passes the ball to player B who scores a goal 0.9 16 seconds

[0098] For instance, the machine learning system may identify a confidence value of the first shot as 0.5 and a duration of the first shot as 5 seconds, where the first shot captures a player ‘A’ running towards a ball and kicking the ball. Further, the machine learning system may identify a confidence value of the second shot as 0.3 and a duration of the second shot as 3 seconds, where the second shot captures passing of the ball up in the air and receiving of the ball by another player ‘B’. Furthermore, the machine learning system may identify a confidence value of the third shot as 0.9 and a duration of the third shot as 16 seconds, where the third shot captures the whole event of passing of the ball from the player ‘A’ to the player ‘B’ and scoring of a goal by the player ‘B’.
[0099] In the previously noted example, a user may initially set values of the confidence threshold and minimum shot duration as 0.2 and 3 seconds. Accordingly, the highlights generation system (100) selects all the first, second, and third shots for generating summary of the football match. However, the generated summary needs to include only the third shot as the third shot is a most significant shot capturing scoring of the goal by the player ‘B’, and needs to exclude the first and second shots that are non-significant shots merely capturing passing of the ball from the player ‘A’ to the player ‘B’. Accordingly, the user may refine the values of the confidence threshold and minimum shot duration as 0.5 and 5 seconds. In this instance, the highlights generation system (100) filters and selects the first and third shots for generating the summary of the football match, where the selected first shot needs to be excluded from the summary. Therefore, the user may further refine the values of the confidence threshold and minimum shot duration as 0.7 and 15 seconds. In this instance, the highlights generation system (100) selects only the third shot, which is the most significant shot, for generating the summary of the football match. Subsequently, values 0.7 and 15 seconds, respectively are stored as final values of the confidence threshold and minimum shot duration for ‘sports’ genre. Similarly, the highlights generation system (100) enables the user to identify and customize the values of the confidence threshold and minimum shot duration for other genres using trial and error.
[0100] In certain embodiments, defining the values of the confidence threshold and minimum shot duration as same for different genres may lead to inclusion of a larger number of non-significant shots in the subset of shots. Considering one of the previously noted examples, the highlights generation system (100) would include a non-significant shot such as the first shot in the summary of the football match when the confidence threshold for the genre ‘sports’ is set as 0.5 that is same as the confidence threshold associated with the genre ‘movie.’ Accordingly, the values of the confidence threshold and minimum shot duration are different for different genres.
[0101] At step (410), the audio processing system (112) determines an average audio intensity associated with each of the filtered subset of shots. For example, the audio processing system (112) extracts audio associated with a first shot in the subset of shots from the media file (109). Further, the audio processing system (112) divides the extracted audio into a plurality of acoustic frames. The audio processing system (112) then determines an audio intensity for the first acoustic frame in the plurality of acoustic frames, for example, based on a sum of square of amplitude of the first acoustic frame in the time domain. Similarly, the audio processing system (112) determines audio intensities of all other remaining acoustic frames in the plurality of acoustic frames, and determines the average audio intensity of the first shot by averaging audio intensities determined for all of the acoustic frames corresponding to the first shot. Likewise, the audio processing system (112) determines a corresponding average audio intensity for each of other shots in the subset of shots.
[0102] At step (412), the audio processing system (112) identifies candidate shots from the subset of shots whose corresponding average audio intensities are equal to or above a defined threshold. In one embodiment, the candidate shots identified by the audio processing system (112) are assumed to include occurrence of most significant, exciting, and/or interesting moments in the video. For example, the video corresponding to a football match generally includes multiple shots. However, the average audio intensity of each of the shots in the video would not be above the defined threshold. For example, the average audio intensities of the shots capturing only normal play of the game with no goals scored by the players would generally be less than the defined threshold as audiences may not be cheering and commentators may not be speaking in exciting voices during normal play of the game. However, the average audio intensities of the shots capturing goals scored by the players would generally be more than the defined threshold as audiences would be cheering and commentators would be speaking in exciting voices during and immediately after the players score the goals. Thus, the audio processing system (112) identifies the shots having associated audio intensities that are above the defined threshold from the subset of shots as candidates for generating a summary of the video.
[0103] When the video corresponds to a movie, the audio processing system (112) identifies specific shots whose average audio intensities are above the defined threshold from the subset of shots. Examples of such identified shots include shots that have sudden spikes in their background music and/or have high energy action sequences. The audio processing system (112) then uses the identified shots as candidates for generating a summary of the movie. When the video corresponds to a TV serial, the audio processing system (112) identifies specific shots in the subset of shots having high audio intensities. Examples of such identified shots include shots having intense argument scenes, happy and emotional conversation scenes, and celebratory moments. The audio processing system (112) then uses the identified shots as candidates for generating a summary of the TV serial. When the video corresponds to an educational or news video, the audio processing system (112) identifies specific shots in the subset of shots having high audio intensities. Examples of such identified shots include shots having high tonality variations. The audio processing system (112) then uses the identified shots as candidates for generating a summary of the educational or news video.
[0104] In step (414), the audio processing system (112) estimates a level of noise in audio associated with each of the candidate shots. Specifically, the audio processing system (112) extracts audios corresponding to the candidate shots from the media file (109). The audio processing system (112) then converts the extracted audios into corresponding mono-channel audios. Subsequently, the audio processing system (112) estimates a level of noise in each of the mono-channel audios, for example, using a machine learning system implementing DeepXI algorithm, which generally outperforms masking and mapping-based deep learning approaches for enhancing speech segments in the mono-channel audios.
[0105] In certain embodiments, the level of noise in the audio associated with each of the candidate shots may be high such that speech segments in the audio may not be very clear. In such scenarios, the level of noise in the audio needs to be reduced to improve clarity and quality of speech segments in the audio. To that end, at step (416), the audio processing system (112) reduces the level of noise in the audio associated with each of the candidate shots. Specifically, the audio processing system (112), for example, uses a machine learning system implementing the NoiseReduce algorithm for reducing the level of noise in the audio using noise thresholds and a spectral gating technique.
[0106] At step (418), the audio processing system (112) verifies if speech detected in the audio associated with each of the candidate shots is complete. For example, the audio processing system (112) implements a machine learning system, such as WebRTC, that extracts speech segments from the audio associated with a first candidate shot selected from the candidate shots. Further, the machine learning system in the audio processing system (112) analyzes the extracted speech segments to identify if the speech occurring in the first candidate shot is complete. Similarly, the audio processing system (112) analyzes speech segments extracted from each of other candidate shots and identifies if the speech occurring in each of the other candidate shots is complete.
[0107] At step (420), the audio processing system (112) completes incomplete speech occurring in audios associated with a subset of candidate shots selected from the candidate shots. For example, the audio processing system (112) identifies that speech occurring in a particular candidate shot is incomplete. Accordingly, the audio processing system (112) identifies that the particular candidate shot begins at 40th second and ends at 70th second of the media file (109). In this example, the audio processing system (112) adds first audio and video data associated with a segment starting at 30th second and ending at 39th second of the media file (109) at the beginning of the particular candidate shot. Similarly, the audio processing system (112) adds second audio and video data associated with another segment starting at 71th second and ending at 80th second of the media file (109) at the end of the particular candidate shot to obtain a modified candidate shot. Thus, post adding additional video and audio data, the modified candidate shot includes video and audio data associated with 30th second to 80th second of the media file (109).
[0108] Subsequently, the audio processing system (112) verifies if the speech occurring in the audio of the modified candidate shot is complete post adding additional video and audio data to the particular candidate shot. For instance, the audio processing system (112), using the machine learning system such as WebRTC, identifies that the speech occurring in the audio of the modified candidate shot is complete post inclusion of additional video and audio data to the particular candidate shot. Further, the audio processing system (112) identifies that the speech in the modified candidate shot specifically starts at 32nd second and ends at 75th second. In the previously noted example, the audio and video processing systems (112 and 114) include only video and audio data corresponding to 32nd second to 75th second of the media file (109) as part of the modified candidate shot. Similarly, the audio processing system (112) modifies other candidate shots by completing incomplete speech occurring in the other candidate shots by adding additional video and audio data to the candidate shots and by verifying if the speech occurring in the modified candidate shots is complete post addition of additional video and audio data. In one embodiment, the highlights generation system (100) uses one or more of the modified candidate shots for generating a summary of the video in the media file (109).
[0109] Subsequently, at step (422), the highlights generation system (100) stores the candidate shots and associated audios in an associated database (not shown in FIGS) post identifying all the candidate shots and completing speeches in those candidate shots.
[0110] At step (424), the video processing system (114) identifies one or more protagonists who prominently appear in most of the candidate shots. Specifically, the video processing system (114) employs a machine learning system that extracts facial images of various users appearing in the candidate shots. Examples of the machine learning system include multi-task cascaded convolutional networks (MTCNN), Dlib, FaceNet, ArcFace, and EigenFaces. Further, using the machine learning system, the video processing system (114) identifies one or more users from the various users whose facial images appear prominently across various candidate shots as the one or more protagonists. Examples of the one or more protagonists identified from the candidate shots include leading actor and actress of a movie, a lecturer who provides lecture on a particular education topic, a newsreader reading the news, or players who are all part of a particular sport.
[0111] At step (426), the video processing system (114) identifies emotions of the one or more protagonists. Specifically, the video processing system (114) uses a first machine learning system that extracts facial features from the facial images of the one or more protagonists. Further, the video processing system (114) uses a second machine learning system that identifies emotions of the one or more protagonists based on the facial features extracted from the facial images of the one or more protagonists. An example of the first machine learning system and second machine learning system includes FaceNet and a support vector machine, respectively. Examples of the identified emotions include happy, sad, angry, surprised, disgust, and neutral.
[0112] At step (428), the highlights generation system (100) assigns ranks to each candidate shot selected from the candidate shots based on a corresponding average audio intensity and identified emotions of the one or more protagonists in that particular candidate shot. Specifically, in certain embodiments, the highlights generation system (100) stores predefined rules for ranking the candidate shots. For example, one of the predefined rules may include ranking a candidate shot that has the highest average audio intensity and that includes one or more protagonists who are surprised as 1 by the highlights generation system (100). Another predefined rule may lead the highlights generation system (100) to rank a candidate shot that has the second highest average audio intensity and that includes one or more protagonists who are happy as 2. Similarly, the highlights generation system (100) may assign a rank 3 to a candidate shot that has the third highest average audio intensity and that includes one or more protagonists who are angry.
[0113] At step (430), the summary generation system (118) generates a new media file including a summary of the video in the media file (109). In one embodiment, the summary generation system (118) generates the summary of the video by stitching the candidate shots in a particular sequence based on order of the candidate shots in the media file (109). For example, the highlights generation system (100) identifies that the video includes three candidate shots. The first candidate shot starts at 10th minute and ends at 11th minute of the media file (109), the second candidate shot starts at 19th minute and ends at 21st minute of the media file (109), and the third candidate shot starts at 29th minute and ends at 31st minute of the media file (109). In this example, the summary generation system (118) generates the summary by stitching all three candidate shots in a particular sequence such that the generated summary starts with the first candidate shot, followed by the second candidate shot, and ends with the third candidate shot. In one embodiment, the summary generation system (118) tailors a total length of the summary based corresponding ranks assigned to the candidate shots.
[0114] At step (432), the summary generation system (118) also generates a highlights of the video in the media file (109). In one embodiment, the summary generation system (118) generates the highlights of the video based on ranks assigned to the candidate shots. Specifically, the summary generation system (118) tailors a total length of the highlights to a desired length based on corresponding ranks assigned to the candidate shots. For example, the summary generation system (118) selects the first ranked candidate shot that is of 60 seconds length, the second ranked candidate shot that is of 15 seconds length, and the third ranked candidate shot that is of another 15 seconds length from the candidate shots when the total length of the highlights needs to be 1.5 minutes. Accordingly, the summary generation system (118) generates the highlights of the video by stitching all the selected shots including first, second, and third ranked candidate shots, for example, using a ffmpeg library or a Moviepy library. However, in another example, the summary generation system (118) selects only the first ranked candidate shot that is of 60 seconds length when the total length of the highlights needs to be only 1 minute.
[0115] In certain embodiments, the summary generation system (118) uses the generated summary and the generated highlights of the video in different ways. In case of the genre ‘TV serial’, the summary generation system (118) generates a highlights of a particular episode, for example, a 101st episode of a TV serial by stitching only first and second ranked candidate shots identified from the 101st episode. The summary generation system (118) then stitches the highlights of the 101st episode post to a closing segment of a present-day episode corresponding to 100th episode to provide audiences with a preview of the 101st episode. Subsequently, the highlights generation system (100) automatically transmits stitched media content including content related to the 100th episode and the highlights of the 101st episode to the user devices (130A-N) at a specified date and time. In one embodiment, the highlights generation system (100) may be configured to ensure that the highlights stitched to the 100th episode does not include shots that reveal climax or suspense scenes in the 101st episode.
[0116] In certain embodiments, once the 101st episode is broadcast, the summary generation system (118) generates a summary of the 101st episode by stitching together all candidate shots including shots that reveal climax and/or suspense scenes in that particular episode. The summary generation system (118) then stitches the summary of the 101st episode before an opening segment of a subsequent episode corresponding to 102nd episode to provide audiences with a recap of the 101st episode. Subsequently, the highlights generation system (100) automatically transmits stitched media content including the recap of the 101st episode and content related to the 102nd episode to the user devices (130A-N) at a specified date and time. Therefore, the same summary generation system (118) uses the generated highlights for showing a preview of a particular TV serial episode and uses the generated summary for showing a recap of the particular TV serial episode.
[0117] Similarly, in case of the genre ‘movie’, the summary generation system (118) generates a highlights or a trailer of the movie before release of the movie. The generated highlights of the movie may include candidate shots that are ranked, for example, above 5 and excludes the candidate shots that are ranked from 1 to 5 to ensure such top ranked shots are not revealed to audiences before release of the movie. Further, the highlights generation system (100) automatically uploads or transmits the highlights of the movie to one or more of the content delivery server, OTT platforms, VOD platforms, and social media platforms based on a particular date and time previously set by a user.
[0118] In addition to generating the highlight of the movie, the summary generation system (118) generates a summary of the movie by stitching together all of the candidate shots identified from the movie. The generated highlights are made available as a teaser or a trailer on a particular OTT platform prior to the release of a movie. Subsequently, both the movie and the summary of the movie are made available on the particular OTT platform at the time of release of the movie such that subscribers can make an informed decision on whether to watch the movie or not based on the summary. In addition, it is to be understood that the summary generation system (118) similarly generates highlights of media content belonging to other genres such as ‘sports’, ‘news program’, and ‘education program’ by selectively stitching top ranked candidate shots, and generates summaries by stitching together all of the candidate shots.
[0119] In case of ‘sports’ genre, post completion of a sports match, the highlights generation system (100) uploads a summary of the sports match across one or more of the content delivery server, OTT platforms, VOD platforms, and social media platforms for enabling users to watch the summary of the sports match. In case of ‘news’ genre, the highlights generation system (100) summarizes a news program and provides the summary at the end of the news program. In case of ‘education’ genre, the highlights generation system (100) summarizes the entire education program and provides the summary of the education program to students for their revision.
[0120] As noted previously, processing steps associated with generating the summary of a TV serial, sports, movie, education, or news type are different from processing steps associated with generating the summary of a dance or singing reality show. Using of processing steps described previously with reference to FIGS. 4A-B for generating the summary of a dance or singing reality show leads to a very long summary of the dance or singing reality show as shots associated with singing or dancing reality shows are generally lengthy. Accordingly, custom processing steps associated with generating the summary of the media file (109) that corresponds to a dance or singing reality show are described subsequently with reference to FIGS. 5A-B.
[0121] FIGS. 5A-B illustrate a flow diagram depicting another exemplary method (500) for generating summary of a video in the media file (109) whose associated genre corresponds to one of a singing reality show and a dancing reality show. Specifically, at step (502), the highlights generation system (100) receives the media file (109) that includes a video of a reality show, and an audio associated with the video from the content server (102). At step (504), the genre identification system (116) identifies that a genre of the video is one of the singing reality show and the dancing reality show, as noted previously with reference to FIGS. 2A-C.
[0122] At step (506), the audio processing system (112) extracts the audio from the media file (109). At step (508), the audio processing system (112) converts the extracted audio into a spectrogram. At step (510), the audio processing system (112) identifies a plurality of performance segments and a plurality of speech segments from the extracted audio. In one embodiment, the performance segments refer to segments in the extracted audio that capture only singing performances or dancing performances with songs in the background. The speech segments refer to segments that do not include performances of contestants of the reality show. For example, the speech segments include a welcome note and introduction provided by a show host and judges of the reality show, conversations occurred between contestants and the judges, conversations occurred among the judges, and comments passed by the judges on performances of the contestants.
[0123] To identify the performance and speech segments from the extracted audio, the audio processing system (112) selects values of audio parameters for dividing the extracted audio into multiple audio segments. Examples of the audio parameters include a hop length, a window size, and a sampling rate. The hop length corresponds to a number of audio samples in between successive audio frames, the window size corresponds to an amount of time over which a waveform is sampled, and the sampling rate corresponds to a number of samples of audio recorded every second.
[0124] Based on the audio parameters, the audio processing system (112) divides the extracted audio from the reality show, for example, of 40 minutes length into 8 audio segments, where each of the 8 audio segments is of 5 minutes length. Subsequently, the audio processing system (112) processes the first audio segment that is of 5 minutes length using a machine learning system, for example, VocalRemover.
[0125] For example, the first 4-minutes of the first audio segment includes speech segments corresponding to a welcome note and introduction by a show host and judges of the reality show. The last 1-minute of the first audio segment includes audio associated with a first performance in the reality show. The first performance may include a signing performance in which a contestant sings a song, or a dancing performance in which the contestant dances along with the song or music. In this example, using the machine learning system, the audio processing system (112) represents the first 4-minutes including the speech segments as “VVVVVSSVVVVV…”, and the last 1-minute including audio associated with the first performance as “MMMMM…”, as indicated in equation (2):

First audio segment = {VVVVVSSVVVVV…MMMMM…} (2)

where, V corresponds to a human voice, S corresponds to silence, and M corresponds to the song or music.

[0126] Post representing the first audio segment as indicated previously using equation (2), the audio processing system (112) using the machine learning system to generate a speech segment and a performance segment from the first audio segment, where each of the speech segment and the performance segment is also of 5 minutes length. Specifically, the machine learning system generates the speech segment by retaining human voice (‘V’) and silence (‘S’) portions in equation (2) as it is, and by modifying music (‘M’) portions in equation (2) as silence (‘S’) portions, as noted subsequently in equation (3).

Speech segment = {VVVVVSSVVVVV…SSSSS…} (3)

Performance segment = {SSSSSSSSSSSS…MMMMM…} (4)

[0127] Similarly, the machine learning system generates the performance segment by retaining music (‘M’) potions in equation (2) as it is, and by modifying human voice (‘V’) and silence (‘S’) portions in equation (2) as silence (‘S’) portions, as noted previously in equation (4).
[0128] Post generating the speech and performance segments from the first audio segment, the audio processing system (112) discards the speech segment and identifies that the first performance starts from the 4th minute of the reality show, for example, based on timestamp information associated with the performance segment. Post processing the first audio segment, the audio processing system (112) similarly processes the second audio segment selected from the extracted audio. For example, the initial section of the second audio segment may include music (‘M’) portions and the later section of the second audio segment may include human voice (‘V’) portions, where the judges may be providing feedbacks about the first performance, as noted in equation (5):

Second audio segment = {MMMMM…VVVVVVVVVVVV} (5)

[0129] In the previously noted example, the audio processing system (112) generates a subsequent speech segment represented using equation (6) and a subsequent performance segment represented using equation (7) from the second audio segment, as described previously.

Speech segment = {SSSSS…VVVVVVVVVVVV} (6)

Performance segment = {MMMMM…SSSSSSSSSSSS} (7)

[0130] Post generating the subsequent speech and performance segments from the second audio segment, the audio processing system (112) discards the speech segment and identifies that the first performance starts, for example, at 5th minute and ends at 7th minute of the reality show based on timestamp information associated with the performance segment. Further, in the previously noted examples, the audio processing system (112) identifies that the performance segment generated from the second audio segment is actually a continuation of the performance segment generated from the first audio segment. Accordingly, the audio processing system (112) identifies that the first performance in the reality show starts at the 4th minute and ends at the 7th minute of the reality show. Similarly, the audio processing system (112) processes other audio segments in the extracted audio and identifies a plurality of other performances in the reality show along with their corresponding start times and corresponding end times.
[0131] At step (512), the highlights generation system (100) stores the plurality of performance segments identified from the extracted audio along with their corresponding start times and corresponding end times in an associated database (not shown in FIGS). At step (514), the video processing system (114) extracts videos associated with the performance segments from the media file (109), where each extracted video selected from the extracted videos would include a performance such as a singing or dancing performance of a contestant.
[0132] In certain embodiments, the extracted videos only include singing or dancing performances of contestants and may not capture the judges’ reactions to the performances, which are required for ranking the performances. Accordingly, at step (516), the video processing system (114) extends lengths of the extracted videos capturing performances of the contestants by a designated length such that extended videos capture both the performances and the judges’ reactions to the performances. For example, an extracted video capturing a singing or dancing performance of the first contestant is of 3 minutes length, the extracted video starting at the 4th minute and ending at the 7th minute of the reality show. In the previously noted example, the video processing system (114) extends the length of the extracted video from 3 minutes to 7 minutes by adding additional video segments of a designated length corresponding to 7th minute to 11th minute of the reality show at the end of the extracted video. The additional video segments added to the extracted video would include the judges’ comments and reactions to the singing or dancing performance of the first contestant. Similarly, the video processing system (114) extends lengths of other extracted videos to capture the judges’ reactions to the performances.
[0133] At step (518), the video processing system (114) detects emotions of the judges of the reality show from each of the extended videos. To that end, the video processing system (114) implements a machine learning system, for example, YoloV4 that identifies objects such as chairs, tables, and/or microphones in the extended video that captures both the singing and dancing performance of the first contestant and the judges’ reactions to the performance. Subsequently, the video processing system (114) implements another set of machine learning systems, for example, MTCNN and FaceNet that identify and recognize human faces that are close to a particular combination of the identified objects, for example chairs, tables, and microphones, as judges of the reality show. Further, the video processing system (114) implements yet another machine learning system, for example, a support vector machine that analyzes facial information of the judges and detects an emotion of each of the judges as happy, upset, sad, angry, surprised, disgust, or neutral. Similarly, the video processing system (114) detects emotions of the judges from each of the other extended videos.
[0134] At step (520), the video processing system (114) identifies if there is a standing ovation from the judges after each performance. For example, the video processing system (114) implements an algorithm, for example an OpenPose or AplphaPose, which identifies body poses of the judges from the extended video that captures both the singing and dancing performance of the first contestant and the judges’ reactions to the performance. The video processing system (114) then identifies if the judges in the extended video are providing standing ovation from their identified body poses. Similarly, the video processing system (114) identifies if the judges in other extended videos, capturing performances of other contestants, are providing standing ovation from their identified body poses.
[0135] At step (522), the performance ranking system (120) assigns a rank to each of the performances based on standing ovation from the judges and emotions of the judges. For example, the performance ranking system (120) assigns a 1st rank to a particular performance when the judges provide standing ovation after the particular performance and emotions of all the judges are happy. In another example, the performance ranking system (120) assigns a 2nd rank to the particular performance when the judges do not provide standing ovation but emotions of all the judges are happy. In yet another example, the performance ranking system (120) assigns a 3rd rank to the particular performance when only one of the judges provides standing ovation and emotions of some of the judges are neutral.
[0136] At step (524), the summary generation system (118) generates a new media file including a summary of the reality show. In one embodiment, the summary generation system (118) generates the summary of reality show by stitching all the performances identified from the reality show. Further, the highlights generation system (100) uploads or transmits the summary of the reality show across one or more of the content delivery server, OTT platforms, VOD platforms, and social media platforms for enabling audiences to watch performances of their favorite contestants.
[0137] At step (526), the summary generation system (118) generates a highlights of the reality show based on rankings assigned to the performances of the contestants. Specifically, the summary generation system (118) tailors a total length of the highlights of the reality show to a desired length based on ranks assigned to the performances identified from the reality show. For example, the summary generation system (118) selects sub-segments only from the first and second ranked performances when the total length of the highlights has a predefined limit of 1 to 2 minutes, where an average audio intensity of each of the selected sub-segments is above a particular threshold. The summary generation system (118) then stitches the selected sub-segments to generate the highlights of the reality show. In another example, the summary generation system (118) selects the sub-segments from the first, second, and third ranked performances when the total length of the highlights video needs to be of 2 to 4 minutes. The summary generation system (118) then stitches the selected sub-segments to generate the highlights of the reality show. The highlights generation system (100) then uploads or shares the highlights of the reality show across one or more of the content delivery server, OTT platforms, VOD platforms, and social media platforms as a trailer of the reality show or for promoting the reality show.
[0138] As noted previously, conventional highlights generation systems require significant amounts of computation resources to understand and interpret meaning of the thousands of scenes in the video in order to generate highlights. Conventionally, such systems employ metadata information and or extensive video processing to generate highlights from a particular type of media content, for example, a movie or a TV show. However, such systems would fail to generate highlights for media content of different genres such as a news program or an education program as these may lack specific categorizations such as violence or romance. Therefore, such systems cannot be used universally to generate highlights for all different types of genres of media content as every genre has its own peculiar characteristics.
[0139] The present highlights generation system (100) addresses such specific peculiarities in media content by first identifying the specific genre of the input media content before applying custom processing tailored for that specific genre to generate highlights of greater relevance. Particularly, the present highlights generation system (100) is capable of identifying different types of genres such as a signing reality show, a dancing reality show, a sports program, a TV serial, an action movie, a drama movie, an education related program, and a news program. Unlike conventional systems that require significant processing resources for processing the entire media content, the highlights generation system (100) automatically identifies genres of media content by processing text, audio, and video data associated with small segments selected from the opening and closing segments of the media content, thus needing significantly lesser processing and storage resources. Further, unlike conventional highlights generation systems that generate highlights only for one or two types of media content, the present highlights generation system (100) is capable of generating highlights for different types of media content including, but not limited to, reality shows, sports programs, television (TV) serials, movies, education programs, and news programs.
[0140] Although specific features of various embodiments of the present systems and methods may be shown in and/or described with respect to some drawings and not in others, this is for convenience only. It is to be understood that the described features, structures, and/or characteristics may be combined and/or used interchangeably in any suitable manner in the various embodiments shown in the different figures.
[0141] While only certain features of the present systems and methods have been illustrated and described herein, many modifications and changes will occur to those skilled in the art. It is, therefore, to be understood that the appended claims are intended to cover all such modifications and changes.

LIST OF NUMERAL REFERENCES:

100 Highlights Generation System
102 Content Server
104 Communication Link
106 Media Content Storage System
108 Media Formatting System
109 Media File
110 Text Processing System
112 Audio Processing System
114 Video Processing System
116 Genre Identification System
118 Summary Generation System
120 Performance Ranking System
200-244 Steps of a method for identifying a genre of a particular media content
302 Pixel in Video Frames
304, 306 Video Frames
308 Horizontal Pixel Distance
310 Vertical Pixel Distance
400-432 Steps of a method for generating a summary of media content belonging to one of a first set of genres
500-526 Steps of a method for generating a summary of media content belonging to one of a second set of genres

Documents

Application Documents

#	Name	Date
1	202241019683-POWER OF AUTHORITY [31-03-2022(online)].pdf	2022-03-31
2	202241019683-FORM-9 [31-03-2022(online)].pdf	2022-03-31
3	202241019683-FORM 3 [31-03-2022(online)].pdf	2022-03-31
4	202241019683-FORM 18 [31-03-2022(online)].pdf	2022-03-31
5	202241019683-FORM 1 [31-03-2022(online)].pdf	2022-03-31
6	202241019683-FIGURE OF ABSTRACT [31-03-2022(online)].jpg	2022-03-31
7	202241019683-DRAWINGS [31-03-2022(online)].pdf	2022-03-31
8	202241019683-COMPLETE SPECIFICATION [31-03-2022(online)].pdf	2022-03-31
9	202241019683-FORM-26 [08-04-2022(online)].pdf	2022-04-08
10	202241019683-FER.pdf	2022-10-20
11	202241019683-FORM-26 [13-04-2023(online)].pdf	2023-04-13
12	202241019683-FORM 3 [13-04-2023(online)].pdf	2023-04-13
13	202241019683-FER_SER_REPLY [13-04-2023(online)].pdf	2023-04-13
14	202241019683-CORRESPONDENCE [13-04-2023(online)].pdf	2023-04-13
15	202241019683-CLAIMS [13-04-2023(online)].pdf	2023-04-13
16	202241019683-PatentCertificate30-01-2025.pdf	2025-01-30
17	202241019683-IntimationOfGrant30-01-2025.pdf	2025-01-30

Search Strategy

1	202241019683SearchE_19-10-2022.pdf