Method And System For Enabling Skipping Of Selected Portions Of Media

< Back

Method And System For Enabling Skipping Of Selected Portions Of Media Content

Abstract: A method for skipping selected portions of a digital content is provided. The method includes accessing digital content from a content database (204) and extracting transcripts (218), visual parameters and diarized audio (228) from the digital content. One or more textual cues are identified based on the transcripts (218), visual cues based on the visual parameters and audio cues based on the diarized audio (228), where the identified cues are indicative of the selected portions. A start time of each selected portion is determined based on the cues and an end time based on a comparison of frame properties of a frame corresponding to the start time and subsequent frames. The method includes displaying a widget (502) on a graphical user interface (500) of a user device (106) at the start time of each selected portion and skipping playback of the selected portion upon activation of the widget (502).

Get Free WhatsApp Updates!
Notices, Deadlines & Correspondence

Patent Information

Application #

Filing Date

20 September 2024

Publication Number

40/2024

Publication Type

INA

Invention Field

COMPUTER SCIENCE

Status

Parent Application

Applicants

TATA ELXSI LIMITED

ITPB Road, Whitefield, Bangalore – 560048, India

Inventors

1. ARUN PARTHASARATHY

TATA ELXSI LIMITED, ITPB Road, Whitefield, Bangalore – 560048, India

2. MADHUR SHANKAR DHURGA SHANKAR

TATA ELXSI LIMITED, ITPB Road, Whitefield, Bangalore – 560048, India

Specification

Description:

METHOD AND SYSTEM FOR ENABLING SKIPPING OF SELECTED PORTIONS OF MEDIA CONTENT

RELATED ART

[0001] Embodiments of the present disclosure relate generally to content delivery, and more particularly to managing playback of selected sections of the digital media content based on user preferences.
[0002] Digital content consumption has become ubiquitous in recent years with the availability of multiple content viewing platforms such as Netflix, Prime, Hulu & Youtube. Present day content delivery platforms have changed the way a user consumes content. These platforms provide users with the ultimate choice of what to watch and when to watch instead of the traditional way of watching regularly scheduled programs interspersed with ad breaks. These platforms offer an ever-expanding variety of content such as movies, episodic and serialized shows, sports, documentaries, educational and animated content to cater to continuously changing consumer interests and preferences.
[0003] Furthermore, these platforms spend a huge amount of resources to innovate and build the best content recommendation algorithms tailored to users viewing patterns and to cater to their ever-diminishing patience for finding and consuming content. For example, these platforms provide the users with the ability to skip commonly non-preferred portions of content such as introduction, recap, opening credits, closing credits, and ad breaks, among others. These non-preferred portions interrupt viewing momentum and break the continuity of the storyline of the content, which degrades the content viewing experience for users. Therefore, these platforms provide the users with the ability to choose what not to watch via thumbs down or unlike, skip intro, skip outro, skip recap, skip ad, and skip survey options to ensure the user spends more time on the platform.
[0004] Typically, in content such as series and movies, certain non-preferred portions such as ad breaks, recaps and introductions may occur at predefined times in the content. Accordingly, these non-preferred portions may be tagged by a content owner or distributor, who has access to the source media files of the content, further inserting the content skipping features in relation to the tagged content while processing and uploading the content. However, preferred and non-preferred portions may differ for different types of content for different types of users at different times.
[0005] For example, educational content such as training videos, tutorials, conference proceedings and panel discussions, among others, may include different types of non-preferred portions such as question and answers sessions among participants and significant silent or dead times. Typically, users scrub the progress bar, that is, fast-forward and reverse multiple times through training videos to identify and consume portions of the content of their specific preference. Attempting to tag such content spanning multiple hours to suit the different preferences of different users would involve exorbitant computing effort and waste significant network storage. This places an undue burden on the content owner or distributor to tag the likely preferred portions of the source media files to insert content seeking features into the content while processing and uploading the content. However, failing to provide the ability to choose relevant portions of the video to the users hinders viewing experience and may lead to loss of viewers and associated sponsorship and subscription revenue.
[0006] Accordingly, there is a need for a solution that enables users to quickly identify and navigate to relevant sections of the content that are of his or her interest, while skipping irrelevant or unwanted segments without needing time & resource consuming analysis and tagging of source videos.

BRIEF DESCRIPTION

[0007] It is an objective of the present disclosure to provide a method for enabling skipping of selected portions of a digital content. The method includes accessing digital content from a content database by an intelligent content customization system. Further, the method includes extracting one or more of textual, visual and audio information from the digital content by the intelligent content customization system. Extracting the textual information comprises extracting one or more transcripts associated with an audio of the content, extracting the visual information comprises extracting one or more visual parameters from a video associated with the content and extracting the audio information comprises diarizing the audio. Further, the method includes identifying one or more textual cues based on the extracted transcripts, one or more visual cues based on the extracted visual parameters, and one or more audio cues based on the diarized audio. The one or more of the textual cues, visual cues and the audio cues are indicative of one or more selected portions in the content. The one or more selected portions are non-preferred portions of the content.
[0008] Further, the method includes determining a start time corresponding to each of the one or more selected portions of the content based on one or more timestamps in the digital content that correspond to an onset of one or more of the identified textual cues, visual cues and the audio cues. Each of the one or more selected portions comprises a corresponding end time determined by identifying a subsequent frame comprising one or more frame properties that are different from the frame properties associated with a frame corresponding to the determined start time. Furthermore, the method includes displaying a widget on a graphical user interface of a user device at the determined start time of each of the one or more selected portions when the digital content is played on the user device. The method includes further skipping playback of the one or more selected portions of the content upon activation of the widget.
[0009] Further, the method includes identifying the one or more audio cues comprising determining one or more segments of the diarized audio that comprises speech from more than one participating speaker exceeding a threshold duration of time. Further, determining the start time of each of the one or more selected portions of the digital content based on the one or more timestamps that correspond to the onset of the identified audio cues comprises one or more of identifying a number of speakers in the diarized audio based on one or more of an audio intensity parameter, an audio histogram parameter, and an audio frequency parameter corresponding to the diarized audio. Further, determining the start time of each of the one or more selected portions includes identifying a timestamp in the diarized audio that corresponds to absence of speech in the audio for a predefined duration of time.
[0010] Further, the method includes determining the start time corresponding to each of the one or more selected portions of the digital content based on one or more timestamps in the digital content that corresponds to the onset of one or more of the textual cues, visual cues and the audio cues comprising determining if the one or more timestamps corresponding to the onset of two or more of the textual cues, visual cues and audio cues coincide. Further, the method includes identifying the one or more timestamps in the digital content that corresponds to the onset of the two or more of the textual cues, visual cues and the audio cues that coincide as the corresponding start time of each of the one or more selected portions.
[0011] Further, the method includes determining the start time corresponding to each of the one or more selected portions of the digital content based on one or more timestamps in the digital content that corresponds to the onset of one or more of the textual cues, visual cues and the audio cues comprising determining if the one or more timestamps corresponding to the onset of two or more of the textual cues, visual cues and the audio cues coincide. Further the method includes identifying one or more lags between the one or more timestamps corresponding to the two or more of the textual cues, visual cues and audio cues upon identifying that the one or more timestamps corresponding to the onset of the two or more of the textual cues, visual cues and audio cues do not coincide and determining if the lags between the timestamps are within a predefined range. The method further includes identifying an earliest timestamp from the one or more timestamps corresponding to the two or more of the textual cues, visual cues and audio cues as the start time of each of the one or more selected portions when the one or more timestamps do not coincide.
[0012] Further, the method includes extracting the visual parameters further comprises extracting one or more machine encoded texts from image frames corresponding to the content via an optical character recognition (OCR), wherein the one or more machine encoded texts comprises optical character recognition data.
[0013] Further, the method includes determining the end time of each of the one or more selected portions comprising comparing one or more corresponding frame properties determined from a currently displayed frame of the digital content with one or more subsequent frames until identifying a difference in the corresponding frame properties. The currently displayed frame corresponds to the start time of a selected portion from the one or more selected portions in the digital content. A subsequent frame comprising the identified difference corresponds to the end time of the selected portion. The method includes determining a timestamp in the digital content corresponding to a transition to the subsequent frame and tagging the determined timestamp as the end time of the selected portion.
[0014] Further, determining the end time corresponding to each of the one or more selected portions includes determining if the extracted transcripts indicate absence of the textual cues corresponding to a currently displayed frame. The method further includes determining if the diarized audio indicates a change from more than one speaker to one speaker corresponding to an audio associated with the currently displayed frame. Further, determining the end time corresponding to each of the one or more selected portions further includes determining if the extracted transcripts indicate instances of resumption of discussion relevant to a topic of interest corresponding to the currently displayed frame.
[0015] Further, displaying a widget at a graphical user interface of the user device includes receiving a master file comprising the textual cues, visual cues and audio cues that are indicative of the one or more selected portions in the digital content, the start time and end time of each of the one or more selected portions. Further the method includes displaying the widget displaying the widget at the start time corresponding to each of the one or more selected portions determined from the master file when the digital content is played on the user device. Further the method includes receiving activation input corresponding to the displayed widget and skipping the one or more selected portions until the end time upon receiving the activation input.
[0016] The method further includes skipping the one or more selected portions until the end time upon receiving the activation input comprises skipping one or more of the selected portions from an activation timestamp until the end time. The activation timestamp corresponds a point in time in the digital content at which the activation input is received by the user device. Further, skipping the one or more selected portions includes skipping one or more portions of the digital content corresponding to one or more categories associated with educational content, wherein the one or more categories comprise one or more of introduction sessions, questions and answer sessions, silence, and break sessions.
[0017] Further identifying one or more of the textual cues, visual cues, and audio cues includes processing one or more of the digital content, the master file, and one or more user files using a pre-trained Large Language Model.
[0018] It is another objective of the present disclosure to provide a system enabling skipping of selected portions of a digital content. The system includes an intelligent content customization system communicatively coupled to one or more of a content database associated with a content viewing platform and a user device. Further the intelligent content customization system is configured to access the digital content from the content database. The intelligent content customization system further extracts textual and audio information from the digital content by an audio analytics subsystem and extract visual information from the digital content by a video analytics subsystem. The textual information includes one or more transcripts associated with an audio of the digital content, wherein the visual information comprises one or more visual parameters associated with a video of the digital content and wherein the audio information comprises diarized audio. The intelligent content customization system identifies one or more textual cues based on the extracted transcripts, one or more visual cues based on the extracted visual parameters and one or more audio cues based on the diarized audio by a large language model subsystem using a large language model. One or more of the textual cues, visual cues and audio cues are indicative of one or more selected portions in the content, wherein the one or more selected portions correspond to non-preferred portions in the digital content.
[0019] The intelligent content customization system determines a start time corresponding to each of the one or more selected portions of the content based on one or more timestamps in the digital content that correspond to an onset of one or more of the identified textual cues, visual cues and audio cues using the large language model. Each of the one or more selected portions comprises a corresponding end time determined by identifying a subsequent frame comprising one or more frame properties that are different from the frame properties associated with a frame corresponding to the determined start time. Further, the intelligent content customization system includes a storage subsystem configured to store a master file that stores one or more of the textual cues, audio cues and visual cues indicating the one or more selected portions in the content. The start time and end time corresponding to each of one or more of the selected portions identified by the large language model.
[0020] Further, the user device is configured to download the master file and the digital content. The user device displays a widget at a graphical user interface of the user device at the start time of each of the one or more selected portions of the digital content during playback of the digital content. The intelligent content customization system skips the one or more selected portions upon receiving an activation input corresponding to the displayed widget at the graphical user interface.
[0021] Further, the intelligent content customization system is one of a standalone system communicating with the content viewing platform and the user device via a communication link and a system integrated into one of the content viewing platform and the user device.

BRIEF DESCRIPTION OF DRAWINGS

[0022] These and other features, aspects, and advantages of the claimed subject matter will become better understood when the following detailed description is read with reference to the accompanying drawings in which like characters represent like parts throughout the drawings, wherein:
[0023] FIG. 1 illustrates a block diagram depicting an exemplary system for enabling skipping of selected portions while consuming a digital content, in accordance with aspects of the present disclosure;
[0024] FIG. 2 illustrate a block diagram depicting an exemplary intelligent content customization system that enables skipping of unwanted portions in the digital content, in accordance with aspects of the present disclosure;
[0025] FIGS. 3A-B illustrate a flowchart depicting an exemplary method for enabling skipping of unwanted portions while consuming a digital content, in accordance with aspects of the present disclosure;
[0026] FIG. 4 illustrates a flowchart depicting an exemplary method for integrating a skip widget into the digital content when the digital content is being played by a media player, in accordance with aspects of the present disclosure;
[0027] FIG. 5 illustrates an exemplary user interface displayed on a user device depicting the skip widget of FIG. 4 when the digital content is being played by the media player, in accordance with aspects of the present disclosure; and
[0028] FIG. 6 shows a flowchart depicting a method of training a large language Model (LLM) subsystem to identify relevant audio and textual cues indicating unwanted portions in the digital content, in accordance with an embodiment of the present disclosure.

DETAILED DESCRIPTION

[0029] The following description presents an exemplary system and a method for enabling skipping of selected portions of a digital content consumed by a user using a respective user device. The selected portions may pertain to unwanted or non-preferred portions of the digital content. Particularly, embodiments described herein disclose an intelligent content customization system configured to enable skipping of non-preferred portions that accompany preferred portions in the digital content. The intelligent content customization system enables skipping of the non-preferred portions in the digital content by identifying relevant audio, textual and visual cues that indicate non-preferred or unwanted portions in the content. Further, the intelligent content customization system provides users with an option or a control feature, which allows skipping past the unwanted portions. Activation of the control feature allows the users to continue viewing the portions that the user prefers without hindering the continuity or flow of a storyline offered by the content, which improves users’ content viewing experience.
[0030] As used herein, the term “content” may refer to all kinds of content that a content viewer can consume at any given time or device on any content delivery platform. In one or more embodiments of this disclosure, content particularly embodies educational videos, training videos, an audio and/or video recording of classroom lectures, tutorials and panel discussions, among others. Further, as used herein the terms “unwanted” or “non-preferred” portions may refer to portions or segments of the content that may not be relevant to a storyline or theme of the content. Additionally, as used herein the term unwanted or non-preferred portions may also refer to portions or sections of the content that may be relevant to the storyline or theme of the content but may not be preferred for consumption by one or more viewers. Examples of selected, unwanted, or non-preferred portions that may not be relevant to the storyline or theme of content such as movies, series, news, sports, documentaries and reality shows, among others, include but are not limited to, ad breaks, series or movie intros and outros, episode recaps, opening and closing credits, songs, and silence. Further, examples of selected, unwanted, or non-preferred portions include, but are not limited to, judges’ speech in reality shows, conversation between show hosts, question and answer (Q&A) sessions in conference proceedings or educational videos, among others. It may further be noted that the terms “selected,” “unwanted,” or “non-preferred” may be used interchangeably in one or more embodiments of this disclosure.
[0031] Conventional content skipping approaches provide users with features such as thumbs down, dislike, ratings, skip, and the like to skip past the unwanted sections of digital content such as movies and series. Users’ interaction with these features on the content delivery platform allow these platforms to understand user preferences, and thereby recommend more personalized or customized content. The conventional content skipping methods, however, are designed for skipping only certain predefined selected portions, such as, ad breaks, episode intros and recaps, which typically occur at predefined times within the content.
[0032] Moreover, the conventional content skipping methods are only provided with specific types of content such as pre-recorded movies, series, documentaries, reality shows or sport shows. These portions can be tagged with various types of pre-generated tags based on the predefined times at which they occur in the content for easy identification by a human or a machine. Based on the tags, a skip feature can be incorporated at appropriate points in time in the content to enable skipping of the unwanted portions. Subsequently, when a user selects a content to be played on the platform, the skip feature can be displayed at appropriate times during playback of the content. The user can choose to activate the skip feature to skip non-preferred portions while viewing the content.
[0033] However, non-preferred portions of content may not always be of a predefined format and may differ for different types of content and for different users. For example, content such as educational videos may include different types or formats of unwanted and non-preferred portions. The different formats of unwanted portions may include, for example, introduction, Q&A sessions, breaks, general conversations among the participants, silence, and dead time. Non-preferred portions associated with the educational content, such as topics user is already familiar with, and thus skips, may not be easily distinguishable from the preferred portions of the content, and therefore may not be tagged with pre-generated tags. Attempting to customize conventional content skipping methods to tag such educational videos spanning few hours, and atypical content other than movies, series, documentaries, reality and sport shows, involves significant amount of cost and effort, and still may be error-prone.
[0034] The present system and method mitigate the aforementioned issues with conventional content skipping approaches by accurately identifying untagged unwanted and non-preferred portions in content, which are not easily distinguishable from the preferred portions of the content. In particular, the present disclosure describes an intelligent content customization system that utilizes audio and video analytics, for example using a pre-trained large language model (LLM), to identify textual, visual and audio cues that indicate unwanted or non-preferred portions in content. Once the unwanted and non-preferred portions are identified, the intelligent content customization system determines a start time and an end time corresponding to the unwanted and non-preferred portions in the recorded content based on the audio, visual and textual cues and one or more frame characteristics.
[0035] Additionally, the intelligent content customization system enables integration of a widget in a graphical user interface (GUI) of a user device to be displayed at the start time corresponding to the unwanted and non-preferred portions of the digital content when the digital content is played on the user device. When the widget is activated by the user, the unwanted and non-preferred portions of the content are skipped during playback of the content. The intelligent content customization system and associated method disclosed in the present disclosure, thus, eliminate the need to fast forward a content or make multiple adjustments to consume the preferred portions of the content, which improves users’ content consuming experience. An embodiment of the present intelligent content customization system is described in greater detail with reference to FIG. 1.
[0036] FIG. 1 shows a block diagram depicting an exemplary intelligent content customization system (102) for enabling skipping of selected portions of digital content such as audio, video, visual, or mixed media content. The selected portions may be non-preferred or unwanted portions of the digital content. In one embodiment, the intelligent content customization system (102) is communicatively coupled to a content delivery platform (104) and a user device (106) via a communication link (108) to enable access to the digital content. Examples of the communication link (104) include a satellite-based communications system, an over-the-top (OTT) system, and the internet, among other generally available communication systems. The content viewing platform (104) may correspond to a server owned by a content distributor such as an over-the-top (OTT) media content provider or a video-on-demand (VOD) service provider. Some non-limiting examples of the content viewing platforms (104) include Youtube, Netflix, Prime, Hulu, Hotstar, Twitch, Discord, Facebook, Instagram, Spotify, Udemy, Coursera, and similar platforms. Further, in some embodiments, the content viewing platforms (104) may correspond to a platform or server owned by an educational institution, such as a university, that imparts education and training to users via an online and/or an offline mode. A user may select any content from the content viewing platform (104) such as YouTube using his or her user device (106), such as a smartphone, a laptop or a desktop, and play the selected content.
[0037] To that end, the content viewing platform (104) may include a content database (shown in FIG. 2) that stores a variety of media content for selection and consumption. For example, a selected digital content may correspond to a recorded lecture provided by an educator by utilizing a PowerPoint presentation including multiple slides. In an embodiment, the content viewing platform (104) is configured to transform and package the digital content into a format which is suitable to be viewed across multiple platforms on different types of user devices having different properties including screen sizes, operating systems, compatible bit rates, download speeds and resolutions. The intelligent content customization system (102) is configured to identify and share the associated features of the digital content such as the size and format of the digital content, among others, with the media player (103) and the user device (106) to allow for seamless playback of the digital content at desired bit-rates and resolutions.
[0038] In some embodiments, the intelligent content customization system (102) may be integrated as a part of the content viewing platform (104). Alternatively, the intelligent content customization system (102) may be integrated as a part of the user device (106). In certain other embodiments, the intelligent content customization system (102) may be integrated as a part of the content viewing platform (104). Alternatively, the intelligent content customization system (102) may be implemented as a standalone system remotely communicating with the content viewing platform (104) and the user device (106) via the communication link (108). In some embodiments, the user device (106) may include a media player (103) that plays the content selected by the user from the content viewing platform (104). The media player (103) may alternatively be integrated as a part of the content viewing platform (104). In certain embodiments, the media player (103) may be a standalone unit remotely communicating with one or more content viewing platforms, one or more user devices and the intelligent content customization system (102) via the communication link (108).
[0039] Unlike conventional content skipping approaches that employ tags for skipping predefined unwanted portions of the content during playback, the intelligent content customization system (102) is configured to first extract the audio, visual and textual information associated with the content to identify unwanted and non-preferred portions of the content. Subsequently, the extracted audio, visual and textual information is processed for identifying the presence of audio, visual and textual cues that may indicate unwanted or non-preferred portions of the content using one or more LLMs. The intelligent content customization system (102) uses one or more LLMs to identify, and in turn, skip selected or non-preferred sections of content, which may not possess distinguishable video properties or tags, and that occur at random times in the content.
[0040] In one embodiment, the intelligent content customization system (102) is configured to access recorded content available in one or more content databases of the content viewing platform (104). Particularly, the intelligent content customization system (102) is configured to request for permission from the content viewing platform (104) to access selected content. The content viewing platform (104) may provide permission and authorization to the intelligent content customization system (102) based on certain information that identifies the intelligent content customization system (102) as a verified requestor. Post verification, the intelligent content customization system (102) may be listed as an authorized system by the content viewing platform (104).
[0041] Subsequently, upon requesting a digital content from the content viewing platform (104), the intelligent content customization system (102) receives a master file (shown in FIG. 2) associated with the digital content from the content viewing platform (104). When a user selects the digital content to play on the user device (106), the media player (103) residing on the user device (106) downloads the master file associated with the digital content along with the digital content to the user device (106).
[0042] In particular, the downloaded digital content may be momentarily stored with metadata associated with the digital content in a storage subsystem (shown in FIG. 2) of the intelligent content customization system (102). Subsequently, the intelligent content customization system (102) may process the digital content to extract at least a transcript of the associated audio by employing one or more audio to text conversion techniques using well known transcription tools, including but not limited to, Otter.AI, Rev, Sonix, Trint, WhisperTranscribe and extract one or more visual parameters by application of optical character recognition (OCR) to an associated video. Additionally, the intelligent content customization system (102) may process the digital content to diarize the audio by employing certain known techniques or algorithms, including but not limited to, Deep Neural Networks (DNN), and Hidden Markov Models (HMM), for segmenting the audio. The intelligent content customization system (102) also identifies voices or speech of individual participating speakers in the audio using one or more available diarization libraries or tools such as Nvidia NeMo, PyAnnote, Kaldi, AssemblyAI, and SpeechBrain. The intelligent content customization system (102) then identifies one or more relevant textual cues from the transcript and visual cues obtained by application of OCR technique to the videos based on a pre-trained AI model or an LLM. Further, the intelligent content customization system (102) also obtains one or more audio cues from the diarized audio using the pre-trained AI model or the LLM.
[0043] In one embodiment, the LLM or the pre-trained AI model is trained and/or finetuned to identify the various textual, visual and audio cues that are indicative of one or more non-preferred portions of the digital content. For example, the LLM or the pre-trained AI model may be trained and/or finetuned to identify the words “Introduction” and “Break” as the textual cues indicative of the non-preferred portion of the digital content. Similarly, the visual cues indicative of the non-preferred portion of the digital content may correspond to “Agenda” and blank screen. Furthermore, the audio cues indicative of the non-preferred portion of the digital content may correspond to silence. In certain embodiments, the LLM or the pre-trained AI model is trained and/or finetuned to identify different types of textual, visual and/or audio cues to identify different types of non-preferred portions of the digital content as per specific user preferences. For example, the LLM or the pre-trained AI model may be trained and/or finetuned to identify textual, visual and/or audio cues to selectively identify portions such as a Q&A session in a digital content such as a panel discussion that may be relevant to the theme of the digital content but may not be preferred for consumption by one or more viewers. Subsequently, the intelligent content customization system (102) determines a start time and an end time corresponding to each of the one or more unwanted or non-preferred portions in the digital content based on the identified textual, visual and/or audio cues. Subsequently, intelligent content customization system (102) transmits the digital content along with the start time, end time, identified textual, audio and visual cues and metadata associated with the digital content to the media player (103).
[0044] The intelligent content customization system (102) may further configure the media player (103) via one or more suitable control instructions to generate and integrate a skip widget at appropriate start times corresponding the unwanted portions when the digital content is being played by the media player (103). When a user plays the digital content in the media player (103), the media player (103) displays the widget at the appropriate start time, which upon activation, precisely seeks the digital content to the identified end time of the non-preferred portions, thereby skipping the unwanted or non-preferred portions of the content. An embodiment depicting certain exemplary components of the intelligent content customization system (102) that enables intelligent skipping of unwanted or non-preferred portions of media content streamed from the content viewing platform (104) to the user device (106) using the media player (103) is described in greater detail with reference to FIG. 2.
[0045] FIG. 2 illustrates a block diagram depicting an embodiment of the exemplary intelligent content customization system (102) of FIG. 1 that enables skipping of unwanted portions in the digital content. In one embodiment, the intelligent content customization system (102) and associated functions performed by the intelligent content customization system (102), may be implemented by suitable code on a processor-based system, such as a general-purpose or a special-purpose computer. Accordingly, intelligent content customization system (102) includes one or more general-purpose processors, specialized processors, graphical processing units, microprocessors, programming logic arrays, field programming gate arrays, integrated circuits, systems on chips, and/or other suitable computing devices.
[0046] In certain embodiments, the intelligent content customization system (102) may include one or more sub-components including, but not limiting to, an audio analysis subsystem (202), a video analysis subsystem (206), a storage subsystem (208), an LLM subsystem (234), a training subsystem (214) and a widget generating subsystem (230). In an embodiment, each of the one or more sub-components of the intelligent content customization system (102) may be implemented as independent microprocessors, microcomputers, microcontrollers, digital signal processors, central processing units, state machines, logic circuitries, and/or a separate hardware device that performs designated functionality based on commands from the intelligent content customization system (102). Among other capabilities, the intelligent content customization system (102) may be configured to fetch and execute computer-readable instructions stored in the storage subsystem (208).
[0047] In an embodiment, the storage subsystem (208) may store various forms of data, without limiting to, extracted transcript (218), diarized audio (228) and OCR data (238) obtained from processing the selected digital content. Additionally, in one or more embodiments, the storage subsystem (208) may also store the master file (220). The master file (220) may include data generated as part of processing the digital content to identify the audio, visual and textual cues, and thereby determining the start and end times associated with the unwanted portions. The master file (220) may also include various information associated with the digital content such as information corresponding to the frames of the content, bit-rate, format of associated videos and associated metadata. The master file (220) may further include one or more categories under which the unwanted portions are grouped. In an example, the categories under which the unwanted or non-preferred portions may be grouped include “Q&A”, “Introduction” and “Break.”
[0048] As previously noted, the intelligent content customization system (102) may access digital content of different genres stored in a content database (204) associated with the content viewing platform (104). Examples of genres may include movies, series, sports, trainings, lectures and documentaries, among others. Additionally, the content database (204) stores the digital content in one or more designated formats and one or more bitrates. Examples of the designated formats may include MPEG-4 part 14 format, MKV format, and AVI format. Upon receiving a request for the digital content from the user device (106), the content viewing platform (104) transmits a copy of the requested digital content and associated master file (220) stored in the content database (204) to the intelligent content customization system (102). Particularly, the storage subsystem (208) associated with the intelligent content customization system (102) stores the digital content and associated master file (220) received from the content viewing platform (104) for further processing by the audio analytics subsystem (202) and the video analytics subsystem (206).
[0049] Subsequently, in certain embodiments, the intelligent content customization system (102) is configured to process the digital content received from the content viewing platform (104) to extract an associated video channel and an audio channel from the digital content to extract associated video and audio content, respectively. The extracted audio is then passed to the audio analysis subsystem (202) to extract transcripts and information corresponding to one or more participating speakers whose voices may be present in the audio. To that end, in one embodiment, the audio analysis subsystem (202) includes one or more sub-components including, but not limited to, a transcript extraction subsystem (212) and a diarization subsystem (222). The audio analysis subsystem (202) receives the extracted audio from the digital content and shares it with the transcript extraction subsystem (212), and a diarization subsystem (222), for example, for simultaneous processing. The transcript extraction subsystem (212) may include one or more audio to text conversion units configured to identify and analyze audios of various languages and generate transcripts in a desired language.
[0050] Further, the diarization subsystem (222) is configured to identify the participating speakers in the audio by segmenting the audio into speech and non-speech segments, and subsequently detect persons speaking or points where a speaker change occurs. Upon identifying segments corresponding to a particular participating speaker, the diarization subsystem (222) groups these segments into speaker homogeneous clusters. In an embodiment, the speech segments and the speaker information associated with each segment of the audio obtained as output from the diarization subsystem (222) may be stored as diarized audio (228) within the storage subsystem (208). Subsequently, the extracted transcripts (218) and the diarized audio (228) are provided as inputs to the LLM subsystem (234) to identify the audio and textual cues in certain sections of the content that may indicate that these sections include unwanted or non-preferred portions.
[0051] Similar to the extracted audio, in certain embodiments, the extracted video is passed to the video analysis subsystem (206) to obtain machine-encoded texts from image frames present in the extracted video. To that end, the video analysis subsystem (206) includes one or more sub-components including, but not limited to, an OCR subsystem (216). The extracted video is processed by the video analysis subsystem (206) to obtain visual information from the frames of the extracted video. The visual information may include, for example, a page number or a slide number and other textual parameters. The visual information may be processed by the OCR subsystem (216) to generate machine encoded texts. The texts obtained as output from the OCR subsystem (216) may be stored as OCR data (238) within the storage subsystem (208). For example, the texts obtained as output from the OCR subsystem (216) may include the slide or page numbers, table of contents, other information associated with the topic of lecture, conclusion, and so on. The OCR data (238) may be provided as input to the LLM subsystem (234) to identify relevant visual cues in certain sections of the content that may indicate that these sections include unwanted portions or non-preferred portions identified based on user preferences.
[0052] In one or more embodiments, the LLM subsystem (234) receives the extracted transcripts (218), diarized audio (228) and OCR data (238) along with the master file (220) associated with the digital content and generates a list of relevant audio cues based on the diarized audio (228), textual cues based on the extracted transcripts (218) and visual cues based on the OCR data (238). To that end, the LLM subsystem (234) uses one or more LLMs pre-trained using a large and diverse dataset such as videos and audios including various formats of unwanted or non-preferred portions available in the training dataset (224). This training dataset (224) including the videos and audios may be collected from various sources, and are cleaned and standardized for further processing. The training of the LLM subsystem (234) is carried out by the training subsystem (214) and will be explained in greater detail with reference to FIG. 6. Once trained, the LLM subsystem (234) aids the intelligent content customization system (102) to identify unwanted portions of the content based on identifying relevant textual cues, audio cues and visual cues that indicate the unwanted portions. To that end, the LLM subsystem (234) uses one or more LLMs such as Claude 3.5, Macaw-LLM, SALMONN, and Meta ImageBind that are trained and/or finetuned by the training subsystem (214) to identify the textual, visual and/or audio cues using the training dataset (224).
[0053] In one embodiment, the relevant textual cues identified from the extracted transcripts (218) by the LLM subsystem (234) may include keywords and phrases related to chapter introduction. Examples of such keywords and phrases that can refer to an unwanted portion such as an “introduction” may include phrases such as, “Hello class, let’s start…” and “Let us wait for others to join”, among others. Further, visual cues indicative of the unwanted introduction portion may include machine-encoded texts that can be obtained from an image of the whiteboard, the blackboard, and/or the PowerPoint presentation that are used as materials for teaching the topic to an audience. Such cues may include the name of the topic or a chapter, table of contents, page number, slide numbers and other information associated with the topic, among other visual cues.
[0054] Further, audio cues identified by the LLM subsystem (234) may include segments of the audio information where the diarized audio indicates presence of more than one individuals speaking, referred to as speaker throughout this disclosure. For example, if the diarized audio indicates more than one speaker for a particular section of the audio associated with the content for longer than a threshold duration of time, that section may be identified as a Q&A portion. Further, if the diarized audio indicates more than one speaker for a particular section of the audio within the threshold duration of time, that section may be identified as a preferred portion. An example of threshold duration of time may include 30 seconds. This way the intelligent content customization system (102) may be able to differentiate between actual Q&A sessions and portions that may only last for few seconds where, for example, the educator seeks some confirmation from other participants such as students related to a specified question. Thus, the intelligent content customization system (102) is configured to prevent false positives.
[0055] Additionally, the LLM subsystem (234) also determines the start and end times of the true unwanted portions based on start timestamps and end timestamps corresponding to the identified cues that indicate unwanted or non-preferred portions of the content. In one or more embodiments, the intelligent content customization system (102) determines the start time of the one or more unwanted or non-preferred portions of the content based at least on a timestamp in the digital content that corresponds to an instance of onset of at least one cue, that is, textual cue, visual cue or the audio cue. For certain unwanted portions, the intelligent content customization system (102) determines the start time of the one or more non-preferred portions based on more than one cue and/or more than one type of cue. Additionally, the start times of certain unwanted portions can be determined based on one or more frame properties, including but not limited to, change or transition of frames and change in frame properties, among others.
[0056] In an example, the intelligent content customization system (102) may determine the start time of an introduction session in an educational video based on the timestamp when phrases such as “Hello class, let’s start…”, “Today we will discuss” and “Let us wait for others to join” appear in the content. The intelligent content customization system (102) then tags the timestamp corresponding to the instance of onset of such phrases as the start time of a relevant textual cue that indicates introduction session in the lecture video. In another example, the intelligent content customization system (102) may determine the start time of an introduction session in the video based on a timestamp when a current frame displays name of a topic or a chapter or a table of contents. The name of a topic and the table of contents can be identified from the visual information obtained from the OCR subsystem (238). Additionally, the intelligent content customization system (102) also determines the start time of the introduction session based on visual cues such as slide or page numbers associated with the slides. In an example, if the intelligent content customization system (102) infers that the slide or page numbers from the currently displayed frame is 1 or 2, then the timestamp corresponding to the onset of the currently displayed frame is determined as the start time of the introduction session.
[0057] In an embodiment, the intelligent content customization system (102) may determine the start time of an unwanted portion such as a Q&A session of an educational video based on at least an audio cue. In an example, the intelligent content customization system (102) notes a point in the timeline of the audio associated with the content where a speaker changes from one speaker to more than one speaker as the start timestamp of a relevant audio cue. The intelligent content customization system (102) may utilize other audio indicators, such as audio intensity parameters, audio histogram parameters, audio frequency parameters in addition to the diarized audio (228) that may indicate a change in the speaker by identifying a number of participating speakers.
[0058] In another embodiment, the intelligent content customization system (102) may determine the start time of a break session of the educational video based on at least an audio cue. In an example, the intelligent content customization system (102) notes the start timestamp of a relevant audio cue wherein the diarized audio (228) indicates absence of speech in the audio or no audio itself for a predefined duration of time. An example of predefined duration of time may include a duration longer than 60 seconds. Additionally, intelligent content customization system (102) may determine the start time of a break session based on at least a visual cue, for example, a timer appearing on a currently displayed frame showing a duration remaining for the break session to end. The timestamp corresponding to the first appearance of the timer on the current frame may be determined as the start time for the break session.
[0059] As previously noted, for certain unwanted portions, the intelligent content customization system (102) determines corresponding start times based on more than one cues. In such scenarios, the intelligent content customization system (102) determines if timestamps corresponding to the onset of two or more of the cues coincide or match. If the timestamps match, then the coinciding timestamp corresponding to the of two or more cues is considered as the start time of the one or more unwanted portions. However, in certain other scenarios, the start time of one cue indicating an unwanted portion may not coincide with the start time of another cue indicating the unwanted portion. For example, the start time of the textual cue indicating an unwanted portion may not coincide with the start time of the audio cue indicating the unwanted portion. Likewise, in certain scenarios, the start time of the audio cue indicating an unwanted portion may not coincide with the start time of the visual cue indicating the unwanted portion. Such mismatch in the start time may occur due to various reasons, including the type of device on which the content is played, type of application, such as the media player (103) using which the content is played. Such mismatch in the start time may occur even when the intelligent content customization system (102) and the content viewing platforms (104) have compatibility issues. Mismatches in the start times may also occur as a result of transforming the digital content to make it suitable to be viewed across multiple platforms on different devices having different screen sizes, bit rates, different download speeds and resolutions at the content viewing platform (104).
[0060] Consequently, onset of two or more cues may exhibit lags relative to one another resulting in the cues not appearing at matching points in the timeline of the digital content when the digital content is played by the media player (104). In such scenarios, the intelligent content customization system (102) determines that timestamps corresponding to the onset of two or more of the cues do not coincide or match. Further, the intelligent content customization system determines the lag associated with the timestamp corresponding to the onset of each of the cues. If the lag between the timestamps of two or more cues is determined to be within a predefined range, then the intelligent content customization system (102) considers the timestamp corresponding to the earliest cue or the first identified cue as the start time of the unwanted portion. The predefined range, for example, may be selected to be within 2 seconds to 10 seconds. In one embodiment, the timestamp of onset of a textual cue may not match the timestamp of onset of an audio cue. Likewise, in certain scenarios the timestamp of onset of a textual cue may not match the timestamp of onset of a visual cue. If an audio cue appears between 5 seconds and 25 seconds of the digital content but the visual cue appears between 7 seconds to 30 seconds, then the start time corresponding to the unwanted portion is identified as 5 seconds. In this way, the intelligent content customization system (102) uses more than one or a combination of cues to verify that correct unwanted portions are identified, and false positives are avoided, thereby improving the accuracy of identifying an unwanted portion in the digital content.
[0061] The intelligent content customization system (102) further determines the end times of the unwanted portions based at least on identifying a difference in visual and audio parameters between a currently displayed frame corresponding to the start time of at least one non-preferred portion and each of the subsequent frames in the timeline of the digital content until identifying a difference. Specifically, the intelligent content customization system (102) identifies a subsequent frame that includes different visual and audio parameters from the currently displayed frame. The intelligent content customization system (102) then determines a timestamp that coincides with the transition to the identified subsequent frame and tags the determined timestamp as the end time of the non-preferred portion of the digital content. In addition to frame properties, the end time can also be determined based on one or more of the identified cues.
[0062] For example, for an introduction session, in addition to noting the timestamp corresponding to a frame transition, the end time is determined based on a timestamp that corresponds to when the transcript (218) indicates instances of resumption of discussion relevant to a topic of interest. Furthermore, the end time associated with the introduction session may correspond to a time when the OCR data (238) indicates that a slide in display includes texts relevant to a topic or a chapter related to the lecture or the training. Further, in case of a Q&A session, the intelligent content customization system (102) determines the end time based on an audio cue in addition to a corresponding frame transition. For example, the end time associated with the Q&A session corresponds to a timestamp when the diarized audio indicates a change from more than one speaker to one speaker. Further, in case of a break session, the intelligent content customization system (102) determines the end time of the break session based on a timestamp when the diarized audio indicates presence of a speaker or resumption of the audio after a pause.
[0063] In an embodiment, the intelligent content customization system (102) determines the end time for certain unwanted portions based on more than one cue in addition to frame transitions similar to the method for determining the start time. In certain scenarios, the end times of the unwanted portions indicated by the extracted transcript (218) may not match the end times of the unwanted portions indicated by the diarized audio. Likewise, in certain scenarios, the end time of the unwanted portions indicated by the diarized audio and the end time of the unwanted portions indicated by OCR data (238) may exhibit a mismatch. Consequently, one or more of the textual cues identified based on the extracted transcript (218), audio cues identified based on the diarized audio (228) and one or more visual cues identified based on the OCR data (238) may exhibit lags relative to one another resulting in the cues not appearing at matching points in the timeline of the digital content when the digital content is played by the media player (104). In such scenarios, the intelligent content customization system (102) configures the media player (103) to identify the last identified cue as the end time of the unwanted portion. For example, if an audio cue indicates that an unwanted portion appears between 5 seconds and 25 seconds of the digital content, but the visual cue indicates that the same unwanted portion appears between 7 seconds to 30 seconds, then the end time corresponding to the unwanted portion is identified as 30 seconds.
[0064] Subsequently, the identified audio, visual and textual cues, the start and end times associated with the unwanted portions are added to the master file (220), for example, by the LLM subsystem (234) in the intelligent content customization system (102). Additionally, the LLM subsystem (234) may also group the identified textual and audio cues under appropriate categories of unwanted portions for storage in one or more of the master file (220) and the storage subsystem (208). For example, relevant textual cues including texts such as, “Hello class, let’s start…,” “Today we will discuss” and “Let us wait for others to join,” may be stored under the category “Introduction”. Likewise, a segment of the diarized audio (228) with an indication of change in speaker from one speaker to more than one speaker for longer than a threshold duration may be stored under a category marked “Q&A” in the master file (220).
[0065] Subsequently, when the user device (106) requests the digital content, the media player (103) residing on the user device (106) downloads the modified master file (220) associated with the digital content to the user device (106). Additionally, the media player (103) may also receive user files, for example, including an associated user’s preferences, user’s engagement with various content and user profiles created and managed by the content viewing platform (104). Upon downloading the modified master file (220) and the user files, the media player (103) has access to the identified audio, visual and textual cues, as well as the start and end times associated with the unwanted portions, as well as metadata of the digital content. The media player (103) uses the downloaded information to integrate a skip widget to the digital content and display the skip widget at appropriate start time of the unwanted portions during playback of the digital content.
[0066] In an embodiment, the intelligent content customization system (102) further includes a widget generating subsystem (230) configured to generate the skip or dislike widget and integrate the generated widget with the master file (220). In another embodiment, the media player (103) is configured to generate and integrate the widget with the master file (220) before displaying the widget at appropriate times corresponding to the unwanted portions when the digital content is being played on the user device (106). To that end, in certain embodiments, the media player (103) includes one or more sub-components including, but not limiting to, a widget integrating subsystem (213), a video player subsystem (223) and a time manipulator (233). The media player (103), the widget integrating subsystem (213), the video player subsystem (223) and the time manipulator (233) may be implemented as independent microprocessors, microcomputers, microcontrollers, digital signal processors, central processing units, state machines, logic circuitries, and/or a separate hardware device.
[0067] Subsequently, when a user selects a digital content for viewing, the digital content is displayed using the video player subsystem (223) of the media player (103) on a GUI of the user device (106). When the content plays on the GUI of the user device (106), the skip widget is displayed at the determined start time of the unwanted portion. In certain embodiments, the skip widget may be displayed at the start times of only certain non-preferred portions of the digital content for certain users based on user preferences determined from received user files. For example, the skip widget may be displayed at the start time of a segment corresponding to judges’ speech in a reality show for a user who does not prefer such content, but will not be displayed during playback of the same reality show requested by certain other users who may prefer to watch the segment, as identified using their respective user files.
[0068] Once displayed on the GUI, the user can choose to activate or ignore the skip widget. Once the user activates the skip widget, the media player (103) automatically skips to the end time of the unwanted portion and resumes playing the digital content once the end time of the unwanted portion is reached. In an embodiment, the skip widget may be displayed until the end time of the non-preferred portion in absence of any user activation input. In another example scenario, a user may choose to activate the skip widget after a certain time has elapsed since the display of the skip widget on the GUI at the onset of the start time of the unwanted portion. The time manipulator (233), thus, intelligently skips the unwanted portion of the content from the timestamp when the user activation input is received until the determined end time to resume playback. An exemplary method describing skipping of unwanted portions of a digital content using the intelligent content customization system (102) is described in greater detail with reference to FIGS. 3A-B.
[0069] FIGS. 3A-B illustrate a flowchart depicting an exemplary method (300) for enabling skipping of unwanted portions while consuming a digital content. The order in which the exemplary method (300) is described is not intended to be construed as a limitation, and any number of the described blocks may be combined in any order to implement the exemplary method disclosed herein, or an equivalent alternative method. Additionally, certain blocks may be deleted from the exemplary method or augmented by additional blocks with added functionality without departing from the claimed scope of the subject matter described herein.
[0070] At step (302), the intelligent content customization system (102) accesses content from the content database (204). The intelligent content customization system (102) may momentarily store a copy of the accessed content in the storage subsystem (208) of the intelligent content customization system (102) before separating the audio and video channels and processing the separated channels to extract the audio and the video from the content, respectively. Subsequently, the intelligent content customization system (102) shares the extracted audio of the content to the audio analysis subsystem (202) and the video of the content with the video analysis subsystem (206).
[0071] At step (304), the video analysis subsystem (206) extracts OCR data (238) from the extracted video, whereas the audio analysis subsystem (202) extracts the transcripts and diarized audio from the extracted audio. In particular, the audio analysis subsystem (202) receives the extracted audio and sends it to the transcript extraction subsystem (212), and the diarization subsystem (222) for simultaneous processing. The transcripts of the extracted audio corresponding to the content may be stored as extracted transcripts (218) and the speaker information associated with each segment of the audio obtained as output from the diarization subsystem (222) may be stored as diarized audio (228) within the storage subsystem (208). Similarly, the extracted video is processed by the video analysis subsystem (206) to identify and obtain visual parameters from visual information such as the image frames of the video. The visual parameters include machine-encoded texts obtained as output from the OCR subsystem (216) based on application of OCR techniques to the video/image frames. The extracted transcripts (218), the diarized audio (228) and the visual parameters referred to as OCR data (238) may be provided as input to the LLM subsystem (234) to identify relevant textual cues in certain sections of the content that may indicate that these sections include unwanted portions.
[0072] At step (306), the LLM subsystem (234) identifies textual cues, visual cues and the audio cues that indicate the unwanted portion in the content based on the extracted transcript (218). The relevant textual cues identified from the extracted transcripts (218) may include keywords and phrases, for example, that fall within the meaning or definition of topic or chapter introduction. The LLM subsystem (234) identifies visual cues based on the extracted OCR data (238). The LLM subsystem (234) first identifies one or more speakers based on the diarized audio and identifies audio cues that indicate the unwanted portion in the content based on identifying the speakers. At step (308), the intelligent content customization system (102) determines the start time and the end time corresponding to one or more of the identified textual, audio and visual cues that are indicative of the start times and the end times of the unwanted portions.
[0073] At step (310), the intelligent content customization system (102) customizes the digital content to include a skip widget at the determined start time such that the skip widget appears on a GUI of a user device (106) when the digital content is played. The integration and provision of the skip widget will be explained in greater detail with reference to FIG. 4. Further, at step (312), it is determined if the skip widget is activated by the user. If it is determined that the widget is not activated, then at step (314), display of the skip widget is retained until the determined end times corresponding to one or more of the identified textual, audio and visual cues is reached and is subsequently removed from the GUI. If the skip widget is not activated by the user, the content plays with the unwanted portion. If it is determined that the widget is activated, then at step (316), the unwanted portion is skipped from a corresponding start time up to the determined end time and the content resumes playing where the unwanted portion ends based on the determined end time. An exemplary method of integrating the skip widget into the digital content to enable display of the skip widget at an appropriate start time of an unwanted portion when the digital content is played is illustrated in reference to FIG. 4.
[0074] Particularly, FIG. 4 illustrates a flowchart depicting an exemplary method (400) for integrating the skip widget into the digital content when the digital content is being consumed on a user device (106). In one embodiment, the steps of the method (400) are performed by the media player (103) associated with the user device (106). In certain other embodiments, the method (400) may be performed by the intelligent content customization system (102). In yet another embodiment, the method (400) may also be performed at the content viewing platform (104) based on appropriate contracts executed between the intelligent content customization system (102) and the content viewing platform (104).
[0075] At step (402), the media player (103) receives or downloads the master file (220) from the intelligent content customization system (102). As previously noted, the master file (220) includes information associated with the digital content, the relevant cues, and the metadata associated with the digital content. At step (404), the media player (103) identifies the start time and the end time corresponding to the one or more relevant cues indicating the unwanted portions of the digital content from the information received from the master file (220).
[0076] At step (406), the media player (103) displays the skip widget on the GUI of the user device (106) at the determined start time of the unwanted portion when the content is played on the user device (106). Subsequently, at step (408), the media player (103) receives the user activation input indicating the user’s engagement with the displayed skip widget. When the user activates the skip widget immediately after the skip widget is displayed at the determined start time of the unwanted content, the content automatically skips the selected portions from an activation timestamp until the end time of the unwanted portion and resumes playing the digital content right after the end time of the unwanted portion is reached. The activation timestamp corresponds to a point in time in the digital content at which the user activation input is received. As an example, digital content that is playing on the user device (106) may have unwanted portions, such as a Q&A session starting at 25 mins and 08 seconds of the content, which ends at 31 mins 14 seconds. The skip widget will be displayed exactly at 25 mins and 08 seconds of the content or even a few milliseconds earlier. If the user activates the skip widget at 25 mins 08 seconds, the video player subsystem (223) skips the subsequent 6 mins and 6 seconds of the content and resumes playback of the content from 31 mins and 15 seconds.
[0077] In another example scenario, a user may choose to activate the skip widget after a certain time has elapsed since the display of the skip widget on the GUI at the onset of the start time of the unwanted portion. Accordingly, at step (410), the video player subsystem (223) skips the unwanted portion from the time at which the user activation input is received until the identified end time of the unwanted portion. In the previous example, where the Q&A session starts at 25 mins and 08 seconds of the content and ends at 31 mins 14 seconds, the skip widget will be displayed exactly at 25 mins and 08 seconds of the content or even a few milliseconds earlier. If the user activates the skip widget at any time later than 25 mins 08 seconds, say at 26 mins 08 seconds, the time manipulator (233) of the media player (103) configures the video player subsystem (223) to skip the remaining 5 mins and 6 seconds of the unwanted portion instead of 6 mins 6 seconds and resume playback of the digital content from 31 mins and 15 seconds. An example of the skip widget displayed at the determined start time at a graphical user interface of a user device (106) when the content is played is illustrated in FIG. 5.
[0078] More specifically, FIG. 5 illustrates an exemplary GUI (500) of the user device (106) depicting the skip widget (502) being displayed when the digital content is being played on the user device (106) using the media player (103). In one embodiment, the digital content being played on the user device (106) corresponds to a PowerPoint presentation being used by a lecturer to teach a topic to a number of students. As seen in FIG, 5, the PowerPoint presentation is at slide no. 2, which corresponds to a table of contents corresponding to the topic being taught. The audio transcripts on the right side of the GUI shows texts such as “Hello class…”. These texts indicate an introduction session corresponding to a topic. Further, the information shown on the slide of the PowerPoint presentation may be captured as images. The video analysis subsystem (206) may then process these images using the OCR subsystem (216) to determine machine-encoded texts that may indicate, for example, a slide number and a table of contents.
[0079] As can be seen from FIG. 5, the skip widget (502) is accurately displayed at the determined start time of the introduction portion of the content when the content is playing. As previously noted, the introduction portion and other non-preferred portions are identified based on relevant audio, textual and visual cues by the trained LLM subsystem (234), as noted previously. An example of training the LLM subsystem (234) to identify the relevant cues from the digital content is described in detail with reference to FIG. 6.
[0080] FIG. 6 shows a flowchart depicting a method (600) of training LLM subsystem (234) to identify relevant audio and textual cues indicating unwanted portions. The LLM subsystem (234) may use a self-supervised or semi-supervised learning technique for neural networks that have the ability for general-purpose language generation and other natural language processing tasks based on learning statistical relationships from the training dataset (224) that includes vast amounts of texts. The LLM used by the LLM subsystem (234) can also be trained for text generation, a form of generative AI.
[0081] At step (602), the training subsystem (214) collects data for compiling the training dataset (224). The data can be collected from various sources, including but not limited to, content or transcripts belonging to online classrooms, trainings, tutorials, panel discussions, and interviews. Other relevant data sources may include transcripts obtained from content such as movies and series. Some another relevant data sources may include books, e-books, articles, web content, images and videos of online or virtual classrooms and other similar data sources. The training dataset (224) also includes educational videos and audios including various formats of unwanted portions. It shall be noted that the data in the training dataset (224) may be cleaned and standardized to remove noise and irrelevant information.
[0082] Subsequently, at step (604), the training subsystem (214) trains the LLM using the training dataset (224) and fine-tunes the LLM using known techniques to accurately identify the cues. Fine-tuning the LLM may involve refinements in the form of modifying the LLM’s architecture. Fine-tuning prepares the LLM for specific use cases. This cycle is repeated multiple times until the LLM achieves a desired performance. Once the training and fine-tuning of the LLM concludes, the LLM subsystem (234) deploys the fine-tuned LLM for run-time identification of relevant audio, visual and textual cues based on extracted transcripts (218), diarized audio (228) and the extracted OCR (238) received from new incoming videos.
[0083] At step (606), the LLM subsystem (234) receives incoming data in the form of videos accompanied by unwanted portions at run-time. At step (608), the LLM subsystem (234) uses the LLM to identify the unwanted portions based on the training. The LLM subsystem (234) receives feedback on the identification results from a human operator such as an engineer or a supervisor who is responsible for generating, modifying and providing feedback to the LLM. The feedback may be used to further fine-tune the LLM subsystem (234). The feedback-based fine-tuning may further improve the accuracy of identifying unwanted portions in the digital content.
[0084] At step (610), the LLM subsystem (234) labels and classifies the unwanted portions into one or more categories among ‘Introduction’, ‘Q&A’ and ‘Break’ based on the training and fine-tuning. The relevant audio, visual and textual cues, their start and end times, the respective categories along with the metadata associated with the corresponding content are stored in the master file (220). The master file (220) is then sent to the media player (103) for integrating the skip widget to the GUI at appropriate start times of unwanted portions identified in the digital content.
[0085] As noted previously, conventional content skipping methods are designed for skipping predefined formats of unwanted portions such as ad breaks, series introductions and recaps, which typically occur at predefined points of time within the content. These unwanted portions are generally tagged with pre-generated tags for easy identification by a human or a machine. Based on the pre-generated tags and their predefined formats, the skip widget is integrated into the content to be displayed at appropriate points of time. However, non-preferred portions associated with certain content may not occur at predefined times within the content, may not be easily distinguishable from the preferred portions of the content, and may not be tagged with pre-generated tags. As such, the conventional content skipping methods may fail to distinguish non-preferred portions such as topic or chapter introduction sessions and Q&A sessions from the preferred portions of the digital content.
[0086] The present system (100) addresses such limitations of conventional content skipping methods by providing an intelligent content customization system (102) configured to identify the presence of relevant audio and textual cues that indicate non-preferred portions of the content. Unlike conventional content skipping methods, which require defining tags and are only suitable for skipping sections of typical OTT content, which are identified based on the predefined tags and distinguishable video or frame properties, the system (100) identifies non-preferred portions based on relevant audio and textual cues identified by a pre-trained LLM subsystem. Unlike the conventional methods, the system (100) enables skipping of long and boring sections of content, which may not possess distinguishable video properties or tags and that occur at random points of time in the digital content, with significantly lesser processing.
[0087] Although specific features of various embodiments of the present systems and methods may be shown in and/or described with respect to some drawings and not in others, this is for convenience only. It is to be understood that the described features, structures, and/or characteristics may be combined and/or used interchangeably in any suitable manner in the various embodiments shown in the different figures.
[0088] While only certain features of the present systems and methods have been illustrated and described herein, many modifications and changes will occur to those skilled in the art. It is, therefore, to be understood that the appended claims are intended to cover all such modifications and changes.

LIST OF NUMERAL REFERENCES:

102 Intelligent content customization system
103 Media player
104 Content viewing platform
106 User device
108 Communication link
202 Audio analysis subsystem
204 Content database
206 Video analysis subsystem
208 Storage subsystem
212 Transcript extraction subsystem
213 Widget integration subsystem
214 Training subsystem
216 OCR subsystem
218 Extracted transcript
220 Master file
222 Diarization subsystem
223 Video player subsystem
224 Training dataset
228 Diarized audio
230 Widget generating subsystem
234 LLM subsystem
238 OCR data
300 Method for enabling skipping of selected portions while consuming a digital content
302-316 Steps of method for enabling skipping of selected portions while consuming a digital content
400 Method for integrating the skip widget into the digital content
402-410 Steps of method for integrating the skip widget into the digital content
500 Graphical user interface of user device
502 Skip widget
600 Method of training LLM subsystem
602-610 Steps of method of training LLM subsystem to identify relevant audio and textual cues indicating unwanted portions

, Claims:

We claim:

1. 1. A method for enabling skipping of selected portions of a digital content, the method comprising:
accessing digital content from a content database (204) by an intelligent content customization system (102);
extracting one or more of textual, visual and audio information from the digital content by the intelligent content customization system (102), wherein extracting the textual information comprises extracting one or more transcripts associated with an audio of the content, wherein extracting the visual information comprises extracting one or more visual parameters from a video associated with the content, and wherein extracting the audio information comprises diarizing the audio;
identifying one or more textual cues based on the extracted transcripts (218), one or more visual cues based on the extracted visual parameters, and one or more audio cues based on the diarized audio (228), wherein one or more of the textual cues, visual cues and audio cues are indicative of one or more selected portions in the content, wherein the one or more selected portions are non-preferred portions of the content;
determining a start time corresponding to each of the one or more selected portions of the content based on one or more timestamps in the digital content that correspond to an onset of one or more of the identified textual cues, visual cues and audio cues, wherein each of the one or more selected portions comprises a corresponding end time determined by identifying a subsequent frame comprising one or more frame properties that are different from the frame properties associated with a frame corresponding to the determined start time; and
displaying a widget (502) on a graphical user interface (500) of a user device (106) at the determined start time of each of the one or more selected portions when the digital content is played on the user device (106), further skipping playback of the one or more selected portions of the content upon activation of the widget (502).

2. The method as claimed in claim 1, wherein identifying the one or more audio cues comprises determining one or more segments of the diarized audio (228) that comprise speech from more than one participating speakers exceeding a threshold duration of time.

3. The method as claimed in claim 1, wherein determining the start time of each of the one or more selected portions of the digital content based on the one or more timestamps that correspond to the onset of the identified audio cues comprises one or more of:
identifying a number of speakers in the diarized audio (228) based on one or more of an audio intensity parameter, an audio histogram parameter, and an audio frequency parameter corresponding to the diarized audio (228) for identifying a timestamp in the diarized audio (228) that corresponds to a change from one participating speaker to more than one participating speaker; and
identifying a timestamp in the diarized audio (228) that corresponds to absence of speech in the diarized audio (228) for a predefined duration of time.

4. The method as claimed in claim 1, wherein determining a start time corresponding to each of the one or more selected portions of the digital content based on the one or more timestamps in the digital content that correspond to the onset of one or more of the textual cues, visual cues and audio cues comprises:
determining if the one or more timestamps corresponding to the onset of two or more of the textual cues, visual cues and audio cues coincide; and
identifying the one or more timestamps in the digital content that corresponds to the onset of the two or more of the textual cues, visual cues and audio cues that coincide as the corresponding start time of each of the one or more selected portions.

5. The method as claimed in claim 3, wherein determining a start time corresponding to each of the one or more selected portions of the content based on one or more timestamps in the digital content that correspond to the onset of one or more of the textual cues, visual cues and audio cues comprises:
determining if the one or more timestamps corresponding to the onset of two or more of the textual cues, visual cues and audio cues coincide;
identifying one or more lags between the one or more timestamps corresponding to the two or more of the textual cues, visual cues and audio cues upon identifying that the one or more timestamps corresponding to the onset of the two or more of the textual cues, visual cues and the audio cues do not coincide;
determining if the lags between the timestamps are within a predefined range; and
identifying an earliest timestamp from the one or more timestamps corresponding to the two or more of the textual cues, visual cues and audio cues as the start time of each of the one or more selected portions when the one or more timestamps do not coincide.

6. The method as claimed in claim 1, wherein extracting the visual parameters further comprises extracting one or more machine encoded texts from image frames corresponding to the digital content via optical character recognition, wherein the one or more machine encoded texts comprises optical character recognition data (238).

7. The method as claimed in claim 1, wherein determining the end time of each of the one or more selected portions comprises comparing one or more corresponding frame properties determined from a currently displayed frame of the digital content with one or more subsequent frames until identifying a difference in the corresponding frame properties, wherein the currently displayed frame corresponds to the start time of a selected portion from the one or more selected portions in the digital content and wherein a subsequent frame comprising the identified difference corresponds to the end time of the selected portion; and
determining a timestamp in the digital content corresponding to a transition to the subsequent frame and tagging the determined timestamp as the end time of the selected portion.

8. The method as claimed in claim 7, wherein determining the end time corresponding to each of the one or more selected portions further comprises one or more of:
determining if the extracted transcripts (218) indicate absence of the textual cues corresponding to a currently displayed frame;
determining if the diarized audio (228) indicates a change from more than one speaker to one speaker corresponding to an audio associated with the currently displayed frame; and
determining if the extracted transcripts (218) indicate resumption of discussion relevant to a topic of interest corresponding to the currently displayed frame.

9. The method as claimed in claim 1, wherein displaying a widget (502) on a graphical user interface of the user device (106) comprises:
receiving a master file comprising one or more of the textual cues, visual cues and audio cues that are indicative of the one or more selected portions in the digital content and the start time and end time of each of the one or more selected portions;
displaying the widget (502) at the start time corresponding to each of the one or more selected portions determined from the master file (220) when the digital content is played on the user device (106);
receiving activation input corresponding to the displayed widget (502); and
skipping the one or more selected portions until the end time upon receiving the activation input.

10. The method as claimed in claim 10, wherein skipping the one or more selected portions until the end time upon receiving the activation input comprises skipping one or more of the selected portions from an activation timestamp until the end time, wherein the activation timestamp corresponds a point in time in the digital content at which the activation input is received by the user device (106).

11. The method as claimed in claim 1, wherein skipping the one or more selected portions comprises skipping one or more portions of the digital content corresponding to one or more categories associated with educational content, wherein the one or more categories comprise one or more of introduction sessions, questions and answer sessions, silence, and break sessions.

12. The method as claimed in claim 1, wherein identifying one or more of the textual cues, visual cues, and audio cues comprises processing one or more of the digital content, the master file, and one or more user files using a pre-trained Large Language Model.

13. A system for enabling skipping of selected portions of a digital content, the system comprising:
an intelligent content customization system (102) communicatively coupled to one or more of a content database (204) associated with a content viewing platform (104) and a user device (106), wherein the intelligent content customization system (102) is configured to:
access the digital content from the content database (204);
extract textual and audio information from the digital content by an audio analytics subsystem (202) and extract visual information from the digital content by a video analytics subsystem (206), wherein the textual information comprises one or more transcripts associated with an audio of the digital content, wherein the visual information comprises one or more visual parameters associated with a video of the digital content, and wherein the audio information comprises diarized audio;
identify one or more textual cues based on the extracted transcripts (218), one or more visual cues based on the extracted visual parameters and one or more audio cues based on the diarized audio (228) by a large language model subsystem (234) using a large language model, wherein one or more of the textual cues, visual cues and the audio cues are indicative of one or more selected portions in the content, wherein the one or more selected portions correspond to non-preferred portions in the digital content;
determine a start time corresponding to each of the one or more selected portions of the content based on one or more timestamps in the digital content that correspond to an onset of one or more of the identified textual cues, visual cues and audio cues using the large language model, wherein each of the one or more selected portions comprises a corresponding end time determined by identifying a subsequent frame comprising one or more frame properties that are different from the frame properties associated with a frame corresponding to the determined start time;
a storage subsystem (208) configured to store a master file (220) that stores one or more of the textual cues, audio cues and visual cues indicating the one or more selected portions in the content, the start time and end time corresponding to each of one or more of the selected portions identified by the large language model; and
wherein the user device (106) is configured to download the master file (220) and the digital content, display a widget (502) at a graphical user interface (500) of the user device (106) at the start time of each of the one or more selected portions of the digital content during playback of the digital content, and skip playback of the one or more selected portions upon receiving an activation input corresponding to the displayed widget (502) at the graphical user interface (500).

14. The system as claimed in claim 14, wherein the intelligent content customization system (102) is one of a standalone system communicating with the content viewing platform (104) and the user device (106) via a communication link (108) and a system integrated into one of the content viewing platform (104) and the user device (106).

Documents

Application Documents

#	Name	Date
1	202441071196-POWER OF AUTHORITY [20-09-2024(online)].pdf	2024-09-20
2	202441071196-FORM-9 [20-09-2024(online)].pdf	2024-09-20
3	202441071196-FORM 18 [20-09-2024(online)].pdf	2024-09-20
4	202441071196-FORM 1 [20-09-2024(online)].pdf	2024-09-20
5	202441071196-FIGURE OF ABSTRACT [20-09-2024(online)].pdf	2024-09-20
6	202441071196-DRAWINGS [20-09-2024(online)].pdf	2024-09-20
7	202441071196-COMPLETE SPECIFICATION [20-09-2024(online)].pdf	2024-09-20
8	202441071196-FORM-26 [27-09-2024(online)].pdf	2024-09-27
9	202441071196-FER.pdf	2025-10-10

Search Strategy

1	202441071196_SearchStrategyNew_E_SearchHistory(1)E_07-10-2025.pdf