A System For Generating Highlight Movie Using Multiple Raw Videos

< Back

A System For Generating Highlight Movie Using Multiple Raw Videos

Abstract: Abstract “A SYSTEM FOR GENERATING HIGHLIGHT MOVIE USING MULTIPLE RAW VIDEOS” A system (100) for generating a highlight movie from a plurality of raw video files, optionally with a flow document and/or background music, comprises an input device (102), a processing unit (106), memory (104), and an output device (108). The processing unit (106) includes: a single frame feature extraction module (10) using a region-based CNN; an identification module (12) for key entity identification using attention and categorization; an aggregation module (14) for frame representation; a video feature extraction module (16) incorporating frame-level information; an audio processing module (18); a selection module (22) for video segment selection based on multi-modal matching, considering scene change timestamps; a video cut identification module (24) for precise start/end times; a color correction module (26); and a creation module (28) for the final highlight movie. This system automatically generates engaging highlight movies by integrating visual, temporal, semantic, and audio data, enabling personalized output based on user preferences and/or background music synchronization. Figure. 1A and 1B

Get Free WhatsApp Updates!
Notices, Deadlines & Correspondence

Patent Information

Application #

Filing Date

15 February 2024

Publication Number

36/2025

Publication Type

INA

Invention Field

COMPUTER SCIENCE

Status

Parent Application

Applicants

REJIG AI RESEARCH PRIVATE LIMITED

24, Sahajanand Palace, Near Sukruti Bunglows Sindhu Bhavan – Thaltej Road, Thaltej Ahmedabad, Gujarat-380059, India

Inventors

1. Hitendra M Shah

24, Sahajanand Palace, Near Sukruti Bunglows Sindhu Bhavan – Thaltej Road, Thaltej, Ahmedabad, Gujarat, India-380059

Specification

DESC:FORM - 2

THE PATENTS ACT, 1970
(39 of 1970)

COMPLETE SPECIFICATION
(Section 10, Rule 13)

“SYSTEM FOR GENERATING HIGHLIGHT MOVIE USING MULTIPLE RAW VIDEOS”

REJIG AI RESEARCH PRIVATE LIMITED
An Indian
Having address at
24, Sahajanand Palace, Near Sukruti Bunglows
Sindhu Bhavan – Thaltej Road, Thaltej
Ahmedabad, Gujarat-380059, India.

The following specification particularly describes the invention and the manner in which it is to be performed.

FIELD OF THE INVENTION:
This invention relates to the field of generating creative highlight movie(s), specifically a system for generating highlight movies from multiple raw videos and images, optionally including a flow document specifying user preferences and/or a background music file. The system processes the raw video data, extracts visual and temporal features, identifies key entities and salient frames, and selects and sequences video segments based on multi-modal matching that integrates visual, semantic (from the flow document), and audio information. This automated process creates engaging highlight movies with minimal manual intervention, while still allowing for user customization through the flow document and/or background music synchronization.

BACKGROUND OF THE INVENTION:
Current methods for generating creative highlight movies predominantly rely on human editors manually operating video editing software. This involves reviewing all the raw footage manually, selecting the best shots, sequencing them, performing colour correction, and adding effects—all of which are time-consuming and labor-intensive processes. While there are technological solutions aimed at automating this process, they generally create summary videos that attempt to include most of the content from the raw footage. However, these solutions fail to focus on the key moments or segments of importance and do not adequately synchronize the audio and video sentiments. Additionally, these automated systems lack the capability to imbue the final output with a human touch, often resulting in less engaging and emotionally resonant videos.
Due to these limitations, video editors are still required to perform repetitive and laborious tasks manually. Existing technology in this domain does not sufficiently assist editors in their workflow. The invention presented here addresses this gap by taking into account the various steps involved in the video editing process and introducing dedicated AI agent systems designed to understand and replicate these workflows. This ensures that the invention can take over these tasks while accurately interpreting the instructions provided by the human editor. As a result, the invention significantly reduces the manual effort and time required by human editors.
Creative highlight movies are not mere summaries of raw videos and images; they must capture and emphasize important events, arrange them in a visually appealing and creative manner, and synchronize the video’s sentiment with the background music, if provided. Aesthetic enhancements, such as colour correction and special effects, are also essential for making the videos more engaging for viewers. Repeating these artistic tasks across different raw content often leads to creative burnout for human editors, adversely affecting the quality of the output.
Hence, there is a pressing need for a system that can address this technical problem by generating creative highlight movies from multiple raw videos and images in a manner that closely resembles the work of a human editor. Such a system must seamlessly and accurately understand the human editor’s instructions, thereby providing a solution that enhances efficiency without compromising the quality and emotional impact of the output.

PRIOR ART AND ITS DISADVANTAGES:
A patent application no. US20100183280A1, titled “Creating a new video production by Intercutting between multiple Video clips” discloses a method in which multiple video clips are temporally aligned based on the content of their audio tracks, and then edited to create a new video production incorporating material from two or more of those video clips.
However, said prior art fails to consider any kind of visual or motion features extracted from the non-audio content of the given videos. This results in only the better-sounding clips to be picked in the final created video. Moreover, choosing video segments that include specific people and objects of interest cannot be achieved by just analyzing the audio features. Though multiple video clips are considered and analyzed, there lacks a module to make sure that the same and related clips are not picked again and again in the final created video, which makes it uninteresting and mundane. Further, visually attractive clips are never considered based on their visual features. It is, therefore very prone to miss out on what could have been great additions to the final created video.
A patent number US10650245B2, titled “Generating digital video summaries utilizing aesthetics, relevancy, and generative neural networks” discloses systems, methods, and non - transitory computer-readable media for generating digital video summaries based on analyzing a digital video utilizing a relevancy neural network, an aesthetic neural network, and/or a generative neural network. For example, the disclosed systems can utilize an aesthetics neural network to determine aesthetics scores for frames of a digital video and a relevancy neural network to generate importance scores for frames of the digital video. Utilizing the aesthetic scores and relevancy scores, the disclosed systems can select a subset of frames and apply a generative reconstructor neural network to create a digital video reconstruction. By comparing the digital video reconstruction and the original digital video, the disclosed systems can accurately identify representative frames and flexibly generate a variety of different digital video summaries.
However, in the said prior art, the relevancy neural network module selects a set of frames based on the selection scores described. In the context of event movie and teaser creation, there are specific time segments and objects of particularly significant interest, such as the main rituals in a wedding, etc., and these need to be there in the final movie to make it at least acceptable, relevant and more viewer engaging. This module does not have any specific mechanism to consider this. It simply assigns scores to frames and is not trained on a dataset to handle such practical use cases. The video trailer module in said art can cause abrupt cuts at irrelevant timestamps at scene changes in the final created video, making it frustrating for the viewer and thus missing out on important details of relevant raw video segments. The aesthetic neural network is trained on a general-purpose dataset for the task, which uses simple pre-trained convolutional neural networks to extract visual features of the frames, and aggregates these to compute the visual features of a segment for segment selection in the final created video. For highlight movie creation, this kind of aesthetic feature analysis could not give enough scores to cinematic and drone shots to be picked in the final video, which is relevant and very viewer-engaging. Additionally, such aesthetic visual scores do not work very well when there are multiple human objects in a frame and if there are several colours in the frame. In the context of an event, this problem can potentially omit very relevant clips such as family pictures and family rituals. The generative neural network module reconstructs the digital video features and uses LSTM (Long short-term memory) encoder and decoder along with the knapsack algorithm to generate the video trailer. This module can help to identify representative frames that reflect the whole video in the trailer as a whole, however, it does not have an attention mechanism to focus on the essential required video segments in the final movie. This technology also fails to provide any mechanism to include human feedback and incorporate a human touch to construct a creative flow of the created movie. Further, said art also fails to provide any way to incorporate background music into the video and match the video segment transition on relevant audio cuts in the music. It also does not provide a way to match the intensity and sentiment of the video clips to the background music.
A patent number US10192584B1, titled “Cognitive dynamic video summarization using cognitive analysis enriched feature set” suggests an accurate and concise summarization of a media production which is achieved using cognitive analysis which groups segments of the production into clusters based on extracted features, selects a representative segment for each cluster, and combines the representative segments to form a summary. The production is separated into a video stream, a speech stream and an audio stream, from which the cognitive analysis extracts visual features, textual features, and aural features. The clustering groups segments together whose visual and textual features most closely match. Selection of the representative segments derives a score for each segment based on factors including a distance to a centroid of the cluster, an emotion level, audio uniqueness, and video uniqueness. Each of these factors can be weighted, and the weights can be adjusted by user input. The factors can have initial weights which are based on statistical attributes of historical media productions.
However, said prior art stating about clustering video segments and stitching cluster heads into the final movie does not have an attention mechanism to put more weight on more relevant video clips. Said prior art can also cause abrupt cuts at irrelevant timestamps in the final created video, making it frustrating for the viewer and thus missing out on important details of relevant raw video segments. It does not provide a way to pick video segments that are aesthetically good to watch. Any segment that is chosen to be the cluster head based on the extracted features will be put in the final video. This cluster head segment might not be the best kind of segment to watch and can aesthetically be much worse. Said art also failed to provide any mechanism to include human feedback and incorporate a human touch to construct a creative flow of the created movie. Though multiple video clips are considered and analyzed, there lacks a module to make sure that the same and related clips are not picked again and again in the final created video, which makes it uninteresting and mundane.
An application number US 2010/0183280 A1, titled “creating a new video production by intercutting between multiple video clips” suggests the method in which multiple video clips are temporally-aligned based on the content of their audio tracks, and then edited to create a new video production incorporating material from two or more of those video clips.
However, said prior art does not consider any kind of visual or motion features extracted from the non-audio content of the given videos. This results in only the better-sounding clips to be picked in the final created video. Choosing video segments that include specific people and objects of interest cannot be achieved by just analyzing the audio features. Though multiple video clips are considered and analyzed, there lacks a module to make sure that the same and related clips are not picked again and again in the final created video, which makes it uninteresting and mundane. Visually attractive clips are never considered based on their visual features. It is, therefore very prone to miss out on what could have been great additions to the final created video. All the selected clips need to be aligned with a common audio track. Clips are always sequenced in the order of time. Uses only the amplitude of the background audio from a video clip to make a decision. Hence, said prior art fails to suggest a system for generating highlight movie(s) using multiple raw videos and pictures that capture the key moments, embed background music and special effects to result in more captivating highlight movie(s) similar to as human touch.
A patent application no. US 2016/0365114 A1 titled “video editing system and method using machine learning” and international application no. WO 20211150282 A1 titled “Selection of video frames using a machine learning predictor” suggests the editing part which is a means to do manual editing and is not fully automated. This system uses Machine Learning to create signals which are provided to the human editor to assist in editing. Said prior arts nowhere suggests creating the edited highlight movie automatically. There is no kind of audio or video analysis is done using Machine Learning. Further, said art focuses on how a video will be played on a device. It has a module for gesture recognition, which uses image features, but fails to suggest a system for generating highlight movie(s) using multiple raw videos and pictures that capture the key moments, embed background music and special effects to result in more captivating highlight movie(s) similar to as human touch.
A patent application number US 2018/0132006 A1, titled “highlight-based movie navigation, editing and sharing” discloses the methods and apparatuses for highlight-based movie navigation, editing and sharing are described. In one embodiment, the method for processing media comprises: playing back a movie on a display of a media device performing gesture recognition to recognize one or more gestures made with respect to the display; and navigating through the media on a per-highlight basis in response to recognizing the one or more gestures.
However, said prior art relates to finding interesting parts of an image and cropping it accordingly. In one of the applications of said, it is mentioned to automate the selection of good video frames. It does not select and choose good video segments automatically which can be stitched together into a highlight movie. The way it does is through score computation for each video frame, and select the video frame with the highest score. Said art is related to cropping images and only images, it does not in any way find interesting parts of a video or help identify video sub-clip start and end frames which can be used to create a highlight movie. Said art nowhere suggests audio feature extraction, audio-video matching, color correction and set forth. Hence, said prior art fails to suggest a system for generating highlight movie(s) using multiple raw videos and pictures that capture the key moments, embed background music, and special effects to result in more captivating highlight movie(s) similar to human touch.

DISADVANTAGES OF THE PRIOR ART:
Said prior art suffers from at least all or any of the following disadvantages:
? Most prior art approaches fail to consider visual or motion features extracted from the non-audio content of videos, resulting in the selection of clips based primarily on audio quality rather than overall content quality.
? Prior art solutions do not ensure the inclusion of video segments containing specific people or objects of interest, which may be critical for relevance and personalization.
? The creation of visually appealing clips, as suggested by prior art, is not based on any analysis or evaluation of visual features, leading to arbitrary or suboptimal results.
? Abrupt cuts at irrelevant timestamps in the final created videos are common in the prior art, which can frustrate viewers and omit important details from the raw video segments.
? The prior art fails to ensure that selected video segments are aesthetically pleasing, potentially leading to visually unengaging final outputs.
? Most prior art lacks a mechanism to incorporate human feedback or a creative touch, resulting in a rigid and non-customizable flow in the constructed videos.
? Many prior art approaches fail to include a module for detecting and eliminating duplicate or overly similar clips, leading to redundancy and monotony in the final video.
? The prior art does not provide a mechanism to incorporate background music in a way that aligns video segment transitions with relevant audio cuts in the music, reducing the overall synchronization and cohesiveness of the video.
? Matching the intensity and sentiment of video clips to the background music is absent in prior art, resulting in a mismatch between audio and visual elements, which diminishes the viewer's emotional engagement.
Therefore, the aforementioned drawbacks and limitations of the prior arts have been solved efficiently and correctly by the present invention.
OBJECTS OF THE INVENTION:
It is an objective of the present invention to provide a system and method for generating creative highlight movies from multiple raw video files, and optionally, images/pictures, a flow document, and/or a background music file, enabling the production of professional-quality video highlights with minimal manual effort.
A further objective of the invention is to automatically identify and prioritize key moments and significant video segments, emphasizing important people, objects, and events, guided by user preferences specified in an optional flow document.
Another objective of the invention is to synchronize background music (if provided) with the video content, dynamically adjusting the selection and pacing of video segments to complement the music's sentiment and intensity for an emotionally engaging viewing experience.
A further objective of the invention is to ensure smooth transitions and continuity between selected video segments, avoiding abrupt or irrelevant cuts and enhancing the overall aesthetic quality of the highlight movie.
Another objective of the invention is to automate the inclusion of color correction and other aesthetic enhancements to ensure a visually appealing and professionally polished final product.
A further objective of the invention is to provide a mechanism for incorporating user feedback, allowing for a creative touch that reflects individual preferences and artistic direction.
Another objective of the invention is to provide a scalable and efficient system that reduces the time and manual effort required for video editing by automating key processes, including feature extraction, entity identification, multi-modal matching, and video cut selection.
A further objective of the invention is to avoid redundant or similar video segments in the final output, ensuring a dynamic and engaging viewing experience.
Another objective of the invention is to provide a versatile solution applicable to various domains, including weddings, corporate events, sports events, travel vlogs, and other video content.
A further objective of the invention is to generate highlight movies that capture the essence of the original content while overcoming the drawbacks of prior art video editing methods, such as manual effort, time consumption, and the difficulty of achieving professional-quality results.
BRIEF DESCRIPTION OF THE DRAWINGS:
Various other objects, features and attendant advantages of the present invention will become fully appreciated as the same becomes better understood when considered in conjunction with the accompanying drawings, and wherein:
Figure 1A illustrates the overall system architecture (100) for generating highlight movies, showing the interaction between the processing unit (106), input device (102), memory (104), and output device (108) according to embodiments of the present invention.
Figure 1B illustrates the internal architecture of the processing unit (106), detailing the interconnected modules responsible for processing video, audio, and text data to generate the highlight movie, according to embodiments of the present invention.
Figure 2 illustrates the architecture of the single frame features extraction module (10) according to embodiments of the present invention.
Figure 3 illustrates the architecture of the identification module (12), comprising the attention module and categorization module according to embodiments of the present invention.
Figure 4 illustrates the architecture of the aggregation module (14) according to embodiments of the present invention.
Figure 5 illustrates the data flow between the single frame features extraction module (10), identification module (12), and aggregation module (14), according to embodiments of the present invention.
Figure 6 illustrates the architecture of the video features extraction module (16). It depicts the CNN-based processing of a video segment *S* through convolutional layers, spatio-temporal feature representation generation, and regional processing to extract visual and temporal features. It highlights how frame-level features and attention outputs are incorporated into the video segment representation, according to embodiments of the present invention.
Figure 7 further illustrates the video features extraction module (16), emphasizing the attention mechanism used to determine the importance of each frame within the video segment and the subsequent aggregation of frame-level features into a 1-dimensional vector *v_S* representing the video segment, according to embodiments of the present invention.
Figure 8 illustrates the overall operation of the selection module (22). It shows the processing of the flow document (if provided) using the BERT model, the processing of raw videos, and the processing of audio features (from a separate music file or video tracks), according to embodiments of the present invention.
Figure 9 illustrates the feature vector scaling and high-dimensional mapping performed within the selection module (22). It shows how feature vectors from different modalities (text, video, audio) are scaled to similar dimensions and mapped into a high-dimensional space for comparison, according to embodiments of the present invention.
Figure 10 illustrates the multi-modal matching process within the selection module (22), specifically when a flow document is provided. It shows how the matching process considers the flow document's embeddings and scene change timestamps to select the most relevant video segments, according to embodiments of the present invention.
Figure 11 illustrates the matching process within the selection module (22) when only a background music file is provided. It shows how the matching relies primarily on comparing audio and video features, using scene change timestamps derived from the music, according to embodiments of the present invention.
Figure 12 illustrates the matching process within the selection module (22) when only a flow document is provided. It shows how the matching considers both the textual information from the flow document and the audio characteristics of the video segments to select the most appropriate segments, according to embodiments of the present invention.
Figure 13 illustrates the architecture of the video cuts identification module (24). It shows how the module receives frame-level feature vectors and attention outputs, uses an attention mechanism to identify salient frames, and then employs a neural network to determine the precise start and end times for the selected video segments, according to embodiments of the present invention.
Figure 14 illustrates the architecture of the colour correction module (28), according to the embodiment of the present invention.

SUMMARY OF THE INVENTION:
A system for generating a highlight movie from a plurality of raw video files, optionally with a flow document and/or background music, comprises an input device, a processing unit, memory, and an output device. The processing unit includes: a single frame feature extraction module using a region-based CNN; an identification module for key entity identification using attention and categorization; an aggregation module for frame representation; a video feature extraction module incorporating frame-level information; an audio processing module; a selection module for video segment selection based on multi-modal matching, considering scene change timestamps; a video cut identification module for precise start/end times; a color correction module; and a creation module for the final highlight movie. This system automatically generates engaging highlight movies by integrating visual, temporal, semantic, and audio data, enabling personalized output based on user preferences and/or background music synchronization.

LIST OF REFERENCE NUMERALS:
System (100)
Input Device (102)
Memory (104)
Processing Unit (106)
Output device (108)
Single frame features extraction module (10)
Identification module (12)
Aggregation module (14)
Video features extraction module (16)
Audio processing module (18)
Selection module (22)
Video cuts identification module (24)
Colour correction module (26)
Creation module (28)

DETAILED DESCRIPTION OF THE INVENTION:
The following description is presented to enable any person skilled in the art to make and use the invention and is provided in the context of a particular application and its requirements. Various modifications to the disclosed embodiments will be readily apparent to those skilled in the art, and the general principles defined herein may be applied to other embodiments and applications without departing from the spirit and scope of the present invention. Thus, the present invention is not intended to be limited to the embodiments shown but is to be accorded the widest scope consistent with the principles and features disclosed herein.
It is to be understood that the terms "comprising" or "comprises" as used in the specification and claims are intended to mean that the enumerated elements are present, but do not exclude the presence of other elements. For example, an invention comprising elements X, Y, and Z may also include elements A, B, and/or C. Further, the terms "photos," “photographs”, "images," and "pictures" are used interchangeably herein.
According to the embodiments illustrated in Figure 1A, a system (100) for generating a highlight movie utilizing multiple raw videos is disclosed. The system (100) comprises an input device (102), a processing unit (106), a memory (104), and an output device (108).
The input device (102) is configured to receive a plurality of raw video files, and optionally, a flow document, a background music file, and/or reference video material from a user. The output device (108) is configured to generate the final highlight movie.
In one aspect of the invention, the flow document is a structured document configured to allow users to specify particular instructions and preferences for the highlighted movie output. Such a document may include information including, but not limited to, objects or moments of interest, a desired storyline, energy levels, time of day/year, background sentiments, and lighting conditions.
The memory (104) is operably coupled to the input device (102) and the processing unit (106). It is configured to store the received raw video files, and optionally, the flow document, the background music file, and/or the reference video material.
The processing unit (106) is the central component of the system, comprising several modules for processing the input data and generating the highlight movie. These modules are described in detail below.
The received data are stored in the memory (104), and the processing unit (106) may access the information or data to further analyze the data, as described below.
In other embodiments illustrated in Figure 1B, the processing unit (106) is equipped with various modules, including
• a single frame features extraction module (10) configured to extract visual features from individual frames of the raw video files using a region-based Convolutional Neural Network (CNN) model. In one embodiment, this CNN is a ResNet architecture, although other architectures like VGG or EfficientNet could also be used. The CNN comprises:
- a plurality of convolutional layers to generate a feature map for each frame;
- means for partitioning the feature map into a plurality of partially overlapping regions of varying sizes; and
- a plurality of convolutional channels, each channel processing a respective region to generate a 1-dimensional feature vector for each region;
• an identification module (12) configured to identify key entities within each frame, comprising:
- an attention module configured to receive the 1-dimensional feature vectors from the single frame features extraction module (10) and assign attention weights to each region; and
- a categorization module configured to receive the 1-dimensional feature vectors from the single frame features extraction module (10) and identify the presence and type of entities within each region;
• an aggregation module (14) configured to receive the 1-dimensional feature vectors from the single frame features extraction module (10) and the entity identification information from the identification module (12), and generate a single integer representation for each frame;
• a video features extraction module (16) configured to extract visual and temporal features from video segments, comprising:
- means for grouping a plurality of frames to form a video segment;
- a plurality of convolutional layers to generate a spatial and temporal feature representation for the video segment;
- means for partitioning the spatial and temporal feature representation into regions, each region corresponding to a frame in the segment;
- a plurality of convolutional channels, each channel processing a respective region to generate regional video features;
- means for appending the frame-level feature vectors from the aggregation module (14) and attention outputs from the identification module (12) to their corresponding regional video features;
- an attention mechanism to determine the importance of each frame within the video segment; and
- fully connected layers to generate a 1-dimensional vector representing the video segment;
• an audio processing module (18) configured to extract and process audio features from the input background music file or from the audio tracks embedded within the raw video files, comprising:
- means for generating a spectrum of the audio file using a Short-Time Fourier Transform (STFT) or a similar technique; and
- a recurrent neural network (RNN) model to process the spectrum sequentially, capturing temporal dependencies in the audio signal, to extract audio features and identify segments with similar audio characteristics;
• a selection module (22) configured to select video segments for inclusion in the highlight movie, comprising:
- means for processing a received flow document (if provided) using a Bidirectional Encoder Representations from Transformers (BERT) model to generate contextualized word/sentence embeddings;
- means for performing a multi-modal matching and ranking process that integrates:
o the 1-dimensional vectors representing the video segments (from the video features extraction module (16));
o optionally, the contextualized word/sentence embeddings (from the flow document); and
o optionally, the extracted audio features (from the audio processing module (18));
- a defined matching algorithm (e.g., cosine similarity) to compare feature vectors in a high-dimensional space, considering scene change timestamps (derived from audio processing or the flow document). Weights may be applied to different modalities.
- means for ranking the video segments based on the results of the multi-modal matching and ranking process.
• a video cuts identification module (24) configured to identify precise start and end times for the selected video segments, comprising:
- an attention mechanism to identify salient frames within each selected video segment, receiving as input the frame-level feature vectors (from the aggregation module (14)) and attention outputs (from the identification module (12)); and
- a neural network with fully connected layers to predict the start and end times for each selected video segment based on the attended frame features;
• a colour correction module (26) configured to adjust the color of each frame in the selected video segments using a defined color correction algorithm based on analyzing histograms of color channels; and
• a creation module (28) configured to combine the selected video segments, using the start and end times, and optionally the background music, to generate the highlight movie. The creation module (28) optimizes transitions between selected video segments to ensure visual smoothness. The module receives input from the color correction module (26).
According to the embodiments depicted in Figure 2, the single frame features extraction module (10) is configured to extract visual features from individual images or video frames using a region-based Convolutional Neural Network (CNN) model. The input image *I* is first processed by a series of initial convolutional layers, denoted as *CL_initial(I; W_initial, b_initial)*, where *W_initial* represents the weights and *b_initial* represents the biases of these layers. These initial convolutional layers perform a series of convolutional operations, extracting hierarchical features and generating a feature map *F*.
This feature map *F* is then partitioned into *n* partially overlapping rectangular regions of varying sizes, denoted as *R = {r_1, r_2, ..., r_n}*. The degree of overlap between regions is a configurable parameter. Each region *r_i* is then processed by a separate set of regional convolutional layers, denoted as *CL_regional_i(r_i; W_regional_i, b_regional_i)*, where *W_regional_i* and *b_regional_i* represent the weights and biases of the convolutional layers for the *i*-th region. These regional convolutional layers extract features specific to each region, producing a 1-dimensional feature vector *f_i* for each region.
The output of the single frame features extraction module (10) is a set of *n* 1-dimensional feature vectors, *F_regional = {f_1, f_2, ..., f_n}*. This architecture, employing parallel convolutional channels for different regions, allows the module to capture both local and global visual information within the input frame. The use of partially overlapping regions ensures that boundary information is not lost and provides contextual information between adjacent regions. The resulting feature vectors, *F_regional*, are then passed to subsequent modules for further processing.
The specific architecture of the CNN (number of layers, filter sizes, stride, activation functions, etc.) and the parameters of the region partitioning (size, overlap) are configurable and can be optimized during training to achieve the desired performance for the specific application. The CNN is trained using a suitable dataset of images and associated labels, employing a loss function appropriate for the task (e.g., a multi-class classification loss if the features are used for object recognition). Backpropagation is used to update the weights and biases of the network during training.
According to the embodiments depicted in Figure 3, the identification module (12) is configured to identify key entities within each frame, such as people, objects, and events, by analyzing the visual features extracted by the single frame features extraction module (10). This module employs a machine learning model comprising two sub-modules: an attention module and a categorization module. The attention module and the categorization module both utilize neural networks. In one embodiment, both the attention module and the categorization module employ multilayer perceptrons (MLPs). The specific architecture of the MLPs (number of layers, neurons per layer, activation functions) is configurable and optimized during training. For example, the attention module could utilize a single hidden layer with ReLU activation, while the categorization module might use multiple hidden layers followed by a softmax layer for multi-class classification.
The attention module receives as input the set of 1-dimensional feature vectors (E1:01) generated by the single frame features extraction module (10). Each vector in (E1:01) corresponds to a specific region within the frame. The attention module processes these input vectors to identify regions of interest within the frame. It assigns an attention weight, *a_i*, to each region *r_i*. This weight represents the importance or relevance of that specific region for the task of entity identification. The output of the attention module is a set of attention weights *A = {a_1, a_2, ..., a_n}*.
The categorization module also receives the same set of 1-dimensional feature vectors (E1:01) as input. Its function is to classify these input vectors and, based on this classification, identify the presence and type of entities within each region. The categorization module employs a neural network architecture, which in some embodiments includes fully connected layers (as shown in Figure 3). This neural network is trained to recognize and categorize different entities of interest (e.g., people, objects, events). The output of the categorization module is a set of entity labels *E = {e_1, e_2, ..., e_n}*, where each *e_i* represents the identified entities in region *r_i*. In some embodiments, *e_i* can be represented as a 1-D one-hot vector (E2:02), where each entry in the vector corresponds to an entity type, with a value of 1 indicating the presence of that entity in the corresponding region and 0 indicating its absence.
The combined architecture of the attention module and the categorization module, trained on a dataset of labeled images, allows for accurate and efficient identification of key entities within the visual data. The attention module focuses the processing on the most relevant regions of the image, while the categorization module identifies the specific entities present in those regions. The outputs of both modules, the attention weights *A* (E2:01) and the entity labels *E* (E2:02), are then passed on to the aggregation module (14) for further processing.
According to the embodiments depicted in Figure 4, the aggregation module (14) is configured to receive the 1-dimensional feature vectors (E1:01) from the single frame features extraction module (10) and the entity identification information from the identification module (12). The entity identification information comprises the attention weights (E2:01) from the attention module and the entity labels (E2:02) from the categorization module. The aggregation module (14) processes these inputs to generate a single integer representation (E3:01) for each frame.
Specifically, the module (14) receives:
* The 1-dimensional feature vectors (E1:01), representing local visual features extracted from different regions of the frame.
* The attention weights (E2:01), indicating the relative importance of each region for entity identification.
* The entity labels (E2:02), specifying the type of entities detected in each region.
These inputs are then processed by a neural network within the aggregation module (14). This neural network, in some embodiments, comprises a sequence of fully connected layers. The neural network within the aggregation module (14) is a multilayer perceptron (MLP). In one embodiment, this MLP consists of three fully connected layers with ReLU activation functions in the hidden layers and a linear activation function in the output layer. The number of neurons in each layer is a design parameter chosen based on the complexity of the features being aggregated. Other architectures, such as a convolutional neural network (CNN) or a recurrent neural network (RNN), could also be employed depending on the nature of the frame-level information being aggregated. The neural network is trained to learn a mapping that combines and compresses this information—local visual features, regional importance, and entity categorization—into a single integer representation (E3:01) for the frame. This aggregated integer encapsulates the key visual and semantic information for that frame, providing a concise and informative representation suitable for subsequent processing in the highlight movie generation pipeline.
The training process utilizes a dataset of images and associated labels. The labels would typically include information about the dominant entities or events present in the image, which the network uses to learn to aggregate the regional features, attention, and categorization outputs effectively. The specific architecture of this neural network, including the number of layers, the number of neurons in each layer, and the activation functions used, can be varied and optimized during training to achieve effective feature aggregation for the specific task of highlight movie generation. The loss function used during training would be appropriate for a regression task, as the output is a single integer. For example, mean squared error could be used if the integer represents a score related to the frame's importance.
Referring to Figure 5, the single frame features extraction module (10), the identification module (12), and the aggregation module (14) are interconnected to form a processing pipeline for analyzing individual frames. This interconnected structure enables efficient and comprehensive analysis of each frame, combining low-level visual features with higher-level entity identification and contextual information.
Specifically, the single frame features extraction module (10) receives the input image *I* and extracts a set of 1-dimensional feature vectors *F_regional = {f_1, f_2, ..., f_n}* (E1:01), as described previously. These feature vectors *F_regional* (E1:01) are then passed as input to the identification module (12).
Within the identification module (12), the attention module processes the input feature vectors *F_regional* (E1:01) and generates a set of attention weights *A = {a_1, a_2, ..., a_n}* (E2:01), where each *a_i* corresponds to the attention weight for region *r_i*. Simultaneously, the categorization module within the identification module (12) also processes the input feature vectors *F_regional* (E1:01) and generates a set of entity labels *E = {e_1, e_2, ..., e_n}* (E2:02), where each *e_i* represents the identified entities in region *r_i*.
The aggregation module (14) receives three sets of inputs: the 1-dimensional feature vectors *F_regional* (E1:01) from the single frame features extraction module (10), the attention weights *A* (E2:01) from the attention module within the identification module (12), and the entity labels *E* (E2:02) from the categorization module within the identification module (12). The aggregation module (14) then processes these inputs using a neural network, as described previously, to generate a single integer representation *s* (E3:01) for the frame. This single integer representation *s* (E3:01) encapsulates the key visual and semantic information for the frame, integrating the low-level features, attention weights, and entity labels.
According to the embodiments depicted in Figure 6, the video features extraction module (16) is configured to extract visual and temporal features from video segments and encode them into a 1-dimensional vector *v_S*. A video segment, *S*, consists of a sequence of *m* frames, *S = {I_1, I_2, ..., I_m}*. The module (16) employs a convolutional neural network (CNN) architecture to process the video segment.
The module (16) first processes the video segment *S* through a series of convolutional layers *CL_video(S; W_CL_video, b_CL_video)*, where *W_CL_video* and *b_CL_video* represent the weights and biases of these convolutional layers. These layers generate a spatio-temporal feature representation, *F_video*, capturing both spatial and temporal information within the video segment.
The spatio-temporal feature representation, *F_video*, is then conceptually partitioned into *m* regions, where each region is associated with a frame *I_i* in the video segment. It's important to note that the partitioning here is not a literal splitting of the *F_video* tensor but rather a way to conceptually associate parts of the spatio-temporal representation with individual frames. Each of these regions is then processed by a separate set of convolutional layers, *CL_regional_i(F_video_i; W_CL_regional_i, b_CL_regional_i)*, where *i = 1, 2, ..., m*, and *F_video_i* represents the portion of *F_video* associated with frame *I_i*. *W_CL_regional_i* and *b_CL_regional_i* represent the weights and biases of the convolutional layers for the *i*-th frame. These frame-specific convolutional layers extract features specific to each frame *I_i* within the context of the entire video segment, generating regional video features, *vf_i*.
Crucially, the frame-level feature vectors (generated by the aggregation module (14) and denoted as *f_i* in Figure 4) and the attention model outputs from the identification module (12) (shown in Figure 3 and denoted as *a_i*) corresponding to each frame *I_i* in the video segment are then appended to their respective regional video features *vf_i*, creating augmented feature vectors *vf_i_augmented = [vf_i, f_i, a_i]*. This augmentation allows the video segment representation to incorporate frame-level information, including entity identification and attention weights.
These augmented vectors *vf_i_augmented* are then processed by an attention mechanism, *Att_video(vf_i_augmented; W_Att_video, b_Att_video)*. This attention mechanism analyzes the importance of each frame *I_i* within the context of the entire video segment *S*, considering the frame-level features and attention weights.
Finally, the outputs of the attention mechanism are passed through a neural network consisting of fully connected layers, denoted as *FC_video(.; W_FC_video, b_FC_video)*. In one embodiment, this is a multilayer perceptron (MLP). The MLP architecture, including the number of layers and neurons per layer, is optimized during training. Other architectures, such as an RNN or a transformer network, could also be employed depending on the desired temporal processing of the video segment features. This neural network aggregates the frame-level features, weighted by their importance as determined by the attention mechanism, into a single 1-dimensional vector, *v_S*, representing the complete video segment *S*.
This 1-dimensional vector, *v_S*, encapsulates the essential visual and temporal characteristics of the video segment and is suitable for further processing in the highlight movie generation pipeline. While not explicitly depicted in Figure 7 (which should illustrate the entire process), the module (16) *may* further consider the temporal alignment and emotional congruence of video segments with the background music (if provided) during this process, ensuring a synchronized and engaging viewing experience.
Now, referring to another embodiment of the present invention, said audio processing module (18) extracts and processes audio features from the input background music file or from the audio tracks embedded within the raw video files. It comprises the following steps:
- A spectrum of the audio is generated using a Short-Time Fourier Transform (STFT) or a similar technique. This spectrum represents the frequency content of the audio signal over time.
- The spectrum is processed sequentially by a recurrent neural network (RNN) model. In one embodiment, a Long Short-Term Memory (LSTM) network is used for the RNN. The LSTM network processes the audio spectrum sequentially, capturing temporal dependencies in the audio signal. Other RNN architectures, such as Gated Recurrent Units (GRUs), could also be used. The specific architecture of the RNN (number of layers, hidden units) is a design parameter optimized during training. The RNN captures temporal dependencies in the audio signal, extracting audio-level features. These features can include Mel-Frequency Cepstral Coefficients (MFCCs), chroma features, spectral centroid, spectral rolloff, and other relevant audio descriptors.
- The extracted audio features are used to segment and classify the audio. A clustering algorithm (e.g., k-means, hierarchical clustering, or a Gaussian Mixture Model) is applied to the feature vectors to group audio segments with similar characteristics (tone, frequency, pitch, timbre, etc.). The number of clusters can be determined empirically or based on characteristics of the music. This clustering process identifies distinct audio events or musical phrases. The boundaries between these clusters define the audio cuts (timestamps for potential scene changes). Optionally, a classification step can be added to label each cluster with a musical genre, mood, or other relevant descriptor.
The selection module (22), as depicted in Figures 8-12, selects video segments for inclusion in the final highlight movie. This process utilizes several inputs: the raw video files, a flow document (if provided), and a background music file (if provided).
Now, if a flow document is provided, it is processed using a Bidirectional Encoder Representations from Transformers (BERT) model. The pre-trained BERT model is fine-tuned on a dataset of flow documents and corresponding video segments to optimize its performance for the task of generating contextualized sentence embeddings that are relevant for video selection. This pre-trained natural language processing model generates contextualized word/sentence embeddings that capture the semantic meaning and intent expressed in the document. These embeddings represent the user's preferences for storyline, focus, and other thematic elements.
If a background music file is provided, audio features are extracted and processed by the audio processing module (18), as described earlier. This includes generating a spectrum using STFT (or a similar technique) and using an RNN to extract audio features and identify segments with similar audio characteristics. These audio segments are timestamped.
If no separate background music file is provided, the audio tracks within the raw video files are used. These audio tracks are processed similarly to the background music, using STFT and an RNN to extract features and identify segments with similar audio characteristics within each video file. These video-associated audio segments are also timestamped.
The raw video files are sorted temporally based on their timestamps. This chronological organization facilitates the selection and sequencing of video segments.
The selection module (22) then performs a multi-modal matching and ranking process. Scene change timestamps are derived from the processed audio (either from the background music file or the audio tracks of the raw videos) or, if provided, from the flow document. These timestamps define the segments of the final highlight movie.
For each scene defined by the timestamps, the module (22) identifies the most suitable video segment. The matching process compares the following features in a high-dimensional space:
- the 1-dimensional vectors representing the video segments, generated by the video features extraction module (16).
- the contextualized word/sentence embeddings from the flow document.
- the extracted audio features from either the background music file or the raw video tracks.
A defined matching algorithm, such as cosine similarity, is used to calculate the similarity between these feature vectors. The specific implementation of the matching algorithm may involve weighting the different modalities (video, text, audio) based on their relative importance. For instance, if the user emphasizes the storyline in the flow document, the text embeddings might be given a higher weight.
The raw video segments are then ranked based on their similarity scores. The video segment with the highest similarity score is initially selected for the current scene.
If a flow document is provided, the system verifies whether the initially selected video segment aligns with the flow document's specifications (e.g., desired focus on specific entities, mood, or storyline elements). If the alignment is satisfactory, the video segment is chosen for the scene. If the alignment is not satisfactory, the system iteratively assesses additional video segments from the ranked list until a suitable video segment that meets the flow document's criteria is identified. This iterative process ensures that the selected video segments are both relevant to the scene and consistent with the user's overall vision.
If only a background music file is provided (and no flow document), the selection process focuses on matching audio and video features. For each scene defined by the background music's timestamps, the module (22) compares the audio features of the corresponding music segment with the video features of each raw video segment. The video segment with the highest audio-video similarity is selected.
If only a flow document is provided (and no background music), the selection process matches video segments based on both the textual information from the flow document and the audio characteristics of the video segments themselves. The system ranks the video segments based on their alignment with the flow document’s narrative and how well the grouped audio within the videos matches the text descriptions in the flow document.
This multi-modal matching and ranking process, combined with the logic for integrating the flow document (if provided), ensures that the selected video segments create a coherent and engaging highlight movie that reflects the user's input and preferences.
The video cuts identification module (24) determines the precise start and end times for each selected video segment, as depicted in Figure 13. For each selected video segment *S = {I_1, I_2, ..., I_m}*, the module (24) receives:
- A sequence of frame-level feature vectors *F_frame = {f_1, f_2, ..., f_m}* (E3:01) from the aggregation module (14), where *f_i* corresponds to frame *I_i*.
- A sequence of attention weights *A_frame = {a_1, a_2, ..., a_m}* corresponding to each frame, derived from the attention module of the identification module (12).
These inputs are processed through a frame-level attention mechanism *Att_cuts(F_frame, A_frame; W_Att_cuts, b_Att_cuts)* (where *W_Att_cuts* and *b_Att_cuts* are the weights and biases, respectively) to identify the most important or salient frames within the video segment. This attention mechanism weights the frame-level features *f_i* based on their corresponding attention weights *a_i*. The output of this attention layer is a set of attended frame features *F_attended = {f'_1, f'_2, ..., f'_m}*, where *f'_i = Att_cuts(f_i, a_i)*.
The attended frame features *F_attended* are then passed through a neural network with fully connected layers *FC_cuts(F_attended; W_FC_cuts, b_FC_cuts)* (where *W_FC_cuts* and *b_FC_cuts* are the weights and biases, respectively). In one embodiment, this neural network is a multilayer perceptron (MLP). The MLP takes as input the attended frame features and predicts the start and end times of the video cut. The architecture of the MLP (number of layers, neurons per layer, activation functions) is optimized during training. The loss function used for training could be a combination of L1 or L2 loss on the start and end times, as well as a term to encourage temporal smoothness of cuts. This neural network maps the attended frame features to the start time *t_start* and end time *t_end* for the video cut. The network may be trained using a loss function that considers both the accuracy of the predicted start and end times and potentially other factors, such as the duration of the cut or its alignment with audio events. In some embodiments, the neural network may predict offsets from salient frames rather than absolute times.
Further, the colour correction module (26) adjusts the color of each frame in the selected video segments, as depicted in Figure 14. It employs a color correction algorithm based on analyzing histograms of the red, green, blue, and luminance (or other color space components) channels of the frames. Specifically, the module (26) may:
- Calculate the average color and luminance values for each channel in the selected video segments.
- Compare these average values to target values (which could be based on a reference video, user preferences, or a global average across all videos).
- Apply color transfer techniques, such as histogram matching or color mapping, to adjust the exposure, saturation, white balance, and other color properties of each frame to match the target values, ensuring consistency across videos captured with different cameras and lighting conditions. Specific color correction means, such as those based on 3D Look-Up Tables (LUTs) or other color grading techniques, may be employed.
According to further embodiment of the present invention, the creation module (28) combines the selected video segments, using the determined start and end times *t_start* and *t_end*, and optionally combines them with the background music (from the audio processing module (18)), to generate the final highlight movie. The creation module (28) optimizes transitions between selected video segments to ensure visual smoothness. This may involve techniques like crossfades, dissolves, or other transition effects. The module receives input from the color correction module (26), ensuring that the color-corrected frames are used in the final movie. The output of this module is the final highlight movie.
According to the other embodiments of the present invention, a method for generating a highlight movie from a plurality of raw video files, and optionally, a flow document and/or a background music file, using a system comprising an input device, a processing unit, a memory, and an output device, the method comprising the steps of:
1. Receiving, via the input device, a plurality of raw video files, and optionally, a flow document and/or a background music file.
2. Storing the received raw video files, and optionally, the flow document and/or the background music file in the memory.
3. For each frame of the raw video files, extracting visual features using a region-based Convolutional Neural Network (CNN) model within the single frame features extraction module, the CNN model comprising:
a. Processing the frame through a plurality of convolutional layers to generate a feature map.
b. Partitioning the feature map into a plurality of partially overlapping regions of varying sizes.
c. Processing each region through a separate channel of convolutional layers to generate a 1-dimensional feature vector for each region.
4. For each frame, identifying key entities within the identification module by:
a. Inputting the 1-dimensional feature vectors from step 3 into an attention module to assign attention weights to each region, the attention weights representing the importance of each region for entity identification.
b. Inputting the 1-dimensional feature vectors from step 3 into a categorization module to classify the regions and identify the presence and type of entities within each region.
5. For each frame, aggregating the 1-dimensional feature vectors from step 3 and the entity identification information from step 4 within the aggregation module using a neural network to produce a single integer representation for each frame.
6. For each video segment (a group of consecutive frames), extracting visual and temporal features within the video features extraction module by:
a. Processing the video segment through a plurality of convolutional layers to generate a spatio-temporal feature representation.
b. Partitioning the spatio-temporal feature representation into regions, each region corresponding to a frame in the segment.
c. Processing each region through separate convolutional channels to generate regional video features.
d. Appending the frame-level feature vectors from step 5 to their corresponding regional video features.
e. Applying an attention mechanism to the appended vectors to determine the importance of each frame within the video segment.
f. Fusing the attended vectors using fully connected layers to produce a 1-dimensional vector representing the video segment.
7. If a background music file is received, extracting and processing audio features within the audio processing module by:
a. Generating a spectrum of the audio file using a Short-Time Fourier Transform (STFT) or a similar technique.
b. Processing the spectrum sequentially using a recurrent neural network (RNN) to extract audio features and identify segments with similar audio characteristics.
8. If no background music file is received, extracting and processing audio features from the audio tracks of the raw video files using the same method as in step 7.
9. If a flow document is received, processing the flow document using a Bidirectional Encoder Representations from Transformers (BERT) model to generate contextualized word or sentence embeddings representing the semantic content of the document.
10. Selecting video segments for inclusion in the highlight movie within the selection module by:
a. Defining scene change timestamps based on the processed audio (from step 7 or 8) or the flow document (if provided).
b. For each scene defined by the timestamps:
i. Performing a multi-modal matching process that integrates the 1-dimensional vectors representing the video segments (from step 6), optionally the text embeddings (from step 9), and optionally the audio features (from step 7 or 8). The matching process uses a defined algorithm (e.g., cosine similarity) to compare the feature vectors in a high-dimensional space, considering the scene change timestamps. Weights may be applied to different modalities.
ii. Ranking the video segments based on the results of the multi-modal matching process. iii. Selecting the highest-ranked video segment for the current scene.
11. Identifying precise start and end times for the selected video segments within the video cuts identification module by:
a. Inputting the frame-level feature vectors from step 5 and the attention outputs from step 4 for each selected video segment into an attention mechanism to identify salient frames.
b. Inputting the output of the attention mechanism into a neural network with fully connected layers to predict the start and end times for each selected video segment.
12. Adjusting the color of each frame in the selected video segments within the color correction module using a defined color correction algorithm based on analyzing histograms of color channels.
13. Combining the selected video segments, using the start and end times identified in step 11, and optionally combining with the background music file (if provided), to generate the final highlight movie within the creation module, and outputting the final highlight movie via the output device.
The foregoing is considered as illustrative only of the principles of the invention. Further, since numerous modifications and changes will readily occur to those skilled in the art, it is not desired to limit the invention to the exact configuration and operation shown and described, and accordingly, all suitable modifications and equivalents may be resorted to, falling within the scope of the invention.

ADVANTAGES OF THE INVENTION:
The present invention, a system for generating highlight movies from multiple raw videos, offers several key advantages:
1. The system automates the process of identifying and combining key moments from multiple raw video files, including event footage (weddings, corporate events, sports events, etc.), significantly reducing the time and effort required for highlight movie creation. This automation translates to lower production costs and increased efficiency compared to traditional manual editing, addressing the limitations of resource-intensive manual processes.
2. By leveraging Artificial Intelligence (AI)-powered analysis, including Convolutional Neural Networks (CNNs), Recurrent Neural Networks (RNNs), and a Bidirectional Encoder Representations from Transformers (BERT) model, the system prioritizes essential and significant video segments for inclusion in the final highlight movie. This intelligent content selection ensures that important moments (e.g., key wedding rituals, crucial plays in a sporting event) and points of interest (e.g., specific individuals, objects, or actions) are prominently featured, resulting in a more impactful and engaging narrative. This overcomes the challenge of sifting through large volumes of footage to identify key content.
3. The system intelligently determines optimal start and end times for selected video segments, avoiding abrupt or awkward transitions that can disrupt the viewing experience. This automated cut selection results in a polished and professional final product, addressing the limitations of manual cut selection which can be subjective and time-consuming.
4. The system accommodates user input through a flow document, enabling users to specify preferences for storyline, focus on particular entities, desired mood, and other parameters. This customization allows for personalized highlight movies tailored to individual needs and tastes, overcoming the limitations of generic automated editing.
5. The system automatically integrate background music and synchronize it with the selected video segments. The audio processing module identifies segments with similar audio characteristics and can match them to segments in the background music, creating a more immersive and emotionally resonant viewing experience. This automated synchronization addresses the challenges of manual audio editing and synchronization.
6. The automated nature of the system drastically reduces the time required for highlight movie creation, from weeks or months with manual editing to minutes. This increased efficiency allows for rapid turnaround times and quicker delivery of the final product. Furthermore, the system's software-based implementation enables easy scalability. Multiple instances of the system can be deployed (e.g., on a cloud platform) to handle numerous editing tasks concurrently, a capability that is difficult and costly to replicate with human editors. This addresses the scalability limitations of manual editing workflows.
7. Automated editing ensures consistent quality and avoids the creative burnout that human editors can experience, especially when working on repetitive tasks. The system can operate continuously without breaks or fatigue, maintaining a high level of performance. This addresses the variability and potential for errors associated with human editors, particularly in repetitive tasks.
,CLAIMS:We Claim
1. A system (100) for generating a highlight movie from a plurality of raw video files, comprising:
a. an input device (102) configured to receive the plurality of raw video files, and optionally, a flow document and/or a background music file;
b. a memory (104) configured to store the raw video files, and optionally, the flow document and/or the background music file;
c. a processing unit (106) comprising:
i. a single frame features extraction module (10) configured to extract, for each frame of the raw video files, 1-dimensional feature vectors for a plurality of partially overlapping regions of the frame using a region-based Convolutional Neural Network (CNN);
ii. an identification module (12) configured to identify, for each frame, key entities within the frame using the 1-dimensional feature vectors, and to generate attention weights for the regions and entity labels for each region;
iii. an aggregation module (14) configured to generate, for each frame, a single integer representation by aggregating the 1-dimensional feature vectors and the entity identification information from the identification module (12);
iv. a video features extraction module (16) configured to extract, for each video segment comprising a plurality of frames, a 1-dimensional vector representing the video segment, wherein the module (16) appends the single integer representations and attention weights for each frame to regional video features before generating the 1-dimensional vector;
v. an audio processing module (18) configured to extract and process audio features from the background music file, if provided, or from audio tracks within the raw video files, comprising generating a spectrum of the audio and processing the spectrum using a recurrent neural network (RNN) to extract audio features and identify segments with similar audio characteristics;
vi. a selection module (22) configured to select video segments for inclusion in the highlight movie by performing a multi-modal matching and ranking process that integrates the 1-dimensional vectors representing the video segments, and optionally, contextualized word/sentence embeddings from the flow document (if provided) and/or the extracted audio features, wherein the matching process considers scene change timestamps derived from the audio processing or the flow document;
vii. a video cuts identification module (24) configured to identify precise start and end times for the selected video segments by using an attention mechanism that receives the single integer representations and attention weights for each frame, and a neural network to predict the start and end times;
viii. a color correction module (26) configured to adjust the color of each frame in the selected video segments using a defined color correction algorithm based on analyzing histograms of color channels; and
ix. a creation module (28) configured to generate the highlight movie by combining the selected video segments using the start and end times, and optionally, the background music; and
d. an output device (108) configured to output the generated highlight movie.
2. The system (100) as claimed in claim 1, wherein the single frame features extraction module (10) comprises:
a. a plurality of convolutional layers to generate a feature map for each frame;
b. means for partitioning the feature map into a plurality of partially overlapping regions of varying sizes; and
c. a plurality of convolutional channels, each channel processing a respective region to generate the 1-dimensional feature vector for the region.
3. The system (100) as claimed in claim 1, wherein the identification module (12) comprises:
a. an attention module configured to receive the 1-dimensional feature vectors from the single frame features extraction module (10) and assign the attention weights to each region; and
b. a categorization module configured to receive the 1-dimensional feature vectors from the single frame features extraction module (10) and identify the presence and type of entities within each region to generate the entity labels.
4. The system (100) as claimed in claim 1, wherein the video features extraction module (16) comprises:
a. means for grouping a plurality of frames to form a video segment;
b. a plurality of convolutional layers to generate a spatial and temporal feature representation for the video segment;
c. means for partitioning the spatial and temporal feature representation into regions, each region corresponding to a frame in the segment;
d. a plurality of convolutional channels, each channel processing a respective region to generate regional video features;
e. means for appending the single integer representations from the aggregation module (14) and attention outputs from the identification module (12) to their corresponding regional video features;
f. an attention mechanism to determine the importance of each frame within the video segment; and
g. fully connected layers to generate the 1-dimensional vector representing the video segment.
5. The system (100) as claimed in claim 1, wherein the selection module (22) comprises:
a. means for processing the received flow document (if provided) using a Bidirectional Encoder Representations from Transformers (BERT) model to generate contextualized word/sentence embeddings; and
b. a defined matching algorithm to compare feature vectors in a high-dimensional space, wherein the matching algorithm considers scene change timestamps derived from the audio processing or the flow document.
6. The system (100) as claimed in claim 1, wherein the video cuts identification module (24) comprises:
a. an attention mechanism to identify salient frames within each selected video segment, receiving as input the single integer representations and the attention weights for each frame; and
b. a neural network with fully connected layers to predict the start and end times for each selected video segment based on the attended frame features.
7. The system (100) as claimed in claim 1, wherein the color correction module (26) uses a color correction algorithm that analyzes histograms of color channels to adjust exposure, saturation, and white balance.
8. The system (100) as claimed in claim 1, wherein the audio processing module (18) uses a clustering algorithm to identify segments with similar audio characteristics.
9. The system (100) as claimed in claim 1, wherein the selection module (22) weights different modalities (video, text, audio) based on their relative importance during the multi-modal matching and ranking process.
10. The system (100) as claimed in claim 1, wherein the video cuts identification module (24) predicts offsets from salient frames to determine the start and end times.
Dated this 13th day of February, 2025.

___________________
GOPI JATIN TRIVEDI
IN/PA-993
Authorized Agent of Applicant
To,
The Controller of Patents,
The Patent Office,
At Mumbai.

Documents

Application Documents

#	Name	Date
1	202421010672-STATEMENT OF UNDERTAKING (FORM 3) [15-02-2024(online)].pdf	2024-02-15
2	202421010672-PROVISIONAL SPECIFICATION [15-02-2024(online)].pdf	2024-02-15
3	202421010672-PROOF OF RIGHT [15-02-2024(online)].pdf	2024-02-15
4	202421010672-POWER OF AUTHORITY [15-02-2024(online)].pdf	2024-02-15
5	202421010672-FORM FOR STARTUP [15-02-2024(online)].pdf	2024-02-15
6	202421010672-FORM FOR SMALL ENTITY(FORM-28) [15-02-2024(online)].pdf	2024-02-15
7	202421010672-FORM 1 [15-02-2024(online)].pdf	2024-02-15
8	202421010672-EVIDENCE FOR REGISTRATION UNDER SSI(FORM-28) [15-02-2024(online)].pdf	2024-02-15
9	202421010672-EVIDENCE FOR REGISTRATION UNDER SSI [15-02-2024(online)].pdf	2024-02-15
10	202421010672-DECLARATION OF INVENTORSHIP (FORM 5) [15-02-2024(online)].pdf	2024-02-15
11	202421010672-FORM-5 [14-02-2025(online)].pdf	2025-02-14
12	202421010672-DRAWING [14-02-2025(online)].pdf	2025-02-14
13	202421010672-COMPLETE SPECIFICATION [14-02-2025(online)].pdf	2025-02-14
14	Abstract.jpg	2025-03-27