Systems And Methods For Localizing Moments, In Surveillance Videos,

Systems And Methods For Localizing Moments, In Surveillance Videos, Using Natural Language Queries

Abstract: ABSTRACT SYSTEMS AND METHODS FOR LOCALIZING MOMENTS, IN SURVEILLANCE VIDEOS, USING NATURAL LANGUAGE QUERIES This invention discloses a system for localizing moments, in surveillance videos, using natural language queries, said localized moment being a video span as an output of a query-video pair, said localized moment having a start time for said video and an end time for said video. [[FIGURE 1]]

Patent Information

Application #

Filing Date

22 August 2022

Publication Number

08/2024

Publication Type

INA

Invention Field

COMPUTER SCIENCE

Status

Email

Parent Application

Applicants

COLLEGE OF ENGINEERING

WELLESLEY RD., SHIVAJINAGAR, PUNE 411005, MAHARASHTRA, INDIA

Inventors

1. SHAUNAK ASHISH HALBE

47/8 JANKI VILLA, OFF LAW COLLEGE ROAD, ERANDWANE, PUNE 411004, MAHARASHTRA, INDIA

2. VIREN RAJESH PATIL

VITTORIA 602, HIRANANDANI ESTATE, GHODBUNDER ROAD, THANE (W) 400607, MAHARASHTRA, INDIA

3. VASVI PANKAJ GUPTA

B-303, PARK IVORY, PARK STREET, WAKAD, PUNE 411057, MAHARASHTRA, INDIA

4. VAHIDA ZAKIRHUSEN ATTAR

MADHUBAN CLASSIC, VISHRANTWADI, PUNE 411015, MAHARASHTRA, INDIA

5. KARISHMA VIKAS PAWAR

C-6, VATSALYA VILLA, VARSHA PARK SOCIETY, BANER, PUNE 411008, MAHARASHTRA, INDIA

Specification

DESC:FIELD OF THE INVENTION:
This invention relates to the field of computer engineering and video technologies.

Particularly, this invention relates to systems and methods for localizing moments, in surveillance videos, using natural language queries.

BACKGROUND OF THE INVENTION:
According to the prior art, machine learning-based anomaly detection algorithms are pervasive in CCTV surveillance systems, but these are limited to detecting the presence of a predefined set of activities.

Additionally, these anomaly detection systems are only conditioned to detect frames where the activity is taking place and do not account for a cause or a method in which the unusual activity took place.

Furthermore, these prior art systems can, effectively, detect unusual activities, but none of them can ‘localize’ moments, in video streams, based on human queries.

As per prior art, such a task requires human operators and manual inspection for hours and hours together to find a particular activity in the video.

The goal of moment detection is to refer to a particular temporal segment in the video based on a free-form natural language query. A human user who relies on manual screening finds this task extremely arduous, particularly for long continuous video streams.

Additionally, almost all prior artwork has concentrated its efforts on designing a system for short video clips from the internet.

Therefore, there is a need to provide a solution to these problems.

PRIOR ART:
In a prior art document, QV HIGHLIGHTS: Detecting Moments and Highlights in Videos via Natural Language Queries research paper, a large-scale multimodal transformer was used to train on a high-quality dataset. This approach uses high quality videos with easily discernible activities and actors.

FIGURE 1 illustrates this prior art’s architecture (QV HIGHLIGHTS: Detecting Moments and Highlights in Videos via Natural Language Queries).
The architecture is simple, with a transformer encoder-decoder and three prediction heads for predicting saliency scores, fore-/back-ground scores, and moment coordinates. For brevity, the video and text feature extractors are not shown in this figure.

Given a natural language query q of Lq tokens, and a video v comprised of a sequence of Lv clips, they aim to localize one or more moments { mi } (a moment is a consecutive subset of clips in v), as well as predict clip-wise saliency scores S ? RLv (the highest scored clips are selected as highlights).

The overall architecture of Moment-DETR is given below:
Input representations: The input to the transformer encoder is the concatenation of projected video and query text features. For video, they use SlowFast and the video encoder (ViT-B/32) of CLIP to extract features every 2 seconds. They then normalize the two features and concatenate them at hidden dimensions. The resulting video feature v is denoted as Ev ? R Lv × 2816. For query text, they use the CLIP text encoder to extract token-level features, Eq ? RLq × 512. Next, they use separate 2-layer perceptrons with layernorm and dropout to project the video and query features into a shared embedding space of size d. The projected features are concatenated at length dimension as the input to the transformer encoder, denoted as Einput ? RL×d, L=Lv + Lq.

Transformer encoder-decoder: The video and query input sequence is encoded using a stack of T transformer encoder layers. Each encoder layer has the same architecture as in previous work, with a multi-head self-attention layer and a feed-forward network (FFN). Since the transformer architecture is permutation-invariant, fixed positional encodings are added to the input of each attention layer, following. The output of the encoder is Eenc ? RL×d. The transformer decoder is the same as in, with a stack of T transformer decoder layers. Each decoder layer consists of a multi-head self-attention layer, a cross-attention layer (that allows interaction between the encoder outputs and the decoder inputs), and an FFN. The decoder input is a set of N trainable positional embeddings of size d, referred to as moment queries. These embeddings are added to the input to each attention layer as in the encoder layers. The output of the decoder is Edec ? RN×d.

Prediction heads: Given the encoder output Eenc, they use a linear layer to predict saliency scores S ? RLv for the input video. Given the decoder output Edec, they use a 3-layer FFN with ReLU to predict the normalized moment center coordinate and width w.r.t. the input video. They also follow DETR to use a linear layer with softmax to predict class labels. In DETR, this layer is trained with object class labels. In their task, since class labels are not available, for a predicted moment, they assign it a foreground label if it matches with ground truth, and background otherwise.

According to another prior art document, in an approach used in the Localizing Moments in Video with Natural Language research paper, a latent context variable is used for better reasoning over temporal context but it requires short high-quality video clips of a fixed length.

According to another prior art document, an approach in the Real-world Anomaly Detection in Surveillance Videos, the research paper uses multiple instance learning (MIL). It also presents a new 128-hour video dataset, which is the first of its kind on a big scale. But, it fails to detect an anomalous part because of the darkness of the scene. It also fails to identify the normal group activity.

According to another prior art document, an approach used in the Anomaly Event Detection in Security Surveillance Using Two-Stream Based Model considers the fusion of two streams with the same or a different number of layers respectively. But, there is an absence of video segment-level annotations.

According to another prior art document, an approach to Deep anomaly detection through visual attention in surveillance videos research paper introduces the concept of visual attention to help pinpoint the region of interest (ROI). Its disadvantage is that it only highlights the region of interest, and the users still have to scan those frames to find the crime scene or the scene of interest.

According to another prior art document, an approach used in Real-time anomaly recognition through CCTV using neural networks research paper identifies and classifies levels of high movements in the frames. The approach identifies and classifies high-intensity movements and does not classify or identify them as abnormal or suspicious.

Therefore, there is a need to provide a solution to these problems.

OBJECTS OF THE INVENTION:
An object of the invention is to automate the process of localizing moments in surveillance grade video streams based on human language queries.

Another object of the invention is to provide a system and method for analyzing surveillance grade videos without the need for the video to have easily discernible objects and high-quality video scenes.

Yet another object of the invention is to provide a system and method for analyzing videos without the need for the video to be short and/or high quality.

Still another object of the invention is to provide a system and method for analyzing videos without the need for the video to be bright.

An additional object of the invention is to provide a system and method for analyzing any type of activity in videos.

An additional object of the invention is to provide a system and method for analyzing videos without the need for manually scanning frames in order to find a crime scene or a scene of interest or activity.

Another object of the invention is to provide a system and method for analyzing videos and to classify and/or identify the videos to be abnormal or suspicious in correlation with the activity in the video.

SUMMARY OF THE INVENTION:
According to this invention, there are provided systems and methods for localizing moments, in surveillance videos, using natural language queries.

This invention provides the application of moment localization to surveillance videos from real-world CCTV cameras where this issue is most pressing.

The current invention defines a multi-modal system that can efficiently localize moments from CCTV video streams when provided with a natural-language query. Due to the absence of a dedicated data set for this task, the current invention presents a novel dataset, complete with videos and annotations (referred to as “CamPark-Captions”). The system localizes moments in videos using natural language queries. The goal is to predict a span as a tuple of time measured in seconds from the start of the video as represented as [start, end].

The current invention also defines two large-scale multimodal neural network architectures to solve this task.

According to this invention, there is provided a system for localizing moments, in surveillance videos, using natural language queries, said localized moment being a video span as an output of a query-video pair, said localized moment having a start time for said video and an end time for said video, said system comprising:
- a dataset consisting, essentially, of videos with predefined textual and temporal annotations;
- a video input module configured to allow input of videos to be analyzed for providing a first component of said query-video pair;
- a query input module configured to allow input of query by a user in relation to videos for providing a second component of said query-video pair;
- a trained state-of-the-art multimodal transformer model, using data items from the dataset, for purposes of benchmarking, said transformer model responsive to each of said query-video pairs to predict a defined number of moments ranked by confidence score, said model consisting, essentially, of a computer processor, with a memory to store instructions, the instructions being:
o providing a transformer having stack of transformer encoders followed by a stack of transformer decoders, wherein decoder output is passed through feed-forward network layers to predict said video span of a moment;
o converting said input videos to video feature vectors;
o converting said input text to text feature vectors;
o concatenating, said video feature vectors and said text feature vectors, along a hidden dimension;
o feeding said concatenated vectors to said transformer; and
o determining an output, of said localized moment, based on peal mean average precision (mAP) at a defined threshold.

In at least an embodiment, said confidence score being selected from (0.0 - 1.0) with an empirically set threshold of 0.55.

In at least an embodiment, said step of converting said input videos to video feature vectors comprising a step of using a SlowFast Network.

In at least an embodiment, said step of converting said input text to text feature vectors comprising a step of using a CLIP-textual encoder to obtain token-level feature vectors of size 512.

In at least an embodiment, said model comprising a preprocessor configured to preprocess videos by splitting said videos and by merging said videos, in that, videos having a pre-defined time duration being merged with videos having a time span larger than said pre-defined time duration in order to ensure removal of redundant parts of said video.

In at least an embodiment, said transformer decoder being passed through a feed-forward network activated by ReLU to predict moment coordinates and foreground-background class labels.’

In at least an embodiment, said videos being annotated videos, having annotated fields, said annotated fields selected from a group of annotators consisting of query relevant to said video, name of said video, identity of a query relevant to said video, start time for said video, and end time for said video.

In at least an embodiment, said peak mAP being encountered near a 30th epoch of training.

In at least an embodiment, said dataset, of videos, comprising an altered dataset, said alteration being an effect alteration of darkening said videos by adding -100 value to all pixel intensities.

In at least an embodiment, said dataset, of videos, comprising an altered dataset, said alteration being an effect alteration of blurring said videos by adding gaussian blur to all videos.

BRIEF DESCRIPTION OF THE ACCOMPANYING DRAWINGS:
FIGURE 1 illustrates this prior art’s architecture (QV HIGHLIGHTS: Detecting Moments and Highlights in Videos via Natural Language Queries).

This invention will now be described in relation to the accompanying drawings, in which:
FIGURE 2 illustrates a schematic block diagram of the system of this invention;
FIGURE 3 illustrates an input visual representation for the system of this invention;
FIGURE 4 illustrates an input textual representation for the system of this invention;
FIGURE 5 illustrates joint input representation for the system of this invention;
FIGURE 6 illustrates a transformer model architecture for the system of this invention;
FIGURES 7, 8, and 9 indicate the mean average precision (mAP) during the training process at different IOU thresholds;
FIGURES 10 and 11 indicate the plot of Recall@1 which measures the presence of the ground truth in the first rank of a given ranked output;
FIGURE 12 illustrates performance comparison between the base and robust model on altered videos; and
FIGURE 13 illustrates performance comparison between base and robust model on paraphrased queries.

DETAILED DESCRIPTION OF THE ACCOMPANYING DRAWINGS:
According to this invention, there are provided systems and methods for localizing moments, in surveillance videos, using natural language queries.

In at least an embodiment of this invention, there is provided a dataset, complete with videos and annotations. This dataset comprises videos with predefined textual and temporal annotations.

In at least an embodiment of this invention, there is provided a video input module configured to allow input of videos to be analyzed.

In at least an embodiment of this invention, there is provided a query input module configured to allow input of query by a user in relation to videos.

In at least an embodiment of this invention, there is provided a trained state-of-the-art multimodal transformer model, using data items from the dataset, for the purpose of benchmarking. There are defined, according to this invention, two large-scale multimodal neural network architectures.

MODEL PREDICTION
Prediction: For each query-video pair, the model predicts 10 possible moments ranked by their confidence score. The system and method, of this invention, is configured to select the top-most prediction for evaluation and inference. To discard irrelevant queries, the system and method compares the confidence score (0.0 - 1.0) of the topmost prediction with an empirically set threshold of 0.55. All predictions below the threshold are rejected and the query is deemed unfit for the video. Additionally, the system and method, of this invention, is configured to curate negative samples while training the model by cross-matching queries and videos. The model is rewarded if a span of [0,0] is predicted for such irrelevant queries and is penalized if otherwise.

In at least an embodiment, moments are localized, in videos, using natural language queries. The goal is to predict a span, of video, as a tuple of time (preferably, measured in seconds) from a start point of a video as represented as [start, end]. The logic, of this invention, is built, and enhances, using the Moment-DETR model which is prior art for moment localization and highlight detection. Typically, only parts related to moment localization are retained and parts responsible for highlight detection are pruned. Moment-DETR is an end-to-end transformer-based architecture that predicts moment spans given visual and textual features.

In at least an embodiment, the moment DETR architecture hosts two sets of linear layers which process the output of the encoder and decoder respectively. The first linear layer takes the encoder output to predict saliency scores for each video. In the current invention’s task, there is primary interest in predicting moment spans, hence this particular linear layer is masked. Consequently, the system and method, of this invention, does not need to optimize for hinge loss, hence the system and method, of this invention, focuses on optimizing L1 +IoU loss (Absolute Error Loss L1 + Intersection over union – IoU loss) and the Cross-Entropy loss. In the current invention, the system and method provides moment localization and prunes parts responsible for highlight detection. Moment-DETR is an end-to-end transformer-based architecture that predicts moment spans given visual and textual features.

In at least an embodiment, of this invention, the saliency prediction head, which is present in the Moment-DETR architecture of FIGURE 1, is pruned. The goal of the current invention’s task is to predict moments conditioned on language queries and predicting highlights is irrelevant to this goal. Correspondingly, the system and method, of this invention, is configured to also remove the saliency loss term associated with this head. In the prior art architecture, the transformer visual feature length is set as 150 which is, now, reduced to 30 owing to the size and dimensionality of the clips used by the system and method of this invention. The system and method, of this invention, additionally, extends the Feed Forward Network by another hidden dimension.

Given a natural language query q of Lq tokens, and a video v comprised of a sequence of Lv clips, they aim to localize one or more moments { mi } (a moment is a consecutive subset of clips in v), as well as predict clip-wise saliency scores S ? RLv (the highest scored clips are selected as highlights).

The overall architecture of Moment-DETR is given below:
Input representations: The input to the transformer encoder is the concatenation of projected video and query text features. For video, they use SlowFast and the video encoder (ViT-B/32) of CLIP to extract features every 2 seconds. They then normalize the two features and concatenate them at hidden dimensions. The resulting video feature v is denoted as Ev ? R Lv × 2816. For query text, they use the CLIP text encoder to extract token-level features, Eq ? RLq × 512. Next, they use separate 2-layer perceptrons with layernorm and dropout to project the video and query features into a shared embedding space of size d. The projected features are concatenated at length dimension as the input to the transformer encoder, denoted as Einput ? RL×d, L=Lv + Lq.
This invention’ multimodal transformer encoder-decoder architecture. Its task only concerns prediction of moment spans. Hence, as explained above the system and method, of this invention, removes parts of the model contributing to saliency prediction. Additionally, the current invention’s architecture changes the hidden layer dimension in the transformer to 512 (from 256) to account for longer video sequences.
Typically, a standard video feature extraction technique employs two sets of vision encoder to extract action recognition features. The current invention utilizes this method as is, and this is a standard procedure done in most video recognition methods. Typically, a pretrained CLIP (Contrastive Language Image Pre-training) model, particularly the ViT-B/32 architecture (ViT-Vision Transformer) is used as, and finetuned on this invention’s dataset during training of tasks. Following a convention set, the system and method, of this invention, feeds projected video and text features as input to the transformer encoder. Every 2 seconds, features from the last hidden dimension of the relevant network are extracted for video using SlowFast network and the video encoder (ViT-B/32) of CLIP. These features are then independently normalized and concatenated along the hidden dimension to get a 2816-D vector. The textual features are prepared by using a CLIPtextual encoder to obtain token-level feature vectors of size 512. Finally, the system and method, of this invention, projects the video and textual features to a common dimension and concatenate them. These final features are used as input to the transformer encoder.

Transformer encoder-decoder: The video and query input sequence is encoded using a stack of T transformer encoder layers. Each encoder layer has the same architecture as in previous work, with a multi-head self-attention layer and a feed-forward network (FFN). Since the transformer architecture is permutation-invariant, fixed positional encodings are added to the input of each attention layer, following. The output of the encoder is Eenc ? RL×d. The transformer decoder is the same as in, with a stack of T transformer decoder layers. Each decoder layer consists of a multi-head self-attention layer, a cross-attention layer (that allows interaction between the encoder outputs and the decoder inputs), and an FFN. The decoder input is a set of N trainable positional embeddings of size d, referred to as moment queries. These embeddings are added to the input to each attention layer as in the encoder layers. The output of the decoder is Edec ? RN×d.

Transformer Layers: A stack of layers in a transformer encoder receives the joint characteristics. The transformer encoder follows the same structure and interleaving pattern where it has an encoder-decoder architecture and has been applied to language-only tasks. The decoder conditions the moment queries to generate its own output. Finally, the decoder output is passed through a feed-forward network activated by ReLU to predict moment coordinates and foreground background class labels.

The architecture is diagrammatically represented in Figure 5. The gray and black boxes below transformer encoder block indicate visual and textual embeddings respectively. Visual embeddings are computed in a sampled frame by frame basis and textual embeddings are computed for each token in the textual query. The boxes on top of the encoder are the encoder representations for each of these embeddings. The blue boxes indicate moment queries which are used. Finally, the decoder output is passed through a Feed-Forward network (FFN) to get the span and class predictions

Prediction heads: Given the encoder output Eenc, they use a linear layer to predict saliency scores S ? RLv for the input video. Given the decoder output Edec, they use a 3-layer FFN with ReLU to predict the normalized moment center coordinate and width w.r.t. the input video. They also follow DETR to use a linear layer with softmax to predict class labels. In DETR, this layer is trained with object class labels. In their task, since class labels are not available, for a predicted moment, they assign it a foreground label if it matches with ground truth, and background otherwise.

FIGURE 2 illustrates a schematic block diagram of the system of this invention.

This invention defines a stack of transformer encoders followed by a stack of transformer decoders. The decoder output is passed through feed-forward network layers to predict a span of a moment (in seconds) along with their class (foreground/background). The video and text input are converted to feature vectors by large-scale extractors like Slowfast and CLIP. The input to the transformer encoder is a joint representation of these video and text features which are concatenated along the hidden dimension.

FIGURE 3 illustrates an input visual representation for the system of this invention.

FIGURE 4 illustrates an input textual representation for the system of this invention.

In at least an embodiment, projected video and text features are provided as input to a transformer encoder. As shown in FIGURE 4, for video, in preferred embodiments, a SlowFast Network and, for text, in preferred embodiments, CLIP (ViT-B/32) is used to extract features of the last hidden dimension of the respective network every 2 seconds. These features are, then, independently normalized and concatenated along a hidden dimension to get a 2816-D vector. Textual features are prepared using the CLIP-textual encoder to obtain token-level feature vectors of size 512 as depicted in FIGURE 4. The video and textual features, which are extracted, are projected on to a common dimension and concatenated as shown in FIGURE 5. These final features are used as input to the transformer encoder.

FIGURE 5 illustrates joint input representation for the system of this invention.

FIGURE 6 illustrates a transformer model architecture for the system of this invention.

As seen in FIGURE 6, joint features are fed to a stack of transformer encoder layers. The transformer encoder follows the same structure and interleaving pattern as disclosed in prior art of FIGURE 1. The decoder conditions the moment queries to generate its own output. Finally, the decoder output is passed through a feed-forward network activated by ReLU to predict moment coordinates and foreground-background class labels.

In at least an embodiment, the dataset, of this invention, consists of CCTV surveillance videos from different sources. One such source is the ViSOR (Video Surveillance Online Repository) [Lamberto Ballan, Marco Bertini, Alberto Del Bimbo, Lorenzo Seidenari, and Giuseppe Serra. Effective codebooks for human action categorization. pages 506 – 513, 11 2009.] dataset. The ViSOR dataset is composed of 130 videos showing 8 different human activities: running, getting into a car, getting out of a car, leaving an object, giving an object, sitting, standing and people shaking hands. These videos were captured by a stationary camera and contain a different number of actors and activities in various locations. Another dataset that we have used is the VIRAT(Video and Image Retrieval and Analysis Tool) [Sangmin Oh, Anthony Hoogs, Amitha Perera, Naresh Cuntoor, Chia-Chih Chen, Jong Taek Lee, Saurajit Mukherjee, J. K. Aggarwal, Hyungtae Lee, Larry Davis, Eran Swears, Xioyang Wang, Qiang Ji, Kishore Reddy, Mubarak Shah, Carl Vondrick, Hamed Pirsiavash, Deva Ramanan, Jenny Yuen, Antonio Torralba, Bi Song, Anesco Fong, Amit Roy-Chowdhury, and Mita Desai. A large-scale benchmark dataset for event recognition in surveillance video. In CVPR 2011, pages 3153–3160, 2011.] in which data was recorded in natural scenarios depicting people performing typical tasks in conventional contexts, with uncontrolled, cluttered backgrounds. There are numerous examples of various sorts of human actions and human-vehicle interactions, with a huge number of examples per action class. Data was gathered at a variety of locations across the United States, including university campuses and parking lots. Both ground camera videos and aerial videos are available as part of the VIRAT Dataset. We have used a total of 329 ground camera videos for our purpose. The UFPArk [Ingrid Nascimento, Pedro Castro, Sofia Klautau, Luan Goncalves, Carnot Filho, Flavio Brito, Aldebaro Klautau, and Silvia Lins. Public dataset of parking lot videos for computational vision applied to surveillance. In 2020 19th IEEE International Conference on Machine Learning and Applications (ICMLA), pages 61–64, 2020.] dataset provides video footage of the daily activities of a parking lot located in Brazil. Data samples were divided according to the period of the day. The current invention’s system and method have used the morning dataset samples for the invention’s purpose which consists of 55 videos in total.

In the pre-processing stage, the videos were split or merged.
Original videos were taken and if the duration was too short (i.e. less than 30s), then it was merged with larger videos and, then, split the resultant videos into clips of 30s, each. In this, it was ensured that the redundant parts, of the video, were removed i.e. the part of the video where there are no actions / events taking place. It is to be ensured that while the events are taking place, the background of the videos remains the same and that no abrupt scene shift occurs.
According to a non-limiting exemplary embodiment, a total of 995 videos were generated after this stage. The duration of the videos is 30s, frame rate is 30 frames/s and resolution is 224px*224px. The final size of our dataset with all the videos is 33.8 GB.

While annotating the dataset videos, distinct moments were identified in the videos and manually labeled them using natural language descriptions. Actions like running, walking, cycling, driving a car, sitting inside a car, opening the door of a car, talking, and handshake amongst others have been identified by the annotators while generating the annotations.

In all these actions, peculiarities like the colour of clothes, type of vehicles, and others have also been mentioned in the annotations for detail and diversity. The annotation file of the videos is stored in JSON format. A total of 3569 annotations were generated and the average number of annotations per video are 3.6. The average length of the annotations is 9.3 words. Also, three separate humans were involved in annotating the videos. After an annotator describes a moment, the other two annotators validate it. The following describes the different fields in the JSON file:
? qid: Annotation ID for a query.
? query: Description for a particular moment in the video.
? vid: Name of the video.
? relevant windows: Start and end times of that particular moment in the
video in seconds.

This invention is extremely useful in the surveillance provider industry, as they can couple their monitoring feed along with our application to allow the users to track activity easily.

According to a non-limiting exemplary embodiment, the current invention’s dataset captions were trained, validated, and tested in splits comprising 70%, 15%, and 15% mutually exclusive fractions of the entire dataset. A three-stage training schedule was followed:
Firstly, weakly supervised pretraining, was performed, on the QVHighlights dataset [Jie Lei, Tamara L. Berg, and Mohit Bansal. Qvhighlights: Detecting moments and highlights in videos via natural language queries, 2021.] via Automatic Speech Recognition (ASR) captions. The system perform 50 epochs of this pretraining using a batch size of 256.
Secondly, pretraining, was performed, on the actual train split of QVHighlights. We use a batch size of 64 and train for 200 epochs. The system utilized early stopping criteria on the validation set to choose the best
performing checkpoint.
Finally, the best performing checkpoint was finally trained on a CCTV dataset, created by the inventors, for 200 epochs with a batch size of 32.
V100 GPU was utilized to perform all experiments.
FIGURES 7, 8, and 9 indicate the mean average precision (mAP) during the training process at different IOU thresholds.
It was observed that peak mAP was encountered near the 30th epoch of training. These metrics were calculated on the validation set. Through early stopping, the 30th checkpoint was chosen for test-set benchmarking.

FIGURES 10 and 11 indicate the plot of Recall@1 which measures the presence of the ground truth in the first rank of a given ranked output. Similar to mAP, the inventors plotted Recall1 against the number of training epochs and observed the 30th checkpoint to perform best on the validation set for the R1 @ 0.5 metric, but the trend differed for the R1 @ 0.7 metric where the score seemed to increase with the number of epochs.

Table 1, below, lists results of the Moment-DETR model for three
different training configurations:

TABLE 1

Row 1 indicates results for the model which was only pretrained on QVHighlights ASR and training splits.
Row 2 indicates results for the model which was trained only on the training split of CamPark dataset.
Row 3 indicates results for the model which underwent both pretraining (PT) and domain-specific training (DST) on the QVHighlights and CamPark-Captions datasets respectively.

The technique used was the same as Jie Lei, Tamara L. Berg, and Mohit Bansal. Qvhighlights: Detecting moments and highlights in videos via natural language queries, 2021 of reporting the Recall@1 (Recall@1) and mean Average Precision (mAP) metrics at certain Intersection-Over Union (IOU) thresholds.

IOU is a standard metric used while measuring the extent of overlap between a prediction and its corresponding ground truth. In this case, the prediction and ground truths are represented as a temporal span: [start, end], which means a tuple of starting and ending times in seconds. The IOU metric captures the extent of overlap as a ratio between the intersection and union of the prediction and ground truth span. Here, the inventors measure the recall and precision metrics for IOU thresholds of 0.5 and 0.7 for Recall@1 and 0.5, 0.75 for mAP. Recall@1 is a standard metric used in single-moment retrieval. Here, for an IOU of 0.7, the inventors define a prediction to be positive if the IOU between the predicted and ground truth moment is more than 0.7. For average mAP, the inventors calculate the average of mAPs over multiple IOU thresholds within the range 0.5 to 0.95 with increments of 0.05.

By comparing the results across the three rows, the inventors observed that their three-stage training scheme of pretraining on ASR and training split of QVHighlights followed by the training split of CamPark yields best results as given by Row 3. The performance depicted by Row 1 is quite poor as compared to Row 2 and 3 which emphasizes the importance of our CamPark dataset in implementing surveillance-grade video moment retrieval. Finally, the gap between validation and test sets for all three rows appears to be marginal and can be minimized further with some degree of hyperparameter fine-tuning.

Currently, for the sake of simplicity, the inventors did not perform any hyperparameter finetuning apart from early-stopping. Overall, the inventors observed encouraging results by following the above detailed training scheme. These results serve as empirical evidence indicating the need and importance of a specialized dataset like “CamPark” as well as the success of the inventors’ model and training strategy on this complex real-world task.

According a non-limiting exemplary embodiment, an uploading module allows a user to upload a video and a free form natural language query as input and get back the trimmed video from the start timestamp to the end timestamp where the event mentioned in the query is taking place. The user can see a video frame on an output page which shows a trimmed video and also shows a timestamp of the event taking place in the original video. The query that was input by the user is also visible at the top of the page. The video gets uploaded and stored inside a database from where it is then fetched by the system and method, of this invention, and results are displayed. The video and user-provided query is passed through this invention’s trained model. The model predictions are then displayed on the success page. The model produces accurate results, that too in real-time. It is worth noting that the model does not require a GPU for this purpose and can run on a simple CPU. This allows the project to run on any system which just has Python, Django, and PyTorch installed on it.

According to a non-limiting exemplary embodiment, the system and method of this invention are tested for linguistic robustness and visual robustness.

The current invention’s system and method achieves a high accuracy on a representative test set confirming the efficacy of the current invention’s system and method. However, traditional learning paradigms assumes data to be drawn from a single distribution, which is not always the case in the real world. The data distribution changes rapidly with changes in the environment and presents problems to statically trained models. Hence, care must be taken to prevent the model from collapsing in the real world.

for VISUAL ROBUSTNESS:
In the visual space, considerable variation is possible which depends on the condition of the CCTV camera, the environment it is installed in along with the noise that creeps in between recording and transmission of the videos. Hence, to simulate such disturbances in the visual space, the system and method, of this invention, generates synthetic clips from original clips through the use of appropriate transformations.

To turn the input clips into a new, much bigger collection of slightly altered videos, the inventors used the vidaug python library for augmenting videos. The effects used for altering the videos are the Add Effect and the Blur Effect. The Add effect adds a value to all pixel intensities in a video. The value used for the dataset is -100, thus darkening the overall video. The Blur effect uses gaussian blur to blur the overall video.

A total of 1990 videos were generated after augmentation, which when combined with the original 995 videos came out to be 2985 videos. The final size of the dataset with all the videos is 65 GB. Since, annotations were required for the videos with the two effects as well, additional annotations were added to the original annotations file with only the vids changed to the name of the new videos. A total of 10707 annotations were generated in the end for the original videos and the two effects.

Finally, the inventors test the performance of the current invention’s system and method on these altered clips. It was observed that the model performs inadequately due to the distribution shift. The inventors, then, trained a robust model by augmenting these altered clips with the original dataset. The robust model which is trained on this entire collection performs much better and is able to predict accurately even on blurry / altered videos.

FIGURE 12 illustrates performance comparison between the base and robust model on altered videos.

In FIGURE 12, the second and the third images are the blurred and darkened versions of the first image. The predictions made by the base model are shown in solid blue boxes or lines and Ground Truth are indicated by dashed green lines, whereas the predictions made by our robust model are shown in solid orange boxes.

for LINGUISTIC ROBUSTNESS:
FIGURE 13 illustrates performance comparison between base and robust model on paraphrased queries.

The variations in the language space are governed by syntactical changes in the input query. Since the system and method, of this invention, places no restriction on structure of the query, it is bound to assume various syntactical forms conforming to the same meaning. For example, a query “Two men are jumping and dancing on the sidewalk” can be restructured to “Two males are seen skipping and swaying on the footpath”. In other words, there exist many paraphrases for each query and the current invention’s system and method should be able to predict the same output for all these queries. The inventors focused on generating such paraphrases in order to evaluate and train the current invention’s system and method. The inventors followed the setup of backtranslation which has been used for data-augmentation in various natural language processing tasks. Particularly, the current invention utilizes a MarianMT model as the translation backbone. MarianMT is a neural machine translation model built on top of a transformer encoder-decoder architecture. Huggingface data was used for a pretrained set of MarianMT models. Through Huggingface, the inventors obtained bi-directional translation models between English and 77 other languages. To perform backtranslation, the inventors utilize a translation-reverse translation approach with these pretrained models. The translation model uses source language as “English” and the target language as one of the 77 languages. Some examples of the target are “German”, “French”, “Spanish” etc. Symmetrically, for the reverse-translation model, the inventors chose one of the 77 languages as the source language and “English” as the target. Hence, a particular English language query is first translated into a pivot language from the 77 available options and then backtranslated to English. Through this approach, the current invention’s system and method is able to generate many paraphrases for a single query. Using the above-described approach, 77 paraphrases were generated for each query from the dataset. It was observed that the performance of the base model on these paraphrased queries were lower than on the original queries. Hence, to improve robustness of the current invention’s system and method and to close out this performance gap, the inventors engage robust training on the original dataset augmented with their corresponding paraphrases. The inventors visualized the effect of these paraphrased queries on our base model and robust model in FIGURE 13. It was observed that the current invention’s system and method was able to match the original prediction, whereas the base model was seen to be mispredicting.

The TECHNICAL ADVANCEMENT of this invention lies in a multi-modal neural network in the form of an end-to-end system, that takes a video clip, as one input, and a natural language query, as another input, and conditions over frames of the video to predict the moment/s of interest in the form of a temporal span [start, end].

While this detailed description has disclosed certain specific embodiments for illustrative purposes, various modifications will be apparent to those skilled in the art which do not constitute departures from the scope of the invention as defined in the following claims, and it is to be distinctly understood that the foregoing descriptive matter is to be interpreted merely as illustrative of the invention and not as a limitation.

,CLAIMS:WE CLAIM,

1. A system for localizing moments, in surveillance videos, using natural language queries, said localized moment being a video span as an output of a query-video pair, said localized moment having a start time for said video and an end time for said video, said system comprising:
- a dataset consisting, essentially, of videos with predefined textual and temporal annotations;
- a video input module configured to allow input of videos to be analyzed for providing a first component of said query-video pair;
- a query input module configured to allow input of query by a user in relation to videos for providing a second component of said query-video pair;
- a trained state-of-the-art multimodal transformer model, using data items from the dataset, for purposes of benchmarking, said transformer model responsive to each of said query-video pairs to predict a defined number of moments ranked by confidence score, said model consisting, essentially, of a computer processor, with a memory to store instructions, the instructions being:
o providing a transformer having stack of transformer encoders followed by a stack of transformer decoders, wherein decoder output is passed through feed-forward network layers to predict said video span of a moment;
o converting said input videos to video feature vectors;
o converting said input text to text feature vectors;
o concatenating, said video feature vectors and said text feature vectors, along a hidden dimension;
o feeding said concatenated vectors to said transformer; and
o determining an output, of said localized moment, based on peal mean average precision (mAP) at a defined threshold.

2. The system as claimed in claim 1 wherein, said confidence score being selected from (0.0 - 1.0) with an empirically set threshold of 0.55.

3. The system as claimed in claim 1 wherein, said step of converting said input videos to video feature vectors comprising a step of using a SlowFast Network.

4. The system as claimed in claim 1 wherein, said step of converting said input text to text feature vectors comprising a step of using a CLIP-textual encoder to obtain token-level feature vectors of size 512.

5. The system as claimed in claim 1 wherein, said model comprising a preprocessor configured to preprocess videos by splitting said videos and by merging said videos, in that, videos having a pre-defined time duration being merged with videos having a time span larger than said pre-defined time duration in order to ensure removal of redundant parts of said video.

6. The system as claimed in claim 1 wherein, said transformer decoder being passed through a feed-forward network activated by ReLU to predict moment coordinates and foreground-background class labels.’

7. The system as claimed in claim 1 wherein, said videos being annotated videos, having annotated fields, said annotated fields selected from a group of annotators consisting of query relevant to said video, name of said video, identity of a query relevant to said video, start time for said video, and end time for said video.

8. The system as claimed in claim 1 wherein, said peak mAP being encountered near a 30th epoch of training.

9. The system as claimed in claim 1 wherein, said dataset, of videos, comprising an altered dataset, said alteration being an effect alteration of darkening said videos by adding -100 value to all pixel intensities.

10. The system as claimed in claim 1 wherein, said dataset, of videos, comprising an altered dataset, said alteration being an effect alteration of blurring said videos by adding gaussian blur to all videos.

Dated this 22nd day of August, 2023

CHIRAG TANNA
of INK IDÉE
APPLICANT’S PATENT AGENT
REGN. NO. IN/PA - 1785

Documents

Application Documents

#	Name	Date
1	202221047823-PROVISIONAL SPECIFICATION [22-08-2022(online)].pdf	2022-08-22
2	202221047823-PROOF OF RIGHT [22-08-2022(online)].pdf	2022-08-22
3	202221047823-FORM-8 [22-08-2022(online)].pdf	2022-08-22
4	202221047823-FORM FOR SMALL ENTITY(FORM-28) [22-08-2022(online)].pdf	2022-08-22
5	202221047823-FORM 3 [22-08-2022(online)].pdf	2022-08-22
6	202221047823-FORM 1 [22-08-2022(online)].pdf	2022-08-22
7	202221047823-EVIDENCE FOR REGISTRATION UNDER SSI(FORM-28) [22-08-2022(online)].pdf	2022-08-22
8	202221047823-EVIDENCE FOR REGISTRATION UNDER SSI [22-08-2022(online)].pdf	2022-08-22
9	202221047823-EDUCATIONAL INSTITUTION(S) [22-08-2022(online)].pdf	2022-08-22
10	202221047823-DRAWINGS [22-08-2022(online)].pdf	2022-08-22
11	202221047823-FORM 18 [22-08-2023(online)].pdf	2023-08-22
12	202221047823-ENDORSEMENT BY INVENTORS [22-08-2023(online)].pdf	2023-08-22
13	202221047823-DRAWING [22-08-2023(online)].pdf	2023-08-22
14	202221047823-COMPLETE SPECIFICATION [22-08-2023(online)].pdf	2023-08-22
15	Abstract1.jpg	2024-01-09
16	202221047823-FORM-26 [19-02-2024(online)].pdf	2024-02-19
17	202221047823-FER.pdf	2025-06-02
18	202221047823-FORM 3 [06-06-2025(online)].pdf	2025-06-06

Search Strategy

1	SEARCHSTRATEG1E_29-12-2024.pdf