Abstract: Abstract A control unit for enriching object annotations in a video sequence and a method thereof. The control unit 10 comprises a tracking module 14 and a detecting module 16. The control unit 10 inputs a video sequence 12 having multiple frames and object annotations 11 in the video sequence and builds an initial appearance model for each of the object annotation present in the multiple frames. The control unit initiates a tracking for each of the object annotation from a first frame (t0) of the video sequence via the tracking module. The control unit matches at least one parameter of the tracking module and the detection module and update the initial appearance model of the object annotations based on the match. The control unit adds new object annotations to an existing detection set to form an object annotation super set. The control unit updates a state of the tracking module when the new object annotations are added to the object annotation super set. Figure 1 & 2
Description:Complete Specification:
The following specification describes and ascertains the nature of this invention and the manner in which it is to be performed.
Field of the invention
[0001] This invention is related to a control unit for enriching object annotations in a video sequence and a method thereof.
Background of the invention
[0002] Annotations could be in the form of object location, type, and identity. In many of the scenarios, annotations are sparse. For e.g., incomplete annotations in a frame (only a few objects are annotated) or annotations being exists sporadically. This mode of annotations is practically useless in the context of training and validating modern day camera-based AI functions. As huge effort, knowledge and cost has already been incurred in generating a plethora of such informative annotations, it is important that to improvise on existing sparse annotations by enriching it further and make it usable for modern day AI based systems. In other words, the current scenario of annotation cost and effort can be significantly improved by developing an automated way to densify the already existing sparse annotations.
[0003] A US patent 8693842 discloses a systems and methods are presented for enriching audio/video recordings using annotation data provided by attendees of a presentation, in which annotations from attendees are received by a server which merges and synchronizes the annotation data, performs text data mining to identify key messages and temporal segments of the audio/video data, and constructs an enriched audio/video recording including audio and/or video data as well as segment data and key message data for ease of user navigation.
Brief description of the accompanying drawings
[0004] Figure 1 illustrates a control unit for enriching object annotations in a video sequence, in accordance with an embodiment of the invention; and
[0005] Figure 2 illustrates a flowchart for a method for enriching object annotations in the video sequence in accordance with the present invention.
Detailed description of the embodiments
[0006] Figure 1 illustrates a control unit for enriching object annotations 11 in a video sequence 12, in accordance with an embodiment of the invention. The control unit 10 comprises a tracking module 14 and a detecting module 16. The control unit 10 inputs a video sequence 12 having multiple frames 13 and object annotations 11 in the video sequence 12 and builds an initial appearance model for each of the object annotation 11 present in the multiple frames 13 using a backbone architecture. The control unit 10 initiates a tracking for each of the object annotation 11 from a first frame (t0) of the video sequence 12 via the tracking module 14. The control unit 10 matches at least one parameter of the tracking module 14 and the detection module 16 and update the initial appearance model of the object annotations 11 based on the match. The control unit 10 adds new object annotations 11 to an existing detection set to form an object annotation super set. The new object annotations 11 are obtained when there is no match between the tracking module 14 and the detection module 16 parameters. The control unit 10 updates a state of the tracking module 14 when the new object annotations are added to the object annotation super set and creates a new track for the added object annotation 11 if there is no match between the tracking module 14 and the detection module 16 parameters.
[0007] Further the construction of the control unit 10 and the method of working of the control unit is explained in detail. The control unit 10 is chosen from a group of control units like a microprocessor, a microcontroller, a digital circuit, an integrated chip and the like. The video sequence 12 comprises multiple frames 13 having the reference or spare object annotations 11 which are predefined manually.
[0008] Figure 2 illustrates a flowchart for a method for enriching annotations in the video sequence in accordance with the present invention. The control unit 10 comprises a tracking module 14 and a detecting module 16. In step S1 a video sequence 12 having multiple frames 13 and object annotations 11 in the video sequence 12 are inputted. In step S2, an initial appearance model is built for each of the object annotation 11 present in the multiple frames 13 using a backbone architecture. In step S3, a tracking is initiated for each of the object annotation 11 from a first frame (t0) of the video sequence 12 via the tracking module 14.
[0009] In step S4, at least one parameter of the tracking module 14 and the detection module 16 are matched and the initial appearance model of the object annotations 11 are updated based on the match. In step S5, at least one new object annotation 11 is added to an existing detection set to form an object annotation super set. The at least one new object annotation 11 is obtained, when there is no match between said tracking module 14 and the detection module 16 parameters. In step S6, a state of the tracking module 14 is updated when said at least one new object annotation 11 is added to the object annotation super set and a new track for the added object annotation 11 is created, if there is no match between the tracking module 14 and the detection module 16 parameters.
[0010] The method of enriching the annotations 11 in the video sequence 12 is explained in detail. The object annotations 11 can be found in the form of object location, type, and identity and the like. However, it is to be understood that the object annotations 11 are not limited to the above disclosed examples, but can be of any other type that is known to a person skilled in the art. In many of the scenarios, annotations 11 are sparse. For instance, the presence of incomplete annotations 11 in a frame 13 of the video sequence 12 (only a few objects are annotated) or annotations 11 exists sporadically.
[0011] This mode of annotations is practically useless in the context of training and validating modern day camera based artificial intelligence (AI) functions. It becomes vital for improvising existing sparse annotations by enriching it further and make it usable for modern day AI based systems. In other words, the current scenario of annotation cost and effort can be significantly improved by developing an automated way of densifying the already existing sparse annotations with the method disclosed in this invention.
[0012] The video sequence 12 comprising multiple frames 13 with the sparse object annotations 11 are provided as an input to the control unit 10. The control unit 10 upon receiving the video sequence 12 and the spare object annotations 11, builds an initial appearance model for each of the object annotation 11 present in the multiple frames 13. I.e.., for each of the object annotation 11 present in each of the frame 13 of the video sequence 12, the control unit 10 builds corresponding initial appearance model. According to one embodiment of the invention, the appearance model is build using a backbone architecture. However, the appearance model can be built using any other techniques as known in the state of the art. The initial appearance model is built for the object annotations present from the first frame (i.e.., t0).
[0013] The control unit 10 further reduces the dimensionality using offline computed dimensionality reduction techniques. The control unit 10 initiates a tracking for each of the object annotation 11 from a first frame (t0) of the video sequence 12 via the tracking module 14. The at least one parameter of the tracking module 14 and the detection module 16 is matched with each other. The parameter is chosen from a group of parameters comprising an appearance feature, a confidence value, a state of the tracking module 14 related to each of the object annotation 11. For example, if the parameter is the appearance feature, then the appearance feature is matched based on high cosine similarity when compared with an original appearance feature. The original appearance feature is determined from the initial appearance model. The control unit 10 stores the original appearance features of the each of the object annotations 11 from the corresponding initial appearance model and will be comparing the current appearance model with the original appearance model.
[0014] If there is a high cosine similarity between the above-mentioned features, then the control unit 10 updates the initial appearance model of the corresponding object annotation. The control unit 10 updates the initial model appearance feature with high confidant detection at t=0 and linearly adapt it with high confidant detections with time and also predicts the previous detections in the current frame of the video sequence 12. The current frame can be from t1 to the end of the video sequence 12.
[0015] The control unit 10 adds new object annotations 11 to an existing detection set to form an object annotation super set. The new object annotations 11 are obtained when there is no match between the tracking module 14 and the detection module 16 parameters.
[0016] The control unit 10 adds only high confidant predictions to a current object annotation set which is present in the detection module 16 to form the object annotation superset. The control unit 10 further matches the object annotation superset with the state of the tracks of the tracking module 14, using the product likelihood of intersection over union and cosine measure. The new object annotation 11 has confidence value more than a predefined threshold value and high cosine similarity and the new object annotation 11 is an existing object annotation with a corresponding updated appearance model. The control unit 10 updates a state of the tracking module 14 when the new object annotations 11 are added to the object annotation super set. The control unit 10 creates a new track for the added object annotation 11 if there is no match between the tracking module 14 and the detection module 16.
[0017] The control unit 10 further comprises an automatic correction module 18 adapted to calibrate the updated initial appearance model, when an error is occurred between an original set of object annotations 11 and the formed/current object annotation super set. The automatic correction module 18 operates the video sequence 12 in both forward and backward directions. The control unit 10 deletes the object annotations 11 when the parameter of the tracking module 14 and the detection module 16 is not matched within a predefined time.
[0018] With the above disclosed method of enriching the object annotations 11 and the control unit 10 preforming the task, the existing annotations 11 is densified in the video sequence 12 using multi annotation propagation technique which is disclosed above. A unique appearance models for each object annotation 11 within the seed regions (existing sparse annotations) is established, assuming identities of each object annotation 11 is maintained in existing sparse annotations. The control unit 10 builds the appearance models in addition to pretrained models (wherever applicable) to localize and identify the respective object instances in subsequent frames 13. Consistency of annotations 11 is checked by employing the object trackers in both forward and backwards directions in the video sequence 12 and determining their misalignment with existing annotations 11.
[0019] The misalignments/misclassification with respect to the existing object annotations 11 in the video sequence 12 are treated as errors and used to further tune the underlying label propagation mechanism. Tracks are further pruned based on membership of existing annotations 11 with newly created annotations 11 and the track confidence. In this manner one can able to preserve existing annotations 11 and also introduce newer annotations 11 to obtain a much denser set of annotations 11.
[0020] Embodiments explained in the description above are only illustrative and do not limit the scope of this invention. Many such embodiments and other modifications and changes in the embodiment explained in the description are envisaged. The scope of the invention is only limited by the scope of the claims
, Claims:We Claim:
1. A control unit (10) for enriching object annotations in a video sequence (12), said control unit (10) comprising a tracking module (14) and a detecting module (16), said control unit (10) adapted to:
- input a video sequence (12) having multiple frames (13) and object annotations (11) in said video sequence (12);
- build an initial appearance model for each of said object annotation (11) present in said multiple frames (13) using a backbone architecture;
- initiate a tracking for each of said object annotation (11) from a first frame (t0) of said video sequence (12) via said tracking module (14);
- match at least one parameter of said tracking module (14) and said detection module (16) and update said initial appearance model of said object annotations (11) based on said match;
- add new object annotations (11) to an existing detection set to form an object annotation super set, said new object annotations (11) are obtained when there is no match between said tracking module (14) and said detection module (16) parameters;
- update a state of said tracking module (14) when said new object annotations (11) are added to said object annotation super set and create a new track for said added object annotation (11) if there is no match between said tracking module (14) and said detection module (16) parameters.
2. The control unit (10) as claimed in claim 1, wherein said control unit (10) comprises an automatic correction module (18) adapted to calibrate said updated appearance model, when an error is occurred between an original set of object annotations (11) and said formed object annotation super set.
3. The control unit (10) as claimed in claim 2, wherein said automatic correction module (18) adapted to operate said video sequence (12) in both forward and backward directions.
4. The control unit (10) as claimed in claim 1, wherein said parameter of said detection module (16) and said tracking module (14) is chosen from a group of parameters comprising an appearance feature or a confidence value or a state of said tracking module (14) related to each of said object annotation (11).
5. The control unit (10) as claimed in claim 4, wherein said appearance feature is matched based on high cosine similarity with an original appearance feature, wherein said original appearance features is determined from said initial appearance model.
6. The control unit (10) as claimed in claim 1, wherein said new object annotation (11) has confidence value more than a predefined threshold value and high cosine similarity.
7. The control unit (10) as claimed in claim 6, wherein said new object annotation (11) is an existing object annotation with a corresponding updated appearance model.
8. The control unit (10) as claimed in claim 1, wherein said object annotations (11) are deleted when said parameter of said tracking module (14) and said detection module (16) is not matched within a predefined time.
9. A method for enriching object annotations (11) in a video sequence (12) by a control unit (10), said control unit (10) comprising a tracking module (14) and a detecting module (16), said method involves the steps of:
- inputting a video sequence (12) having multiple frames (13) and object annotations (11) in said video sequence (12);
- building an initial appearance model for each of said object annotation (11) present in said multiple frames (13) using a backbone architecture;
- initiating a tracking for each of said object annotation (11) from a first frame (t0) of said video sequence (12) via said tracking module (14);
- matching at least one parameter of said tracking module (14) and said detection module (16) and updating said initial appearance model of said object annotations (11) based on said match;
- adding at least one new object annotation (11) to an existing detection set to form an object annotation super set, said at least one new object annotation (11) is obtained, when there is no match between said tracking module (14) and said detection module (16) parameters;
- updating a state of said tracking module (14) when said at least one new object annotation (11) is added to said object annotation super set and creating a new track for said added object annotation (11) if there is no match between said tracking module (14) and said detection module (16) parameters.
| # | Name | Date |
|---|---|---|
| 1 | 202341030697-POWER OF AUTHORITY [28-04-2023(online)].pdf | 2023-04-28 |
| 2 | 202341030697-FORM 1 [28-04-2023(online)].pdf | 2023-04-28 |
| 3 | 202341030697-DRAWINGS [28-04-2023(online)].pdf | 2023-04-28 |
| 4 | 202341030697-DECLARATION OF INVENTORSHIP (FORM 5) [28-04-2023(online)].pdf | 2023-04-28 |
| 5 | 202341030697-COMPLETE SPECIFICATION [28-04-2023(online)].pdf | 2023-04-28 |