A Method And A System To Detect An Object In An Image

< Back

A Method And A System To Detect An Object In An Image

Abstract: A method and a system to detect an object in an image ABSTRACT Disclosed are techniques to detect a target object in an image. In an example, a processor (100) comprises a downstream model (103); a detector model (104); and a feature extractor network (105) with plurality of convolutional layers. The downstream model (103) is developed by the processor by extracting and learning plurality of features of target objects sliced out of images. The detection model (104) is built by extracting and learning the features of the actual images having target object and is trained by the downstream model through a knowledge distillation technique. A downstream model trained for a particular class of object trains the detection model to detect the same class of target object thereby diving the training method into two steps and simplifying the learning of features by the detector model. A method to detect an object in an image is further disclosed.

Get Free WhatsApp Updates!
Notices, Deadlines & Correspondence

Patent Information

Application #

Filing Date

14 March 2023

Publication Number

38/2024

Publication Type

INA

Invention Field

COMPUTER SCIENCE

Status

Parent Application

Applicants

Bosch Global Software Technologies Private Limited

123, Industrial Layout, Hosur Road, Koramangala, Bangalore – 560095, Karnataka, India

Robert Bosch GmbH

Feuerbach, Stuttgart, Germany

Inventors

1. Shreyas Bhanderi

A301, Shukan silver apartment, Kudasan, Gandhinagar, Gujarat, India

Specification

Description:Complete Specification:
The following specification describes and ascertains the nature of this invention and the manner in which it is to be performed

Field of the invention
[0001] The present disclosure relates to a method and a system to detect an object in an image

Background of the invention
[0002] There are several approaches for deep neural network-based object detection in an image. However the same requires a large amount of data. Further, the present methods in this field , where the target object forms only a smaller portion of the image, are ineffective in producing good results.

[0003] An object detection Machine learning (ML) model can be trained using few target images with data augmentations or by generating synthetic data, where, targets can be detected using trained models. However, this takes large number of training cycles in addition to being time consuming and expensive.

[0004] To mitigate this expensive process of training cycles and overfitting, a feature-similarity based method- to collect data points that are similar to a target object- is adopted. There are many methods to train a feature model and retrieve an image using the trained feature model for “image level” targeted data collection, however, traditional image feature training cannot be used for “object level” feature training.

[0005] The paper by Bor-Chun Chen, Zuxuan Wu, Larry S.Davis, Ser-Nam Lim “Efficient object embedding for spliced image retrieval”, uses pretrained image level feature network and region of interest (ROI) pooling to generate embeddings for object.

[0006] In the present disclosure, instead of learning the features of a box of a region of interest (ROI) of an image , object features are learned at ‘crop level’ to generate more robust features. This reduces the complexity of overall leaning by the model as all the irrelevant information is removed.

[0007] A teacher model or a downstream model learns features of cropped objects (at object level). The trained downstream model (teacher model) trains a detector model (student model) or an object level feature extractor model through knowledge distillation. By dividing the training methods in two different steps, the feature learnability for object level feature model is simplified.
[0008] Optionally, multiple use-case (classes) specific teacher models can be used to distill knowledge into the detector model (student model) thereby optimizing the feature extraction method disclosed.

Brief description of the accompanying drawings
An embodiment of the invention is described with reference to the following accompanying drawings:
[0009] Figure 1 depicts a system to detect an object in an image
[0010] Figure 2 depicts the elimination of duplicate features between a query image and a feature map, detected by the detector model during an inference.
[0011] Figure 3 depicts a flowchart of the method to detect an object in an image.

Detailed description of the drawings

[0012] Referring to Figure 1, same depicts a processor (100) for object detection in an image
[0013] The processor (100) may be a computing system found in a wide range of electronic device types to process signals and/or states representative of a diverse of content types for a variety of purposes. Examples of the processor 100 may include, but are not limited to, a laptop, a notebook computer, a desktop computer, a server, a cellular phone, a digital camera, a personal digital assistant, and a wearable electronic device.
[0014] In an example, the processor 100 includes a storage device 101. The storage device may include any non-transitory computer-readable medium including, for example, volatile memory, such as static random-access memory (SRAM) and dynamic random-access memory (DRAM), and/or non-volatile memory, such as read only memory (ROM), erasable programmable ROM, flash memories, hard disks, optical disks, and magnetic tapes.
[0015] In an example, the processor 100 includes interface(s) 102. The interface(s) 102 may include a variety of interfaces, for example, interface(s) 102 for users. The interface(s) 102 may include data output devices. The interface(s) 102 may facilitate the communication of the processor 100 with various communication and electronic devices.
[0016] In an example, the interface(s) 102 may enable wireless communications between the processor 100, such as a laptop, and one or more other computing devices (not shown).
[0017] In an example, the processor comprises a Downstream model (103) and a detector model (104).
[0018] The Downstream model(103) or a teacher model is built by the processor as a trained convolutional neural network (CNN). The processor takes a training data as input, wherein, target objects are cropped out (or sliced out) of plurality of images and fed into a CNN to train a teacher model or a downstream feature extractor model . The training data for the teacher model is prepared by cropping the target object out of plurality of images.
[0019] In an example, the training data may be balanced by a data balancer (to balance the data in terms of inputs of each class either by oversampling the less common classes or under sampling the common classes) and a data augmentation (obtaining new data points from existing data) may be performed.

[0020] While the downstream model learns the features of an object cropped out of an image, the feature learning method for downstream model varies from task to task. It can be supervised or unsupervised depending on availability of class labels in training data.

[0021] The detector model (104) or a student model is built to extract and learn the features on the entire image from which the target objects are cropped out to build the downstream model (103). The detector model (student) is trained by the downstream model (teacher model) through a knowledge distillation technique.

[0022] In the knowledge distillation technique, detector model (the student model) is trained to mimic the behavior of the downstream model (teacher model) by minimizing the difference between the teacher model’s output and the student model’s output.

[0023] The trained downstream model (teacher model) captures knowledge of the data in its intermediate layers of a convolution neural network. The intermediate layers learn to discriminate specific features and this knowledge is used to train a detector model (student model). The purpose of this ‘knowledge distillation’ is for the detector model (student model) to learn the same feature activations as the teacher model, where, feature activation is the process of extracting features of an input image by applying convolutional filters. The convolutional filters are used to detect patterns in the input image. The output of the convolutional filters is then used to create a ‘feature map’ which is used to identify the features in the input image. Feature maps are response images that are convolved by a trained kernel or filter and represent features extracted from the image.
[0024] In order for the detector model(104) to learn the same feature activations as the downstream model, a distillation loss function is used to minimize the difference between the feature activations of the teacher and the student models.

[0025] The detector model’s (104) (student model) task is to learn the feature from downstream model (103) (teacher model) but at the larger scale and variability without worrying about robustness of the features. Student model is able to generate feature embeddings for each point (Feature map) in the input image with the target object.

[0026] When a query image is introduced to the detector model (104) (student model) as input, the detector model (student model) locates the presence of objects with a bounding box and detects the classes of the located objects in these boxes. Center of each object box in the feature map is treated as the relevant feature for that object in the image. Detector model’s (Student model's) training is done in a way that downstream model’s ( teacher model) feature for particular object is as close as possible to relevant feature from the downstream model (student model) for the same object in same image.

[0027] During targeted data collection (learning features target objects) when new image comes, it is passed through detector model (student model) to generate corresponding feature map for the image. Each of the features in this feature map is then matched with target data feature using similarity function. If closest feature points similarity falls below a threshold then it is accepted as a relevant data point to the target data. Due to scalable nature the matching functions (Cosine and Euclidean similarity) this method is highly scalable for targeted data collections using multiple target data as well.

[0028] In an example, the similarity is also calculated between each feature in feature map and the negative targets in the target data to refine the collected data. This positive and negative target based method helps achieve further fine-grained results

[0029] Optionally, the processor further includes a feature extractor network (105). The feature extractor network (or the backbone architecture) may be a standard deep Convolutional Neural Network (CNN) architecture with multiple convolution layers such as Visual geometry Group (VGG).

[0030] The feature extractor network generates an embedding head of the input image. The feature extractor network performs feature extraction of the input image and features of different dimensions through different layers of the convolutional neural network. The specific number of layers of a convolutional neural network depends on the specific use case.

[0031] Optionally, in case where the target objects belong to multiple classes (multi-use case targeted data collection), a single shared feature extractor network (backbone) is used to extract the feature maps for each class (use case) of the target object. For each use-case/class, a different embedding head is trained on top of this shared backbone for use-case/class specific features. Each of these usecase/class specific head can be trained from different use-case /Class specific detector model. This training method brings the advantage of dividing the feature learning complexity of the multi use-case model.

[0032] In an example, for an entire image with traffic sign as a class of target object in it, a downstream model is trained from ‘traffic signs’ cropped out of a full image. The detector model extracts the feature of full image from which traffic sign was cropped out. Based on the distillation loss between the two downstream model (teacher model) and detector (student model), the detector model is trained. Once this ‘knowledge’ is transferred to student model it is deployed for inference.

[0033] Referring to Figure 1, disclosed is a processor(100) to detect at least one target object in at least one image. The processor comprises a downstream model (103) , a detector model (104) and a feature extractor network (105) with plurality of convolutional layers. The processor is adapted to receive at least one image with at least one target object as an input.

[0034] The processor is further adapted to receive at least one target object sliced out of at least one image as an input and to build a downstream model (103). The downstream model is developed by extracting and learning plurality of features of target objects sliced out of plurality of images.

[0035] The processor is further adapted to build a detector model (104) by extracting and learning the features of said at least one image having said at least one target object, said detector model trained by the downstream model (103) through a knowledge distillation technique

[0036] The processor is further adapted to extract features of at least one image with plurality of target objects belonging to one of plurality of classes, using the feature extractor network (105).

[0037] The detector model (104) obtains an embedding head specific to a class of the target object and extracts features of said embedding head using a layer from the plurality of convolutional layers of the feature extractor network. The detector model (104) is trained by the downstream model built for said class of the target object, by training the embedding head specific to the class of the target object.

[0038] From Figure 2, in an inference, the detector model extracts and compares the features of the input query image (40) (query features) with an extracted feature map (41) of the target object. During inference, simply calculating the similarity scores (scores representing a measure of similarity between the query features and the feature map) between query feature and feature map generates lot of duplicate matches, in other words, duplicate features are detected between the query features and the feature map.
[0039] Therefore, present disclosure further discloses a method and a system to eliminate duplicate features detected between the query features and the feature map.
[0040] In an example, during inference (42), the detector model uses an adaptive pooling technique to eliminate duplicate detections between the query features and the feature map. The detector model obtains an actual matrix of plurality of similarity scores (43). The similarity scores represent a measure of similarity between the query features and the feature map.

[0041] The detector obtains a matrix (44) by shifting (padding) the matrix of similarity scores obtained, by half (W/2) of a predefined window size (W) of a maximum pooling layer. The detector model obtains a matrix of maximum similarity scores (45) by passing the matrix of plurality of similarity scores through a maximum pooling layer of the pre-defined window size (W).
[0042] The detector model compares the matrix of maximum similarity scores with the actual matrix of plurality of similarity scores calculated for the query features. Thus, a matrix (46) obtained that is indicative of the object location in the query image. The detector model then concludes the maximum similarity score values not matching the actual plurality of similarity scores as duplicate and eliminates them.

[0043] Referring to Figure 3, the same depicts a flowchart of the method to detect an object in an image.

[0044] In Fig 3, disclosed is A method to detect at least one target object in at least one image. The method comprises the steps of receiving at least one image with at least one target object as an input (301), by a processor and receiving at least one target object sliced out of at least one image as an input (302) , by the processor .

[0045] The method further comprises the steps of building a downstream model by the processor (303), in that step, developing the downstream model by extracting and learning plurality of features of target objects sliced out of plurality of images (303a).

[0046] The method further comprises the steps of building a detector model by extracting and learning the features of said at least one image having said at least one target object (304) and training the detector model by the downstream model through a knowledge distillation technique (305).

[0047] The method further comprises the step of extracting features of at least one image with plurality of target objects belonging to one of plurality of classes, using a feature extractor network with plurality of convolutional layers (306).

[0048] The method further comprises the step of obtaining an embedding head specific to a class of the target object and extracting features of the embedding head, by the detector model, using a layer from the plurality of convolutional layers of the feature extractor network (306a).

[0049] The method further comprises the step of training the detector model by the downstream model built for said class of the target object by training the embedding head specific to the class of the target object (307).

[0050] Referring to Figure 2, in the disclosed method multiple features of an input query image (40) are extracted by the detector model and compared with an extracted feature map (41) of the target object. In order to eliminate the duplicate features detected between the multiple features of the input query image and the feature map, by the detector model, an actual matrix of plurality of similarity scores (43) representing a measure of similarity between the multiple features of query image and the feature map is obtained. This matrix of plurality of similarity scores (43) obtained is shifted by half of a predefined window size of a maximum pooling layer (in an example for a window size of 4x4, the matrix will be shifted by 2x2 through padding). A matrix of maximum similarity scores (45) is obtained by passing the matrix of plurality of similarity scores through the maximum pooling layer of the pre-defined window size. The matrix of maximum similarity scores (45) is compared with the actual matrix of plurality of similarity scores (43) calculated and duplicates are eliminated by comparing the maximum similarity score values with the actual plurality of similarity scores.
[0051] Optionally, the disclosure further provides a provision for heatmap based training for an offline search. Typically, heatmaps is a type of graphical representation of data that consists of a set of cells, in which each cell is painted for a specific color according to a specific value attributed to the cell (the term “heat” in this context is seen as a high concentration of geographical objects in a particular place).

[0052] During the inference, from the feature maps generated by the detector model, heat maps (visualizing a probability of existence of the target object at a location in an image) are generated by the detector model to detect target objects from full images.

[0053] Typically to facilitate search , all feature maps generated by detector model can be stored. However, the same takes up too much space. Instead, only the foreground features (representative of the location of target object) can be stored while the background features can be ignored. These foreground features are stored as foreground heat maps, that visualize areas that are most likely to contain the target objects. With this, when a new query image comes, model inference gain does not have to be done and large amount of query images can be searched in real time. Learnings from this foreground heat map based detection are not transferred to the feature extractor network (gradient propagation) so as to generate more accurate and unbiased embeddings.

[0054] Therefore, the disclosed method also comprises the method of storing the features of the plurality of target objects located in the foreground of at least one image as plurality of foreground heat maps.
, Claims:We Claim:
1. A method (300) to detect at least one target object in at least one image, the method comprising the steps of:

-receiving at least one image with at least one target object as an input (301), by a processor;

-receiving at least one target object sliced out of at least one image as an input (302) , by the processor;

-building a downstream model by the processor (303),

characterized in that:
developing the downstream model by extracting and learning plurality of features of target objects sliced out of plurality of images (303a);

-building a detection model by extracting and learning the features of said at least one image having said at least one target object (304); and

-training the detection model by the downstream model through a knowledge distillation technique (305).

2. The method (300) as claimed in Claim 1, wherein, extracting features of at least one image with plurality of target objects belonging to one of plurality of classes, using a feature extractor network with plurality of convolutional layers (306).
.

3. The method as claimed in Claim 2, wherein, obtaining an embedding head specific to a class of the target object and extracting features of the embedding head, by the detector model, using a layer from the plurality of convolutional layers of the feature extractor network (306a).

4. The method as claimed in Claim 3, wherein training the detector model by the downstream model built for said class of the target object by training the embedding head specific to the class of the target object (307) .

5. The method (300) as claimed in Claim 1, wherein , extracting multiple features of an input query image (40) by the detector model and comparing with an extracted feature map (41) of the target object.

6. The method (300) as claimed in Claim 5, wherein, eliminating the duplicate features detected between the multiple features of the input query image and the feature map, by the detector model,
the method comprising the further steps of:

- Obtaining an actual matrix of plurality of similarity scores (43) representing a measure of similarity between the multiple features of query image and the feature map;

- Shifting the matrix of plurality of similarity scores (43) obtained, by half of a predefined window size of a maximum pooling layer;

- obtaining a matrix of maximum similarity scores by passing the matrix of plurality of similarity scores through the maximum pooling layer of the pre-defined window size;

- Comparing the matrix of maximum similarity scores (45) with the actual matrix of plurality of similarity scores (43) calculated ; and

- Eliminating said duplicates by comparing the maximum similarity score values with the actual plurality of similarity scores.

7. The method (300) as Claimed in Claim 2, wherein , storing the features of the plurality of target objects located in the foreground of at least one image as plurality of foreground heat maps.

8. A processor (100) to detect at least one target object in at least one image, the processor comprising:
a downstream model (103);
a detector model (104);
a feature extractor network (105) with plurality of convolutional layers;
the processor (100) is adapted to:
-receive at least one image with at least one target object as an input;

-receive at least one target object sliced out of at least one image as an input;

-build a downstream model (103) by the processor,

in that:
the downstream model(103) is developed by extracting and learning plurality of features of target objects sliced out of plurality of images;

-build a detection model (104) by extracting and learning the features of said at least one image having said at least one target object, said detection model trained by the downstream model through a knowledge distillation technique;

- extract features of at least one image with plurality of target objects belonging to one of plurality of classes, using the feature extractor network (105).

9. The processor as claimed in Claim 8, wherein, the detector model (104) obtains an embedding head specific to a class of the target object and extracts features of said embedding head using a layer from the plurality of convolutional layers of the feature extractor network (105).

10. The processor as claimed in Claim 9 , the detector model is trained by the downstream model (103) built for said class of the target object, by training the embedding head specific to the class of the target object.

Documents

Application Documents

#	Name	Date
1	202341016824-POWER OF AUTHORITY [14-03-2023(online)].pdf	2023-03-14
2	202341016824-FORM 1 [14-03-2023(online)].pdf	2023-03-14
3	202341016824-DRAWINGS [14-03-2023(online)].pdf	2023-03-14
4	202341016824-DECLARATION OF INVENTORSHIP (FORM 5) [14-03-2023(online)].pdf	2023-03-14
5	202341016824-COMPLETE SPECIFICATION [14-03-2023(online)].pdf	2023-03-14
6	202341016824-FORM-8 [16-05-2023(online)].pdf	2023-05-16
7	202341016824-Power of Attorney [28-06-2023(online)].pdf	2023-06-28
8	202341016824-Covering Letter [28-06-2023(online)].pdf	2023-06-28
9	202341016824-Covering Letter [04-07-2023(online)].pdf	2023-07-04