Abstract: Learning robust classifiers implies having sufficient data for training, especially if the training is carried out with CNNs. In addition, when a new vehicle model is released to the market, learning to recognize it involves collecting a lot of images of it then start learning again. This is why we are interested here in the use of 3D models to learn a classifier allowing, subsequently, to classify images. To learn a class of vehicles.it would . . then suffice to use the corresponding 3D model. A CNN-based vehicle makes and model classifier is then learned using these synthetic examples. To classify an actual vehicle during the test, the image contour map is calculated. This card is then sent to the classifier learned about synthetic data. Unfortunately, this type of approach does not work. The main reason for this failure lies in the notion of overfitting and domain adaptation. Indeed, CNN learned on synthetic contour data over-learns structures that are not present in actual contour images, in this, the learned CNN is not generalizable to real data. To overcome this problem, we investigated a hybrid solution. More specifically, we first propose to learn the characteristics discriminating visuals for the recognition of brands and models of vehicles using real data and CNN. Secondly, these characteristics are used to learn a classifier of makes and models of vehicles on synthetic images of contours. This classifier can then be used to recognize the make and model of a vehicle in an image real color and to associate it with the corresponding 3D model. The originality, of the approach lies in the properties of the characteristics. visuals learned from real images. These characteristics are invariant to color and outlines. In other words, the deep characteristics extracted of a color image and their outline maps are very similar.
Preamble to the Description
Discrimination of visual characteristics for the recognition of brands and models of vehicles using real data and CNN
In this work, we propose to deal with one of the issues of fine vehicle analysis: fine classification. Specifically, we want to build a system capable of returning the make and model of a vehicle from an image of it. Such a system involves creating classifiers capable of describing and separating classes that are often very similar in appearance. Indeed, vehicles of different makes and models may be very close visually, which makes it difficult to distinguish them. In addition to this inter-class similarity, the appearance of vehicles varies depending on several other factors. For example, the point of view of the vehicle affects its appearance: a car seen from the front or from the rear does not have the same visual characteristics. Color also changes the way a vehicle is perceived but is not a sufficient characteristic for fine classification: the same vehicle model can be of different colors. Finally, the environment can introduce noise to vehicle images due to their specular properties. The objective of fine classification is to find discriminant descriptors sufficiently invariant to all these disturbances. CNN-based approaches are tested on it in order to provide initial results. An interesting conclusion from their work is that CNNs can learn multi-view descriptors for fine recognition. In other words, the authors highlight the ability of CNNs to be point of view invariant for fine classification. However, learning robust classifiers involves having enough data for training, especially if training is done with CNNs. Also, when a new vehicle model is released, learning to recognize it involves collecting a lot of images of it and then starting the learning again. This is why we are interested here in the use of 3D models to learn a classifier allowing, subsequently, to classify images. The use of 3D models for the fine classification of vehicles has several advantages. The first is the ability to generate large synthetic databases effortlessly with annotations to train robust classifiers. In addition, the use of 3D models makes it possible to overcome the ambiguity on the version and the year of the vehicle: all the mages generated from the 3D model will correspond exactly to the make, the model, the version and the vehicle year. This is not the case in real databases. In fact,'the versions and years of the different"makes and models *are generally concatenated to form a single class. Using 3D models would allow to go to an ultra-fine level of precision on the type of vehicles while guaranteeing a large amount of data.
Description
Discrimination of visual characteristics for the recognition of brands and models of vehicles using real data and CNN
1.1 Intuitive solution
As said before, we are interested in using 3D models to train a fine classifier that can be applied to real images. To learn a class of vehicles, it would then suffice to use the corresponding 3D model. In other words, given a real image as input, this classifier would make it possible to find the make and model of the car and thus the corresponding 3D model. We have a database of non-textured 3D models that allows us
* * - *
to generate non-photorealistic contour renderings. It is a very simple rendering process that allows you to focus on the shape of the vehicle (contours) without needing to generate complex lighting conditions or color renditions. The basic intuition of this work is that a significant part of 3D contours can be found in contour images calculated on real images with gradient-based contour detectors. An intuitive approach (illustrated in Figure 1) is to generate 3D contour rendering learning examples from a database of 3D models. A CNN based vehicle make and mode! classifier is then learned using these synthetic examples. To classify an actual vehicle during the test, the image contour map is calculated. This card is then sent to the classifier learned on the synthetic data. Unfortunately, this type of approach does not work (demonstrated in experiments). The main reason for this failure lies in the notion of overfitting and domain adaptation. Indeed, CNN learned on synthetic contour data over-learns structures that are not present in actual contour images. In this, the learned CNN is not generalizable to real
•■ - * *
data.
1.2 Proposed solution
To overcome this problem, we investigated a hybrid solution. More specifically, we first
■ * • • «
propose to learn discriminating visual characteristics for the recognition of vehicle makes and models using real data and a CNN. Secondly, these characteristics are used to learn a classifier of makes and models of vehicles on synthetic contour images. This classifier can then be used to recognize the make and model of a vehicle in a real color image and to associate the corresponding 3D model with it. The originality of the approach lies in the properties of visual characteristics learned from real images. These characteristics are invariant to the color and to the contours. In other words, the deep characteristics extracted of a color image and its outline map is very similar. These characteristics have several advantages:
• They make it possible to extract contour structures present in a real image and [ (discriminating for finef.recognition._[n this way, using these features to train a classifier on synthetic contour ima'ges'circurhverits-tn"e pro^lem'of ovlerfitting^D contours.
• They produce a very rich description of the vehicles because they were learned in two very different areas (colors and contours). In other words, the combined use of color and outline images during training improves performance over a CNN learned only on edge images. Our learning is divided into two stages.
1. First, we learn color and contour invariant descriptors using a CNN and the CompCar database. Here we consider two areas: color images and edge images. Our method makes it possible to directly find a common representation for the two domains. This way of doing things is different from domain adaptation methods which consist in finding the transformation between the training data and the test data by generally minimizing the distance between the distributions of the characteristics of the two domains. With this work, we show the ability of neural networks to find the common representation between two types of data that are naturally very distant from each other.
2. Secondly, these color and contour invariant descriptors are calculated on synthetic contour images and used to learn a linear synthetic classifier of vehicle makes and models. We show that this classifier gives very good results when applied to real color images.
Description with full specification
Discrimination of visual characteristics for the recognition of brands and models of vehicles using real data and CNN
1. Synthetic contours and contours of real images
The work presented in this section is based on (1) synthetic data obtained by non-photorealistic contour type renderings and [2) contour maps obtained from real images.
1.1 Generation of synthetic contours by 3D rendering
The goal of contour-type 3D rendering is to draw certain edges of the 3D model in a rendering window. The edges to be drawn are the edges connecting two faces whose normals form an angle'greater than a certain*threshold (for example 30 degrees}. These" edges are called projecting edges (or sharp edges). This type of rendering is not very dependent on the quality of the 3D model because the salient contours are relatively invariant to the resolution of the mesh. Figure 2 shows the principle of the calculation of salient contours. We have shown two sides of a 3D model. The normals to these faces are shown in green. The angle 9 is the angle between these two normals. If this angle is greater than a certain threshold, the edge between the two faces (in blue) is considered as protruding. In this case it is drawn in the 3D rendering window. And Figure 3 shows an example of a non-hotorealistic outline-type rendering.
1.2 Edge detection in a real image
There are many algorithms for detecting edges in an image and there are different ways of approaching this problem.
1. A first set of methods proposes a vision oriented image processing and
2. A second set of methods proposes approaches based on machine learning.
3. Finally, a thresholding on the absolute value of the magnitude of the gradient is applied.
The latter require an annotated learning base in order to learn how to classify which pixels of the image correspond to outlines or CNNs. In this thesis we are particularly interested in edge detection algorithms by image processing. The best known of these is the Canny algorithm. It is based on the orientation and magnitude of the gradient in an image. In this algorithm, the imag'e is first smoothed using a Gaussian kernel (in order to eliminate noise). Then the gradient is calculated using Sobel's convolution filters. A step of removing non-maxima is then carried out in the direction of the gradient. The contour map returned by the Canny algorithm is a binary map. This algorithm is very sensitive to noise because the absolute value of the magnitude of.the gradient is not . sufficient to characterize an contour.
On the basis of this observation, a contour detector based on a saliency map is calculated using the gradient map using its spatial distribution. This makes it possible to highlight the pixels which do not necessarily have a very high gradient but which nevertheless belong to outlines. Segments are created from the salience map and are -i assigned-a salience score.,Thresholding isrthen carried out on these segments. In our work, we have used this edge detector for its simplicity arid its resistance to noise. Nevertheless it might be interesting to investigate other algorithms.
2. Recognition model of vehicle brands and models
In this section the classifier of vehicle makes and models is presented. The goal is to learn a classifier on synthetic images that can be tested on real color images. We first built a database called 2D3DCar made up of training and testing data. The learning base (called 3DCar) consists of N 3D models of cars. The test base (called 2DCar) consists of real color images that exactly match 3DCar models. The 2D3DCar base is shown in Figure 4 (c).
The learning is divided into two phases.
1. First, we use the CompCar database composed of M classes, in order to learn discriminative descriptors for fine recognition but also common to both domains (colors and contours].
2. Secondly, we use these descriptors to learn a classifier only on synthetic contour images generated with the 3DCar database.
The generation of these data is carried out by performing contour renderings of each 3D model of the 3DCar database according to several points of view and several intrinsic camera parameters. Figure 4 illustrates the entire system.
2.1 Descriptors common to color images and contours
The main objective of this step is to find a common representation in the descriptor space between contour maps and color images. In addition, these descriptors must be discriminating to recognize the make and model of vehicles. Using CompCar, we generate a contour map for each of the images using the contour detection. Thus, we obtain from CompCar two databases:
1. The color database {XColor, YColor} and
2. The outline database {XEdge, YEdge}
where X corresponds to the data and Y G [1, .., M] to the labels of the classes. Deep convolutional neural networks can be defined by'two sets of parameters: the feature extraction parameters 0r (typically the parameters of the convolutional layers) and the classification parameters 6ci (typically the completely connected layers of a multilayer perceptron taking as input the characteristics extracted by the convolutional layers). Figure 4(a) shows the learning descriptors common to colors and contours on the
CompCar database: this results in the characteristics extraction parameters common "f
and the classification parameters "r-i. Figure 4(b) shows the Learning of the fine synthetic classifier on synthetic contour images generated with the 3DCar database. In
this step, the classification parameters "^are learned by fixing "/. Figure 4(c) shows the 3D2DCar database, (d) Network test on the images of the 2DCar base.
We define a neural network that is completely shared between the two types of data (the parameters of this network are the same for both domains). Figure 4 (a) illustrates this learning phase. The optimization of the network parameters (8r, 6ci) is carried out by minimizing the following cost function:
where L corresponds to the classic Softmax cost function widely used to optimize classification issues. No weighting is used so as not to favor one area over another. This way of doing things can be assimilated to Siamese neural networks where the objective is to minimize the distance (L2 norm) in the descriptor space between two similar images. However, unlike Siamese networks, we try to reduce the distance between images representing the same vehicle (on both domains) while learning discriminant descriptors for fine recognition. Unlike domain adaptation methods we do not try to match one data type with another, but we propose to directly find the common representation which can be relevant for both types of data.
2.2 Synthetic classifier
Once the descriptors common to the colors and to the contours have been learned, we use them to train a classifier only on the synthetic contour images {Xsynth, Ysynth} generated with the 3D models of the 3DCar database with Y 6 [1,.. ,NOT]. We fix the
descriptor extraction parameters *V learned previously and we train new classification parameters 8cz corresponding to the N labels of our 3D models. The new function of cost to be minimized is defined as follows:
3 Experiments 3.1 Databases
In these experiments, we consider two databases:
1. Comp-Carand
2. 2D3DCar.
The Compear database provides 36,431 training images and 15,627 test images corresponding to M = 430 classes. The 3DCar database is made up of N = 35 3D models and the 2DCar database is made up of 402 images (approximately 10 images per make and model). We generated 3,452 contour renderings for each 3D model. The total number of examples in the synthetic database is therefore 124,413 examples. To obtain these examples, we varied the following parameters:
• The azimuth of vehicles between 0 and 360 degrees with a step of 1 degree.
• The elevation of vehicles between 0 and 10 degrees with a step of 1 degree.
• The position of the vehicles in relation to the camera between 4 to 10 meters in z n (wTtlra^slfepTct! 2 meWFs)"^fromllto'-i^'0/niet^rs' in yl(w?tFf a^step:of-0?25 meters) and
from -0.5 to 0.5 meters in x with a step of 0.5 meters).
As the resulting images are cropped with their 2D bounding boxes, we did not vary the camera's focal length settings. The effects of perspective linked to the value of the focal length are simulated thanks to the variation in z. Noise has been added to the render images using background images from PASCAL VOC 2007. For this, the contour detector is applied to these background images. The resulting contour maps are randomly composed with the contour renderings of the 3D models. For added realism, noise is added in this way to the exterior and interior of vehicles.
3.2 Results
Seven deep networks were trained to evaluate, compare and validate our new approach.
We define each network by J •'"■■ where dtc and dtf correspond to the training data used for the classifier and for the calculation of the descriptors respectively (colors + contours: ce, colors: c, contours: e and synthetic contours: s). We learned all the networks using the GoogLeNet architecture. For the training of contour and color . invariant descriptors we used an initialization using pre-learned weights (fine-tuning) of googLeNet on the ImageNet base. We detail the type of training data for each learned network in Table 1 and Table 2. The evaluation metric used is the standard performance metric for classification: the percentage of correctly classified test base examples.
3.2.1 Visualization of common characteristics.
Colors / Synthetic. We propose here to visualize the descriptors invariant to the colors
JifCC
and to the contours. Using the CP- network, we extracted the descriptors on the synthetic contour images and the real color images (coming from 3DCar and 2DCar) and we plotted their distributions using the t-SNE reduction of dimension (t-Distributed Stochastic Neighbor Embedding) in Figure 5. This algorithm makes it possible to reduce high-dimensional vectors in order to be able to visualize them (in 2D). It is based on minimizing the Kullback-Leibler divergence between the sought-after low-dimensional distribution and the high-dimensional one. We can see that the two domains (real colors and synthetic outlines) are very similar in the space of descriptors while being discriminating for the recognition of makes and models of vehicles. This shows that the older descriptors on contours extracted from real images and on color images are generalized to images of synthetic contours. Moreover, it proves the power of generalization for fine recognition: even if the descriptors are learned on a database wjth specific makes and models (the classes _of CompCar), they can be applied to other „ makes and models of cars while remaining discriminating for these new classes (the 2D3DCar classes).
Real colors / contours. We also qualitatively analyzed the outputs of the different
A/"cc
layers of extraction of the descriptors of the ce network. For that, we pass successively in this network a color image of CompCar then its contour map is
calculated.
Figure 6 illustrates the outputs of different convolutional layers of CNN. We observe
that the deeper the observed layer, the more similar the outputs corresponding to these
two images. This confirms that a common space of visual characteristics exists between t fN^he^^dom'Sns.^H^W AT. VkT t BSVi fcTil -Kris -**»- '
3.2.2 Rating on CompCar
We evaluated and compared the performance of the ca network (shown in Figure 7 (a)) learned over two domains (actual colors and actual outlines) to recognize the make and model of a vehicle. Table 1 gives the results for fine recognition on the CompCar test images (colors and outlines). We can clearly see that our common descriptors learned
Kfcc
with the ce network perform very well on both data types while •'vand ' «completely fail on the data type on which they have not been learned. In addition, it
KfCC
is very interesting to note that the cc network provides better performance than the
J^e network when these two networks are tested on real edge images (8% more for
jVcc Kfa
ce ). In other words, the - « network, learned only on images of real contours, is less
efficient on this same type of data. This proves the relevance of joint learning in two
areas. The color information makes it possible to better learn the parameters of the
network to obtain a more efficient contour map classifier. Another possible explanation
for this gain is the fact of having used a network pre-trained on color images.
3.3 Rating on 2DCar
We also evaluated the performance of the /vs synthetic classifier (section 3.2) on real color images. Table 2 gives the results for the recognition of vehicle makes and models using synthetic contour images for training. The test is carried out on the 2DCar basis.
kfcc !\fce
We can see that our •-' classifier which uses the CL* descriptors significantly
increases the classification score in comparison with other networks. The results also
show that learning directly a full bucket on synthetic images (-V« ) is not effective on real images during the test (colors and outlines).
This network cannot be generalized to this type of images because it focuses on synthetic structures that cannot be found in real images. We also learned classifiers using SVM and structured-SVM and the results are similar. Figure 7 shows the rank curves for these networks. The abscissa axis represents the rank [1,.., N] the ordinate axis the recognition performance of vehicle makes and models. The curves of the same color are associated with the same network but tested on different types of data: contours (dotted lines) and colors (entire lines). The magenta curves correspond to the network learned with our approach and the black curve at random. Figure 7 (a) Rank I cu'wesFoTi Co^pCart-and^Figure. 7 (ib)7ranki e.urves LorvZDCar. And-figure 8 illustrates results returned by our *f? network. The convolutional neural network takes as input
a real color image of the 2DCar base (first column) and produces a classification confidence score for each 3D model of the 3DCar base.
4 Conclusion
The approach proposed in this work makes it possible to calculate descriptors common to color images and outline images. These descriptors are very discriminating for the recognition of makes and models of vehicles in the two domains. Once these descriptors have been learned, we use them to train a fine classifier on synthetic contour images that can be applied to real color images. The power of the approach is that it is easy to generate a lot of synthetic images for training without the effort of labeling. We give results proving the effectiveness of the method. This work shows the ability of CNNs to extract domain invariant descriptors and demonstrates their powers of generalization for fine recognition over several domains.
Claims
1. Synthetic contours and Contours of real images of Figurel.
2. Recognition model of vehicle brands and models of Figure 4.
| # | Name | Date |
|---|---|---|
| 1 | 202141020757-Abstract_As Filed_07-05-2021.pdf | 2021-05-07 |
| 1 | 202141020757-Form 9_Earlier Publication_07-05-2021.pdf | 2021-05-07 |
| 2 | 202141020757-Claims_As Filed_07-05-2021.pdf | 2021-05-07 |
| 2 | 202141020757-Form 5_As Filed_07-05-2021.pdf | 2021-05-07 |
| 3 | 202141020757-Correspondence_As Filed_07-05-2021.pdf | 2021-05-07 |
| 3 | 202141020757-Form 3_As Filed_07-05-2021.pdf | 2021-05-07 |
| 4 | 202141020757-Description Complete_As Filed_07-05-2021.pdf | 2021-05-07 |
| 4 | 202141020757-Form 2(Title Page)_Complete_07-05-2021.pdf | 2021-05-07 |
| 5 | 202141020757-Form 1_As Filed_07-05-2021.pdf | 2021-05-07 |
| 5 | 202141020757-Drawings_As Filed_07-05-2021.pdf | 2021-05-07 |
| 6 | 202141020757-Drawings_As Filed_07-05-2021.pdf | 2021-05-07 |
| 6 | 202141020757-Form 1_As Filed_07-05-2021.pdf | 2021-05-07 |
| 7 | 202141020757-Description Complete_As Filed_07-05-2021.pdf | 2021-05-07 |
| 7 | 202141020757-Form 2(Title Page)_Complete_07-05-2021.pdf | 2021-05-07 |
| 8 | 202141020757-Correspondence_As Filed_07-05-2021.pdf | 2021-05-07 |
| 8 | 202141020757-Form 3_As Filed_07-05-2021.pdf | 2021-05-07 |
| 9 | 202141020757-Claims_As Filed_07-05-2021.pdf | 2021-05-07 |
| 9 | 202141020757-Form 5_As Filed_07-05-2021.pdf | 2021-05-07 |
| 10 | 202141020757-Form 9_Earlier Publication_07-05-2021.pdf | 2021-05-07 |
| 10 | 202141020757-Abstract_As Filed_07-05-2021.pdf | 2021-05-07 |