A System And Method Of 3 D Object Detection

< Back

A System And Method Of 3 D Object Detection

Abstract: TITLE OF THE INVENTION: A SYSTEM AND METHOD OF 3D OBJECT DETECTION A method and a system (100) for classifying an object in a 3D image is disclosed. In an embodiment, a depth map of a query image having a target object is obtained, by an active region generator (ARG) (106), wherein the query image is a 3D image captured by a 3D imaging device (102). A depth map of a background image is obtained by the ARG (106). A point cloud corresponding to an active region is generated by the ARG (106) based upon the depth maps of the query image and the background image, the active region indicating a region comprising the target object. An object embedding corresponding to the target object is generated by a 3D object embedding generator (114) based upon the point cloud corresponding to the active region. The target object is classified by a classifier (116) into an object class of at least one object class based upon the generated object embedding. Fig. 1

Get Free WhatsApp Updates!
Notices, Deadlines & Correspondence

Patent Information

Application #

Filing Date

27 October 2023

Publication Number

33/2024

Publication Type

INA

Invention Field

COMPUTER SCIENCE

Status

Parent Application

Applicants

e-Infochips Private Limited

Building No. 2, Aaryan Corporate Park Near Shilaj Railway Crossing, Thaltej-Shilaj Road, Ahmedabad Gujarat 380054 India

Inventors

1. TRIPATHI, Man Mohan

68, Annamayya Enclave, Hydereabad-502032, Telangana, India

2. PANSARE, Pallavi Genu

S. No. 62, Vinayaknagar, Pimple Nilakh, Aundh Post, Pune-411027, Maharashtra, India

3. BHAPKAR, Vishal Vilas

Flat no A11, Chandrabhaga Society, Suncity road, Anand Nagar, Pune, Maharashtra – 411051 India

4. GUPTA, Amit

302, Kenilworth Block (bdg 5), Grand Forte Society, Sector Sigma 4, Greater Noida. Uttar Pradesh 201310 India

Specification

DESC:FORM 2 THE PATENTS ACT, 1970 (39 of 1970) & THE PATENTS RULES, 2003 COMPLETE SPECIFICATION (Section 10 and Rule 13) TITLE OF THE INVENTION: A SYSTEM AND METHOD OF 3D OBJECT DETECTION APPLICANT e-Infochips Private Limited, an Indian company of the address Building No. 2, Aaryan Corporate Park Near Shilaj Railway Crossing, Thaltej-Shilaj Road, Ahmedabad Gujarat 380054, India The following specification particularly describes the invention and particularly the manner in which it is to be performed: FIELD OF INVENTION The present disclosure relates to image processing. More particularly, the present disclosure relates to a system and method for 3D object detection. BACKGROUND OF INVENTION In computer vision, the scene understanding is one of the important tasks. These tasks are accomplished by segmentation, localization, and classification. Significant advancements have been made in 2D computer vision algorithms, due to the accessibility of large datasets. However, 2D cameras cannot capture a realistic depiction due to the lack of depth information. Further, their perspective varies depending on the object’s size, the camera’s distance from it and the camera’s viewing angle. It renders the 2D computer vision inefficient for essential measurements of critical applications such as robotics, industrial automation, autonomous driving, volumetric estimation, home automation etc. To overcome the drawbacks of the 2D computer vision, 3D imaging devices are used. The 3D imaging devices retain the depth information when capturing a real-world scene. Creating and annotating 3D dataset for training object detection and classification algorithms are complex tasks. Further, 3D datasets are not easily available. Consequently, 3D computer vision has not been widely deployed. Existing techniques of 3D computing vision are assisted by 2D based model, which require RGB inputs. However, RGB cameras are not suitable for applications where safety and privacy are a concern. Further, artificial intelligence (AI) based detection and classification algorithms are heavily dependent on data and training datasets. They perform poorly for an unknown/unseen object, and adding a new class of objects requires retraining on the entire dataset. Thus, there arises a need for system and method for 3D object detection that overcomes the problems associated with the conventional techniques. SUMMARY OF INVENTION Particular embodiments of the present disclosure are described herein below with reference to the accompanying drawings; however, it is to be understood that the disclosed embodiments are mere examples of the disclosure, which may be embodied in various forms. Well-known functions or constructions are not described in detail to avoid obscuring the present disclosure in unnecessary detail. Therefore, specific structural and functional details disclosed herein are not to be interpreted as limiting, but merely as a basis for the claims and as a representative basis for teaching one skilled in the art to variously employ the present disclosure in virtually any appropriately detailed structure. The present disclosure relates to a method and a system for classifying an object in a 3D image. In an embodiment, the method includes obtaining, by an active region generator (ARG), a depth map of a query image having a target object, wherein the query image is a 3D image captured by a 3D imaging device. The method further includes obtaining, by the ARG, a depth map of a background image, wherein the background image is a 3D image captured by the 3D imaging device and corresponds to an image of a background. The method further includes generating, by the ARG, a point cloud corresponding to an active region based upon the depth map of the query image and the depth map of the background image, the active region indicating a region comprising the target object and the point cloud corresponding to the active region comprises points in a three-dimensional space. The method further includes generating, by a 3D object embedding generator, an object embedding corresponding to the target object based upon the point cloud corresponding to the active region. The method further includes classifying, by a classifier, the target object into an object class of at least one object class based upon the generated object embedding. In an embodiment, the system includes an active region generator (ARG), a 3D object embedding generator and a classifier. The ARG is coupled to a 3D imaging device and is configured to obtain a depth map of a query image having a target object, wherein the query image is a 3D image captured by the 3D imaging device. The ARG is further configured to obtain a depth map of a background image, wherein the background image is a 3D image captured by the 3D imaging device and corresponds to an image of a background. The ARG is further configured to generate a point cloud corresponding to an active region based upon the depth map of the query image and the depth map of the background image, the active region indicating a region comprising the target object. The 3D object embedding generator is coupled to the ARG and is configured to generate an object embedding corresponding to the target object based upon the point cloud corresponding to the active region. The classifier is coupled to the 3D object embedding generator and is configured to classify the target object into an object class of at least one object class based upon the generated object embedding. BRIEF DESCRIPTION OF DRAWINGS The summary above, as well as the following detailed description of illustrative embodiments, is better understood when read in conjunction with the apportioned drawings. For the purpose of illustrating the present disclosure, exemplary constructions of the disclosure are shown in the drawings. However, the disclosure is not limited to specific methods and instrumentality disclosed herein. Moreover, those in the art will understand that the drawings are not to scale. Fig. 1 depicts a system 100 for 3D object detection, according to an embodiment of the present disclosure. Fig. 2 depicts a point cloud of a scene captured using a 3D imaging device 102, according to an embodiment of the present disclosure. Fig. 3A depicts a depth map of the environment captured by the 3D imaging device 102, according to an embodiment of the present disclosure. Fig. 3B depicts a point cloud of the environment processed by a salient points extraction engine 104, according to an embodiment of the present disclosure. Fig. 3C depicts a point cloud after applying thresholding by the salient points extraction engine 104, according to an embodiment of the present disclosure. Fig. 3D depicts a point cloud after plane removal and outlier removal by the salient points extraction engine 104, according to an embodiment of the present disclosure. Fig. 3E a depicts a 3D bounding box for an object as estimated by a 3D bounding box estimator 108, according to an embodiment of the present disclosure. Fig. 3F illustrates object point clouds extracted by the salient points extraction engine 104 for various exemplary objects, according to an embodiment. Fig. 4 depicts a schematic of a training engine 118, according to an embodiment of the present disclosure. Fig. 5A depicts an exemplary static background for generating an active region, according to an embodiment of the present disclosure. Fig. 5B depicts an exemplary current scene for generating an active region, according to an embodiment of the present disclosure. Fig. 5C depicts exemplary Boolean vectors R and C, according to an embodiment of the present disclosure. Fig. 5D depicts an exemplary active region, according to an embodiment of the present disclosure. Fig. 6A depicts a depth map of a background of the environment, according to an embodiment of the present disclosure. Fig. 6B depicts a depth map of the environment having an object, according to an embodiment of the present disclosure. Fig. 6C depicts an object point cloud of the environment, according to an embodiment of the present disclosure. Fig. 6D depicts a point cloud of static background of the environment, according to an embodiment of the present disclosure. Fig. 6E depicts a point cloud of the object with static background of the scene, according to an embodiment of the present disclosure. Fig. 7 depicts a schematic of a 3D object embedding generator 114, according to an embodiment of the present disclosure. Fig. 8 depicts an example classification of an object by a classifier 116, according to an embodiment of the present disclosure. Fig. 9 depicts a flowchart of a method 900 for a training mode of the system 100, according to an embodiment of the present disclosure. Fig. 10 depicts a flowchart of a method 1000 for extracting a point cloud corresponding to an object in a training image, according to an embodiment of the present disclosure. Fig. 11 depicts a flowchart of a method 1100 for an inference mode of the system 100, according to an embodiment of the present disclosure. Fig. 12 depicts a flowchart of a method 1200 for identifying an active region from a depth map of a query image, according to an embodiment of the present disclosure. DETAILED DESCRIPTION OF THE ACCOMPANYING DRAWINGS Prior to describing the invention in detail, definitions of certain words or phrases used throughout this patent document will be defined: the terms "include" and "comprise", as well as derivatives thereof, mean inclusion without limitation; the term "or" is inclusive, meaning and/or; the phrases "coupled with" and "associated therewith", as well as derivatives thereof, may mean to include, be included within, interconnect with, contain, be contained within, connect to or with, couple to or with, be communicable with, cooperate with, interleave, juxtapose, be proximate to, be bound to or with, have a property of, or the like. Definitions of certain words and phrases are provided throughout this patent document, and those of ordinary skill in the art will understand that such definitions apply in many, if not most, instances to prior as well as future uses of such defined words and phrases. Reference throughout this specification to “one embodiment,” “an embodiment,” or similar language means that a particular feature, structure, or characteristic described in connection with the embodiment is included in at least one embodiment. Thus, appearances of the phrases “in one embodiment,” “in an embodiment,” and similar language throughout this specification may, but do not necessarily, all refer to the same embodiment, but mean “one or more but not all embodiments” unless expressly specified otherwise. The terms “including,” “comprising,” “having,” and variations thereof mean “including but not limited to” unless expressly specified otherwise. An enumerated listing of items does not imply that any or all of the items are mutually exclusive and/or mutually inclusive, unless expressly specified otherwise. The terms “a,” “an,” and “the” also refer to “one or more” unless expressly specified otherwise. Although the operations of exemplary embodiments of the disclosed method may be described in a particular, sequential order for convenient presentation, it should be understood that the disclosed embodiments can encompass an order of operations other than the particular, sequential order disclosed. For example, operations described sequentially may in some cases be rearranged or performed concurrently. Further, descriptions and disclosures provided in association with one particular embodiment are not limited to that embodiment, and may be applied to any embodiment disclosed herein. Moreover, for the sake of simplicity, the attached figures may not show the various ways in which the disclosed system, method, and apparatus can be used in combination with other systems, methods, and apparatuses. The embodiments are described below with reference to block diagrams and/or data flow illustrations of methods, apparatus, systems, and computer program products. It should be understood that each block of the block diagrams and/or data flow illustrations, respectively, may be implemented in part by computer program instructions, e.g., as logical steps or operations executing on a processor in a computing system. These computer program instructions may be loaded onto a computer, such as a special purpose computer or other programmable data processing apparatus to produce a specifically-configured machine, such that the instructions which execute on the computer or other programmable data processing apparatus implement the functions specified in the data flow illustrations or blocks. These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including computer-readable instructions for implementing the functionality specified in the flowchart block or blocks. The computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer-implemented process such that the instructions that execute on the computer or other programmable apparatus provide operations for implementing the functions specified in the data flow illustrations or blocks. Accordingly, blocks of the block diagrams and data flow illustrations support various combinations for performing the specified functions, combinations of operations for performing the specified functions and program instructions for performing the specified functions. It should also be understood that each block of the block diagrams and flowchart illustrations, and combinations of blocks in the block diagrams and flowchart illustrations, can be implemented by special purpose hardware-based computer systems that perform the specified functions or operations, or combinations of special purpose hardware and computer instructions. Further, applications, software programs or computer readable instructions may be referred to as components or modules. Applications may be hardwired or hardcoded in hardware or take the form of software executing on a general-purpose computer such that when the software is loaded into and/or executed by the computer, the computer becomes an apparatus for practicing the disclosure, or they are available via a web service. Applications may also be downloaded in whole or in part through the use of a software development kit or a toolkit that enables the creation and implementation of the present disclosure. In this specification, these implementations, or any other form that the disclosure may take, may be referred to as techniques. In general, the order of the steps of disclosed processes may be altered within the scope of the disclosure. Furthermore, the described features, advantages, and characteristics of the embodiments may be combined in any suitable manner. One skilled in the relevant art will recognize that the embodiments may be practiced without one or more of the specific features or advantages of a particular embodiment. In other instances, additional features and advantages may be recognized in certain embodiments that may not be present in all embodiments. These features and advantages of the embodiments will become more fully apparent from the following description and apportioned claims, or may be learned by the practice of embodiments as set forth hereinafter. The present disclosure proposes a system and a method for 3D object detection. Unlike conventional systems which rely on 2D data for 3D object detection, the proposed system is capable of detecting and classifying one or more objects (or objects of interest) in a 3D image of a scene captured by a 3D imaging device. The system processes a point cloud of the scene to detect and classify the one or more objects. In an embodiment, the system includes a salient points extraction engine which extracts points corresponding to the one or more objects using a non-learning-based method in a training mode. This overcomes the lack of large 3D data for learning-based approaches used in conventional systems. In an embodiment, the salient points extraction engine performs thresholding to remove undesired points from a depth map of the scene, removes one or more planar surfaces and removes outlier noise to generate a point cloud corresponding to one or more objects in the scene. The salient points extraction engine performs the thresholding using a thresholding parameter. According to an embodiment, the thresholding parameter may be set based upon prior knowledge of an environment in which the system may be deployed. In an embodiment, the system includes an active region generator (ARG) which generates a point cloud corresponding to an active region in the scene based upon the point cloud of the scene and a point cloud of a static background of the scene in an inference mode. The ARG exploits the fact that in real-world applications, the 3D imaging devices are often installed at a fixed location. Consequently, a field of view (FOV) of the 3D imaging devices is static and a part of the FOV where a change has occurred often includes object(s) of interest and is an active region, which needs to be further analyzed. For example, in an application where a volume of a card board box moving on a conveyer is to be determined, only a frame in a real-time feed corresponding to when the card board box is present on the conveyer belt is to be processed and a region in the frame that includes the card board box corresponds to an active region. The ARG too processes the 3D data (in the form of depth maps) unlike conventional systems, which use 2D data. The point cloud corresponding to the active region functions as a localizer during the inference mode of the system. The system also includes a 3D object embedding generator for generating object embeddings using the point cloud generated by the ARG and a classifier for classifying the one or more objects in the scene using the generated object embeddings. The system employs a few-shot learning model. This overcomes the challenge of 3D training data scarcity encountered by the conventional systems. Further, due to the few-shot learning model, the system is able to classify new objects without requiring re-training of the 3D object embedding generator afresh using training data corresponding to the new object class. Thus, the proposed system is very efficient. Also, as the system reduces the number of points processed at each stage, the computational complexity decreases, leading to faster execution. Further, due to the reduced computational complexity, the system can also be deployed on edge devices, thereby leading to wider utility of the system. The proposed system may be deployed for various applications including, without limitation, home automation, autonomous vehicles, surveillance, robotics, industrial application (such as, object detection on conveyer belt, volumetric estimation, targeted lighting, etc.), medical imaging, surgical robots, etc. Fig. 1 depicts a system 100 for 3D object detection, according to an embodiment of the present disclosure. In an embodiment, the system 100 includes a 3D imaging device 102, a salient points extraction engine 104, an active region generator (ARG) 106, a database 110, an image selector 112, a 3D object embedding generator 114, a classifier 116, and a training engine 118. In an embodiment, the salient points extraction engine 104, the ARG 106, the 3D bounding box estimator 108, the image selector 112, the 3D object embedding generator 114 and the classifier 116 are executed by one or more computing devices. The one or more computing devices may include a general-purpose computer, a special-purpose computer, an application specific integrated circuit (ASIC), a cloud computing device, a server, a graphical processing unit (GPU), an edge device, etc. The one or more computing devices are coupled to the 3D imaging device 102 using a suitable interface or may communicate with the 3D imaging device 102 over a network. Examples of suitable interfaces include, without limitation, Gigabit Multimedia Serial Link (GMSL), Universal Serial Bus (USB), Cameral Serial Interface (CSI), etc. The network may include, without limitation, a local area network, a wide area network, a private network, a wireless network (such as Wi-Fi), a cellular network (such as 2G, 3G, 4G, 5G, etc.), an Internet, or any combinations thereof. According to an embodiment, the system 100 may be integrated with a robot used for different domestic or industrial purposes. In an embodiment, the system 100 operates in two modes, namely, a training mode and an inference mode. In the training mode, the 3D object embedding generator 114 is trained using a set of training images for one or more object classes. In the inference mode, the 3D object embedding generator 114 and the classifier 116 are used to classify an object in a scene (hereinafter, interchangeably referred to as an image) captured by the 3D imaging device 102. Fig. 1 illustrates a schematic flow of how various components of the system 100 interact with each other in the training mode and the inference mode, according to an embodiment. The training mode and the inference mode have been explained later. The 3D imaging device 102 captures a 3D image of a scene. The 3D imaging device 102 may include an active 3D image sensor or a passive 3D image sensor. In the active 3D imaging sensor, a source illuminates an object using a light signal, for example, a modulated visible light or an infrared (IR) signal and the reflection is captured by a receiver. The actual distance of the object is measured by calculating the time taken by the light signal from the source to the receiver (i.e., the round-trip time). Examples of the active 3D imaging sensors include Light Detection and Ranging (LiDAR) sensors, Time of Flight (ToF) sensors, etc. The passive 3D imaging sensors typically include two imaging sensors, for example, a left and a right imaging sensor, to capture respective images of the object. The distance of the object is calculated using pixel disparity between the images captured by the two imaging sensors. Stereoscopic cameras are an example of the passive 3D image sensors. Other types of 3D imaging techniques that provide depth information of the object can also be used without deviating from the scope of the present disclosure. In an embodiment, the 3D imaging device 102 outputs a point cloud for a scene (or a scene point cloud, represented herein by ?PC?_scene). In another embodiment, the 3D imaging device 102 outputs a depth map which is processed by a computing device (not shown) to generate the point cloud for a scene. The point cloud of the scene (?PC?_scene) includes a plurality of data points in three-dimensional space representing each object, plane, etc. in the scene. Each data point represents a point in the scene and includes three spatial coordinates (namely, X, Y and Z coordinates). Each data point may also include one or more values indicating a color, an intensity, etc. The point cloud for an exemplary scene is shown in Fig. 2. According to an embodiment, in the training mode, a set of training images having an object belonging to one or more object classes are captured by the 3D imaging device 102. In an example implementation, an object belonging to an object class of the one or more object classes is placed in front of the 3D imaging device 102. The environment where the object is placed in the training mode may be similar to an environment present during the inference mode of the system 100 (this may be done to improve the performance of the system 100) or may be a test environment. The 3D imaging device 102 captures a training image of a scene having the object and generates a corresponding depth map and/or point cloud. This process may be repeated by placing the same object or different object(s) belonging to the same object class under different environmental conditions (e.g., change in the placement of the object, lighting condition, varying the size of the object, etc.) to generate one or more training images (and corresponding depth maps and/or point clouds) for the object class. Similarly, one or more training images are captured for other object classes of the one or more object classes. Without loss of generality, the point cloud for each training image may be represented by ?PC?_scene. In the training mode, the salient points extraction engine 104 is configured to extract a point cloud corresponding to the object (belonging to an object class) from the depth map and/or the point cloud corresponding to each training image. The salient points extraction engine 104 obtains the output of the 3D imaging device 102. The salient points extraction engine 104 is communicatively coupled to the 3D imaging device 102. In an embodiment, salient points extraction engine 104 is configured to obtain the depth map of the training image from the 3D imaging device 102. The salient points extraction engine 104 is configured to identify points corresponding to object(s) of interest based upon the depth map of the training image and generate a point cloud corresponding to the object (hereinafter referred to as an object point cloud ?PC?_obj). The object point cloud includes points in 3D space. In an embodiment, the salient points extraction engine 104 is configured to remove points from the depth map of the training image based upon a first thresholding parameter t and generate a post-thresholding depth map. According to an embodiment, the first thresholding parameter t may be include three thresholding parameters (t_x,t_y,t_z ) with each value t_x,t_y,t_z representing a threshold in X, Y and Z dimension, respectively. In an example implementation, the salient points extraction engine 104 removes all points having X-coordinate greater than t_x. Similarly, the salient points extraction engine 104 removes all points having Y-coordinate greater than t_y and Z-coordinate greater than t_z. Thus, the thresholding is performed in all three dimensions based on the first thresholding parameter t. In an embodiment, the individual thresholding parameters (t_x,t_y,t_z ) are estimated based on the prior knowledge of the environment. For example, if the object was placed within a distance of 2.5 m in z-direction from the 3D imaging device 102 while capturing the training image, the user may set t_z to 2.5 m such that all points beyond 2.5 m may be removed from further processing. Similarly, t_x and t_y may be determined based upon such knowledge of the environment. Though it has been explained as an example that all points beyond threshold (t_x,t_y,t_z ) in each dimension are removed, in an embodiment, all points less than the threshold (t_x,t_y,t_z ) in each dimension may be removed. In another embodiment, the threshold t may include (t_x1,t_x2,t_y1,t_y2,t_z1,t_z2 ) and all points except for the range defined by these values are removed. The salient points extraction engine 104 is configure to generate a point cloud after the thresholding step, hereinafter a post-thresholding point cloud (herein represented as ?PC?_t1), based upon the post thresholding depth map (i.e., depth map after removing the points using the first thresholding parameter t). Since points in the depth map are reduced, the thresholding results in reduction of the search space from ?PC?_scene to ?PC?_t1. The first thresholding parameter t may be stored in the database 110 and retrieved by the salient points extraction engine 104. Though the exemplary implementation explained herein applies the first thresholding parameter t in three dimensions, the first thresholding parameter t may be applied in one dimension or two dimensions based upon the environment, and the same is within the scope of the present disclosure. The salient points extraction engine 104 is configured to detect one or more planar surfaces in the post thresholding point cloud (?PC?_t1). In an embodiment, the salient points extraction engine 104 uses Random Sample Consensus (RANSAC) algorithm for detecting the one or more planar surfaces. The salient points extraction engine 104 may use any other suitable algorithm, such as, plane segmentation or any variants of RANSAC to detect the one or more planar surfaces in the point cloud ?PC?_t1. The one or more planar surfaces may correspond to planar surfaces (other than the object) in the scene such as a ground plane, one or more walls, etc. As the number of points in the ?PC?_t1 used for detecting the one or more planar surfaces are lesser compared to the ?PC?_scene due to the removal of points based upon the first thresholding parameter t, the algorithms (e.g., RANSAC) used for detecting the one or more planar surface run faster and are more accurate. Once the one or more planar surfaces are detected, the salient points extraction engine 104 is configured to remove the one or more planar surfaces from the point cloud post-thresholding (?PC?_t1). In an embodiment, the salient points extraction engine 104 removes the one or more planar surface by removing points corresponding to the one or more planar surfaces in the post-thresholding point cloud (?PC?_t1) to generate a point cloud after the one or more planar surfaces removal (hereinafter referred to as a post surface removal point, represented by ?PC?_t2). The salient points extraction engine 104 is further configured to generate the object point cloud (?PC?_obj) based at least in part upon the post surface removal point cloud ?PC?_t2. In an embodiment, the object point cloud (?PC?_obj) is the same as the post surface removal point cloud ?PC?_t2. Optionally, or in addition, the salient points extraction engine 104 is configured to apply a 3D filter on the point cloud ?PC?_t2 to remove noise and generate the object point cloud (?PC?_obj). The noise may correspond to one or more clusters that may be disjoint from points corresponding to the object in the point cloud ?PC?_t2. The salient points extraction engine 104 may apply any known 3D filter such as, without limitation, radius outlier removal, statistical outlier removal, mode filter, etc. The 3D filter may be stored in the database 110 and is retrieved by the salient points extraction engine 104. Further, the salient points extraction engine 104 may store various point clouds, such as, ?PC?_t1, ?PC?_t2 and ?PC?_obj in the database 110. In an embodiment, the salient points extraction engine 104 is configured to annotate the ?PC?_obj with a class label. The class label denotes a corresponding object class. In an embodiment, the class labels may be entered by a user. For example, the salient points extraction engine 104 is configured to present a user interface on a user device for the user to enter a class label for the ?PC?_obj and receive the user input denoting the class label. In an embodiment, the salient points extraction engine 104 may present the user interface to prompt the user to enter the class label before capturing the corresponding training image. In this case, the user enters the class label (say, ‘cup’) and then places the corresponding object for the 3D imaging device 102 to take the corresponding training image. In another embodiment, the salient points extraction engine 104 may present the user interface to enter the class label after generating the ?PC?_obj. In this case, the salient points extraction engine 104 presents image corresponding to the ?PC?_obj and the user enters the corresponding class label. The salient points extraction engine 104 receives the class label entered by the user and annotates the corresponding ?PC?_obj with the class label. Figs. 3A-3D illustrate the depth map and point clouds corresponding to different stages of the processing performed by the salient points extraction engine 104 for a training image according an embodiment. Fig. 3F illustrates the object point clouds extracted by the salient points extraction engine 104 for various exemplary objects, according to an embodiment. The database 110 is configured to store the annotated point clouds for the set of training images. The annotated points clouds may be stored in any suitable structure and format. For example, the annotations may be saved in an XML format. Optionally, or in addition, a data augmenter (not shown) may be configured to augment the annotated point clouds using any known techniques for augmenting point cloud data. This may be done to introduce variation (e.g., rotation, flipping, etc.) and generate additional training data without capturing additional training images. The augmented point clouds and corresponding annotations may be stored in the database 110. The database 110 may be a hierarchical database, a relational database, a non-relational database, an object-oriented database, etc., or any combinations thereof. In the training mode, the training engine 118 is configured to train the 3D object embedding generator 114 and/or the classifier 116 based upon the annotated point clouds stored in the database 110. The training engine 118 is communicatively coupled to the salient points extraction engine 104. The training engine 118 trains the 3D object embedding generator 114 to generate feature embeddings for the annotated point clouds. The feature embeddings may be multi-dimensional. The dimension of the feature embeddings depends upon the architecture of the 3D object embedding generator 114 and target objects to be classified and can be selected such that the feature embeddings are able to sufficiently represent features of the objects to be classified. The dimensions can be increased or decreased based upon the data complexity. In an embodiment, the training engine 118 trains the 3D object embedding generator 114 to generate 128-dimensional feature embeddings. The training engine 118 uses the object point clouds of an object set for each object class that needs to be detected in the inference mode. The object set may include the plurality of training images corresponding to the object class. Table 1 illustrates an exemplary object set used for training the 3D object embedding generator 114 according to an example implementation. It should be noted that the number of training images for each object class and the types of objects of each class can be chosen based upon the object class, dataset size and dataset complexity. Table 1: Exemplary Object Sets: Object Class No. of training images Type of objects Mug 25 5 Bottle 25 5 Chair 25 5 Paper roll 25 5 Cardboard box 25 5 In an embodiment, the training engine 118 includes an artificial neural network to be trained to generate the feature embeddings. The artificial neural network used in the training engine 118 includes two or more branches with each branch having an instance of the 3D object embedding generator 114. The training engine 118 is configured to obtain an annotated point cloud for each of the two or more branches. The 3D object embedding generator 114 in each branch generates a corresponding training object embedding. The training engine 118 is configured to calculate a loss function based upon the training object embeddings of the two or more branches. The training engine 118 may consider one branch (e.g., a first branch without a loss of generality) as an anchor branch. The training engine 118 is configured to train the 3D object embedding generator 114 in the first branch based upon the loss function. The training engine 118 uses the loss function as follows. For example, when the first branch and another branch (e.g., a second branch without a loss of generality) of the two or more branches are provided with annotated object point clouds correspond to objects belonging to the same object class, the distance between the training object embeddings between the first branch and the second branch is decreased. On the other hand, when the first branch and the second branch are provided with annotated object point clouds correspond to objects belonging to the different object classes, the distance between the training object embeddings between the first branch and the second branch is increased. In one example implementation, the training engine 118 includes a Siamese Neural Network (SNN) having two branches with each branch of the SNN including an instance of the 3D object embedding generator 114 as shown in Fig. 4. The SNN may be trained using a plurality of pairs of annotated point clouds corresponding to objects of the one or more object classes. Each pair may belong to the same object class or different object classes. The two training object embeddings are used to calculate a loss using a loss function. The value of the loss is then used for training the 3D object embedding generator 114 in a first branch (e.g., the top branch in Fig. 4). In an embodiment, a contrastive loss function may be used for training the SNN such that when both branches have point clouds corresponding to objects belonging to the same object class, the distance between the object embeddings is minimized and when the branches have point clouds corresponding to objects belonging to different object classes, the distance between the object embeddings in increased. In another example implementation, the training engine 118 includes three branches. In this case, a plurality of triplets of annotated point clouds corresponding to objects of the one or more object classes is selected. The three training object embeddings are used to calculate losses as per a triplet-based loss function. These losses are used for training the 3D object embedding generator 114 in the first branch. For example, the loss between the first branch with each of the remaining branches is calculated and used for training the 3D object embedding generator 114 of the first branch so as to minimize the distance between the 3D embeddings of the first branch and branch(es) provided with the same object class and increasing the distance between the 3D embedding of the first branch and branch(es) having different object classes. For example, consider the first branch is provided an annotated point cloud of a first object of a first object class, a second branch is provided an annotated point cloud of a second object of the first object class and a third branch is provided an annotated point cloud of a third object of a second object class. In this case, the 3D object embedding generator 114 in the first branch is trained such that the distance between the training object embeddings of the first branch and the second branch is minimized and the distance between the training object embeddings of the first branch and the third branch is increased. The image selector 112, communicatively coupled with the training engine 118, is configured to randomly select appropriate point clouds for the two or more branches and send the selected point clouds and the corresponding class labels to the respective branches of the training engine 118. The image selector 112 may select the point clouds based upon similarities or dissimilarities of the objects in the point clouds. For example, in one iteration, the image selector 112 may select a point cloud representing a first chair and a point cloud representing a second chair (i.e., objects belonging to the same object class). In another iteration, the image selector 112 may select a point cloud representing a third chair and a point cloud representing a first mug (i.e., objects belonging to different object classes). Similarly, the image selector 112 selects the triplets of point clouds with different combinations of the same and different object classes for each of the two or more branches. The image selector 112 may ensure that the number of point clouds belonging to the same object class are approximately equal to the number of point clouds belong to different object classes over the course of training. Once the learning converges, the 3D object embedding generator 114 is deployed for the inference mode. According to an embodiment, the training engine 118 is configured to register a pre-defined number of object embeddings for each object class of the at least one object class. For example, once the 3D object embedding generator 114 converges, the training engine 118 may provide the pre-defined number of point clouds for each object class to the 3D object embedding generator 114 and the object embeddings generated by the 3D object embedding generator 114 are then registered. The pre-defined number of embeddings may be selected considering the dataset size, computational complexity, time needed for training, desired performance requirements, etc. In an example implementation, ten object embeddings are registered for each object class. According to an embodiment, the system 100 detects and classifies one or more objects into the one or more object classes in the inference mode. As explained earlier, the 3D imaging device 102 captures a 3D image of a scene (hereinafter referred to as a query image) having a target object and generates a depth map of the query image. The active region generator (ARG) 106 is configured to obtain the output of the 3D imaging device 102, i.e., the depth map of the query image. The ARG 106 is communicatively coupled to the 3D imaging device 102. The ARG 106 is configured to identify an active region in the depth map of the query image and to generate an active region point cloud (herein represented by, ?PC?_ar). The active region may correspond to region(s) of the query image in which temporal changes may have occurred. In other words, the active region indicates a region having the target object. The point cloud corresponding to the active region includes points in a 3D space. In an embodiment, the ARG 106 is configured to obtain a depth map of a background image and to identify points corresponding to the active region and generate the active region point cloud based upon the depth maps of the query image and the background image. In an embodiment, the depth map of the background image may be obtained by capturing a 3D image of a static background using the 3D imaging device 102 during set-up of the system 100 and generating the corresponding depth map. The depth map of the background image may be stored in the database 110 and the ARG 106 is configured to retrieve it from the database 110. An exemplary static background is shown in Fig. 5A and an exemplary query image is shown in Fig. 5B. According to an embodiment, the ARG 106 is configured to calculate a difference between the depth map of the query image (hereinafter denoted by FG) and the depth map of the background image (hereinafter denoted by BG). The ARG 106 is further configured to determine coordinates of the active region in X and Y dimensions based upon the calculated difference between FG and BG. The coordinates in the X and Y dimensions define a 2D active region. The ARG 106 is configured to determine coordinates of the active region in the Z dimension based upon a depth map of the 2D active region. The ARG 106 is then configured to generate the active region point cloud (hereinafter denoted by ?PC?_ar) based upon a depth map of the active region defined by the corresponding coordinates in the X, Y and Z dimensions. An embodiment of calculating the difference between the depth maps FG, BG and determining the coordinates of the active region is now explained. The technique explained below is computationally more efficient and hence, makes it suitable where the system 100 is deployed on edge devices. The ARG 106 is configured to calculate a row-wise statistical average (hereinafter, row-wise average) for the depth maps FG and BG, represented by average(FG)_r and average(BG)_r, respectively. Similarly, the ARG 106 is configured to calculate a column-wise statistical average (hereinafter, column-wise average) for the depth maps FG and BG, represented by average(FG)_c and average(BG)_c, respectively. The statistical average may include a mean, a median, a mode or any other statistical descriptor. The ARG 106 is configured to calculate absolute difference between the row-wise average value of the depth map FG and the row-wise average value of the depth map BG. The ARG 106 is configured to determine coordinates of the active region in the X dimension based upon the row-wise absolute difference. In an embodiment, the ARG 106 is configured to generate a Boolean vector R based upon the row-wise absolute difference and a second thresholding parameter T_2 as follows: R={¦(0:if abs(average(BG)_r- average(FG)_r)

Documents

Application Documents

#	Name	Date
1	202321073421-STATEMENT OF UNDERTAKING (FORM 3) [27-10-2023(online)].pdf	2023-10-27
2	202321073421-PROVISIONAL SPECIFICATION [27-10-2023(online)].pdf	2023-10-27
3	202321073421-POWER OF AUTHORITY [27-10-2023(online)].pdf	2023-10-27
4	202321073421-FORM 1 [27-10-2023(online)].pdf	2023-10-27
5	202321073421-FIGURE OF ABSTRACT [27-10-2023(online)].pdf	2023-10-27
6	202321073421-DRAWINGS [27-10-2023(online)].pdf	2023-10-27
7	202321073421-DECLARATION OF INVENTORSHIP (FORM 5) [27-10-2023(online)].pdf	2023-10-27
8	202321073421-FORM-26 [28-10-2023(online)].pdf	2023-10-28
9	202321073421-Proof of Right [30-11-2023(online)].pdf	2023-11-30
10	202321073421-FORM 3 [14-03-2024(online)].pdf	2024-03-14
11	202321073421-ENDORSEMENT BY INVENTORS [14-03-2024(online)].pdf	2024-03-14
12	202321073421-DRAWING [14-03-2024(online)].pdf	2024-03-14
13	202321073421-CORRESPONDENCE-OTHERS [14-03-2024(online)].pdf	2024-03-14
14	202321073421-COMPLETE SPECIFICATION [14-03-2024(online)].pdf	2024-03-14
15	Abstract1.jpg	2024-05-24
16	202321073421-Form 1 (Submitted on date of filing) [31-07-2024(online)].pdf	2024-07-31
17	202321073421-Covering Letter [31-07-2024(online)].pdf	2024-07-31
18	202321073421-CERTIFIED COPIES TRANSMISSION TO IB [31-07-2024(online)].pdf	2024-07-31
19	202321073421-FORM-9 [08-08-2024(online)].pdf	2024-08-08
20	202321073421-FORM 18A [21-08-2024(online)].pdf	2024-08-21
21	202321073421-FER.pdf	2024-11-20
22	202321073421-Information under section 8(2) [07-01-2025(online)].pdf	2025-01-07
23	202321073421-FORM 3 [07-01-2025(online)].pdf	2025-01-07
24	202321073421-FER_SER_REPLY [06-05-2025(online)].pdf	2025-05-06
25	202321073421-CLAIMS [06-05-2025(online)].pdf	2025-05-06

Search Strategy

1	IN2024001770-SearchStrategyE_12-11-2024.pdf