Abstract: The present invention relates to a system and method for generation and auto-labeling of multi-modal sensor dataset. The system comprises a label prompt generator and tuning module that enables a processing device to receive user inputs pertaining to a text and visual prompt for an in-cabin system and/or a perception system of a vehicle and correspondingly generates rich text prompts and perception-based prompts. Further, a multi-modal sensor data generator module generates in-cabin image data, and short-range LiDAR data and short-range RADAR data of a cabin of the vehicle, and further generate perception image data, short-range LiDAR data and short-range RADAR data, and long-range LiDAR data and long-range RADAR data of an exterior environment of the vehicle. Further, an auto-labeling module enables the processing device to annotate and label the data generated by the multi-modal sensor data generator module.
Description:TECHNICAL FIELD
[0001] The present invention relates to the field of training of vision-based models, and in particular, relates to a system and method for the generation and auto labeling of multi-modal sensor temporal dataset for an in-cabin system (ICS) and a perception system associated with a vehicle.
BACKGROUND
[0002] Background description includes information that may be useful in understanding the present invention. It is not an admission that any of the information provided herein is prior art or relevant to the presently claimed invention, or that any publication specifically or implicitly referenced is prior art.
[0003] In recent years, the field of autonomous driving has witnessed remarkable advancements, yet it faces challenges in the realm of data generation for the training and evaluation of vision-based models. The existing landscape of autonomous driving solutions faces issues in efficiently producing diverse and labeled data, which is imperative for enhancing the capabilities of autonomous vehicles.
[0004] Existing solutions are predominantly tethered to real-time recording or simulation methodologies facilitated by available simulators. However, these approaches exhibit limitations in their ability to comprehensively cover a diverse range of scenarios that autonomous driving systems (ADS) or autonomous driving assistance systems (ADAS) associated with vehicles may encounter in real-world settings. These constraints restrict the scope of data scenarios, affecting the development and evaluation of vision-based models for ADS or ADAS. This necessitates the need to optimize the generation of multimodal sensor data for training and evaluation of vision-based models based on the specific needs of both in-cabin and perception systems associated with the ADS or ADAS.
[0005] Moreover, a technical gap exists in the current solution, which lacks the flexibility to refine scenes and objects during the multimodal sensor data generation, allowing for the adaptation of the generated data to evolving requirements or for any modifications/corrections. The absence of such tools hampers the adaptability and versatility of the multimodal sensor data generated, posing a hurdle in the refinement and optimization of the vision-based models associated with the ADS or ADAS.
[0006] Thus, there is, therefore, a need to overcome the above-mentioned drawbacks, limitations, and shortcomings associated with existing training data generation solutions, by providing an improved solution that can generate and auto-label multimodal sensor temporal data with interactive prompt editing capability for detection and segmentation of use cases for both in-cabin and perception systems associated with the vehicle.
OBJECTS OF THE INVENTION
[0007] An object of the present invention is to address the drawbacks, limitations, and shortcomings associated with existing training data generation solutions for vision-based models.
[0008] Another object of the present invention is to generate labeled multimodal sensor temporal data for in-cabin and perception systems across all modalities of autonomous driving vehicles.
[0009] Another object of the present invention is to train and evaluate the vision-based models for in-cabin and perception systems associated with vehicles with labeled multimodal sensor temporal data pertaining to scenes that cannot be recorded in real-time or simulated using the existing simulators.
[0010] Yet another object of the present invention is to generate labeled multimodal sensor temporal data for both in-cabin and perception scenes that cannot be recorded in real-time or simulated using the existing simulators.
[0011] Yet another object of the present invention is to provide a system and method for the generation and auto-labeling of multimodal sensor temporal data for detection and segmentation of use cases for both in-cabin and perception systems associated with vehicles.
[0012] Yet another object of the present invention is to provide a system and method for the generation and auto-labeling of multimodal sensor temporal data, which also provides interactive prompt editing capability to edit scenes or objects in the scene.
SUMMARY
[0013] The present invention relates to a system and method for the generation and auto-labeling of multimodal sensor temporal data which also provides interactive prompt editing capability for detection and segmentation of use cases for both in-cabin and perception systems associated with the vehicle.
[0014] According to an aspect, the system and method of the present disclosure involve a processing device, an input device, a label prompt generator and tuning module, a multi-modal sensor data generator module, an auto labeling module, an interactive prompt editor, a database, and one or more output devices. In addition, the label prompt generator and tuning module further comprises a template extractor, an auto refiner network, one or more large language learning model (LLM) networks, and one or more prompt expanders configured with a GPT network. Further, the multi-modal sensor data generator module comprises one or more vision foundation model (VFM) generator networks and one or more point bind LLM generators. Furthermore, the auto-labeling module comprises segment anything module (SAM)-based networks and point transformer-based networks. Furthermore, Furthermore, the interactive prompt editor comprises one or more click-based interactive SAM-based labeling networks.
[0015] The input device allows users to provide user inputs to the label prompt generator and tuning module. The user inputs pertain to a text prompt and/or a visual prompt provided by the users for the in-cabin system (ICS) and/or the perception system associated with the vehicle. These prompts particularly pertain to in-cabin and perception scenes that cannot be recorded in real-time or simulated using the available simulators. Further, the label prompt generator and tuning module enable the processing device to generate rich text prompts and perception-based prompts based on the text prompts and visual prompts received from the input device.
[0016] The multi-modal sensor data generator module further enables the processing device to generate in-cabin image data, and short-range LiDAR data and short-range RADAR data of a cabin associated with the vehicle based on the generated rich text prompts, and also generate perception image data, short-range LiDAR data and short-range RADAR data, and long-range LiDAR data and long-range RADAR data of an exterior environment of the vehicle based on the generated visual prompts.
[0017] Further, the auto labeling module enables the processing device to annotate and label the cabin image data, the perception image data, and the LiDAR data and RADAR data generated by the multi-modal sensor data generator module to generate labeled multi-modal sensor temporal data for both in-cabin and perception systems.
[0018] In addition, the interactive prompt editor allows the users to manually annotate and label, using the input device, the multi-modal sensor data comprising any of the cabin image data, the perception image data, the LiDAR data. and the RADAR data. Further, the interactive prompt editor allows the users to manually edit or modify, using the input device, the labeled multi-modal sensor data.
[0019] Further, the generated labeled multi-modal sensor dataset comprising the multi-modal sensor data, the labeled multi-modal sensor data, and the corresponding prompts is stored in the database. This stored multi-modal sensor dataset is used by the processing device or a training and testing unit associated with the vehicle to train the in-cabin and the perception system of the vehicle.
[0020] Accordingly, the proposed invention (system and method) enables the generation and auto-labeling of multimodal sensor temporal data for detection and segmentation of use cases for both in-cabin and perception systems associated with the vehicle. The invention further provides interactive prompt editing capability to edit scenes or objects in the scene, thereby improving and optimizing the generated multimodal sensor temporal data.
BRIEF DESCRIPTION OF THE DRAWINGS
[0021] The accompanying drawings are included to provide a further understanding of the present disclosure and are incorporated in and constitute a part of this specification. The drawings illustrate exemplary embodiments of the present disclosure and, together with the description, serve to explain the principles of the present disclosure.
[0022] FIG. 1 illustrates an exemplary block diagram of the proposed system for the generation and auto-labeling of multimodal sensor temporal data, according to an embodiment of the present invention.
[0023] FIG. 2 illustrates an exemplary diagram depicting the functional modules of the proposed system, according to an embodiment of the present invention.
[0024] FIG. 3 illustrates an exemplary diagram depicting the functional modules of the label prompt generator and tuning module, according to an embodiment of the present invention.
[0025] FIG. 4 illustrates an exemplary diagram depicting the functional modules of the multi-modal sensor data generator module, according to an embodiment of the present invention.
[0026] FIG. 5 illustrates an exemplary diagram depicting the functional modules of the interactive prompt editor, according to an embodiment of the present invention.
[0027] FIG. 6 illustrates exemplary steps involved in the proposed method for the generation and auto-labeling of multimodal sensor temporal data, according to an embodiment of the present invention.
DETAILED DESCRIPTION
[0028] The present invention relates to an improved and reliable system and method, which addresses the drawbacks, limitations, and shortcomings associated with existing training data generation solutions, by enabling the generation and auto-labeling of multimodal sensor temporal data and further providing interactive prompt editing capability for detection and segmentation of use cases for both in-cabin and perception systems associated with the vehicle.
[0029] According to an aspect, the present disclosure elaborates upon a system for the generation and auto-labeling of multi-modal sensor dataset associated with a vehicle. The system comprises a processing device comprising one or more processors coupled to a memory storing instructions executable by the processors. The system further comprises a label prompt generator and tuning module that enables the processing device to receive, from an input device, a first set of data packets comprising user inputs pertaining to a text prompt and/or a visual prompt for an in-cabin system (ICS) and/or a perception system associated with the vehicle, and correspondingly generate rich text prompts and perception-based prompts. Further, a multi-modal sensor data generator module enables the processing device to generate in-cabin image data, and short-range LiDAR data and short-range RADAR data of a cabin associated with the vehicle based on the generated rich text prompts, and further generate perception image data and short-range LiDAR data and short-range RADAR data, and long-range LiDAR data and long-range RADAR data of an exterior environment of the vehicle based on the generated visual prompts. Further, an auto-labeling module enables the processing device to annotate and label the cabin image data, the perception image data, and the LiDAR data and RADAR data generated by the multi-modal sensor data generator module.
[0030] In an embodiment, the label prompt generator and tuning module comprises a template extractor configured to refine the corresponding text prompt and/or visual prompt into a second set of data packets comprising any or a combination of location, time of day, scene, and environment, and an auto refiner network configured with a zero shot learning module, to generate rich text description for the corresponding text prompt and/or visual prompt based on the second set of data packets. The label prompt generator and tuning module further comprises one or more large language learning model (LLM) networks configured to fine-tune the generated rich text description for the ICS and the perception system. Further, a first prompt expander configured with a first GPT network generates the rich text prompts for the ICS based on the fine-tuned rich text description, and a second prompt expander configured with a second GPT network generates the perception-based prompts for the perception system based on the fine-tuned rich text description.
[0031] In an embodiment, the multi-modal sensor data generator module comprises a first vision foundation model (VFM) generator network that generates temporal images of scenes of a cabin of the vehicle based on the generated rich text prompts and correspondingly generate the in-cabin image data. Further, a first point bind LLM generator generates a temporal point cloud for the short-range LiDAR data and short-range RADAR data of the corresponding scenes of the cabin based on the generated rich text prompts. In addition, the multi-modal sensor data generator module comprises a second VFM generator network that enables the processing device to generate temporal view data associated with an exterior environment of the vehicle based on the generated perception-based prompts and correspondingly generate the perception image data. Further, a second point bind LLM generator enables the processing device to generate temporal point cloud for the short-range LiDAR data and short-range RADAR data, and long-range LiDAR data and long-range RADAR data associated with the exterior environment based on the generated perception-based prompts.
[0032] In an embodiment, the auto-labeling module comprises a segment anything model (SAM)-based network that enables the processing device to annotate and label the cabin image data and the perception image data for the corresponding scenes generated by the first VFM generator network and the first point bind LLM generator, respectively. The auto-labeling module further comprises a point transformer-based network that enables the processing device to label LiDAR data and RADAR data generated by the second VFM generator network and the second point bind LLM generator.
[0033] In an embodiment, the system comprises an interactive prompt editor that is configured to allow, one or more users to manually annotate and label, using the input device, multi-modal sensor data comprising any of the cabin image data, the perception image data, the LiDAR data. and the RADAR data, and also allow the one or more users to manually edit or modify, using the input device, the labeled multi-modal sensor data.
[0034] In an embodiment, the system comprises a database in communication with the processing device, wherein the database is configured to store a multi-modal sensor dataset comprising the multi-modal sensor data, the labeled multi-modal sensor data, and the corresponding prompts. Further, the processing device or a training and testing unit associated with the vehicle is configured to train the ICS and the perception system of the vehicle using the multi-modal sensor dataset.
[0035] According to another aspect, the present disclosure elaborates upon a method for generation and auto labeling of multi-modal sensor dataset associated with a vehicle. The method comprises the steps of receiving, by a processing device configured with a label prompt generator and tuning module, a first set of data packets comprising user inputs pertaining to a text prompt and/or a visual prompt for an in-cabin system (ICS) and/or a perception system associated with the vehicle, and correspondingly generating rich text prompts and perception-based prompts. The method further comprises the steps of generating, by the processing device configured with a multi-modal sensor data generator module, in-cabin image data, and short-range LiDAR data and short-range RADAR data of a cabin associated with the vehicle based on the generated rich text prompts. Further, the method comprises the steps of generating, by the multi-modal sensor data generator module, perception image data, short-range LiDAR data and short-range RADAR data, and long-range LiDAR data and long-range RADAR data of an exterior environment of the vehicle based on the generated visual prompts. Furthermore, the method further comprises the steps of annotating and labeling, by the processing device configured with an auto labeling module, the cabin image data, the perception image data, and the LiDAR data and RADAR data generated by the multi-modal sensor data generator module.
[0036] In an embodiment, the method of generating the rich text prompts and the perception-based prompts comprises the steps of refining, using a template extractor, the corresponding text prompt and/or visual prompt into a second set of data packets comprising any or a combination of location, time of day, scene, and environment; generating, using an auto refiner network, rich text description for the corresponding text prompt and/or visual prompt based on the second set of data packets, fine-tuning, using one or more large language learning model (LLM) networks, the generated rich text description for the ICS and the perception system, generating, by a first prompt expander configured with a first GPT network, the rich text prompts for the ICS based on the fine-tuned rich text description, and generating, by a second prompt expander configured with a second GPT network, the perception-based prompts for the perception system based on the fine-tuned rich text description.
[0037] In an embodiment, the method of generating the multi-modal sensor data comprises the steps of generating, by a first vision foundation model (VFM) generator network, temporal images of scenes of a cabin of the vehicle based on the generated rich text prompts and correspondingly generate the in-cabin image data, generating, by a first point bind LLM generator, temporal point cloud for the short-range LiDAR data and short-range RADAR data of the corresponding scenes of the cabin based on the generated rich text prompts, generating, by a second VFM generator network, temporal view data associated with an exterior environment of the vehicle based on the generated perception-based prompts and correspondingly generate the perception image data, and generating, by a second point bind LLM generator, temporal point cloud for the short-range LiDAR data and short-range RADAR data, and long-range LiDAR data and long-range RADAR data associated with the exterior environment based on the generated perception-based prompts.
[0038] In an embodiment, the method of annotation and labeling comprises the steps of annotating and labeling, using a segment anything model (SAM)-based network, the cabin image data, and the perception image data for the corresponding scenes generated by the first VFM generator network and the first point bind LLM generator, respectively, and labeling, using a point transformer-based network, LiDAR data and RADAR data generated by the second VFM generator network and the second point bind LLM generator.
[0039] In an embodiment, the method further comprises allowing, using the input device, one or more users to manually annotate and label multi-modal sensor data comprising any of the cabin image data, the perception image data, the LiDAR data. and the RADAR data, and allowing, using the input device, the one or more users to manually edit or modify the labeled multi-modal sensor data. The method further comprises storing, in a database, a multi-modal sensor dataset comprising the multi-modal sensor data, the labeled multi-modal sensor data, and the corresponding prompts.
[0040] Referring to FIG. 1 and 2, the proposed system 100 for generation and auto labeling of multi-modal sensor dataset associated with a vehicle is disclosed. The system involves a processing device 102, an input device 104, a database 106, and a training and testing unit 108 for the training of the in-cabin system (ICS) 110 and perception system 112 associated with the vehicle.
[0041] The processing device 102 further comprises a label prompt generator and tuning module 202, a multi-modal sensor data generator module 204, an auto labeling network module 206, and an interactive prompt editor 208. In addition, the label prompt generator and tuning module 202 further comprises a visual text prompt template module 202-1, a prompt auto refiner network 202-2 comprising a template extractor, an auto refiner network, and a rich text description, a prompt guider network 202-3 comprising one or more large language learning model (LLM) networks configured with a GPT network, and one or more prompt expanders 202-4, 202-5. Further, the multi-modal sensor data generator module 204 comprises one or more vision foundation model (VFM) generator networks 204-1, 204-3 and one or more point bind LLM generators 204-2, 204-4. Furthermore, the auto-labeling module 206 comprises Segment Anything Model (SAM)-based networks and point transformer-based networks. For instance, the auto-labeling module 206 comprises an ICS vision labeling network 206-1, an ICS 3D point cloud labeling network 206-2, a perception vision labeling network 206-3, and a perception 3D point cloud labeling network 206-4. Furthermore, the interactive prompt editor 208 comprises one or more click-based interactive SAM-based labeling networks, which further includes an ICS prompt editor 208-1 and a perception prompt editor 208-2. The detailed operation and functional blocks of the label prompt generator and tuning module 202, the multi-modal sensor data generator module 204, the auto labeling network module 206, and the interactive prompt editor 208 have been described later in conjunction with FIGs. 3 to 5.
[0042] In an embodiment, the input device 104 comprises a human-machine interface such as but not limited to touchscreen displays, electronic writing pads, tablets, mobile phones, cameras, and keyboards. The input device 104 allows users to provide user inputs to the label prompt generator and tuning module 202 or the processing device 102. The user inputs pertain to a text prompt and/or a visual prompt provided by the users for the ICS and/or the perception system associated with the vehicle. In an exemplary embodiment, the user inputs comprise but are not limited to texts, bounding boxes, sketches, scribbles, and images. These prompts particularly pertain to in-cabin and perception scenes that cannot be recorded in real-time or simulated using the available simulators. Further, the label prompt generator and tuning module 202 enables the processing device 102 to generate rich text prompts and perception-based prompts based on the text prompts and visual prompts received from the input device.
[0043] The multi-modal sensor data generator module 204 further enables the processing device 102 to generate in-cabin image data, and short-range LiDAR data and short-range RADAR data of a cabin associated with the vehicle based on the generated rich text prompts, and also generate perception image data, short-range LiDAR data and short-range RADAR data, and long-range LiDAR data and long-range RADAR data of an exterior environment of the vehicle based on the generated visual prompts. For instance, in an example, but not limited to the like, the range for short-range RADAR data or short-range LiDAR data associated with the in-cabin system is 2-6 meters. Further, the range for short-range LiDAR data associated with the perception system is 5-12 meters, and the range for short-range RADAR data associated with the perception system is 10-12 meters. Furthermore, the range for long-range RADAR data or long-range LiDAR data associated with the perception system is 100-200 meters, but not limited to the like.
[0044] Further, the auto labeling module 206 enables the processing device 102 to annotate and label the cabin image data, the perception image data, and the LiDAR data and RADAR data generated by the multi-modal sensor data generator module 204 to generate labeled multi-modal sensor temporal data for both in-cabin 110 and perception systems 112.
[0045] In addition, the interactive prompt editor 208 allows the users to manually annotate and label, using the input device 102, the multi-modal sensor data comprising any of the cabin image data, the perception image data, the LiDAR data. and the RADAR data. Further, the interactive prompt editor 208 allows the users to manually edit or modify, using the input device, the labeled multi-modal sensor data.
[0046] Further, the generated labeled multi-modal sensor dataset comprising the multi-modal sensor data, the labeled multi-modal sensor data, and the corresponding prompts is stored in the database 106. This stored multi-modal sensor dataset is used by the processing device 102 or the training and testing unit 108 associated with the vehicle to train the in-cabin system 110 and the perception system 112 of the vehicle.
[0047] Referring to FIG. 3, functional modules of the label prompt generator and tuning module 202 is disclosed. This module 202 is configured to generate the visual textual prompts (rich text prompts and perception-based prompts) from the user which are provided as input by the user. These input prompts are further refined into a custom understandable format to enrich the prompting details for both the in-cabin system 110 and the perception system 112. In an embodiment, the label prompt generator and tuning module 202 is configured to process the received user inputs as a predefined template 302. The template extractor 304 then refines the corresponding predefined template 302 or the text prompt and/or visual prompt received from the input device into a standard template comprising any or a combination of location, time of day, scene, environment, and the likes. The label prompt generator and tuning module 202 comprises a visual text prompt template module 202-1 comprising a visual text prompt template 302, a prompt auto refiner network 202-2 comprising a template extractor 304, an auto refiner network 306, and a rich text description 306, and a prompt guider network 202-3 comprising one or more large language learning model (LLM) networks 310-1, 310-2 configured with a GPT network 312-1, 312-2.
[0048] Further, the auto refiner network 306 which is configured with a zero shot learning module, generates rich text descriptions 308 for the corresponding text prompt and/or visual prompt based on the standard template. In addition, the label prompt generator and tuning module 202 comprises one or more large language learning model (LLM) networks comprising an ICS LLM network 310-1 to fine-tune the generated rich text description 308 for the ICS and a perception LLM network 310-2 to fine-tune the generated rich text description 308 for the perception system. Further, the module 202 comprises a first prompt expander 202-4 (shown in FIG. 2) configured with a first GPT network 312-1 to generate the rich text prompts for the ICS 110 based on the fine-tuned rich text description 308, and a second prompt expander 202-5 (shown in FIG. 2) configured with a second GPT network 312-2 to generate the perception-based prompts for the perception system 112 based on the fine-tuned rich text description 308. Accordingly, the label prompt generator and tuning module 202 enables the processing device 102 to generate rich text prompts and perception-based prompts based on user inputs provided by the users.
[0049] Referring to FIG. 4, functional modules of the multi-modal sensor data generator module 204 is disclosed. This module 204 generates data for both the ICS 110 and perception system 112 for different sensor modalities such as camera, lidar, and radar using vision foundation model and LLM networks which can generate temporal description data of a given scene. In an embodiment, the multi-modal sensor data generator module 204 comprises a first vision foundation model (VFM) generator network 204-1 that generates temporal images of scenes of a cabin of the vehicle based on the rich text prompts generated by the ICS prompt expanders 202-4 and correspondingly generates an in-cabin image data. This module 204 further comprises a first point bind LLM generator 204-2 that generates a temporal point cloud for the short-range LiDAR data and short-range RADAR data of the corresponding scenes of the cabin based on the generated rich text prompts.
[0050] In addition, the multi-modal sensor data generator module 204 comprises a second VFM generator network 204-3 that enables the processing device 102 to generate temporal view data associated with an exterior environment of the vehicle based on the perception-based prompts generated by the second prompt expander 202-5 and correspondingly generate the perception image data. This module 204 further comprises a second point bind LLM generator 204-4 that enables the processing device 102 to generate a temporal point cloud for the short-range LiDAR data and short-range RADAR data, and long-range LiDAR data and long-range RADAR data associated with the exterior environment based on the generated perception-based prompts. Accordingly, the multi-modal sensor data generator module 204 generates the cabin image data, the perception image data, and the LiDAR data and RADAR data for further annotation and labeling.
[0051] In an embodiment, the auto-labeling module 206 is used to annotate the generated ICS data and perception data using an interactive control network for the camera-based data and with an automated data labeling network for lidar and radar-based data. In an embodiment, the auto-labeling module 206 comprises a SAM-based network that enables the processing device to annotate and label the cabin image data and the perception image data for the corresponding scenes generated by the first VFM generator network and the first point bind LLM generator, respectively. Further, this module 206 comprises a point transformer-based network that enables the processing device to label LiDAR data and RADAR data generated by the second VFM generator network and the second point bind LLM generator. In an exemplary embodiment, as shown in FIG. 2, the auto-labeling module 206 comprises an ICS vision labeling network 206-1, an ICS 3D point cloud labeling network 206-2, a perception vision labeling network 206-3, and a perception 3D point cloud labeling network 206-4. Accordingly, the auto-labeling module 206 annotates and labels the cabin image data, the perception image data, and the LiDAR data and RADAR data generated by the multi-modal sensor data generator module 204. This annotated/labeled data (also referred to as multi-modal sensor data) may be further used for training or for performance evaluation of vision-based models for detection and segmentation of use cases for both in-cabin system 110 and perception system 112 associated with vehicles.
[0052] Referring to FIG. 5, functional modules of the interactive prompt editor 208 is disclosed. The interactive prompt editor 208 enables users to edit the labeled multi-modal sensor data generated by the auto-labeling module 206 for different purposes such as but not limited to the removal of objects, the addition of objects, and modifications of the contents of the objects. In an embodiment, the interactive prompt editor 208 comprises one or more click-based interactive SAM-based labeling networks 502, 504. The in-cabin labeled annotation data which consists of images and a point cloud is manually labeled or corrected for other classes using a click-based interactive prompt-based network 502 that modifies the existing labeled annotation data. This allows the user to modify the labeling which increases the accuracy of the labeled annotation. Similarly, for perception data, another custom click-based interactive network 504 is used to modify the generated data along with prompts.
[0053] Referring back to FIGs. 1 and 2, the database 106 associated with the system 100 is configured to store the multi-modal sensor dataset comprising the multi-modal sensor data, the labeled multi-modal sensor data, and the corresponding prompts. The processing device 102 or the training and testing unit 108 associated with the vehicle is further configured to train and further evaluate the performance of the ICS 110 and the perception system 112 of the vehicle using this multi-modal sensor dataset.
[0054] Referring to FIG. 6, in another aspect, the present invention elaborates upon a method 600 for the generation and auto-labeling of multi-modal sensor dataset associated with a vehicle. Method 600 involves the processing device 102, the input device 104, the database 106, and the training and testing unit 108 associated with the system 100 described in the above paragraphs, but not limited to the likes.
[0055] Method 600 comprises step 602 of receiving, by a processing device configured with a label prompt generator and tuning module, a first set of data packets comprising user inputs pertaining to a text prompt and/or a visual prompt for an in-cabin system (ICS) and/or a perception system associated with the vehicle, and correspondingly generating rich text prompts and perception-based prompts. Method 600 further comprises step 604 of generating, by the processing device configured with a multi-modal sensor data generator module, in-cabin image data, and short-range LiDAR data and short-range RADAR data of a cabin associated with the vehicle based on the generated rich text prompts. Method 600 further comprises step 606 of generating, by the multi-modal sensor data generator module, perception image data, and short-range LiDAR data and short-range RADAR data, and long-range LiDAR data and long-range RADAR data of an exterior environment of the vehicle based on the generated visual prompts. Accordingly, method 600 further comprises step 608 of annotating and labeling, by the processing device configured with an auto labeling module, the cabin image data, the perception image data, and the LiDAR data and RADAR data generated by the multi-modal sensor data generator module.
[0056] In an embodiment, at step 602, method of generating the rich text prompts and the perception-based prompts comprises the steps of refining, using a template extractor, the corresponding text prompt and/or visual prompt into a second set of data packets comprising any or a combination of location, time of day, scene, and environment. Further, step 602 comprises generating, using the auto refiner network, rich text description for the corresponding text prompt and/or visual prompt based on the second set of data packets. Furthermore, step 602 comprises fine-tuning, using the large language learning model (LLM) networks, the generated rich text description for the ICS and the perception system; generating, by the first prompt expander configured with the first GPT network, the rich text prompts for the ICS based on the fine-tuned rich text description, followed by another step of generating, by the second prompt expander configured with the second GPT network, the perception-based prompts for the perception system based on the fine-tuned rich text description.
[0057] In an embodiment, at step 604, method of generating the multi-modal sensor data comprises generating, by the first VFM generator network, temporal images of scenes of a cabin of the vehicle based on the generated rich text prompts and correspondingly generating the in-cabin image data, followed by generating, by a first point bind LLM generator, temporal point cloud for the short-range LiDAR data and short-range RADAR data of the corresponding scenes of the cabin based on the generated rich text prompts.
[0058] In addition, at step 606 method of generating the multi-modal sensor data further comprises generating, by the second VFM generator network, temporal view data associated with an exterior environment of the vehicle based on the generated perception-based prompts and correspondingly generating the perception image data, followed by generating, by the second point bind LLM generator, temporal point cloud for the short-range LiDAR data and short-range RADAR data, and long-range LiDAR data and long-range RADAR data associated with the exterior environment based on the generated perception-based prompts.
[0059] In an embodiment, at step 608, method of annotation and labeling comprises annotating and labeling, using the SAM-based network, the cabin image data and the perception image data for the corresponding scenes generated by the first VFM generator network and the first point bind LLM generator, respectively, followed by labeling, using the point transformer-based network, LiDAR data and RADAR data generated by the second VFM generator network and the second point bind LLM generator.
[0060] In an embodiment, method 600 further comprises the steps of allowing, using the input device, the users to manually annotate and label multi-modal sensor data comprising any of the cabin image data, the perception image data, the LiDAR data. and the RADAR data. Method 600 further comprises the steps of allowing, using the input device, the users to manually edit or modify the labeled multi-modal sensor data.
[0061] In addition, method 600 comprises the steps of storing, in the database, the multi-modal sensor dataset comprising the multi-modal sensor data, the labeled multi-modal sensor data, and the corresponding prompts. This stored data is further used to train and evaluate the performance of the ICS and perception systems associated with the vehicle.
[0062] Accordingly, the present invention (system and method) not only addresses the problem associated with the existing training data generation solutions, by providing an improved solution to generate and auto-label multimodal sensor temporal data for both in-cabin and perception systems associated with the vehicle but also provides an interactive prompt editing capability for detection and segmentation of use cases. Moreover, the invention facilitates training and performance evaluation of the vision-based models for in-cabin and perception systems associated with vehicles with the labeled multimodal sensor temporal data pertaining to scenes that cannot be recorded in real-time or simulated using the existing simulators.
[0063] In an embodiment, the processing device 102 comprises one or more processors 102-1 coupled to a memory 102-2 storing instructions executable by the processors 102-1, which causes the processing device 102 to perform one or more designated operations.
[0064] In an embodiment, the proposed system 100 and method 600 are implemented using any or a combination of hardware components and software components such as a cloud, a server, a computing system, a computing device, a network device, and the like. Further, the input device interacts with the processing device, through an application or software that resides in the input devices. In an implementation, the system 100 is accessed by an application that is configured with any operating system, comprising but not limited to, AndroidTM, iOSTM, Windows, and the like. It will be understood that the system is implemented as any suitable computing system known in the art, such as a desktop, a laptop, a server, web server, and the like.
[0065] In an embodiment, the input device 104, the database 106, and the ICS and perception systems 110, 112 of the vehicle may be in communication with the processing device 102 via a secured network. In an implementation, network is a wireless network, a wired network, or a combination thereof that is implemented as one of the different types of networks, such as Intranet, Local Area Network (LAN), Wide Area Network (WAN), Internet, and the like. Further, the network is either a dedicated network or a shared network. The shared network represents an association of the different types of networks that use a variety of protocols, for example, Hypertext Transfer Protocol (HTTP), Transmission Control Protocol/Internet Protocol (TCP/IP), Wireless Application Protocol (WAP), and the like.
[0066] While the foregoing describes various embodiments of the invention, other and further embodiments of the invention may be devised without departing from the basic scope thereof. The scope of the invention is determined by the claims that follow. The invention is not limited to the described embodiments, versions, or examples, which are included to enable a person having ordinary skill in the art to make and use the invention when combined with information and knowledge available to the person having ordinary skill in the art.
ADVANTAGES OF THE PRESENT INVENTION
[0067] The present invention addresses the drawbacks, limitations, and shortcomings associated with existing training data generation solutions for vision-based models.
[0068] The present invention generates labeled multimodal sensor temporal data for in-cabin and perception systems across all modalities of autonomous driving vehicles.
[0069] The present invention trains and evaluates the vision-based models for in-cabin and perception systems associated with vehicles with labeled multimodal sensor temporal data pertaining to scenes that cannot be recorded in real-time or simulated using the existing simulators.
[0070] The present invention generates labeled multimodal sensor temporal data for both in-cabin and perception scenes that cannot be recorded in real-time or simulated using the existing simulators.
[0071] The present invention provides a system and method for the generation and auto-labeling of multimodal sensor temporal data for the detection and segmentation of use cases for both in-cabin and perception systems associated with vehicles.
[0072] The present invention provides a system and method for the generation and auto-labeling of multimodal sensor temporal data, which also provides interactive prompt editing capability to edit scenes or objects in the scene.
, Claims:1. A system for generation and auto labeling of multi-modal sensor dataset associated with a vehicle, the system comprising:
a processing device comprising one or more processors coupled to a memory storing instructions executable by the processors;
a label prompt generator and tuning module that enables the processing device to receive, from an input device, a first set of data packets comprising user inputs pertaining to a text prompt and/or a visual prompt for an in-cabin system (ICS) and/or a perception system associated with the vehicle, and correspondingly generate rich text prompts and perception-based prompts;
a multi-modal sensor data generator module that enables the processing device to generate:
in-cabin image data, and short-range LiDAR data and short-range RADAR data of a cabin associated with the vehicle based on the generated rich text prompts; and
perception image data, short-range LiDAR data and short-range RADAR data, and long-range LiDAR data and long-range RADAR data of an exterior environment of the vehicle based on the generated visual prompts; and
an auto-labeling module that enables the processing device to annotate and label the cabin image data, the perception image data, and the LiDAR data and RADAR data generated by the multi-modal sensor data generator module.
2. The system as claimed in claim 1, wherein the label prompt generator and tuning module comprises:
a template extractor configured to refine the corresponding text prompt and/or visual prompt into a second set of data packets comprising any or a combination of location, time of day, scene, and environment;
an auto refiner network configured with a zero shot learning module, to generate rich text description for the corresponding text prompt and/or visual prompt based on the second set of data packets;
one or more large language learning model (LLM) networks configured to fine-tune the generated rich text description for the ICS and the perception system;
a first prompt expander configured with a first GPT network, to generate the rich text prompts for the ICS based on the fine-tuned rich text description; and
a second prompt expander configured with a second GPT network, to generate the perception-based prompts for the perception system based on the fine-tuned rich text description.
3. The system as claimed in claim 2, wherein the multi-modal sensor data generator module comprises:
a first vision foundation model (VFM) generator network that generates temporal images of scenes of a cabin of the vehicle based on the generated rich text prompts and correspondingly generates the in-cabin image data;
a first point bind LLM generator that generates a temporal point cloud for the short-range LiDAR data and short-range RADAR data of the corresponding scenes of the cabin based on the generated rich text prompts;
a second VFM generator network that enables the processing device to generate temporal view data associated with an exterior environment of the vehicle based on the generated perception-based prompts and correspondingly generate the perception image data; and
a second point bind LLM generator that enables the processing device to generate a temporal point cloud for the short-range LiDAR data and short-range RADAR data, and long-range LiDAR data and long-range RADAR data associated with the exterior environment based on the generated perception-based prompts.
4. The system as claimed in claim 3, wherein the auto labeling module comprises:
a segment anything model (SAM)-based network that enables the processing device to annotate and label the cabin image data and the perception image data for the corresponding scenes generated by the first VFM generator network and the first point bind LLM generator, respectively; and
a point transformer-based network that enables the processing device to label LiDAR data and RADAR data generated by the second VFM generator network and the second point bind LLM generator.
5. The system as claimed in claim 4, wherein the system comprises an interactive prompt editor that is configured to:
allow one or more users to manually annotate and label, using the input device, multi-modal sensor data comprising any of the cabin image data, the perception image data, the LiDAR data. and the RADAR data; and
allow the one or more users to manually edit or modify, using the input device, the labeled multi-modal sensor data.
6. The system as claimed in claim 5, wherein the system comprises a database in communication with the processing device, wherein the database is configured to store a multi-modal sensor dataset comprising the multi-modal sensor data, the labeled multi-modal sensor data, and the corresponding prompts, wherein the processing device or a training and testing unit associated with the vehicle is configured to train the ICS and the perception system of the vehicle using the multi-modal sensor dataset.
7. A method for generation and auto labeling of multi-modal sensor dataset associated with a vehicle, the method comprising:
receiving, by a processing device configured with a label prompt generator and tuning module, a first set of data packets comprising user inputs pertaining to a text prompt and/or a visual prompt for an in-cabin system (ICS) and/or a perception system associated with the vehicle, and correspondingly generating rich text prompts and perception-based prompts;
generating, by the processing device configured with a multi-modal sensor data generator module, in-cabin image data, and short-range LiDAR data and short-range RADAR data of a cabin associated with the vehicle based on the generated rich text prompts;
generating, by the multi-modal sensor data generator module, perception image data, short-range LiDAR data and short-range RADAR data, and long-range LiDAR data and long-range RADAR data of an exterior environment of the vehicle based on the generated visual prompts; and
annotating and labeling, by the processing device configured with an auto labeling module, the cabin image data, the perception image data, and the LiDAR data and RADAR data generated by the multi-modal sensor data generator module.
8. The method as claimed in claim 7, wherein the method of generating the rich text prompts and the perception-based prompts comprises:
refining, using a template extractor, the corresponding text prompt and/or visual prompt into a second set of data packets comprising any or a combination of location, time of day, scene, and environment;
generating, using an auto refiner network, rich text description for the corresponding text prompt and/or visual prompt based on the second set of data packets;
fine-tuning, using one or more large language learning model (LLM) networks, the generated rich text description for the ICS and the perception system;
generating, by a first prompt expander configured with a first GPT network, the rich text prompts for the ICS based on the fine-tuned rich text description; and
generating, by a second prompt expander configured with a second GPT network, the perception-based prompts for the perception system based on the fine-tuned rich text description.
9. The method as claimed in claim 8, wherein the method of generating the multi-modal sensor data comprises:
generating, by a first vision foundation model (VFM) generator network, temporal images of scenes of a cabin of the vehicle based on the generated rich text prompts and correspondingly generate the in-cabin image data;
generating, by a first point bind LLM generator, temporal point cloud for the short-range LiDAR data and short-range RADAR data of the corresponding scenes of the cabin based on the generated rich text prompts;
generating, by a second VFM generator network, temporal view data associated with an exterior environment of the vehicle based on the generated perception-based prompts and correspondingly generate the perception image data; and
generating, by a second point bind LLM generator, temporal point cloud for the short-range LiDAR data and short-range RADAR data, and long-range LiDAR data and long-range RADAR data associated with the exterior environment based on the generated perception-based prompts.
10. The method as claimed in claim 9, wherein the method of annotation and labeling comprises:
annotating and labeling, using a SAM-based network, the cabin image data and the perception image data for the corresponding scenes generated by the first VFM generator network and the first point bind LLM generator, respectively; and
labeling, using a point transformer-based network, LiDAR data and RADAR data generated by the second VFM generator network and the second point bind LLM generator.
11. The method as claimed in claim 10, wherein the method further comprises:
allowing, using the input device, one or more users to manually annotate and label multi-modal sensor data comprising any of the cabin image data, the perception image data, the LiDAR data. and the RADAR data;
allowing, using the input device, the one or more users to manually edit or modify the labeled multi-modal sensor data; and
storing, in a database, a multi-modal sensor dataset comprising the multi-modal sensor data, the labeled multi-modal sensor data, and the corresponding prompts.
| # | Name | Date |
|---|---|---|
| 1 | 202421008364-STATEMENT OF UNDERTAKING (FORM 3) [07-02-2024(online)].pdf | 2024-02-07 |
| 2 | 202421008364-REQUEST FOR EXAMINATION (FORM-18) [07-02-2024(online)].pdf | 2024-02-07 |
| 3 | 202421008364-FORM 18 [07-02-2024(online)].pdf | 2024-02-07 |
| 4 | 202421008364-FORM 1 [07-02-2024(online)].pdf | 2024-02-07 |
| 5 | 202421008364-DRAWINGS [07-02-2024(online)].pdf | 2024-02-07 |
| 6 | 202421008364-DECLARATION OF INVENTORSHIP (FORM 5) [07-02-2024(online)].pdf | 2024-02-07 |
| 7 | 202421008364-COMPLETE SPECIFICATION [07-02-2024(online)].pdf | 2024-02-07 |
| 8 | 202421008364-Proof of Right [27-02-2024(online)].pdf | 2024-02-27 |
| 9 | 202421008364-FORM-26 [27-02-2024(online)].pdf | 2024-02-27 |
| 10 | Abstract1.jpg | 2024-04-17 |
| 11 | 202421008364-FORM-26 [24-04-2024(online)].pdf | 2024-04-24 |