Enhanced Multimodal Object Detection Using U Net Centric Feature

< Back

Enhanced Multimodal Object Detection Using U Net Centric Feature Fusion

Abstract: The invention describes an Enhanced multimodal abject detection using U-Net centric feature fusion to detect thermally radiant objects in low light imagery using multispectral images from visible spectrum and infrared spectrum. The proposed system used U-Net to perform segmentation on the images by using the infrared images to predict binary mask images which separate the object/region of interest from the background and this information is then masked onto the visible image to make it easier to detect objects on the visible image. The background of the visible low light image is cleared and replaced by fully black pixels which makes the objects clearer to the YOLO model that does the detection of objects. This methodology is done to detect the thermally radiant objects in low light environments only.

Get Free WhatsApp Updates!
Notices, Deadlines & Correspondence

Patent Information

Application #

Filing Date

25 July 2025

Publication Number

31/2025

Publication Type

INA

Invention Field

COMPUTER SCIENCE

Status

Email

Parent Application

Applicants

MLR Institute of Technology

Hyderabad

Inventors

1. Dr. K. Sivakrishna

Department of CSE – AI&ML, MLR Institute of Technology, Hyderabad

2. Mr. D. Bhanuteja Reddy

Department of CSE – AI&ML, MLR Institute of Technology, Hyderabad

3. Mr. D. Selvyram Vedha Viyas

Department of CSE – AI&ML, MLR Institute of Technology, Hyderabad

4. Mr. P. Phanidhar

Department of CSE – AI&ML, MLR Institute of Technology, Hyderabad

5. Mr. D. Chandu

Department of CSE – AI&ML, MLR Institute of Technology, Hyderabad

Specification

Description:Field of the Invention
The Enhanced multimodal object detection using U-Net centric feature fusion belongs to the field low light object detection, surveillance and automated driving. It uses modern technologies like Deep Learning, multispectral imagery, Image Processing, and Machine Learning Algorithms to analyse multispectral imagery using the trained models to segment region of interest from the background and detect the objects. The invention uses deep learning models, such as U-Net for segmentation and YOLO for object detection. This is done to improve the accuracy of detection of thermally radiant objects in challenging lighting conditions. This invention can be applied in surveillance, autonomous driving, security systems, and other scenarios requiring robust object detection in low-visibility settings.
Background of the Invention
Low light object detection is a critical task in object detection for applications such as surveillance, automated driving, security, etc. It is also one of the challenges when it comes to object detection as object detection is difficult when using low light, foggy and occluded images as the objects tend to blend with the background and cause problems to the model doing the detection. It also causes the noise in the image data to be more prominent and causes colour distortion.
Traditional approaches struggle to interpret data from this low light imagery, resulting in inefficiencies in object classification and detection which can cause problems especially in critical applications such as security and automated driving. There is an increasing demand for improved models that can effectively detect and classify objects in low light imagery. This demand inspired the idea of this project.CN112102324B The invention discloses a remote sensing image sea ice identification method based on a depth U-Net model, which comprises the following steps of firstly constructing a remote sensing image sea ice training data set: preprocessing the remote sensing image, and then carrying out sea ice annotation on the remote sensing image according to the existing sea ice related data to obtain a true value image; slicing the remote sensing image and the true value image to obtain a remote sensing image sea ice training data set; and then constructing a remote sensing graph sea ice identification model based on the depth.US10445146B2 describes an improved feature extraction that prioritizes significant parts within multispectral data with the help of self-attention mechanism, making it effective at detecting in challenging conditions. This approach is consistent in refining feature selection through attention mechanisms, resulting in optimized detection in various environments.
Similarly, US10303860B2 outlines the importance of incorporating multispectral imaging to improve detection accuracy in low-light conditions with help of MS object detection systems that rely on several spectral bands, emphasizing the necessity of the robust model. Expanding the mentioned principles, MT-YOLO ensures precise object detection by including several bands of images beyond RGB to gather complicated and diverse scene information. US10643067B2 detail the research for embedded attention mechanism in CNNs, to make sure that the models dynamically focus on pertinent areas of image. This article provides support for MT-YOLO’s implementation of multispectral framework with self-attention, such that the model can recognize and detect objects accurately over a range of spectral inputs.Moreover, developments in YOLO based object detection models as detailed in WO2019023456A1, showcase the optimizations of YOLO architecture, particularly in recognizing minor objects and handling fluctuation in scale. For personalized alterations that result in improved detection performance in difficult visual circumstances, MT-YOLO uses the mentioned optimizations in-order to adapt YOLO’s architecture for multispectral detection.
As described in US10096018B2, the use of deep learning for multispectral image classification tells how the data can be processed to use neural networks for object detection and categorization. For improved object categorization through use of multispectral inputs for reliable and precise identification, MT-YOLO architecture comprises existing deep learning methods. For multichannel imaging as per US10924876B2, the use of attention techniques for key ranking channels according to input complexity is highlighted, enabling the models to choose pertinent features across multispectral bands dynamically. This thought reinforces MT-YOLO’s design objective of emphasizing predominant spectral information, improving the model’s capacity to precisely identify the objects in intricate scenarios.Furthermore, US10198762B2 showcases the strategies for object detection in low-visibility settings, which includes sensor fusion and image processing approaches to boost the real-time detection capabilities. MT-YOLO tackles comparable issues by combining multispectral data to enhance identification in tricky situations and it ideally suits applications where conventional RGB-based models fall short.
Summary of the Invention
The proposed innovation introduces a multimodal object detection model based on U-Net and YOLO to segment and predict objects from multispectral low light imagery that are thermally radiant such as people, running cars, etc. This model uses multispectral imagery along with mask images which are binary images with white pixels for objects and black pixels for background for the training process.
The U-Net model is trained using the infrared images and the mask images to make it able to predict the mask images when the infrared image is given. This predicted mask is then applied to the RGB visible image to clear the background from the image so that object detection is made significantly easier. The YOLOv11 model is trained on low light images to further improve detection.
When the detection pipeline is being used the pair of visible and IR images are taken. The IR image is given to the U-Net model to generate the predicted mask. Then bitwise_and operation is applied to the visible image and the mask image. This makes all the background pixels in the visible image black. This image with only the object’s colour and intensity available is significantly better to detect the objects. The two models that are relatively best in what they do (segmentation – U-Net and detection – YOLOv11) are used for those specific tasks in our proposed system. U-Net is used for the segmentation part and YOLO is used for the object detection and tracking part.
Brief Description of Drawings
The invention will be described in detail with reference to the exemplary embodiments shown in the figures wherein:
Figure-1: Flow chart representing the work flow of the system
Figure-2: U-Net architecture diagram used for segmentation of objects and background.
Detailed Description of the Invention
This invention involves image segmentation and object detection on low light multispectral images using two different types of models (U-Net and YOLO). The thermally radiant objects such as people, cars, etc. are detected in lowlight through the infrared thermal energy emitted from them which is captured in infrared imagery. Each of the two models are used for tasks they specialize in. U-Net is trained and used for segmentation and YOLO is used to detect the objects in those segments. Instead of fusing the features from the infrared and visible images in the architecture itself like some of the other inventions mentioned above, the features of the IR image are used only for segmentation in the U-Net model and the features of the visible image is only used for the detection part of the pipeline using YOLO. The input for running of this object detection pipeline also needs images of two spectrums. Infrared and visible images. The U-Net architecture has to be prepared first as given in Fig.2. It is a very popular choice nowadays, specifically in image segmentation tasks such as medical imagery and object detection. It is an encoder decoder architecture CNN. It has convolutional layers with ReLU and Max-pooling layers for down-sampling. Up-sampling layers are used to reconstruct the image data to its original dimensions after the image is processed by the neural network. Skip connections are used to concatenate the feature maps from encoder to the corresponding decoder layers, preserving spatial information. The encoder (contracting path) captures context of input image. Each block has 2 convolutional layers one after the other with 3x3 kernel size and followed by a ReLU activation function.
After each block a max-pooling layer with a 2x2 kernel is applied to reduce the dimensions while preserving depth. Channels are doubled at each step to capture more complex features. The bottleneck contains two convolutional layers followed by ReLU and it is the narrowest part of the architecture containing the most simplified features of the images. Decoder block starts up-sampling using de-convolution (transposed convolution) to increase spatial dimensions. Similar to encoder layer each decoder block has 2 convolutional layers with ReLU activation function. Feature maps from the corresponding encoder block are concatenated with the up-sampled feature maps. This helps the model retain fine details lost during down-sampling. Finally, the output layer is a 1x1 convolutional layer that maps the features to the desired number of output channels, here it is one channel i.e. binary segmentation for object and background. Since it is binary segmentation sigmoid function is used.
The infrared and mask images from the dataset are used to train the U-Net model to predict mask images when infrared image is given. The YOLO model is also trained using low light images to better condition it to the specific application of this pipeline. To prevent overfitting, early stopping is used and the best model is saved during training. After the training of the U-net model, YOLOv11 is used along with it, as it is one of the most accurate object detection models. It is the newest version of Ultralytics YOLO series of real time object detection models whose versions of object detection models have consistently given better accuracy, speed and efficiency than other models for a long time. Also, the YOLO model has a big collection of classes for objects. The pretrained YOLOv11 model has a total of 80 pretrained classes. It detects objects by following the steps given below.Input feature extraction: The Input image travels through the layers of the CNN’s backbone. The features of the image are extracted in this section of the neural network.Feature Fusion: The “neck” of the model receives features that were extracted in the backbone. The features are combined and checked on several scale dimensions to provide enough distinction of the image’s content.Grids: The YOLOv11 divides the image into a grid like manner. The probability for an object class is given along with bounding boxes for each grid cell. This probability is of the chance of existence of the object in that box.
Finally, the result of is an image with bounding boxes formed around the detected objects, class names, and confidence scores based on the probabilities.After the training process is completed, as shown in Fig.1, the input image is classified whether it is in lowlight or normal light using factors such as pixel intensity, average pixel brightness, etc. Based on this classification, it is decided whether to use the U-Net model or not. If the image has less clarity, taken in lowlight, has occluded objects or challenging environment, etc., the infrared image of the captured pair of images are given to the U-Net model to generate the predicted mask and it is masked over the visible image using the bitwise_and operator. This clears the background in the visible image and only the pixels covering the object remain with their original color and intensity. This significantly improves the performance of the YOLO model by eliminating the background and bringing the objects into focus.
3 Claims and 2 Figures
Equivalents
The present invention, a low-light multispectral image processing pipeline using U-Net and YOLO models, is designed for effective image segmentation and object detection in challenging environments. The U-Net model specializes in segmenting infrared images, while YOLO processes visible images for object detection. The pipeline can be adapted to different segmentation and detection models, as well as various imaging modalities beyond infrared and visible spectra. Its versatility allows it to handle other applications requiring robust object detection and segmentation, making it suitable for diverse real-world scenarios beyond the specific low-light conditions described. , Claims:The scope of the invention is defined by the following claims:

Claim:
1. The Enhanced multimodal object detection using U-Net centric feature fusion progression comprising,
a) The system integrates a U-Net segmentation model configured to receive infrared images and generate the segmentation masks that separate background and thermal objects (region of interest).
b) The system also uses a YOLO object detection model configured to receive visible images for detecting objects.
c) The system employs a preprocessing module that classifies the input image as low-light or normal-light based on pixel intensity and other brightness factors.
d) The system uses the predicted mask segmentation from the trained U-Net model and masks it onto the visible image using bitwise_and operation before it is given to the YOLO model to enhance the performance of detection of the RGB image by clearing the background in the RGB image.
2. According to claim 1, the neural networks U-Net and YOLO are used together to detect the objects in low light. The U-Net segmentation algorithm assists the YOLO detection algorithm by clearing the background of the image.
3. According to claim 1, instead of combining features of both spectrums in the architecture of the model, they are separately used in different models to use the features where they are needed. The U-Net model uses the infrared image to detect segmentation masks which separate the region of interest and the background, whereas the YOLO model uses the visible image which has been altered using the predicted mask to detect the objects.

Documents

Application Documents

#	Name	Date
1	202541071003-REQUEST FOR EARLY PUBLICATION(FORM-9) [25-07-2025(online)].pdf	2025-07-25
2	202541071003-FORM-9 [25-07-2025(online)].pdf	2025-07-25
3	202541071003-FORM FOR STARTUP [25-07-2025(online)].pdf	2025-07-25
4	202541071003-FORM FOR SMALL ENTITY(FORM-28) [25-07-2025(online)].pdf	2025-07-25
5	202541071003-FORM 1 [25-07-2025(online)].pdf	2025-07-25
6	202541071003-EVIDENCE FOR REGISTRATION UNDER SSI(FORM-28) [25-07-2025(online)].pdf	2025-07-25
7	202541071003-EVIDENCE FOR REGISTRATION UNDER SSI [25-07-2025(online)].pdf	2025-07-25
8	202541071003-EDUCATIONAL INSTITUTION(S) [25-07-2025(online)].pdf	2025-07-25
9	202541071003-DRAWINGS [25-07-2025(online)].pdf	2025-07-25
10	202541071003-COMPLETE SPECIFICATION [25-07-2025(online)].pdf	2025-07-25