A System For Pedestrian Detection And Tracking

< Back

A System For Pedestrian Detection And Tracking

Abstract: A SYSTEM FOR PEDESTRIAN DETECTION AND TRACKING The invention discloses a multimodal Faster RCNN framework integrating Inception and ResNet V2 architectures for robust pedestrian tracking and detection. The system processes multimodal data, including RGB and infrared images, using a hybrid convolutional feature extraction approach combined with attention-based fusion to enhance detection accuracy under challenging environmental conditions. A region proposal network identifies candidate pedestrian regions, while classification and bounding box regression ensure precise localization. The model achieves high accuracy with reduced false detections and efficient tracking across video frames. Tested on the Penn-Fudan dataset, it demonstrates superior performance compared to conventional methods, offering an advanced solution for smart surveillance and real-time pedestrian detection applications.

Get Free WhatsApp Updates!
Notices, Deadlines & Correspondence

Patent Information

Application #

Filing Date

13 October 2025

Publication Number

46/2025

Publication Type

INA

Invention Field

COMPUTER SCIENCE

Status

Parent Application

Applicants

SR UNIVERSITY

ANANTHSAGAR, HASANPARTHY (M), WARANGAL URBAN, TELANGANA - 506371, INDIA

Inventors

1. JOHNSON KOLLURI

SR UNIVERSITY, ANANTHSAGAR, HASANPARTHY (M), WARANGAL URBAN, TELANGANA - 506371, INDIA

2. SANDEEP KUMAR DASH

NIT MIZORAM, AIZAWL- 796012 INDIA

3. RANJITA DAS

NIT AGARTALA , TRIPURA-799046 INDIA

Specification

Description:FIELD OF THE INVENTION
The present invention relates to the field of computer vision and artificial intelligence. More particularly, it pertains to an advanced multimodal image classification and detection system for pedestrian tracking within smart surveillance environments. The invention integrates multimodal Faster Region Convolutional Neural Networks (RCNN) with Inception and ResNet V2 architectures to accurately identify, track, and classify pedestrians in varying environmental conditions.
BACKGROUND OF THE INVENTION
Pedestrian detection and tracking within smart building surveillance systems present a significant challenge due to image distortions caused by various environmental factors. Traditional image classification methods and machine learning algorithms, such as histograms of oriented gradient filters, struggle to perform efficiently with a large volume of pedestrian images under such conditions. These methods are often inadequate for handling the complexities of real-world surveillance footage, which can include low resolution and high compression.
Current solutions for pedestrian detection often rely on conventional filter-based image classification and traditional machine learning algorithms. While deep learning models like Faster R-CNN have been applied, they may not be optimized for the specific challenges of multimodal pedestrian detection in varying environmental conditions. Existing methods often face difficulties with image distortions and may not effectively discriminate pedestrian locations, especially in cases of occlusion.
New spatio-contextual deep network design in is able to effectively utilising the multimodal data. For extracting features from the two modalities, it comprises of 2 unique deformable ResNeXt-50 encoders. A neural network based unit and many groups of a combination of Graph Attention Networks make up a multimodal feature embedding module (MuFEm), in which the fusion of these two encoded characteristics occurs. Two CRFs receive the outcomes of MuFEm's last feature fusion unit that is handed to them for spatial refinement.
The multimodal data YOLOv3 (MDY) method is used in for embedded device detection and recognition. By optimising anchor frames and including short target detection branches, the Multiple Dyuonate Yield method enhances pedestrian detection performance using YOLOv3 as its fundamental framework. The method is then sped up using TensorRT technology to enhance embedded devices' real-time performance.
Introduce an attention-guided multi-modal and multi-scale fusion (AMSF) module in to synchronously compatible local traits as an example dispersed across multi-modal and multi-scale layers and flexibly combine with fine-grained attention to properly exploit multiple modalities for superior multi-scale prediction results.
Using all available multi-modal characteristics, proposes a cross-modal object tracking network based on Gaussian Cross Attention (GCANet) to enhance detection capability. The important Characteristics of multi-modal are successfully highlighted through bidirectional coupling of local features from distinct modalities, increasing the detection performance by realising feature engagement and fusion between different modalities.
Using a Separation and combining technique, proposes a cross-modal feature learning (CFL) module to systematically examine both the common and modality-specific concepts of matched RGB and infrared pictures. To learn the cross-modal interpretations at various semantic levels, research integrate the suggested CFL module into many layers of a two-branch-based pedestrian recognition network. The multimodal network is developed end-to-end by jointly maximising a multi-task loss by adding a separation-based secondary task.
In the authors offer a unique single-stage detection system that uses multi-label learning to acquire input state-aware characteristics by tagging each input image combination with a different label based on its current state. They also describe a brand-new augmentation technique that synthesizes the unpaired multispectral pictures using geometric modifications.
The proposed MM Fast RCNN ResNet achieves higher precision, recall, and average precision compared to contemporary techniques. Specifically, it recorded a precision of 0.9057, a recall of 0.8629, and an average precision of 0.0943. Unlike traditional methods that are less effective with distorted images, this system is designed to perform robustly under various external environmental factors, making it highly suitable for smart building monitoring.
SUMMARY OF THE INVENTION
This summary is provided to introduce a selection of concepts, in a simplified format, that are further described in the detailed description of the invention.
This summary is neither intended to identify key or essential inventive concepts of the invention and nor is it intended for determining the scope of the invention.
This invention proposes a novel multimodal classifier-based pedestrian identification method called Multimodal Faster RCNN Inception and ResNet V2 (MM Fast RCNN ResNet). The system utilizes a regularized neural network where the feature representation is automatically adjusted for the detection task. By integrating Inception and ResNet V2 architectures with a Faster R-CNN framework, the model is designed to handle a large amount of image data and address tracking problems, forming a basis for various object recognition tasks. The proposed method was assessed using the Penn-Fudan dataset and demonstrated high accuracy.
To further clarify advantages and features of the present invention, a more particular description of the invention will be rendered by reference to specific embodiments thereof, which is illustrated in the appended drawings. It is appreciated that these drawings depict only typical embodiments of the invention and are therefore not to be considered limiting of its scope. The invention will be described and explained with additional specificity and detail with the accompanying drawings.
The present invention introduces a Multimodal Faster RCNN with Inception and ResNet V2 architecture (MM_FAST_RCNN_RESNET), a system specifically designed for robust pedestrian detection and tracking in complex environments. The proposed framework combines the spatial analysis capabilities of the Faster RCNN model with the deep representational strengths of Inception and ResNet V2 networks to form a unified multimodal detection system.
The system employs dual-channel multimodal inputs—such as RGB and infrared images—to improve detection under diverse lighting and occlusion conditions. Feature extraction is performed through hybrid convolutional layers that utilize the hierarchical feature learning capability of Inception modules and the residual learning capacity of ResNet V2. These extracted multimodal features are then integrated into a unified feature map for accurate pedestrian localization.
A region proposal network (RPN) identifies candidate pedestrian regions, while bounding box regression refines spatial localization. The classifier component, trained on multimodal data, distinguishes pedestrian objects from the background with high precision. The network further uses a regularized feature learning mechanism to dynamically adjust feature representation based on input conditions, ensuring stability and generalization.
Performance evaluation conducted on benchmark datasets, such as the Penn-Fudan pedestrian dataset, demonstrated significantly improved accuracy, achieving a precision of 0.9057, a recall of 0.8629, and an average precision of 0.0943, outperforming existing state-of-the-art models.
By combining multimodal data fusion with deep learning, the system exhibits exceptional adaptability to environmental noise, resolution degradation, and occlusion effects. It is particularly useful in surveillance systems, intelligent traffic monitoring, and autonomous navigation scenarios requiring accurate pedestrian detection and tracking.
BRIEF DESCRIPTION OF THE DRAWINGS
The illustrated embodiments of the subject matter will be understood by reference to the drawings, wherein like parts are designated by like numerals throughout. The following description is intended only by way of example, and simply illustrates certain selected embodiments of devices, systems, and methods that are consistent with the subject matter as claimed herein, wherein:
FIGURE 1: SYSTEM ARCHITECTURE
The figures depict embodiments of the present subject matter for the purposes of illustration only. A person skilled in the art will easily recognize from the following description that alternative embodiments of the structures and methods illustrated herein may be employed without departing from the principles of the disclosure described herein.
DETAILED DESCRIPTION OF THE INVENTION
The detailed description of various exemplary embodiments of the disclosure is described herein with reference to the accompanying drawings. It should be noted that the embodiments are described herein in such details as to clearly communicate the disclosure. However, the amount of details provided herein is not intended to limit the anticipated variations of embodiments; on the contrary, the intention is to cover all modifications, equivalents, and alternatives falling within the scope of the present disclosure as defined by the appended claims.
It is also to be understood that various arrangements may be devised that, although not explicitly described or shown herein, embody the principles of the present disclosure. Moreover, all statements herein reciting principles, aspects, and embodiments of the present disclosure, as well as specific examples, are intended to encompass equivalents thereof.
The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of example embodiments. As used herein, the singular forms “a",” “an” and “the” are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms “comprises,” “comprising,” “includes” and/or “including,” when used herein, specify the presence of stated features, integers, steps, operations, elements and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components and/or groups thereof.
It should also be noted that in some alternative implementations, the functions/acts noted may occur out of the order noted in the figures. For example, two figures shown in succession may, in fact, be executed concurrently or may sometimes be executed in the reverse order, depending upon the functionality/acts involved.
In addition, the descriptions of "first", "second", “third”, and the like in the present invention are used for the purpose of description only, and are not to be construed as indicating or implying their relative importance or implicitly indicating the number of technical features indicated. Thus, features defining "first" and "second" may include at least one of the features, either explicitly or implicitly.
Unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which example embodiments belong. It will be further understood that terms, e.g., those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the relevant art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.
This invention proposes a novel multimodal classifier-based pedestrian identification method called Multimodal Faster RCNN Inception and ResNet V2 (MM Fast RCNN ResNet). The system utilizes a regularized neural network where the feature representation is automatically adjusted for the detection task. By integrating Inception and ResNet V2 architectures with a Faster R-CNN framework, the model is designed to handle a large amount of image data and address tracking problems, forming a basis for various object recognition tasks. The proposed method was assessed using the Penn-Fudan dataset and demonstrated high accuracy.
The primary novelty of this invention lies in the construction of a multimodal faster R-CNN that combines Inception and ResNet V2 architectures specifically for pedestrian tracking and detection. This approach allows for the automatic adjustment of feature representation to the detection task, leading to superior accuracy in identifying pedestrians even in challenging surveillance environments. The collected attributes from this method establish a foundation for more complex object recognition tasks.
The invention discloses a multimodal deep learning-based detection and tracking system comprising a Faster RCNN backbone integrated with Inception and ResNet V2 architectures. The system operates in stages including multimodal data acquisition, preprocessing, feature extraction, fusion, classification, and tracking.
The multimodal input unit acquires RGB and infrared images from surveillance sources. These modalities are synchronized temporally and spatially to ensure aligned data input. Each input channel undergoes preprocessing involving normalization, resizing, and noise reduction to improve data quality and reduce computational complexity.
The feature extraction unit integrates Inception and ResNet V2 networks to form a hybrid deep feature extractor. The Inception module captures multi-scale spatial features by processing the input through parallel convolutional layers of varying kernel sizes. This design enables the extraction of fine-grained spatial details from both low- and high-resolution image patches. The ResNet V2 component introduces skip connections that allow gradient flow through deeper layers, reducing vanishing gradient problems and enabling the network to learn deeper feature representations without overfitting.
The multimodal fusion layer combines extracted features from both RGB and infrared modalities. A weighted attention-based fusion mechanism is employed to assign adaptive importance to each modality based on environmental context—emphasizing infrared features in low-light conditions and RGB features in well-lit environments. The fusion process results in a high-dimensional multimodal feature map representing pedestrian-relevant visual cues.
The Region Proposal Network (RPN) identifies potential pedestrian regions using anchor boxes and applies convolutional feature mapping to predict objectness scores and bounding box coordinates. These proposals are filtered through non-maximum suppression to eliminate overlapping or redundant detections.
The classification layer applies a softmax function to distinguish between pedestrian and non-pedestrian objects. Bounding box regression fine-tunes the detected region to maximize localization accuracy. The combination of the RPN and classifier components forms the core of the Faster RCNN framework, optimized through multimodal learning.
The tracking module integrates temporal consistency across video frames using a deep correlation filter and motion vector estimation. It maintains consistent pedestrian IDs across sequences, minimizing false re-identifications.
The training module uses a multimodal dataset, where the loss function is defined as a weighted sum of classification and localization losses. Regularization techniques such as dropout and batch normalization are implemented to prevent overfitting. The model is trained end-to-end using stochastic gradient descent with momentum and adaptive learning rates.
The output visualization unit provides a user interface displaying real-time pedestrian bounding boxes, tracking trajectories, and detection confidence levels.
The interconnection between the modules ensures that data flows sequentially from acquisition through processing and detection to visualization. Feedback loops between the classification and tracking modules allow continuous refinement of detection parameters based on temporal consistency.
The overall architecture ensures efficient utilization of computational resources while maintaining robustness under varying environmental factors. The invention can be implemented in both centralized server systems and embedded platforms for real-time applications in surveillance and traffic systems.
By leveraging multimodal learning and hybrid architectural design, the invention significantly enhances pedestrian detection accuracy, speed, and resilience against environmental distortions.

, Claims:1. A system for pedestrian detection and tracking comprising multimodal data input, feature extraction, fusion, and classification modules integrated into a Faster RCNN framework using Inception and ResNet V2 architectures.
2. The system as claimed in claim 1, wherein the feature extraction module utilizes parallel convolutional layers of Inception and residual blocks of ResNet V2 to obtain hierarchical spatial and semantic representations.
3. The system as claimed in claim 1, wherein multimodal fusion is achieved using an attention-based weighted mechanism that dynamically adjusts feature contributions from RGB and infrared inputs.
4. The system as claimed in claim 1, wherein the region proposal network generates pedestrian candidate regions and applies non-maximum suppression for spatial refinement.
5. The system as claimed in claim 1, wherein the classification layer employs softmax and bounding box regression for accurate pedestrian localization and identification.
6. The system as claimed in claim 1, wherein the tracking module maintains pedestrian identity across frames using correlation filters and motion vector analysis.
7. A method for multimodal pedestrian detection comprising the steps of acquiring multimodal image data, preprocessing, extracting features using hybrid CNN architectures, fusing multimodal features, and detecting pedestrians using region proposals.
8. The method as claimed in claim 7, wherein the extracted features from multimodal sources are fused using attention-based mechanisms to improve detection under varying environmental conditions.
9. The method as claimed in claim 7, wherein temporal tracking is achieved by integrating spatial detections across video frames to maintain consistent identification.
10. The system as claimed in claim 1, wherein the entire architecture is trained end-to-end using a multimodal dataset with adaptive loss functions balancing classification and localization accuracy.

Documents

Application Documents

#	Name	Date
1	202541098664-STATEMENT OF UNDERTAKING (FORM 3) [13-10-2025(online)].pdf	2025-10-13
2	202541098664-REQUEST FOR EARLY PUBLICATION(FORM-9) [13-10-2025(online)].pdf	2025-10-13
3	202541098664-POWER OF AUTHORITY [13-10-2025(online)].pdf	2025-10-13
4	202541098664-FORM-9 [13-10-2025(online)].pdf	2025-10-13
5	202541098664-FORM FOR SMALL ENTITY(FORM-28) [13-10-2025(online)].pdf	2025-10-13
6	202541098664-FORM 1 [13-10-2025(online)].pdf	2025-10-13
7	202541098664-EVIDENCE FOR REGISTRATION UNDER SSI(FORM-28) [13-10-2025(online)].pdf	2025-10-13
8	202541098664-EVIDENCE FOR REGISTRATION UNDER SSI [13-10-2025(online)].pdf	2025-10-13
9	202541098664-EDUCATIONAL INSTITUTION(S) [13-10-2025(online)].pdf	2025-10-13
10	202541098664-DRAWINGS [13-10-2025(online)].pdf	2025-10-13
11	202541098664-DECLARATION OF INVENTORSHIP (FORM 5) [13-10-2025(online)].pdf	2025-10-13
12	202541098664-COMPLETE SPECIFICATION [13-10-2025(online)].pdf	2025-10-13
13	202541098664-Proof of Right [17-11-2025(online)].pdf	2025-11-17