Abstract: The present disclosure relates to a system (102) and method (400) for gunshot detection using a hybrid deep learning architecture. The system (102) receives an audio signal from acoustic sensors (104) and converts into a time-frequency spectrogram using short-time Fourier transform (STFT). A first Vision Transformer (ViT) divides spectrogram into patches of predefined size and applies positional encoding. The encoded patches are reshaped into a tensor and processed by a convolutional neural network (CNN) to extract local and multi-scale features and generate a global feature embedding. The embedding is flattened into a sequence of feature vectors, which is further input into a second Vision Transformer (ViT) to capture temporal and contextual dependencies, and output is classified into one of a plurality of firearm-related classes. The proposed system (102) ensures robust and accurate classification for real-time gunshot detection and can be deployed in surveillance, law enforcement, and public safety applications.
Description:TECHNICAL FIELD
[0001] The present disclosure relates generally to the field of acoustic signal processing and machine learning. More specifically, it pertains to a system and method for gunshot detection using Vision Transformers (ViT) and convolutional neural networks (CNNs).
BACKGROUND
[0002] Background description includes information that may be useful in understanding the present invention. It is not an admission that any of the information provided herein is prior art or relevant to the presently claimed invention, or that any publication specifically or implicitly referenced is prior art.
[0003] Gunshot incidents continue to pose serious risks across multiple sectors, including urban public safety, law enforcement, and wildlife conservation. In densely populated areas, gunfire endangers civilian lives and demands swift emergency responses to prevent escalation. In protected ecological zones, gunshots often indicate poaching, necessitating real-time detection to enable immediate intervention.
[0004] Traditional methods of gunshot detection rely on manual surveillance, basic sound thresholding, or rudimentary acoustic sensors. These approaches are limited by human perceptual biases, inconsistent reaction times, and poor adaptability to noisy or unpredictable acoustic environments. Consequently, such systems may fail to promptly or accurately identify gunfire, particularly in environments saturated with overlapping impulsive sounds like firecrackers, construction noise, or engine backfires.
[0005] To improve detection accuracy, signal processing techniques have been adopted, transforming raw audio into time-frequency representations such as spectrograms using Short-Time Fourier Transform (STFT). These visual forms of audio data facilitate the application of automated pattern recognition techniques.
[0006] With the advancement of machine learning, particularly deep learning, Convolutional Neural Networks (CNNs) have emerged as effective tools for processing spectrograms due to their ability to capture local patterns in image-like data. CNNs have been widely used in various acoustic and image classification applications. However, their capacity to understand long-range dependencies is limited, which may impact their performance in complex sound classification scenarios.
[0007] More recently, attention-based architectures like Vision Transformers (ViTs) have been introduced for image classification tasks. ViTs are designed to capture both local and global dependencies through self-attention mechanisms, making them highly suitable for identifying complex relationships within visual data. These models are increasingly being considered for tasks in acoustic event detection and sound classification.
[0008] Despite these technological advancements, several challenges remain. Current systems often struggle with high false-positive rates, low generalizability to varied environments, and poor performance in real-time detection scenarios. Additionally, the computational complexity and resource requirements of existing solutions may hinder their deployment in edge or resource-constrained settings.
[0009] Moreover, existing models may fail to distinguish gunshots from acoustically similar impulsive noises, especially in the presence of background disturbances. These shortcomings compromise the reliability and responsiveness of current detection systems used in real-world scenarios, including urban surveillance, security operations, and anti-poaching efforts.
[0010] Although a wide range of methods have been proposed for acoustic event classification, including deep learning-based approaches, many still exhibit limitations in accurately detecting impulsive sounds like gunshots amidst environmental noise. Issues such as latency, overfitting to training data, and lack of robustness in diverse conditions persist in practical deployments.
[0011] There is, therefore, a need for an improved, accurate, and efficient system for real-time gunshot detection that may operate reliably across diverse and noisy environments, minimizing false alarms and enhancing public safety, wildlife protection, and surveillance applications.
OBJECTS OF THE PRESENT DISCLOSURE
[0012] A general object of the present disclosure is to provide a system capable of detecting gunshots with high accuracy across varied acoustic environments.
[0013] An object of the present disclosure is to provide real-time gunshot detection to enhance emergency response and public safety.
[0014] An object of the present disclosure is to provide a noise-resilient detection approach effective in urban and natural surroundings.
[0015] An object of the present disclosure is to provide a solution that supports wildlife conservation by enabling the timely detection of poaching activities.
[0016] Another object of the present disclosure is to provide a solution adaptable to both indoor and outdoor settings for security monitoring.
[0017] Another object of the present disclosure is to provide an efficient solution for distinguishing gunshots from other impulsive acoustic events.
[0018] Another object of the present disclosure is to provide a scalable and automated system for continuous acoustic surveillance.
[0019] Another object of the present disclosure is to provide a system that contributes to law enforcement operations by improving situational awareness.
[0020] Another object of the present disclosure is to provide a solution that promotes sustainable urban development through enhanced safety infrastructure.
[0021] Another object of the present disclosure is to provide a framework suitable for integration into smart city and conservation technology ecosystems.
SUMMARY
[0022] Aspects of the present disclosure relate to the field of acoustic signal processing and machine learning. More specifically, it pertains to a system and method for gunshot detection using Vision Transformers (ViT) and convolutional neural networks (CNNs). This approach provides several advantages, including improved accuracy in detecting gunshots amidst background noise, enhanced ability to capture both local and global dependencies in acoustic signals, and reduced false positives. Additionally, the system and method are configured for real-time detection, making it suitable for both urban and rural environments, and offer scalability for integration into larger surveillance and security infrastructures.
[0023] An aspect of the present disclosure pertains to a system for gunshot detection. The system includes one or more processors and a memory coupled to the processors. The memory stores instructions that, when executed by the processors, cause the system to receive an audio signal from acoustic sensors deployed within a predefined area and convert the received audio signal into a time-frequency representation in the form of a spectrogram using a short-time Fourier transform (STFT) with a fixed frame length and overlap. A first Vision Transformer (ViT) is applied to the spectrogram to divide it into a plurality of patches of predefined size and encode positional information for each patch. The encoded patches are reshaped into a tensor format for convolutional processing. A convolutional neural network (CNN) processes the reshaped tensor to extract one or more local and multi-scale features and generate a global feature embedding. The global feature embedding is flattened into a sequence of feature vectors and input into a second Vision Transformer to learn temporal and contextual dependencies. The output of the second Vision Transformer is classified into one of a plurality of firearm-related classes using a fully connected layer followed by a softmax activation function.
[0024] In an aspect, the first Vision Transformer is configured to divide the spectrogram into patches and apply positional encoding to each patch to generate patch-wise embedding suitable for downstream processing.
[0025] In an aspect, the convolutional neural network (CNN) includes a GoogleNet architecture, which includes a 7×7 convolutional layer followed by a max pooling layer, at least two inception modules configured to extract the one or more local and multi-scale features, and a global average pooling layer configured to aggregate outputs of the inception modules and flatten them into the global feature embedding.
[0026] In an aspect, the second Vision Transformer includes a linear dimension-reduction layer configured to project each feature vector to a lower-dimensional embedding, a multi-head self-attention block configured to compute self-attention across the embedded feature sequence, a residual skip connection configured to add the input of each block to its corresponding output, and a fully connected layer followed by a softmax activation function to generate a probability distribution over the plurality of firearm-related classes.
[0027] In an aspect, the processors are configured to train the first Vision Transformer, the convolutional neural network, and the second Vision Transformer using an Adam optimizer and a sparse categorical cross-entropy loss function. The training is performed on data divided into training, validation, and test subsets.
[0028] Another aspect of the present disclosure relates to a method for detecting a gunshot. The method includes receiving, by one or more processors, an audio signal from one or more acoustic sensors positioned within a predefined area. The received audio signal is converted into a time-frequency representation in the form of a spectrogram using a short-time Fourier transform (STFT) with a fixed frame length and frame overlap. A first Vision Transformer (ViT) is applied to the spectrogram to divide it into a plurality of patches of a predefined size and to encode positional information for each patch. The encoded patches are reshaped into a tensor format suitable for convolutional processing. The reshaped tensor is processed using a convolutional neural network (CNN) to extract one or more local and multi-scale features and to generate a global feature embedding. The global feature embedding is flattened to form a sequence of feature vectors. The method further includes inputting the sequence of feature vectors into a second Vision Transformer to learn temporal and contextual dependencies. The output of the second Vision Transformer is classified into one of a plurality of firearm-related classes using a fully connected layer followed by a softmax activation function.
BRIEF DESCRIPTION OF DRAWINGS
[0029] The accompanying drawings are included to provide a further understanding of the present disclosure and are incorporated in and constitute a part of this specification. The drawings illustrate exemplary embodiments of the present disclosure and, together with the description, serve to explain the principles of the present disclosure. The diagrams are for illustration only, which is thus not a limitation of the present disclosure.
[0030] FIG. 1 illustrates an exemplary block diagram of the proposed system for gunshot detection using a Hybrid Vision Transformer and Convolutional Neural Network applied to a spectrogram, in accordance with an embodiment of the present invention.
[0031] FIG. 2 illustrates an exemplary flow chart illustrating the working of the proposed system for gunshot detection, in accordance with an embodiment of the present invention.
[0032] FIG. 3 illustrates an exemplary architecture of the proposed system for gunshot detection, in accordance with an embodiment of the present invention.
[0033] FIG. 4 illustrates an exemplary flow diagram of a method for gunshot detection, in accordance with an embodiment of the present invention.
[0034] FIG. 5 illustrates an exemplary computer system in which or with which embodiments of the present disclosure are utilised in accordance with some embodiments of the present disclosure.
DETAILED DESCRIPTION OF THE PRESENT INVENTION
[0035] The following is a detailed description of embodiments of the disclosure represented in the accompanying drawings. The disclosed embodiments are merely exemplary of the invention, which may be embodied in various forms. The embodiments are in such detail as to clearly communicate the disclosure. However, the amount of detail offered is not intended to limit the anticipated variations of embodiments; on the contrary, the intention is to cover all modifications, equivalents, and alternatives falling within the spirit and scope of the present disclosure as defined by the appended claims.
[0036] Embodiment of the present disclosure relates to the field of acoustic signal processing and machine learning. More specifically, it pertains to a system and method for gunshot detection using Vision Transformers (ViT) and convolutional neural networks (CNNs).
[0037] An embodiment of the present disclosure pertains to a system and method for gunshot detection. The system includes one or more processors and a memory coupled to the processors. The memory stores instructions that, when executed by the processors, cause the system to receive an audio signal from acoustic sensors deployed within a predefined area and convert the received audio signal into a time-frequency representation in the form of a spectrogram using a short-time Fourier transform (STFT) with a fixed frame length and overlap. A first Vision Transformer (ViT) is applied to the spectrogram to divide it into a plurality of patches of predefined size and encode positional information for each patch. The encoded patches are reshaped into a tensor format for convolutional processing. A convolutional neural network (CNN) processes the reshaped tensor to extract one or more local and multi-scale features and generate a global feature embedding. The global feature embedding is flattened into a sequence of feature vectors and input into a second Vision Transformer to learn temporal and contextual dependencies. The output of the second Vision Transformer is classified into one of a plurality of firearm-related classes using a fully connected layer followed by a softmax activation function.
[0038] FIG. 1 illustrates an exemplary block diagram of the proposed system for gunshot detection using a Hybrid Vision Transformer and Convolutional Neural Network applied to a spectrogram, in accordance with an embodiment of the present invention.
[0039] FIG. 2 illustrates an exemplary flow chart illustrating the working of the proposed system for gunshot detection, in accordance with an embodiment of the present invention.
[0040] FIG. 3 illustrates an exemplary architecture of a proposed system for gunshot detection, in accordance with an embodiment of the present invention.
[0041] Referring to FIGs. 1, 2, and 3, a system (102) for gunshot detection is disclosed. The system (102) utilises advanced acoustic signal processing techniques combined with machine learning models to accurately detect gunshots in real-time across diverse environments. The system (102) includes one or more processors (106) that may be implemented as one or more microprocessors, microcomputers, microcontrollers, digital signal processors, central processing units, logic circuitries, and/or any devices that manipulate data based on operational instructions. Among other capabilities, the processors (106) may be configured to fetch and execute computer-readable instructions stored in a memory (108). The memory (108) may store one or more computer-readable instructions or routines, which may be fetched and executed for processing audio signals. The memory (108) may include any non-transitory storage device, including, for example, volatile memory such as Random Access Memory (RAM), or non-volatile memory such as an Erasable Programmable Read-Only Memory (EPROM), flash memory, and the like.
[0042] In an embodiment, the processors (106) may also include an interface(s) (110). The interface(s) (110) may include a variety of interfaces, for example, interfaces for data input and output devices, referred to as Input/Output (I/O) devices, storage devices, and the like. The interface(s) may provide a communication pathway for one or more components of the vehicle. Examples of such components include, but are not limited to, processing engine(s) (112) and a database (126).
[0043] In an embodiment, the processing engine(s) (112) may be implemented as a combination of hardware and programming (for example, programmable instructions) to implement one or more functionalities of the processing engine(s) (112). In other embodiments, the processing engine(s) (112) may be implemented by electronic circuitry. The database (126) may include data that is either stored or generated as a result of functionalities implemented by any of the components of the processing engine(s) (112). In some embodiments, the processing engine(s) (112) may include an audio signal reception module (114), a spectrogram generation module (116), a spectrogram encoding module (118), a feature extraction module (120), and a temporal modelling and classification module (122), and other module(s) (124). The other module(s) (124) may implement functionalities that supplement applications/functions performed by the system (102).
[0044] In an embodiment, the audio signal reception module (114) is configured to receive an audio signal originating from a predefined area of interest. This audio signal reception module (114) acts as an input interface to capture ambient sound data, using one or more acoustic sensors (104) or microphones. The acoustic sensors (104) are operatively coupled to the processors (106). The predefined area may include zones such as public spaces, urban streets, protected forests, or any monitored region where gunshot detection is required. The received audio signal serves as raw input that undergoes further processing for identifying and classifying potential gunshot events. For instance, a model used in the present system (102) is trained using a curated dataset sourced from Kaggle, which includes 851 audio samples categorised into nine firearm classes, including AK-12, AK-47, M4, and M16. This approach enables assessment of the robustness of a model against changes in gunshot loudness. The dataset exhibits a slightly imbalanced class distribution, with the largest class containing around 100 samples and the smallest approximately 72 samples. In preparing the data for the CNN model, preprocessing steps such as contrast adjustment and intensity scaling are employed to enhance robustness of the model to variations in signal amplitude and background noise levels in the spectrogram representations of audio signals.
[0045] In an embodiment, the spectrogram generation module (116) is configured to convert the received audio signal into a time-frequency spectrogram using a Short-Time Fourier Transform (STFT) with a fixed frame length and overlap. The resulting spectrogram is transformed into absolute values, and a channel dimension is added to standardise input shapes. For instance, the audio samples are normalised to a fixed duration of 1800 milliseconds by padding shorter signals and truncating longer ones, ensuring uniformity across the dataset. To address class imbalance, synthetic samples are generated through oversampling techniques for underrepresented classes. To simulate real-world scenarios, environmental sounds such as animal noises, wind, and other ambient audio are added to increase diversity and robustness. The preprocessing pipeline includes converting audio into consistent spectrogram shapes suitable for CNNs, followed by augmentations like random contrast adjustments and time shifts to enhance generalisation. Audio data is resampled to 16 kHz, and TensorFlow datasets are created with corresponding labels. Further, the dataset is divided into training, validation, and test sets, with caching, shuffling, and batching incorporated for optimised data loading during model training.
[0046] In an embodiment, the spectrogram encoding module (118) is configured to apply a first Vision Transformer (ViT) to the input spectrogram. This divides the spectrogram into a plurality of non-overlapping patches of a predefined size (e.g., 16×16 pixels). Each patch represents a localized region in the time-frequency domain. The module then applies positional encoding to each patch to retain spatial information lost during the patching process. This positional information allows the Vision Transformer to understand the relative locations of patches in the spectrogram. After positional encoding, the patches are linearly embedded and reshaped into a tensor format that is compatible with convolutional processing.
[0047] In an embodiment, the feature extraction module (120) is configured to process the reshaped spectrogram patches using a convolutional neural network (CNN) to extract one or more local and multi-scale features and to generate a global feature embedding. The convolutional neural network may be implemented using a GoogleNet architecture, which is particularly effective due to its inception modules. In alternative configurations, the convolutional neural network (CNN) may employ other backbone architectures such as ResNet, which incorporates residual connections to ease the training of deep networks and prevent vanishing gradient issues; AlexNet, which includes five convolutional layers followed by fully connected layers to enable hierarchical feature abstraction; and EfficientNet, which utilizes depthwise separable convolutions along with squeeze-and-excitation (SE) blocks to improve both accuracy and computational efficiency.
[0048] Initially, the reshaped tensor input is passed through a 7×7 convolutional layer followed by a max pooling layer, which helps in spatial downsampling and early feature extraction. The output of this layer is then forwarded to at least two inception modules, which are designed to capture features at multiple receptive fields (e.g., 1×1, 3×3, 5×5 convolutions) in parallel. These inception modules are instrumental in extracting multi-scale spatial features efficiently.
[0049] Subsequently, a global average pooling (GAP) layer aggregates the outputs of the inception modules to reduce the spatial dimensions and generate a compact global feature embedding. This global embedding is then flattened into a sequence of feature vectors, which serves as input to downstream temporal modeling components.
[0050] Further, the flattened sequence of feature vectors is input into a second Vision Transformer (ViT), which is configured to learn temporal and contextual dependencies across the feature sequence. The second ViT processes the sequence using a series of transformer encoder layers. Each encoder layer includes a linear dimension-reduction component that projects the high-dimensional input into a lower-dimensional embedding space. This is followed by a multi-head self-attention mechanism, which enables the model to compute attention scores across the sequence and identify relationships between distant elements. A residual skip connection is integrated within each block to preserve input information and facilitate stable gradient flow during training. These layers collectively capture relevant temporal patterns embedded in the input data. The final output of the second ViT is passed through a fully connected layer followed by a softmax activation function, which generates a probability distribution over predefined firearm-related classes.
[0051] Alternatively, or in comparative configurations, once the feature vectors are generated by the convolutional neural network (CNN), they may also be passed into temporal sequence modeling layers designed to capture time-dependent patterns. These layers may consist of recurrent models such as Long Short-Term Memory (LSTM), Bidirectional LSTM (BiLSTM), Gated Recurrent Unit (GRU), or standard Recurrent Neural Networks (RNN). Typically, these models are implemented using two stacked layers, with the first containing 512 hidden units and the second containing 256 hidden units. This layered configuration enables the system to learn both short- and long-range temporal dependencies within the input sequence. To prevent overfitting, dropout regularization is applied between the recurrent layers during training. The final output from the temporal model is further processed through a fully connected layer followed by a softmax activation function.
[0052] In an embodiment, the classification module (122) is configured to receive the output generated by the second Vision Transformer and classify it into one of several predefined firearm-related categories. This classification is carried out using a fully connected layer that transforms the learned feature representation into a vector of class scores. These scores are further passed through a softmax activation function, which converts them into a probability distribution across all firearm classes. The class associated with the highest probability value is selected as the final prediction. By leveraging the rich temporal and contextual features extracted by the second Vision Transformer, this module enables the system to accurately differentiate between various firearm types, such as AK-47, M4, or M16, based on the distinct acoustic signatures they produce, as learned during the model's training phase.
[0053] In an embodiment, the first Vision Transformer (ViT), the convolutional neural network (CNN), the temporal sequence model, and the second Vision Transformer (ViT), are jointly trained using an Adam optimizer, which is well-suited for handling sparse gradients and adaptive learning rates. The training process involves minimizing a sparse categorical cross-entropy loss function, which measures the error between the predicted class probabilities and the true firearm class labels. The dataset used for training is divided into three distinct subsets: a training set for learning model parameters, a validation set for tuning hyperparameters and preventing overfitting, and a test set for evaluating the model's generalization performance. This structured training pipeline ensures that each model component learns to extract and interpret meaningful acoustic features, ultimately improving the system’s accuracy in gunshot detection and classification.
[0054] In an exemplary embodiment, performance of the model is varied across different noise conditions, achieving classification accuracies of 96%, 95%, 89%, and 86% corresponding to gunshot intensity levels of 20%, 15%, 10%, and 5%, respectively. These results indicate that the model's performance declines under higher environmental noise levels or lower gunshot signal intensities.
[0055] Referring to FIG. 4, a method (400) for detecting a gunshot is disclosed. At step (402), the method (400) includes receiving, by one or more processors (106), an audio signal from one or more acoustic sensors (104) deployed within a predefined area. At step (404), the method includes converting, by the processors (106), the audio signal into a time-frequency representation in the form of a spectrogram using a short-time Fourier transform (STFT) with fixed frame length and frame overlap. At step (406), the method (400) includes applying, by the processors (106), a first Vision Transformer (ViT) to the spectrogram to divide the spectrogram into a plurality of patches of predefined size and encoding positional information for each patch.
[0056] At step (408), the method (400) includes reshaping, by the processors (106), the encoded patches into a tensor format suitable for convolutional processing.
[0057] At step (410), the method (400) includes applying a convolutional neural network (CNN) to the reshaped patches for extracting one or more local and multi-scale features and generating a global feature embedding. In one embodiment, the CNN may use a GoogleNet architecture that includes a 7×7 convolutional layer followed by a max pooling layer, at least two inception modules to extract fine-grained spatial features, and a global average pooling (GAP) layer to aggregate outputs and generate a compact embedding. Alternatively, other CNN architectures such as ResNet, AlexNet, or EfficientNet may be used, offering different capabilities in feature representation.
[0058] At step (412), the method (400) includes flattening, by the processors (106), the global feature embedding into a sequence of feature vectors.
[0059] At step (414), the sequence of feature vectors is input into a second Vision Transformer (ViT), which is configured to learn temporal and contextual dependencies across the feature sequence. The second ViT includes a linear dimension-reduction layer for embedding projection, a multi-head self-attention block to compute attention weights across the sequence, and residual skip connections to preserve input information and support stable training. The output of the second ViT is then passed to a fully connected layer followed by a softmax activation function to produce a probability distribution over predefined firearm-related classes.
[0060] In another embodiment, the method may further include processing the sequence of feature vectors using a temporal sequence model, selected from a long short-term memory (LSTM) network, a bidirectional LSTM (Bi-LSTM), or a gated recurrent unit (GRU) network, to capture sequential dependencies across time.
[0061] At step (416), the method includes classifying, by the processors (106), the output of the second Vision Transformer or temporal sequence model into one of a plurality of firearm classes based on the probability distribution generated by the final softmax layer.
[0062] The method (400) further includes training the first Vision Transformer, the CNN, the second Vision Transformer, and the temporal sequence model using an Adam optimizer and a sparse categorical cross-entropy loss function. The training process is conducted on a dataset split into training, validation, and testing subsets to ensure generalization and robustness of the model in gunshot detection tasks.
EXPERIMENTAL ANALYSIS
[0063] In an exemplary implementation, a comprehensive analysis is carried out to determine the optimal patch size for spectrogram segmentation, which directly affects classification performance. Multiple patch sizes—16, 32, 64, 128, and others—are evaluated. Among these, a patch size of 128 delivers the highest classification accuracy of 85.97%, offering a balanced trade-off between capturing both local and global features while retaining the temporal structure of the input spectrogram. In contrast, smaller patch sizes lead to increased computational cost without yielding significant accuracy gains. The results are summarised in Table 1.
Table 1: Performance Metrics of ViT for Different Patch Sizes
Patch Size Accuracy Loss
16 0.7876 0.5631
32 0.8297 0.4071
64 0.8211 0.3950
128 0.8597 0.4065
[0064] To enhance the model's performance further, a hybrid deep learning architecture combining GoogleNet and Vision Transformer (ViT) is developed to overcome the limitations of standalone models. In the first phase of experimentation, multiple CNN architectures, including GoogleNet, ResNet, AlexNet, and EfficientNet, are evaluated for feature extraction capabilities. ViT is employed as a common classifier across all combinations. Among these, GoogleNet+ViT achieves the highest accuracy and lowest loss, as shown in Table 2, demonstrating GoogleNet’s superior ability to extract high-quality spatial features. However, due to its inception modules, GoogleNet incurs higher computational overhead. AlexNet+ViT, though slightly less accurate, proves advantageous for resource-constrained applications due to its lower complexity.
Table 2: Model Performance Metrics with ViT as Classifier
Feature Ex-tractor Train Acc Train Loss Test Acc Test Loss Epoch
GoogleNet 1.0000 0.0041 1.0000 0.0021 93
ResNet 0.9538 0.1522 0.9964 0.0258 144
AlexNet 0.9781 0.0893 0.9893 0.0447 82
EfficientNet 0.2331 1.9771 0.2911 1.8744 53
[0065] In the second phase, GoogleNet is fixed as the feature extractor, and various sequential models such as LSTM, BiLSTM, GRU, and RNN are tested as classifiers. Once again, GoogleNet+ViT delivers the best performance, achieving perfect classification accuracy and minimal loss (as shown in Table 3). This highlights the advantage of ViT’s multi-head self-attention in modeling long-range dependencies over traditional recurrent architectures.
Table 3: Model Performance Metrics with Different Classifiers and GoogleNet as Feature Extractor
Classifier Train Acc Train Loss Test Acc Test Loss Epoch
ViT 1.0000 0.0041 1.0000 0.0021 93
LSTM 0.9337 0.2190 0.9696 0.1175 142
BiLSTM 0.1112 2.1980 0.1616 2.1327 14
GRU 0.9262 0.2379 0.9612 0.1060 116
RNN 0.9475 0.1771 0.9591 0.1309 100
[0066] The combined architecture leverages GoogleNet’s strength in capturing fine-grained local features and ViT’s ability to analyse global contextual relationships in parallel, thus enhancing classification performance. Despite higher computational demands, the system's superior accuracy and robustness against noise make GoogleNet+ViT a suitable candidate for real-world gunshot detection deployments where accuracy is essential.
[0067] FIG. 5 illustrates a block diagram of an example computer system (500) in which or with which embodiments of the present disclosure may be implemented.
[0068] As shown in FIG. 5, the computer system (500) may include an external storage device (510), a bus (520), a main memory (530), a read-only memory (540), a mass storage device (550), communication port(s) (560), and a processor (570). A person skilled in the art will appreciate that the computer system (500) may include more than one processor and communication ports. The processor (570) may include various modules associated with embodiments of the present disclosure. The communication port(s) (560) may be any of an RS-232 port for use with a modem-based dial-up connection, a 10/100 Ethernet port, a Gigabit or 10 Gigabit port using copper or fibre, a serial port, a parallel port, or other existing or future ports. The communication port(s) (560) may be chosen depending on a network, such as a Local Area Network (LAN), Wide Area Network (WAN), or any network to which the computer system (500) connects. The main memory (530) may be random access memory (RAM) or any other dynamic storage device commonly known in the art. The read-only memory (540) may be any static storage device(s), including, but not limited to, a Programmable Read Only Memory (PROM) chip for storing static information, e.g., start-up or basic input/output system (BIOS) instructions for the processor (570). The mass storage device (550) may be any current or future mass storage solution, which may be used to store information and/or instructions.
[0069] The bus (520) communicatively couples the processor (570) with the other memory, storage, and communication blocks. The bus (520) can be, e.g., a Peripheral Component Interconnect (PCI) / PCI Extended (PCI-X) bus, Small Computer System Interface (SCSI), universal serial bus (USB), or the like, for connecting expansion cards, drives, and other subsystems as well as other buses, such a front side bus (FSB), which connects the processor (570) to the computer system (500).
[0070] Optionally, operator and administrative interfaces, e.g., a display, keyboard, and a cursor control device, may also be coupled to the bus (520) to support direct operator interaction with the computer system (500). Other operator and administrative interfaces may be provided through network connections connected through the communication port(s) (560). In no way should the aforementioned exemplary computer system (500) limit the scope of the present disclosure.
[0071] Thus, the present disclosure provides the system (102) and method (400) for accurate and efficient detection and classification of gunshot sounds using deep learning-based spectrogram analysis. This enables real-time identification of firearm types across diverse acoustic environments, enhancing security and surveillance capabilities.
[0072] While the foregoing describes various embodiments of the invention, other and further embodiments of the invention may be devised without departing from the basic scope thereof. The scope of the invention is determined by the claims that follow. The invention is not limited to the described embodiments, versions, or examples, which are included to enable a person having ordinary skill in the art to make and use the invention when combined with information and knowledge available to the person having ordinary skill in the art.
ADVANTAGES OF THE PRESENT DISCLOSURE
[0073] The present disclosure provides a solution that accurately detects gunshots across diverse acoustic settings.
[0074] The present disclosure provides real-time detection capabilities to support rapid emergency response and enhance public safety.
[0075] The present disclosure provides a robust detection mechanism resilient to noise in both urban and natural environments.
[0076] The present disclosure provides a timely detection solution to aid wildlife protection and prevent poaching activities.
[0077] The present disclosure provides a versatile system adaptable for use in both indoor and outdoor security scenarios.
[0078] The present disclosure provides an effective method for differentiating gunshots from similar acoustic events.
[0079] The present disclosure provides a scalable and automated platform for uninterrupted acoustic monitoring.
[0080] The present disclosure provides an advanced tool to support law enforcement by enhancing environmental awareness.
[0081] The present disclosure provides a safety solution that aligns with the goals of sustainable urban development.
[0082] The present disclosure provides a compatible framework for integration with smart city infrastructure and conservation technologies.
, Claims:1. A system (102) for gunshot detection, the system (102) comprising:
one or more processors (106); and
a memory (108) coupled to one or more processors(106) , wherein the memory (108) stores instructions that, when executed by the one or more processors (106), cause the one or more processors (106) to:
receive an audio signal from one or more acoustic sensors (104) attached within a predefined area, and convert the received audio signal into a time-frequency representation in form of a spectrogram;
apply a first Vision Transformer (ViT) to the spectrogram to divide the spectrogram into a plurality of patches of a predefined patch size and encode positional information for each patch;
reshape the encoded patches into a tensor format for convolutional processing;
process the reshaped patches using a convolutional neural network (CNN) to extract one or more local and multi-scale features and generate a global feature embedding;
flatten the global feature embedding to form a sequence of feature vectors;
input the sequence of feature vectors into a second Vision Transformer (ViT) to learn temporal and contextual dependencies; and
classify output of the second Vision Transformer into one of a plurality of firearm-related classes using a fully connected layer followed by a softmax activation function.
2. The system (102) as claimed in claim 1, wherein the spectrogram is generated by a short-time Fourier transform (STFT) with a fixed frame length and a fixed frame overlap.
3. The system (102) as claimed in claim 1, wherein the convolutional neural network (CNN) comprises a GoogleNet architecture, the GoogleNet architecture comprising:
a 7×7 convolutional layer followed by a max pooling layer;
at least two inception modules configured to extract the one or more local and multi-scale features; and
a global average pooling layer configured to aggregate outputs of the at least two inception modules and flatten the aggregated one or more local and multi-scale features to generate the global feature embedding.
4. The system (102) as claimed in claim 1, wherein the second Vision Transformer comprising:
a linear dimension-reduction layer configured to project each feature vector to a lower-dimensional embedding;
a multi-head self-attention block configured to compute self-attention across the embedded feature sequence;
a residual skip connection configured to add input of each block to the associated output; and
the fully-connected layer followed by the softmax activation function configured to generate a probability distribution over the plurality of firearm-related classes.
5. The system (102) as claimed in claim 5, wherein the one or more processors (106) are further configured to train the first Vision Transformer (ViT), the convolutional neural network, the temporal sequence model, and the second Vision Transformer (ViT) using an Adam optimizer and a sparse categorical cross-entropy loss function on data split into training, validation, and test subsets.
6. A method (400) for detecting a gunshot, the method (400) comprising:
receiving (402), by one or more processors, an audio signal from one or more acoustic sensors (104), disposed within a predefined area;
converting (404), by the one or more processors, the received audio signal into a time-frequency representation in form of a spectrogram;
applying (406), by one or more processors, a first Vision Transformer (ViT) to the spectrogram to divide the spectrogram into a plurality of patches of predefined size and encoding positional information for each patch;
reshaping (408), by the one or more processors, the encoded patches into a tensor format for convolutional processing;
processing (410), by the one or more processors, the reshaped patches using a convolutional neural network (CNN) for extracting one or more local and multi-scale features and generating a global feature embedding;
flattening (412), by the one or more processors, the global feature embedding to form a sequence of feature vectors;
inputting (414), by the one or more processors, the sequence of feature vectors into a second Vision Transformer (ViT) to learn temporal and contextual dependencies; and
classifying (416), by the one or more processors, the output of the second Vision Transformer into one of a plurality of firearm-related classes using a fully connected layer followed by a softmax activation function.
7. The method (400) as claimed in claim 6, wherein the spectrogram is generated using a short-time Fourier transform (STFT) with a fixed frame length and frame overlap.
8. The method (400) as claimed in claim 6, wherein the convolutional neural network (CNN) comprises a GoogleNet architecture, the GoogleNet architecture comprising:
a 7×7 convolutional layer followed by a max pooling layer;
at least two inception modules configured to extract the one or more local and multi-scale features; and
a global average pooling layer configured to aggregate outputs of the at least two inception modules and flatten the aggregated one or more local and multi-scale features to generate the global feature embedding.
9. The method (400) as claimed in claim 6, wherein the second Vision Transformer comprises:
a linear dimension-reduction layer configured to project each feature vector to a lower-dimensional embedding;
a multi-head self-attention block configured to compute self-attention across the embedded feature sequence;
a residual skip connection configured to add input of each block to the associated output; and
the fully-connected layer followed by the softmax activation function configured to generate a probability distribution over the plurality of firearm-related classes.
10. The method (400) as claimed in claim 6, further comprises training the first Vision Transformer (ViT), the convolutional neural network (CNN), the temporal sequence model, and the second Vision Transformer (ViT) using an Adam optimizer and a sparse categorical cross-entropy loss function on data split into training, validation, and test subsets.
| # | Name | Date |
|---|---|---|
| 1 | 202541068876-STATEMENT OF UNDERTAKING (FORM 3) [18-07-2025(online)].pdf | 2025-07-18 |
| 2 | 202541068876-REQUEST FOR EXAMINATION (FORM-18) [18-07-2025(online)].pdf | 2025-07-18 |
| 3 | 202541068876-REQUEST FOR EARLY PUBLICATION(FORM-9) [18-07-2025(online)].pdf | 2025-07-18 |
| 4 | 202541068876-FORM-9 [18-07-2025(online)].pdf | 2025-07-18 |
| 5 | 202541068876-FORM FOR SMALL ENTITY(FORM-28) [18-07-2025(online)].pdf | 2025-07-18 |
| 6 | 202541068876-FORM 18 [18-07-2025(online)].pdf | 2025-07-18 |
| 7 | 202541068876-FORM 1 [18-07-2025(online)].pdf | 2025-07-18 |
| 8 | 202541068876-EVIDENCE FOR REGISTRATION UNDER SSI(FORM-28) [18-07-2025(online)].pdf | 2025-07-18 |
| 9 | 202541068876-EVIDENCE FOR REGISTRATION UNDER SSI [18-07-2025(online)].pdf | 2025-07-18 |
| 10 | 202541068876-EDUCATIONAL INSTITUTION(S) [18-07-2025(online)].pdf | 2025-07-18 |
| 11 | 202541068876-DRAWINGS [18-07-2025(online)].pdf | 2025-07-18 |
| 12 | 202541068876-DECLARATION OF INVENTORSHIP (FORM 5) [18-07-2025(online)].pdf | 2025-07-18 |
| 13 | 202541068876-COMPLETE SPECIFICATION [18-07-2025(online)].pdf | 2025-07-18 |
| 14 | 202541068876-Proof of Right [17-10-2025(online)].pdf | 2025-10-17 |
| 15 | 202541068876-FORM-26 [17-10-2025(online)].pdf | 2025-10-17 |