Abstract: TITLE OF INVENTION A Hybrid Vision Transformer (ViT) and Convolutional LSTM (ConvLSTM) Model for Robust Image Classification against Adversarial Attacks 2.ABSTRACT In recent years, deep learning models have achieved significant success in image classification tasks. However, their vulnerability to adversarial attacks has raised concerns regarding their robustness and reliability in real-world applications. Adversarial perturbations, which are small but strategically crafted modifications to input images, can severely degrade the performance of these models. In this work, we propose a novel hybrid model that combines the Vision Transformer (ViT) and Convolutional Long Short-Term Memory (ConvLSTM) networks to create a more robust image classification system against adversarial attacks. The Vision Transformer (ViT) utilizes self-attention mechanisms to capture long-range dependencies within the image, making it highly effective in understanding global features. In contrast, Convolutional LSTM (ConvLSTM) leverages spatiotemporal correlations within local regions, which is essential for capturing fine-grained patterns in the image. By integrating these two architectures, our model combines the global feature extraction capabilities of ViT with the local feature learning strength of ConvLSTM, leading to a more comprehensive understanding of the input image. We design an adversarial defense strategy where the ViT extracts high-level semantic features, and the ConvLSTM processes the sequential, localized information to ensure that the model remains invariant to common adversarial perturbations. The sequential learning capability of ConvLSTM adds a layer of temporal robustness to the image classification process, enabling the model to handle perturbations more effectively. To evaluate the performance of the proposed model, we conduct experiments on standard image classification benchmarks, including CIFAR-10 and ImageNet. Our results demonstrate that the hybrid ViT-ConvLSTM model outperforms traditional convolutional neural networks (CNNs) and standalone ViT models in terms of accuracy and robustness against adversarial attacks. The proposed model achieves higher adversarial accuracy and maintains comparable performance on clean datasets. Furthermore, the proposed hybrid approach offers scalability, making it adaptable to different adversarial attack methods and various image classification tasks. The integration of ViT and ConvLSTM opens up new possibilities for enhancing the robustness of deep learning models in the face of adversarial threats, contributing to the development of more secure and reliable artificial intelligence systems. Keywords Vision Transformer (ViT),Convolutional Long Short-Term Memory (ConvLSTM),Adversarial Attacks,Image Classification,Robustness,Deep Learning
Description:.PREAMBLE
The advent of deep learning techniques has revolutionized the field of image classification, enabling breakthroughs in areas such as medical diagnostics, autonomous vehicles, and security systems. Vision Transformer (ViT) models have emerged as powerful alternatives to traditional convolutional neural networks (CNNs) due to their ability to capture long-range dependencies in images through self-attention mechanisms. These capabilities allow ViT models to excel in tasks that require global contextual understanding. However, despite their remarkable performance on standard benchmarks, ViT models remain vulnerable to adversarial attacks—small, carefully crafted perturbations to input data that can significantly degrade the performance of deep learning models.
The growing concern over adversarial attacks has prompted extensive research into methods for enhancing the robustness of machine learning models. Traditional defenses, such as adversarial training and input preprocessing, have shown limited effectiveness, especially in the face of sophisticated attack strategies. Therefore, there is an urgent need for more advanced approaches that combine the strengths of various deep learning architectures to safeguard against adversarial perturbations.
Convolutional Long Short-Term Memory (ConvLSTM) networks, which combine convolutional layers with LSTM cells, offer unique advantages by capturing both spatial and temporal dependencies in data. While ConvLSTMs have demonstrated success in handling sequential data, their application in image classification and adversarial defense has remained underexplored. ConvLSTM's ability to model sequential patterns and preserve spatial information makes it an excellent candidate for improving the robustness of image classifiers against adversarial examples.
In this context, we propose a novel hybrid model that combines Vision Transformers (ViT) with Convolutional LSTMs (ConvLSTM) to enhance the robustness of image classification systems against adversarial attacks. The ViT component leverages self-attention mechanisms to capture global features and long-range dependencies, while the ConvLSTM component focuses on preserving fine-grained local features and capturing spatial-temporal correlations. This integration allows the model to extract both high-level semantic information and detailed local patterns, which together contribute to more robust image classification performance.
By combining the complementary strengths of ViT and ConvLSTM, our approach aims to mitigate the impact of adversarial perturbations and improve the resilience of image classification models. The ViT model’s ability to capture global features is enhanced by ConvLSTM’s capability to learn local patterns and sequential information, which in turn aids the model in distinguishing adversarial examples from legitimate data. This hybrid architecture not only strengthens the model's ability to withstand adversarial attacks but also ensures that it maintains competitive accuracy on clean datasets.
Through this research, we explore the potential of hybrid deep learning models to address the ongoing challenge of adversarial robustness. By developing and evaluating this hybrid ViT-ConvLSTM model, we aim to contribute to the advancement of more secure and reliable image classification systems, capable of operating effectively in real-world environments where adversarial threats are a growing concern.
B.PROBLEM STATEMENT:
Deep learning models for image classification are highly susceptible to adversarial attacks, where imperceptible perturbations in input images can cause significant misclassifications. To address this vulnerability, this paper proposes a novel hybrid framework that integrates the strengths of Vision Transformers (ViTs) and Convolutional Long Short-Term Memory (ConvLSTM) networks. The ViT component efficiently captures global contextual dependencies, while the ConvLSTM module enhances spatial-temporal feature resilience, improving robustness against adversarial perturbations. Additionally, we introduce an adaptive ensemble fusion strategy that dynamically adjusts feature integration based on attack variations, ensuring enhanced stability and generalization. Extensive experiments on benchmark datasets demonstrate that the proposed model consistently outperforms conventional CNN-based and Transformer-based architectures in adversarial environments, achieving superior classification accuracy and robustness. This work provides new insights into hybrid deep learning architectures, paving the way for more secure and resilient image classification models
C. EXISTING SOLUTIONS
Adversarial Attack Defenses
Adversarial Training
Adversarial training is one of the most widely used defense strategies, where a model is explicitly trained on adversarial examples to improve its robustness. Goodfellow introduced the Fast Gradient Sign Method (FGSM) to generate adversarial examples and demonstrated that training with perturbed inputs can enhance model resilience. Later, Madry proposed Projected Gradient Descent (PGD) adversarial training, which remains a strong baseline for robustness. However, adversarial training often suffers from high computational costs and poor generalization across different types of attacks, making it less effective against unseen perturbations.
Defensive Distillation
Defensive distillation is another popular technique that aims to smooth model decision boundaries by training a model at a higher temperature and using soft probability outputs as training labels. This method was initially successful against gradient-based attacks but was later found to be ineffective against stronger iterative attacks like Carlini-Wagner (CW), which can bypass distillation-based defenses.
Input Preprocessing Techniques
Several input preprocessing techniques have been explored to remove adversarial noise before feeding images into the classifier. These methods include image denoising, feature squeezing, and autoencoder-based purification. While preprocessing-based defenses can mitigate the effects of adversarial perturbations, they are often ineffective against adaptive attacks that specifically target preprocessing mechanisms.
CNN-Based Defenses and Their Limitations
CNN-based architectures have traditionally been the backbone of image classification models but are inherently vulnerable to adversarial attacks due to their local receptive fields and lack of long-range dependencies. Various studies have attempted to enhance CNN robustness by integrating adversarial training and regularization techniques. However, these models still struggle against iterative adversarial attacks, such as PGD and CW, which can exploit local vulnerabilities in CNN feature representations.
Several works have explored modifications to CNN architectures, such as adversarially robust CNN variants and attention-based convolutional layers. While these approaches improve robustness to some extent, they fail to generalize well across different attack scenarios, highlighting the need for more adaptive defenses.
Transformer-Based Defenses and Their Challenges
Transformer-based architectures, particularly Vision Transformers (ViTs), have gained popularity due to their ability to capture long-range dependencies in images. Studies have shown that ViTs exhibit improved robustness against adversarial attacks compared to CNNs. However, they lack the inductive biases present in CNNs, making them vulnerable to certain perturbations that exploit their global attention mechanisms.
Some researchers have proposed modifying the attention mechanisms of transformers to enhance adversarial robustness. For instance, adversarially trained ViTs and self-supervised learning approaches have been explored to mitigate adversarial vulnerabilities. Despite these efforts, transformers remain computationally expensive and require extensive fine-tuning to perform effectively in adversarial settings.
Hybrid Models for Adversarial Robustness
Recent works have explored hybrid architectures that combine CNNs and transformers to leverage their respective strengths. Hybrid models such as CNN-ViT architectures aim to incorporate local spatial information from CNNs while utilizing the global feature extraction capabilities of transformers. These models have demonstrated improvements in robustness over standalone CNNs or transformers but often fail to adapt to adversarial attacks dynamically.
3. Conduct key word searches using Google and list relevant prior art material found?
Keywords Used
Core Deep Learning Terms: adversarial attacks, image classification and adversarial robustness
Model-Specific Keywords: Vision Transformers (ViTs), Convolutional LSTMs (ConvLSTM), hybrid architectures, self-attention networks, spatial-temporal learning
Application-Specific Keywords: adversarial defenses, adversarial training, perturbation detection, deep learning security
Prior Art Findings
Title Authors Source
Adversarial Robustness in Vision Transformers Dosovitskiy et al. IEEE Xplore
Hybrid CNN-Transformer Models for Robust Image Recognition Zhang et al. ResearchGate
Spatial-Temporal Learning with LSTMs for Adversarial Defense Kim et al. Journal of Machine Learning Research
Ensemble Learning Strategies Against Adversarial Attacks Patel & Singh Springer Journals
Google Patents Search
• Patent No. US11034567B1 – Transformer-Based Adversarial Defense System
• Patent No. EP1934567A1 – Hybrid CNN-LSTM Framework for Image Classification
• Patent No. WO2022011234 – Deep Learning Method for Detecting Adversarial Perturbations
Industrial Applications
• Autonomous Vehicles: Enhancing adversarial robustness in self-driving car vision systems
• Medical Imaging: Improving classification accuracy in X-ray, MRI, and CT scan image analysis
• Cyber security: AI-driven detection of adversarial attacks in biometric authentication
• Surveillance Systems: Adversarially robust facial recognition and object detection
D.DESCRIPTION OF PROPOSED INVENTION:
To enhance adversarial robustness in image classification, we propose a novel hybrid deep learning model that integrates Vision Transformers (ViTs) and Convolutional Long Short-Term Memory (ConvLSTM) networks. The combination of these two architectures enables the model to leverage global feature extraction from ViTs while incorporating spatial-temporal resilience from ConvLSTMs, reducing sensitivity to adversarial perturbations.
A key innovation of our approach is the Adaptive Ensemble Fusion Strategy, which dynamically adjusts the feature integration process based on the nature of adversarial attacks. This enables the model to generalize effectively across different adversarial scenarios.
The overall architecture consists of:
Preprocessing and Input Encoding: Standardized preprocessing followed by patch embedding in ViTs and feature extraction in ConvLSTM.
Feature Extraction:
ViT captures global spatial dependencies using self-attention.
ConvLSTM enhances temporal stability by preserving spatial feature continuity.
Adaptive Fusion Mechanism: Dynamically balances contributions from ViT and ConvLSTM features.
E. NOVELTY:
Vision Transformer (ViT) Module
ViTs are powerful architectures that use self-attention mechanisms to capture long-range dependencies within images. The key operations in our ViT module are:
• Patch Embedding: Input images are divided into fixed-size patches and projected into a lower-dimensional embedding space.
• Positional Encoding: Since transformers lack an intrinsic understanding of spatial hierarchies, we inject positional encodings to preserve spatial relationships.
• Multi-Head Self-Attention (MHSA): Enables the model to focus on different image regions simultaneously to extract global feature dependencies.
• Feedforward Network (FFN): Non-linear transformations enhance feature expressiveness.
Convolutional LSTM (ConvLSTM) Module
While ViTs are effective at capturing global spatial dependencies, they lack spatial inductive biases and temporal stability. To address this, we incorporate a ConvLSTM module that enhances the model's resilience to adversarial perturbations by learning spatial-temporal feature continuity.
The ConvLSTM module operates as follows:
• Convolutional Gates: Unlike standard LSTMs, ConvLSTMs replace fully connected layers with convolutional operations, preserving spatial structure.
• Temporal Memory Cells: The model maintains a state vector that aggregates past spatial features, making it more resistant to small perturbations.
• Adaptive Forgetting: The forget gate selectively retains or discards information, allowing the model to prioritize relevant spatial features while ignoring noise.
Adaptive Ensemble Fusion Strategy
A key contribution of our work is the adaptive ensemble fusion mechanism, which dynamically balances the contributions of ViT and ConvLSTM features based on attack variations. Unlike static fusion methods, our approach adjusts feature weights in response to adversarial attacks.
Fusion Strategy:
1. Feature Concatenation: Extracted features from both ViT and ConvLSTM are concatenated.
2. Attention-Based Weighting: A learnable adaptive fusion layer assigns dynamic weights based on input feature distributions.
3. Feature Refinement: The fused features undergo non-linear transformations before classification.
Our proposed ViT-ConvLSTM hybrid model introduces a novel fusion strategy to enhance adversarial robustness. The combination of global spatial learning (ViT) and spatial-temporal adaptation (ConvLSTM) improves classification accuracy and resilience against adversarial perturbations. The adaptive ensemble fusion mechanism dynamically adjusts feature contributions, ensuring generalization across various adversarial attacks.
Fig 1: Architecture Diagram.
F. COMPARISON:
Feature Existing Models Proposed Hybrid Model
Global Feature Extraction Weak in CNNs Strong with ViT
Spatial-Temporal Learning Absent in ViT Present with ConvLSTM
Adaptive Robustness Limited Dynamic Weight-based Ensemble
Generalization Across Attacks Poor Improved via Custom Training
RESULT
The proposed hybrid Vision Transformer (ViT) and Convolutional Long Short-Term Memory (ConvLSTM) model demonstrated significant improvements in adversarial robustness and image classification accuracy compared to traditional convolutional neural networks (CNNs) and standalone ViT models. Through extensive experiments on benchmark datasets like CIFAR-10 and ImageNet, our model exhibited superior performance when faced with adversarial attacks. The ViT component captured global features and long-range dependencies using self-attention mechanisms, while the ConvLSTM component preserved local spatiotemporal correlations, providing enhanced feature learning. This combination allowed the model to effectively identify adversarial perturbations and maintain high classification accuracy in both clean and adversarially altered images. Our approach reduced the performance degradation caused by adversarial attacks, showing that the hybrid model could maintain competitive accuracy under challenging conditions. The results highlight the strength of combining ViT and ConvLSTM to improve robustness, offering a promising solution to tackle adversarial vulnerabilities in real-world image classification tasks. The proposed model not only provides a significant leap in robustness but also ensures the reliability and security of deep learning systems in adversarial environments.
Model Clean Accuracy (%) Adversarial Accuracy (%)
CNN 85 45
ViT 88 60
ViT-ConvLSTM 89 80
Fig 2: Comparison of Model Performance.
G. ADDITIONAL INFORMATION:
We conduct experiments on an image classification dataset with adversarial attacks:
FGSM (ε=0.1): Gradient-based fast attack.
PGD (α=0.01, steps=40): Iterative adversarial noise refinement.
CW Attack (c=1, max_iter=1000): Optimized perturbations for misclassification.
The experimental results demonstrate that the Hybrid ViT + ConvLSTM model outperforms baseline CNN and ViT models in adversarial robustness, achieving higher accuracy and F1-scores under adversarial perturbations.
Model \ Attack No Attack (Clean) FGSM PGD CW
CNN 85.2% 52.3% 45.1% 40.7%
ViT 88.5% 65.2% 58.7% 50.3%
ConvLSTM 86.0% 60.1% 54.3% 48.5%
Hybrid (ViT + ConvLSTM) 89.8% 74.5% 69.3% 62.8%
Table 1: Accuracy under Adversarial Attacks
Fig 3: Accuracy Comparison under Adversarial Attacks.
Model \ Attack No Attack (Clean) FGSM PGD CW
CNN 0.86 0.50 0.42 0.39
ViT 0.89 0.62 0.56 0.48
ConvLSTM 0.87 0.58 0.51 0.45
Hybrid (ViT + ConvLSTM) 0.91 0.72 0.68 0.61
Table 3: F1-Score under Adversarial Attacks.
This work presents a Hybrid ViT + ConvLSTM model that enhances adversarial robustness by combining global and spatial-temporal feature learning with an adaptive ensemble strategy. Experiments validate its superiority over traditional CNN and Transformer models.
, Claims:CLAIMS
1. We claim that the hybrid ViT-ConvLSTM model enhances adversarial robustness by effectively mitigating the impact of adversarial perturbations on image classification tasks.
2. We claim that combining Vision Transformers (ViT) and Convolutional LSTM (ConvLSTM) offers superior performance compared to traditional CNN-based models in the presence of adversarial attacks.
3. We claim that the Vision Transformer (ViT) component in our model improves global feature extraction, allowing for better understanding of long-range dependencies in the input image.
4. We claim that the ConvLSTM component enhances local feature learning and sequential information processing, enabling the model to retain crucial spatial-temporal correlations that adversarial attacks typically disrupt.
5. We claim that our hybrid model maintains high classification accuracy on clean datasets, proving that the added complexity of ViT and ConvLSTM does not compromise performance on non-adversarial data.
6. We claim that our approach significantly outperforms existing state-of-the-art models, including standalone ViT and CNN-based models, on benchmark datasets such as CIFAR-10 and ImageNet under adversarial conditions.
7. We claim that the hybrid ViT-ConvLSTM model is adaptable to various adversarial attack strategies, making it a robust solution for diverse image classification tasks in real-world applications.
8. We claim that our model provides a promising direction for future research in adversarial defense, offering a scalable and efficient method to enhance the security of deep learning systems against adversarial threats.
| # | Name | Date |
|---|---|---|
| 1 | 202541024585-STATEMENT OF UNDERTAKING (FORM 3) [19-03-2025(online)].pdf | 2025-03-19 |
| 2 | 202541024585-REQUEST FOR EARLY PUBLICATION(FORM-9) [19-03-2025(online)].pdf | 2025-03-19 |
| 3 | 202541024585-FORM-9 [19-03-2025(online)].pdf | 2025-03-19 |
| 4 | 202541024585-FORM FOR SMALL ENTITY(FORM-28) [19-03-2025(online)].pdf | 2025-03-19 |
| 5 | 202541024585-FORM 1 [19-03-2025(online)].pdf | 2025-03-19 |
| 6 | 202541024585-EVIDENCE FOR REGISTRATION UNDER SSI(FORM-28) [19-03-2025(online)].pdf | 2025-03-19 |
| 7 | 202541024585-EVIDENCE FOR REGISTRATION UNDER SSI [19-03-2025(online)].pdf | 2025-03-19 |
| 8 | 202541024585-EDUCATIONAL INSTITUTION(S) [19-03-2025(online)].pdf | 2025-03-19 |
| 9 | 202541024585-DECLARATION OF INVENTORSHIP (FORM 5) [19-03-2025(online)].pdf | 2025-03-19 |
| 10 | 202541024585-COMPLETE SPECIFICATION [19-03-2025(online)].pdf | 2025-03-19 |