Abstract: YoloNasS for real time text detection and recognition using deep learning and super gradient techniques Abstract: Text detection in natural scenes remains a challenging computer vision task due to variations in text appearance, complex backgrounds, and real-time processing requirements. The present YOLONASS-DeepNet, a novel framework that integrates YOLO-based detection with Neural Architecture Search and Super Gradients optimization for efficient text detection in both static images and videos. Our approach employs a text-specific NAS algorithm to craft an optimized architecture while utilizing Super Gradients to enhance training efficiency. Extensive evaluation on COCO-Text V2.0 demonstrates superior performance, achieving 0.823 mAP at 0.5 IoU and 0.912 F-measure on ICDAR 2015. For video analysis, our model attains 0.891 temporal average precision and 0.945 temporal consistency score on ICDAR 2015's Video Text dataset. Operating at 43 frames per second with a compact 18.7 MB model size, YOLONASS-DeepNet outperforms existing methods while maintaining real-time performance. Our framework advances the state-of-the-art in scene text detection and establishes new benchmarks for efficiency and accuracy in practical applications.
Description:Descriptions:
1. Introduction
Text detection in computer vision has become increasingly crucial for applications ranging from autonomous navigation to assistive technologies for the visually impaired. The challenge lies in accurately detecting and localizing text in unconstrained environments where variations in font, size, orientation, and lighting conditions significantly impact detection accuracy. Traditional approaches, while effective for controlled environments, often struggle with the complexity and diversity of text in natural scenes [1]. The rapid growth of multimedia content has intensified the need for efficient text detection systems that can process both images and videos in real-time while maintaining high accuracy. Current state-of-the-art methods either sacrifice speed for accuracy or vice versa, creating a significant gap in practical applications requiring both attributes [2].
Research Objectives are as follows:
• Develop a unified framework for efficient text detection in both static images and videos
• Create a lightweight model suitable for resource-constrained environments
• Achieve state-of-the-art accuracy while maintaining real-time processing capabilities
2. Related Work
Text detection in natural scenes has evolved significantly from traditional rule-based methods to advanced deep learning approaches. Early works focused on basic techniques like Maximally Stable Extremal Regions (MSER) for detecting text in controlled environments [1]. The introduction of Connectionist Text Proposal Network (CTPN) by Tian et al. marked a significant shift towards deep learning-based approaches [2]. This evolution continued with methods like EAST (Efficient and Accurate Scene Text Detector) and TextBoxes++, which improved detection accuracy for multi-oriented text [3], [4]. However, these methods struggled with irregular text shapes and complex backgrounds.Recent advancements have focused on handling arbitrary-shaped text and improving detection efficiency. TextSnake introduced a flexible representation for detecting text of various shapes [5], while PixelLink approached the problem through instance segmentation [6]. The CRAFT (Character Region Awareness for Text detection) method demonstrated improved performance by focusing on character-level detection [7]. These approaches showed significant improvements in handling complex scenarios but often required substantial computational resources, limiting their practical applications in real-time systems.
The field of text detection and recognition has undergone significant transformation with the advent of deep learning technologies. Recent research has focused on developing specialized architectures for handling various text detection challenges. The Supervised Pyramid Context Network has demonstrated improved performance in detecting scene text by utilizing contextual information at multiple scales [11]. Similarly, innovations in backbone architecture search, as demonstrated by AutoSTR [12], have led to more efficient and accurate text recognition systems. The integration of symmetry-constrained rectification networks has particularly enhanced the ability to handle curved and distorted text in natural scenes [13]. These advancements have been crucial in addressing real-world applications, such as electrical nameplate recognition [14] and handwritten text detection [15].
Despite these advancements, several limitations persist in current text detection systems. The challenge of detecting curved text in natural scenes remains significant, even with semi-supervised and weakly-supervised learning approaches [16]. Multi-line text with narrow spacing continues to pose difficulties for accurate detection and recognition [17]. Additionally, while Neural Architecture Search (NAS) has shown promise in optimizing text recognition models [18], the computational resources required for such optimization remain substantial. Current systems also struggle with maintaining consistent performance across different lighting conditions and text styles, particularly in complex real-world scenarios such as power text information systems [19]. The integration of these technologies into practical applications, especially in resource-constrained environments, remains a significant challenge that requires further research and development.
Despite these advancements, several challenges persist in text detection. Current methods struggle with extreme lighting conditions, multilingual text, and maintaining consistent performance across different scenarios [8]. The trade-off between detection accuracy and computational efficiency remains a significant concern, particularly for real-time applications [9]. Additionally, most existing approaches face difficulties in handling text at multiple scales and orientations simultaneously, especially in video sequences where temporal consistency is crucial [10]. These limitations highlight the need for more efficient and robust text detection systems that can maintain high accuracy while operating under resource constraints.
3. Methodology
YOLONASS-DeepNet introduces several novel components that distinguish it from existing text detection frameworks. The core innovation lies in the synergistic integration of YOLO-based detection, Neural Architecture Search (NAS), and Super Gradients optimization, each specifically tailored for text detection challenges. Our framework processes input images through a modified YOLO architecture that incorporates text-specific anchor shapes and orientation-aware loss functions, enabling more accurate detection of varied text orientations and sizes in a single pass [20].
Figure 1: Proposed Methodology
The first key novelty is our text-aware YOLO modification. Unlike traditional YOLO architectures that use generic object detection anchors, the introduce adaptive text anchors that dynamically adjust based on the statistical distribution of text aspect ratios in the training data. The also incorporate a novel rotated IoU loss term that explicitly accounts for text orientation, significantly improving the detection of arbitrarily-oriented text. The loss function is further enhanced with a text-line coherence component that ensures consistent detection of characters within the same text line [21].Our second innovation lies in the specialized Neural Architecture Search strategy. The introduce a text-focused search space that incorporates novel operations specifically designed for text detection. These include hybrid convolution blocks that combine standard and dilated convolutions at multiple scales, enabling better capture of both local character details and global text context. The search process is guided by a multi-objective optimization framework that simultaneously considers text detection accuracy, model size, and inference speed. The implement a novel search algorithm that uses progressive architecture growth, starting with a minimal backbone and gradually expanding based on text-specific performance metrics [23].The backbone is optimized using Neural Architecture Search (NAS) to capture essential multi-scale and high-level features specific to text regions. The hybrid convolution operation is defined as a weighted combination of standard convolution and dilated convolution. Given a feature map x the operation is formulated as:
…(1)
Where, is a standard convolution with kernel size k, (x) is a dilated convolution with kernel size k and dilation rate d, is a learnable parameter that controls the weighting between standard and dilated convolutions. The attention mechanism applies focused weight adjustments to the feature map F. For each feature map F. Perform average pooling on F, pass the pooled output through a Multi-Layer Perceptron (MLP) with a sigmoid activation function, generating attention weigh, Apply the attention weights A(F) by element-wise multiplication with the original feature map:
…(2)
Where, denotes the sigmoid activation function, the multi-scale features are extracted by applying the feature pyramid structure. For each level l in the feature pyramid, construct feature map as a combination of Backbone features from the current layer and Up sampled features from the level above . The final feature at level is computed by combining the features from the current level with down sampled features from the level below controlled by a learnable parameter :
…(3)
Where, is the down sampled output from level , is a learnable parameter adjusting the weight between current and down sampled features. To emphasize text regions, a text-specific transformation is applied. Generate a text mask by applying text-specific convolution filters to the feature map, followed by sigmoid activation:
…(4)
The text-aware feature is computed by combining the masked feature map with the original feature map using a learnable weight
…(5)
The YOLO detection head is modified for text regions by introducing adaptive text anchors and rotation-aware bounding boxes. To handle varying text dimensions, adaptive text anchor dimensions are calculated dynamically as follows:
…(6)
…(7)
Where, and are base width and height values, The scale and ratio factors are learnable parameters optimized by NAS based on text characteristics. Each bounding box is predicted as a rotated box parameterized by (x,y, w, h, ) , where, (x,y) are the center coordinates, w and h are the width and height of the bounding box, is the rotation angle. The rotation loss is computed as follows:
…(8)
Where, L2 is the distance between predicted and ground-truth bounding boxes, CosineSim is the cosine similarity between the predicted and ground-truth rotation angles. The final loss function for optimizing the model includes several components tailored for text detection. The total loss is defined as:
…(9)
Where, is Confidence loss, measured by binary cross-entropy for text/non-text classification, Localization loss, using Smooth L1 loss between predicted and ground-truth bounding boxes, Rotation loss, Coherence loss, which promotes alignment between adjacent text regions. Coherence between text regions is calculated by:
…(10)
Where, are the feature representations of text regions , is a temperature parameter controlling sensitivity, is the Intersection over Union (IoU) of rotated bounding boxes for adjacent text regions The third novel component is our Super Gradients optimization technique, which introduces several text-specific enhancements to traditional gradient-based optimization. The develop a density-aware gradient scaling mechanism that dynamically adjusts the learning process based on text distribution in the input images. This is complemented by a novel layer-wise adaptive rate scaling strategy that considers both the geometric properties of text regions and the semantic importance of different network layers [24]. The optimization process includes:
1. Text-density aware gradient scaling:
gradient_scale = base_scale * (1 + β * (local_text_density - avg_text_density))
2. Adaptive learning rate adjustment:
learning_rate = base_rate * text_importance_factor * (||W_l|| / ||∇L(W_l)||)
For video text detection, the introduce a novel temporal coherence module that integrates motion prediction with text detection. This module employs a lightweight RNN structure that maintains text region consistency across frames while adapting to changes in text appearance and position. The system includes an innovative keyframe selection mechanism that adaptively determines when to perform full detection based on scene complexity and text motion patterns [25].
3.1 Datasets used for training and evaluation
Roboflow facilitated efficient import, organization, and preprocessing of the COCO-Text V2.0 dataset, enabling us to focus on model development and experimentation. The platform's integration capabilities allowed for seamless incorporation of the dataset into our training workflow, enhancing reproducibility and enabling rapid iterations on our experimental setup.In addition to COCO-Text V2.0, we supplemented our training data with the ICDAR 2015 Incidental Scene Text dataset to further enhance the model's generalization capabilities. This dataset provides 1,000 training images and 500 test images, featuring text in various languages and scripts, which complemented the diversity offered by COCO-Text V2.0.
Figure 2: Sample dataset of COCO-Text V2.0
3.2 Implementation and Results
The experimental evaluation of YOLONASS-DeepNet, the primarily utilized the COCO-Text V2.0 dataset, a comprehensive benchmark for text detection in natural scenes. This dataset, an extension of the original COCO-Text, provides a diverse collection of images containing text in various forms, sizes, and orientations. COCO-Text V2.0 encompasses 63,686 images, with 239,506 text instances annotated across 3,645 vocabulary words, making it one of the largest and most challenging datasets in the field of scene text detection. To streamline the data management and pre-processing pipeline, the leveraged Roboflow, a robust platform for computer vision data operations. Leveraging Roboflow's pre-processing capabilities, the implemented a comprehensive data preparation pipeline. Initially, the normalized all images to a consistent size of 640x640 pixels, maintaining aspect ratios through padding. This standardization process ensured uniform input dimensions for our YOLONASS-DeepNet model while preserving the original text proportions and layouts.To enhance the model's robustness and generalization capabilities, the employed a series of data augmentation techniques. These included random rotations (±15 degrees), random cropping (with a minimum 75% retention of the original image), and random adjustments to brightness (±25%) and contrast (±25%). Additionally, the implemented a novel text-aware augmentation strategy that applied more aggressive transformations to images with larger text instances while being more conservative with images containing smaller text, thereby addressing the scale variation challenge inherent in scene text detection. Furthermore, the utilized Roboflow's auto-orient feature to correct for any unintentional rotations in the source images, ensuring that the text orientations in our training data accurately reflected real-world scenarios. To mitigate class imbalance issues, the employed adaptive sampling techniques, oversampling images with rare text instances or underrepresented languagesThe entire framework is implemented using PyTorch, with custom CUDA kernels for efficient processing of text-specific operations. The utilize mixed precision training and gradient accumulation techniques to optimize memory usage and enable training on larger batch sizes. The model achieves real-time performance (43 FPS) while maintaining high accuracy, with a compact model size of 18.7 MB [26].
The implemented YOLONASS-DeepNet using PyTorch 1.9.0, leveraging its dynamic computational graph capabilities for efficient implementation of our Neural Architecture Search components. The model was trained on a distributed system comprising 4 NVIDIA A100 GPUs, utilizing mixed precision training to optimize memory usage and computational efficiency.The Neural Architecture Search phase was conducted using a population size of 100 architectures, evolved over 50 generations. The training parameters are shown in Table 1. The employed a multi-objective fitness function that balanced detection accuracy (mAP), model size, and inference speed. The search process was accelerated using progressive learning techniques, where promising architectures were initially evaluated on a subset of the data before full-scale training.
Table 1: Training Parameters
Hyperparameter Value
Silent mode True
Averagebestmodels True
Warmupmode Linearepochstep
Warmupinitial learning rate 0.000001
learning rate warmup epoch 3
Initial learning rate 0.0002
Learning rate mode Cosine
cosine_final_lr_ratio 0.1
Optimizer Adam
Optimizer weight decay 0.01
EMA True
EMA parameter decay type threshold
EMA parameter decay 0.9
Max Epochs 10
Loss PPYoloLoss
For the main training phase, we utilized the Adam optimizer with an initial learning rate of 1e-3, employing our Super Gradients optimization technique for adaptive learning rate adjustments. The model was trained for 100 epochs with a batch size of 64, using a cosine annealing learning rate schedule. The implemented early stopping with a patience of 10 epochs to prevent overfitting.The training method consists of a maximum of 10 epochs, where the optimization is guided by the PPYoloLoss. Regarding evaluation, the score threshold is established at 0.1, guaranteeing that only forecasts with an acceptable level of confidence are considered. Only the top 300 predictions are kept for further analysis, with an emphasis on text detections that have a high probability of being accurate. During training, the targets are normalized, which means that the output is standardized. This normalization process helps to enhance the model’s learning efficiency. The meticulous and methodical methodology, integrating YOLO-NAS Small with sophisticated optimization algorithms, guarantees efficient text identification in a wide range of intricate and challenging visual settings.
3.3 Evaluation metrics
To comprehensively evaluate YOLONASS-DeepNet's performance, the employed a diverse set of metrics that capture various aspects of text detection accuracy and efficiency. The primary metric used was mean Average Precision (mAP), calculated at different Intersection over Union (IoU) thresholds (0.5, 0.75, and 0.5:0.95) to assess the model's localization accuracy across various precision levels.In addition to mAP, the utilized the Total-Text evaluation protocol , which is specifically designed for curved text detection. This protocol computes precision, recall, and F-measure using a polygon-based IoU calculation, providing a more nuanced evaluation for irregularly shaped text instances.To assess the model's performance in video text detection scenarios, the introduced a temporal consistency score (TCS), defined as:
Where represents the i-th detection in frame t, T is the total number of frames, and N is the number of tracked text instances. This metric quantifies the stability of detections across consecutive frames.Lastly, to evaluate the efficiency aspects of our model, the measured the inference time (in milliseconds per image) and the model size (in megabytes). These metrics were crucial in assessing the model's suitability for deployment in resource-constrained environments, such as mobile devices or edge computing scenarios.
Figure 3: Text detection result obtained from the video frames of outside the stores
Figure 4: Text detection result obtained from the video frames of inside the stores.
Table 2 presents a comprehensive comparison of YOLONASS-DeepNet with state-of-the-art methods on the COCO-Text V2.0 and ICDAR 2015 datasets. Table 2 shows the comparison for video text detection on the ICDAR 2015 Video Text dataset [13].As evident from Tables 1 and 2, YOLONASS-DeepNet outperforms existing state-of-the-art methods across various datasets and metrics. The performance gain is particularly noticeable in video text detection, where our model's temporal consistency and motion-aware features contribute to superior results as shown in figure 4 and 5.
Table 2: Performance Comparison on Image Datasets
Method COCO-Text V2.0 (mAP@0.5) COCO-Text V2.0 (mAP@0.75) ICDAR 2015 (F-measure)
EAST [3] 0.751 0.602 0.832
TextBoxes++ [4] 0.767 0.621 0.851
CRAFT [7] 0.784 0.645 0.869
DB-ResNet-50 [25] 0.795 0.659 0.886
YOLONASS-DeepNet (Proposed) 0.823 0.684 0.912
Figure 5: Performance comparison of text detection methods
Figure 6: Performance Trends across different text detection methods
Table 3: Performance Comparison on Video Text Dataset
Method TAP TAR F-measure TCS
Deep-Text-Spotter [26] 0.841 0.832 0.836 0.912
TextSnake [27] 0.863 0.845 0.854 0.923
EAST+LSTM [28] 0.872 0.859 0.865 0.931
YOLONASS-DeepNet (Proposed) 0.891 0.876 0.883 0.945
To quantify the impact of individual components in YOLONASS-DeepNet, the conducted comprehensive ablation studies. Removing the Neural Architecture Search component resulted in a 3.2% drop in mAP on COCO-Text V2.0, highlighting its importance in optimizing the model architecture. The Super Gradients optimization technique contributed to a 2.7% improvement in overall performance and accelerated convergence by 20% compared to standard Adam optimization.For video text detection, disabling the temporal consistency module led to a 4.1% decrease in the temporal consistency score (TCS) and a 2.8% drop in F-measure. These results underscore the effectiveness of our video-specific adaptations in maintaining stable and accurate detections across frames.
4. Conclusion
In this study, presented YOLONASS-DeepNet, a novel framework that significantly advances the field of text detection in natural scenes and videos. Our approach uniquely combines YOLO-based detection with Neural Architecture Search and Super Gradients optimization, introducing three key innovations: adaptive text anchors that dynamically adjust based on statistical distribution of text aspect ratios, hybrid convolution blocks that synergistically combine standard and dilated convolutions for multi-scale feature extraction, and a text-density aware gradient scaling mechanism for optimized training. These innovations enable robust text detection across various challenging scenarios, achieving state-of-the-art performance with a mean Average Precision of 0.823 at IoU 0.5 on COCO-Text V2.0 and an F-measure of 0.912 on ICDAR 2015. The incorporation of hybrid convolution blocks and text-aware attention mechanisms enables robust text detection across various challenging scenarios, while our adaptive optimization strategy ensures efficient training and inference. Operating at 43 frames per second with a compact model size of 18.7 MB, YOLONASS-DeepNet successfully bridges the gap between accuracy and efficiency, making it particularly suitable for real-world applications. This successful integration of novel components not only bridges the gap between accuracy and efficiency but also opens new avenues for future research in developing more robust text detection systems capable of handling extreme lighting conditions, complex backgrounds, and multi-script scenarios.
, Claims:CLAIMS:
1. A novel integration of YOLO architecture with Neural Architecture Search, specifically optimized for text detection tasks
2. Introduction of Super Gradients optimization technique that enhances training efficiency and model performance
3. Development of an adaptive text-aware NAS strategy that automatically discovers efficient architectures for text detection
4. Implementation of a temporal consistency module for stable video text detection
5. Comprehensive evaluation demonstrating superior performance on benchmark datasets with reduced computational requirements.
6. A compact model (18.7 MB) achieving real-time performance (43 FPS) while maintaining state-of-the-art accuracy.
| # | Name | Date |
|---|---|---|
| 1 | 202541011854-STATEMENT OF UNDERTAKING (FORM 3) [12-02-2025(online)].pdf | 2025-02-12 |
| 2 | 202541011854-REQUEST FOR EARLY PUBLICATION(FORM-9) [12-02-2025(online)].pdf | 2025-02-12 |
| 3 | 202541011854-POWER OF AUTHORITY [12-02-2025(online)].pdf | 2025-02-12 |
| 4 | 202541011854-FORM-9 [12-02-2025(online)].pdf | 2025-02-12 |
| 5 | 202541011854-FORM 1 [12-02-2025(online)].pdf | 2025-02-12 |
| 6 | 202541011854-DECLARATION OF INVENTORSHIP (FORM 5) [12-02-2025(online)].pdf | 2025-02-12 |
| 7 | 202541011854-COMPLETE SPECIFICATION [12-02-2025(online)].pdf | 2025-02-12 |