Abstract: With the ascent in deep learning, computer vision problems have taken center stage among the research community. As an important research area in computer vision, image text detection and recognition has been inevitably influenced by this wave of revolution. For content-based indexing and retrieval applications, text characters embedded in images are a rich source of information. Owing to their different shapes, grayscale values, and dynamic backgrounds, these text characters in scene images are difficult to detect and classify. The complexity increases when the text involved is a vernacular language like Kannada. Despite advances in deep learning neural networks (DLNN), there is a dearth of fast and effective models to classify scene text images and the availability of a large-scale Kannada scene character dataset to train them. Text recognition from natural images is an extremely complex task and entails not just detecting the text but also locating it inside the image by producing the co-ordinates of a bounding box that holds the text. Here, five claim defining inventions are proposed, (i) Kannada Scene Individual Character (KSIC) dataset which is grounds-up curated (ii) AksharaNet, a graphical processing unit (GPU) accelerated modified convolution neural network architecture consisting of linearly inverted depth-wise separable convolutions, (iii) leveraging Transfer Learning in Deep Learning for detecting and recognizing text in scene images motivated by the object detection algorithm, You Only Look Once (YOLO), (iv) Early stopping decisions at 25% and 50% epoch with good and bad accuracies for complex and light models and (v) Useful findings concerning learning rate drop factor and its ideal application period. The YOLO algorithm creates a bounding box around a detected object and includes the box's location information as well. Taking cue from this, AksharaNet, a convolution neural network (CNN) based classification model with YOLO algorithm implemented was trained on the KSIC dataset. A single reference scene image is split into 19x19 grids and the pre-trained network acts as a feature extractor and localizer. It detects text by predicting a class, builds a bounding box, and specifies text location and their respective class names which are then indexed in dictionary. From results, the KSIC dataset generated consists of 46,800 images which are classified into 468 classes, each with 100 samples for training the CNN. It is observed that AksharaNet outperforms four other well-established models by 1.5% on CPU and 1.9% on GPU. The results can be directly attributed to the quality of the developed KSIC dataset. Testing the efficacy of this network using the YOLO algorithm on the same dataset and using Transfer learning resulted in an accuracy of 90.17% and test area under the curve (AUC) of 0.932. Comparing this proposed model with existing models, it is observed that pre-trained AksharaNet with YOLO implementation returns the best precision, recall and F1-scores of 88%, 90% and 93.83% respectively, which is greater than 8%, 4% and 5% on an average compared to other models. And based on early stopping decisions and learning rate drop factor findings suggestions are tabulated for various scenarios. Therefore, results prove that (i) The KSIC dataset is large and robust to train the designed CNN, (ii) AksharaNet outperforms some of the leading CNNs being used in Deep Learning, (iii) the early stopping decisions and learning rate drop factor identified are key parameters for improved CNN performance and (iv) YOLO can be extended to other areas beyond just object detection with Transfer learning an efficient and effective deep learning technique for these class of computer vision problems.
Claims:1. Developed a unique large grounds-up self-curated Kannada scene individual character dataset (KSIC) using different techniques and made it available for researchers here KSIC Dataset.
2. Designed and developed a robust deep neural model named AksharaNet using inverted DSC modules for Classification purpose which outperformed other state-of-the-art DNN models on KSIC dataset.
3. Designed and developed a Kannada text detection and recognition model from scene images using AksharaNet and YOLO algorithm achieving good performance.
4. The optimal Validation Patience (VP), learning rate drop factor (lrdf) and its period (lrdp) on different architectures based on epoch reached and accuracy achieved at termination are enumerated for AksharaNet.
, Description:The identification of text from natural scene images is a popular research subject in the area of image processing and pattern recognition. Signboard images with embedded text have helpful semantic details that can be used to truly comprehend important information for a person’s need and protection. These include institute names, business names, building names, and warning signs, among other items. As a consequence, Scene Character Recognition (SCR) which is an important step in text recognition pipeline has become a popular research subject, with applications ranging from content-based indexing, image retrieval, robotics, as an essential reading tool for the blind to interact with their environment, tour guide systems and intelligent transportation systems. However, scene character recognition from natural scenes has been found to be more complicated and nuanced than recognizing text in scanned documents. While the characters are almost of same size when dealing with same paragraph or title in these documents, natural scene settings pose a number of problems, including irregular fonts, changing lighting conditions, noise, distortion, color variation, a dynamic context, and a variety of writing types.
Most recently published methods use convolution neural networks (CNN) for this task. However, Deep learning has long struggled to meet the need for effective classification models with fewer parameters, lightweight design, and reliable performance. Deep models of regular convolutions have a large number of parameters, which necessitates a lot of computation and infrastructure. Besides, conditions like fitting necessitate more data or deeper layers, all of which increase the computation complexity which is outside the scope of a normal central processing unit (CPU). Alternative hardware architectures will need to be adopted to bring down the computational complexity. This is a significant disadvantage for real-time applications like the one being targeted in this research. Additionally, very less work on classifying Kannada scene characters from scene images has been carried out. Thus, the unavailability of a large dataset, an effective deep model to train, and the ease of use with increased efficiency of the process-support systems are all major hindrances in this mission.
Typically, the text extraction issue is focused on extracting text from English and other non-oriental languages with uniformly sized characters and fixed inter-character spacing, making text object classification simple. Text extraction is simple in north Indian languages like Hindi, due to the presence of a head-line which makes text object classification simple. However, non-uniform geometry and inter-character spacing are seen in South Indian languages such as Kannada. The accuracy of the captured image has a major effect on the system's performance. Text from a shaded or textured backdrop, low-contrast or complex images, or images with differences in font size, composition, color, orientation, and positioning can all be seen in the images taken by the camera. Because of these differences, automated text extraction is exceedingly difficult and there is always a need to leverage newer technologies and improve the accuracy and robustness of such systems. Transfer learning, an evolving field in the deep learning arena, is one such new technological advancement. Transfer learning is a machine learning technique that focuses on storing and applying information learned when solving one problem to a different but similar problem. It employs artificial neural networks such as CNN, RNN, and LSTM, among others, and has proven its worth in a variety of fields, including computer vision. Also, YOLO algorithm which has been primarily used for object detection can be used to pre-train a network and transfer its learning to detect and recognize Kannada text from new images fed into it from billboards.
Since the architecture involves CNN, early stopping criteria and learning rate drop factor are important parameters to be studied. The early stopping criteria, also known as Validation Patience (VP) and its impact on different architectures is analysed based on epoch reached and accuracy achieved at termination on AksharaNet. Behavioural analysis and effects of learning rate drop factor (lrdf) and its period (lrdp) of implication on the network is studied.
| # | Name | Date |
|---|---|---|
| 1 | 202141023568-FORM 1 [27-05-2021(online)].pdf | 2021-05-27 |
| 1 | 202141023568-FORM-9 [31-05-2021(online)].pdf | 2021-05-31 |
| 2 | 202141023568-COMPLETE SPECIFICATION [27-05-2021(online)].pdf | 2021-05-27 |
| 2 | 202141023568-DRAWINGS [27-05-2021(online)].pdf | 2021-05-27 |
| 3 | 202141023568-COMPLETE SPECIFICATION [27-05-2021(online)].pdf | 2021-05-27 |
| 3 | 202141023568-DRAWINGS [27-05-2021(online)].pdf | 2021-05-27 |
| 4 | 202141023568-FORM 1 [27-05-2021(online)].pdf | 2021-05-27 |
| 4 | 202141023568-FORM-9 [31-05-2021(online)].pdf | 2021-05-31 |