Abstract: ABSTRACT Systems and methods for orientation and direction aware text detection in media Embodiments herein disclose a deep learning based media processing technique to detect region(s) corresponding to text regions along with its orientation, direction, and starting position in a media. Embodiments herein can detect text in a media along with its orientation and direction, wherein a skew correction can be enabled right at the text detection stage before proceeding with the rest of the OCR pipelines, hence guaranteeing optimum results. Embodiments herein disclose a technique that can work in real-world media that contain background artefacts and different document templates. Embodiments herein are robust against these cases and also in the cases in which the printed text is misaligned against the axes of the paper medium. Embodiments herein perform text detection with respect to the orientation of the text content rather than the medium. FIG. 3
Description:The following specification particularly describes and ascertains the nature of this invention and the manner in which it is to be performed:-
TECHNICAL FIELD
Embodiments disclosed herein relate to media processing, and more particularly to deep learning based media processing techniques for detecting region(s) corresponding to text regions along with its orientation, direction, and starting position in a media.
BACKGROUND
The real-world media that has to be processed for machine learning tasks (such as an Optical Character Recognition (OCR)) can come with a rotation skew, that is, the text captured may be in an inverted position or positions at right angle with respect to the reference frame at 0 degree. The rotation skew can significantly degrade the text detection and text recognition results of OCR pipelines. Hence, it is critical to develop advanced text detection technique that can also detect the text orientation and text direction.
Some existing solutions assume that there will not be any background artefacts, that is, the media space contains only text content. But in the real-world, especially when the text is captured by cameras and uploaded by users, there will be background artefacts and there can be a variety of templates.
There are some inherent challenges involved in the detection of text in multiple orientations and directions, wherein the text direction means the writing/reading direction of languages and orientation means the inclination of the text with respect to coordinate axes. In the case of a general object detection, there are rich visual cues for the learning network to identify the orientation. For example, for detecting a car in an image is up-right or inverted, the location of tyres with respect to the body-top acts as a visual cue for a learning network to decide its orientation. However, when it comes to text detection, the visual cues to resolve between the orientations are minimal. This is because the text contents in an image can come in different shapes for the letters, different font styles, different font sizes along with variation in image illumination, brightness etc. These are the factors that makes the task of detection of text along with its direction, a difficult and complex learning problem.
OBJECTS
The principal object of embodiments herein is to disclose a deep learning based media processing technique to detect region(s) corresponding to text regions along with its orientation, direction, and starting position in a media.
These and other aspects of the embodiments herein will be better appreciated and understood when considered in conjunction with the following description and the accompanying drawings. It should be understood, however, that the following descriptions, while indicating at least one embodiment and numerous specific details thereof, are given by way of illustration and not of limitation. Many changes and modifications may be made within the scope of the embodiments herein without departing from the spirit thereof, and the embodiments herein include all such modifications.
BRIEF DESCRIPTION OF FIGURES
Embodiments herein are illustrated in the accompanying drawings, throughout which like reference letters indicate corresponding parts in the various figures. The embodiments herein will be better understood from the following description with reference to the drawings, in which:
FIG. 1 is a flowchart depicting the process of detecting region(s) corresponding to text regions along with its orientation, direction, and starting position in a media, according to embodiments as disclosed herein;
FIG. 2 depicts a device configured for detecting region(s) corresponding to text regions along with its orientation, direction, and starting position in a media, according to embodiments as disclosed herein;
FIG. 3 depicts an architecture for detecting region(s) corresponding to text regions along with its orientation, direction, and starting position in a media, according to embodiments as disclosed herein;
FIG. 4 depicts the architecture of the first stage learning module, according to embodiments as disclosed herein;
FIG. 5 depicts an example text and corresponding parameters for a bounding box enclosing the text, according to embodiments as disclosed herein;
FIG. 6A depicts an example visualization of half maps for an inverted image, according to embodiments as disclosed herein;
FIG. 6B depicts an example visualization of half maps for a normal image, according to embodiments as disclosed herein; and
FIG. 7 depicts an architecture for the second stage learning module, according to embodiments as disclosed herein.
DETAILED DESCRIPTION
The embodiments herein and the various features and advantageous details thereof are explained more fully with reference to the non-limiting embodiments that are illustrated in the accompanying drawings and detailed in the following description. Descriptions of well-known components and processing techniques are omitted so as to not unnecessarily obscure the embodiments herein. The examples used herein are intended merely to facilitate an understanding of ways in which the embodiments herein may be practiced and to further enable those of skill in the art to practice the embodiments herein. Accordingly, the examples should not be construed as limiting the scope of the embodiments herein.
The embodiments herein achieve a deep learning based media processing technique to detect region(s) corresponding to text regions along with its orientation, direction, and starting position in a media. Referring now to the drawings, and more particularly to FIGS. 1 through 7, where similar reference characters denote corresponding features consistently throughout the figures, there are shown embodiments.
The following abbreviations have been referred to herein:
CNN: Convolutional Neural Network
OCR: Optical Character Recognition
e-KYC: Electronic-Know Your Customer
FPN: Feature Pyramid Network
THC: Text Head Classification
THS: Text Head Score
ROI: Region of interest
RROI: Rotated region of interest
Media as referred to herein can be any form of media, such as, but not limited to, images, video, animations, presentations, and so on.
Embodiments herein can detect text in a media along with its starting position, orientation and direction, wherein a skew correction can be enabled right at the text detection stage before proceeding with the rest of the OCR pipelines, hence guaranteeing optimum results.
Embodiments herein disclose a technique that can work in real-world media that contain background artefacts and different document templates. Embodiments herein are robust against these cases and also in the cases in which the printed text is misaligned against the axes of the paper medium. Embodiments herein perform text detection with respect to the orientation of the text content rather than the medium.
FIG. 1 is a flowchart depicting the process of detecting region(s) corresponding to text regions along with its orientation, direction, and starting position in a media. The starting position is the position of the text word's first character, also referred to as text head herein.
Consider that a media is taken as an input. The media can be taken in realtime from a media capturing means (such as a camera) or from a storage means (such as an onboard memory, a data storage, the Cloud, and so on). In step 101, one or more feature(s) of the media are extracted by passing the media to a pretrained Convolutional Neural Network (CNN). In step 102, a set of learning tasks are executed on the extracted features to generate bounding boxes for the text regions and to get the coordinates, textness score and text head prediction score of the bounding boxes. Text head is that side of the bounding box which correspond to the first character of the word present inside the bounding box, Learning tasks are dependent on labelled data, labelled data include images and corresponding ground truths. For our case, ground truths are the coordinates of bounding boxes and text head location for each of words (text regions) presents in the image. In step 103, a second stage learning is performed where the boxes are refined further to fit precisely over the text contents. For this, some boxes may have to be merged and/or split. Also, the dimensions of the boxes have to get adjusted for optimal results. At this stage, the fine-tuned text head prediction results can be obtained which enables the localization of text contents while preserving their orientation and direction. Learning the text direction can also enable identification of the starting position of the text. The detected texts, bounding boxes and text head prediction scores are provided as output, either in terms of the regions of the media where the text is present and/or the coordinates of the location of the text in the media. The various actions in method 100 may be performed in the order presented, in a different order or simultaneously. Further, in some embodiments, some actions listed in FIG. 1 may be omitted.
FIG. 2 depicts a device configured for detecting region(s) corresponding to text regions along with its orientation, direction, and starting position in a media. Examples of the electronic device 200 can be any device, such as a scanner, a copier, a PDA, a cell phone, a digital camera, a smart phone, a tablet, a wearable device, an Internet of Things (IoT) device, or any other device which can capture media and/or process media. The electronic device 200 comprises a processor 201 coupled to a memory 202. The memory 202 provides storage for instructions, modules, and other data that are executable by the processor 201.
The memory 202 stores at least one of, one or more media, one or more data/intermediate data generated during the technique (as disclosed herein), detected text, and so on. Examples of the memory 303 may be, but are not limited to, NAND, embedded Multimedia Card (eMMC), Secure Digital (SD) cards, Universal Serial Bus (USB), Serial Advanced Technology Attachment (SATA), solid-state drive (SSD), and so on. Further, the memory 202 may include one or more computer-readable storage media. The memory 202 may include one or more non-volatile storage elements. Examples of such non-volatile storage elements may include Random Access Memory (RAM), Read Only Memory (ROM), magnetic hard discs, optical discs, floppy discs, flash memories, or forms of electrically programmable memories (EPROM) or electrically erasable and programmable (EEPROM) memories. In addition, the memory 202 may, in some examples, be considered a non-transitory storage medium. The term "non-transitory" may indicate that the storage medium is not embodied in a carrier wave or a propagated signal. However, the term "non-transitory" should not be interpreted to mean that the memory is non-movable. In certain examples, a non-transitory storage medium may store data that may, over time, change (e.g., in Random Access Memory (RAM) or cache).
The term ‘processor 201’ as used in the present disclosure, may refer to, for example, hardware including logic circuits; a hardware/software combination such as a processor executing software; or a combination thereof. For example, the processing circuitry more specifically may include, but is not limited to, a central processing unit (CPU), an arithmetic logic unit (ALU), a digital signal processor, a microcomputer, a field programmable gate array (FPGA), a System-on-Chip (SoC), a programmable logic unit, a microprocessor, application-specific integrated circuit (ASIC), etc. For example, the processor 201 may include at least one of, a single processer, a plurality of processors, multiple homogeneous or heterogeneous cores, multiple Central Processing Units (CPUs) of different kinds, microcontrollers, special media, and other accelerators.
Consider that a media is taken as an input. The media can be taken in realtime from a media capturing means (such as a camera) or from a storage means (such as the memory 202, a data storage, the Cloud, and so on). The feature extraction module 202a can extract one or more feature(s) of the media by passing the media to a trainable multi-stage Convolutional Neural Network (CNN). In an embodiment herein, a feature extractor of the CNN feature can be learnable using labelled data. The first stage learning module 202b can execute a set of learning tasks on the extracted features to learn a set of bounding boxes for the text regions, the textness score, coordinates and text head score for each bounding boxes. The textness is the probability that there is text present in a text region or inside a bounding box. The textness score can be calculated using the first stage learning module, where fully connected layers on the top of RROI align layers features predict the textness score. The text head is that side of the bounding box which corresponds to the starting position of the text. The first stage learning module 202b can generate coarse bounding boxes and the textness score, which are fed to the box generation module 202c to generate more fine grained bounding boxes. The box generation module 202c generates fine grained boxes by discarding boxes with low textness scores. The box generation module 202c can provide the generated boxes to the second stage learning module 202d. The second stage learning module 202d can perform a second stage learning where the boxes are refined further to fit precisely over the text contents. In an embodiment herein, the second stage learning module 202d may merge and/or split some boxes. In an embodiment herein, the second stage learning module 202d can adjust the dimensions of the boxes for optimal results. The second stage learning module 202d can provide fine-tuned text head prediction results, which enables in the localization of text contents in the media, while preserving the orientation and direction of the detected text. The process of determining the text direction is enabled by predicting the starting position of the text. The second stage learning module 202d can provide the detected texts as output. The detected texts are provided as output, either in terms of the regions of the media where the text is present and/or the coordinates of the location of the text in the media.
FIG. 3 depicts an architecture for detecting region(s) corresponding to text regions along with its orientation, direction, and starting position in a media. The feature extraction module 202a can extract one or more meaningful features of the visual contents in the media. The feature extraction module 202a can use a trainable multi-stage CNN called Feature Pyramid Network (FPN) for extracting one or more meaningful features of the visual contents in the media, wherein the FPN provides feature maps at multiple scales or resolutions. The CNN can have any number of convolutional layers. Having feature maps at multiple scales or resolutions can help to efficiently process the text contents of varying scales. Depending on the size of the text in the media, the feature extraction module 202a can select one among the multiple feature maps of the FPN that is most suited for the detection task. The feature extraction module 202a can provide the selected feature maps of the FPN to the first stage learning module 202b.
The first stage learning module 202b can comprise a text score learning module 301a, a box regression learning module 301b, and a text head classification module 301c. The first stage learning module 202b can generate an initial set of bounding boxes or proposals. The bounding boxes, generated by the first stage learning module 202b, are very dense. In an embodiment herein, the bounding boxes can be overlapping with each other. Each generated bounding box will have a redness score associated with it.
The box generation module 202c can comprise a box generator 302a and a cropping module 302b. The box generation module 202c can generate one or more fine grained bounding boxes based on inputs received from the first stage learning module 202b.
The second stage learning module 202d can comprise a plurality of RROI alignment modules 303a, 303b, a plurality of text score prediction modules 303c, 303d, and a box coordinates prediction module 330d. The second stage learning module 202d can further refine the bounding boxes and the text head prediction scores for each bounding boxes. The modules 202a, 202b, 202c, 202d are end-to-end trainable using labelled datasets, which enables the system to be used for any type of languages or input medias.
FIG. 4 depicts the architecture of the first stage learning module 202b. The pyramid of feature maps from the FPN is given to a set of convolution layers. The feature maps are given to a 3x3/64 channel convolution first. The output of this convolution is then given to three branches, a text score learning module 301a, a box regression learning module 301b, and a text head classification module 301c. The text score learning module 301a comprises of 1x1/1 channel convolution. The text score learning module 301a can be used to learn a text score map, that is, which pixels correspond to text content, and which does not. The text score learning module 301a can perform the learning at a pixel level using a dice loss function.
The box regression learning module 301b comprises of 1x1/1 channel convolution layers and output comprises of 5 such channels. The box regression learning module 301b learns the dimensions of the bounding box in which the text contents in the image will be contained. The box regression learning module 301b can characterize the box dimensions using a plurality of parameters (a first parameter (l), a second parameter (t), a third parameter (r), a fourth parameter (b), and a fifth parameter(θ)) with respect to each pixel, as shown in FIG. 5. With reference to a pixel p, that is assumed to be inside a bounding box that contains a text content, l is the shortest distance between p and the left side of the bounding box, t is the shortest distance between p and the top side of the bounding box, r is the shortest distance between p and the right side of the bounding box, b is the shortest distance between p and the bottom side of the bounding box, and, θ is the angle the longest side of the box makes with respect to a x axis. The 5 channel output learns one among each of these parameters for each pixel, and the loss function used in the optimization process to learn parameters (l,t,r,b) is an intersection over union (IoU) loss and to learn parameter θ, the box regression learning module 301b can use the L1 loss function.
The text head classification module 301c can enable identification of the orientation of the text, direction of the text and the starting position of the text. The text head classification module 301c can learn a Text Head Score (THS) map, wherein the THS map helps to identify text head regions or starting position of the text in the media. The text head classification module 301c can learn the THS map through two convolution operations; a 3x3/1 channel convolution and a 1x1/1 channel convolution.
The text head classification module 301c can enable the THS map to approximate the half maps from which the text starting position can be found out and it is learned in a pixel level. Half maps are binary images, wherein the half of the region of the text close to the start position is considered by the text head classification module 301c as white pixels and the remaining region as black pixels. FIG. 6A depicts an example visualization of half maps for a normal image. FIG. 6B depicts an example visualization of half maps for an inverted image. The text head classification module 301c can learn the THS maps as a pyramid of features to learn from text contents of varying scales. The half maps will be later consumed by the Rotated region of interest (RRoI) align module 303a, 303b in the second stage learning module 202d, where the final prediction is made whether the text is inverted or not. The text head classification module 301c can learn these maps at a pixel level using a dice loss function.
For identifying the starting position of the text contents that is oriented at any angle, the text head classification module 301c can align the corresponding regions in the THS maps horizontally with respect to the x axis. The text head classification module 301c can resolve the aligned corresponding regions to find if the contents are inverted or not. This can be learned through the overall training method. The labelled data can be used for training and the model can automatically learn it.
In this way, it is possible to find the starting position of the text contents oriented at an arbitrary angle.
From the output of the first stage learning module 202b, the box generation module 202c generates boxes in the form of bounding boxes on the media, wherein the bounding boxes enclose areas in which there is a minimum probability score; i.e., all the bounding boxes that are included have probability score that is greater than the minimum probability score. These boxes have, say, a probability of at least 0.3 to contain text contents. The second stage learning module 202d can further refine the boxes to enclose the correct locations of the text in the media. The second stage learning module 202d can refine the boxes by predicting the textness and box coordinates for each box.
FIG. 7 depicts an architecture for the second stage learning module. The generated boxes (from the box generation module 202c), the FPN pyramid of features (from the first stage learning module 202b), and the THS maps (from the first stage learning module 202b) are the inputs to the second stage learning module 202d. The second stage learning module 202d performs a fine-tune on top of the learning process at the first stage learning module 202b. The second stage learning module 202d can predict the textness, box coordinates, and text head.
For learning textness, and box coordinates, the boxes and the FPN pyramid of features are provided to an RRoI alignment module 303a. The RRoI alignment module 303a can crop out the regions from the FPN features corresponding to the box(es). Since the bounding boxes are rotated, the RRoI alignment module 303a can crop the corresponding rotated local regions in the FPN features. These cropped regions are given as input to a common set of fully connected layers and then to separate linear layers in the second stage learning module 202d (i.e., a text score prediction module 303c and a box coordinates prediction module 303d) for the prediction of textness, and box coordinates predictions. The text score prediction module 303c can learn the textness using a cross-entropy loss function. The box coordinates prediction module 303d can learn the box coordinates using a smooth L1 loss function. Embodiments herein can train the entire system end-to-end using labelled data. From these predictions, it is possible to localize the text regions in a media or the coordinates of the boxes in which the individual words are contained can be obtained. At this stage, the orientation of the text is resolved. The direction of the text is resolved with the help of the text head prediction module 303e. The text head prediction module 303e can learn the text head using a cross-entropy loss function.
For text head prediction, the text head prediction module 303e can crop the generated boxes separately into two halves. If the boxes are inclined at an angle, the text head prediction module 303e horizontally aligns the boxes in a way that the longest side is parallel to the x-axis. From these halves, the text head prediction module 303e can obtain the coordinates of the corresponding region in the THS maps. Since the coordinates of the original box are known, the coordinates can be inferred for each half in the THS map. The RRoI alignment module 303b can crop these regions in the THS maps. The cropped region in the THS map after horizontal align is later consumed by a set of fully connected layers followed by two separate linear layers. One linear layer will give a first probability score whether the corresponding proposal or bounding box contains the text in inverted position and the other will give a second probability score if it is not. The final decision is made corresponding to the maximum among the two probability scores, wherein the text is determined to be inverted/not inverted based on the higher of the first probability score and the second probability score. For example, if the first probability score is higher than the second probability score, it is determined that the text is in an inverted position. For example, if the second probability score is higher than the first probability score, it is determined that the text is not in an inverted position. Hence the text direction along with the starting position is resolved here. The combination of textness, box coordinates, and text head predictions enables the orientation and direction aware text detection.
Embodiments, as disclosed herein, as trainable using labelled data. Labelled data are a set of images containing texts inside and labels comprising of coordinates, orientation and text head positions for each of the text regions or words present in the image.
Embodiments herein can handle even the worst-case of detecting whether the text content at an arbitrary angle in a document is upright or completely inverted. Embodiments herein can be leveraged to correct the skew of the document media with very high accuracy. Embodiments herein do not assume any prior information related to the document media and works for real-world media that does not follow any standard operating procedures during the capture of the media. Embodiments herein can work for a variety of document templates such as scanned textual documents, application forms, identity cards, documents with tables/illustrations, and text with any fonts, font sizes and orientation. Embodiments herein can be made robust to work for any languages with sufficient data for fine tuning the solution. Embodiments herein can be used to process large document media databases in an automated way. Embodiments herein are robust against noisy documents with background artefacts. Embodiments herein can be used in conjunction with any document information extraction processes or systems such as OCR.
The embodiments disclosed herein can be implemented through at least one software program running on at least one hardware device and performing network management functions to control the network elements. The network elements include blocks which can be at least one of a hardware device, or a combination of hardware device and software module.
The embodiment disclosed herein describes a deep learning based media processing technique to detect region(s) corresponding to text regions along with its orientation, direction, and starting position in a media. Therefore, it is understood that the scope of the protection is extended to such a program and in addition to a computer readable means having a message therein, such computer readable storage means contain program code means for implementation of one or more steps of the method, when the program runs on a server or mobile device or any suitable programmable device. The method is implemented in at least one embodiment through or together with a software program written in e.g. Very high speed integrated circuit Hardware Description Language (VHDL) another programming language, or implemented by one or more VHDL or several software modules being executed on at least one hardware device. The hardware device can be any kind of portable device that can be programmed. The device may also include means which could be e.g. hardware means like e.g. an ASIC, or a combination of hardware and software means, e.g. an ASIC and an FPGA, or at least one microprocessor and at least one memory with software modules located therein. The method embodiments described herein could be implemented partly in hardware and partly in software. Alternatively, the invention may be implemented on different hardware devices, e.g. using a plurality of CPUs.
The foregoing description of the specific embodiments will so fully reveal the general nature of the embodiments herein that others can, by applying current knowledge, readily modify and/or adapt for various applications such specific embodiments without departing from the generic concept, and, therefore, such adaptations and modifications should and are intended to be comprehended within the meaning and range of equivalents of the disclosed embodiments. It is to be understood that the phraseology or terminology employed herein is for the purpose of description and not of limitation. Therefore, while the embodiments herein have been described in terms of embodiments and examples, those skilled in the art will recognize that the embodiments and examples disclosed herein can be practiced with modification within the spirit and scope of the embodiments as described herein.
, Claims:ABSTRACT
Systems and methods for orientation and direction aware text detection in media
Embodiments herein disclose a deep learning based media processing technique to detect region(s) corresponding to text regions along with its orientation, direction, and starting position in a media. Embodiments herein can detect text in a media along with its orientation and direction, wherein a skew correction can be enabled right at the text detection stage before proceeding with the rest of the OCR pipelines, hence guaranteeing optimum results. Embodiments herein disclose a technique that can work in real-world media that contain background artefacts and different document templates. Embodiments herein are robust against these cases and also in the cases in which the printed text is misaligned against the axes of the paper medium. Embodiments herein perform text detection with respect to the orientation of the text content rather than the medium.
FIG. 3
| # | Name | Date |
|---|---|---|
| 1 | 202341028624-STATEMENT OF UNDERTAKING (FORM 3) [19-04-2023(online)].pdf | 2023-04-19 |
| 2 | 202341028624-REQUEST FOR EXAMINATION (FORM-18) [19-04-2023(online)].pdf | 2023-04-19 |
| 3 | 202341028624-POWER OF AUTHORITY [19-04-2023(online)].pdf | 2023-04-19 |
| 4 | 202341028624-FORM 18 [19-04-2023(online)].pdf | 2023-04-19 |
| 5 | 202341028624-FORM 1 [19-04-2023(online)].pdf | 2023-04-19 |
| 6 | 202341028624-DRAWINGS [19-04-2023(online)].pdf | 2023-04-19 |
| 7 | 202341028624-DECLARATION OF INVENTORSHIP (FORM 5) [19-04-2023(online)].pdf | 2023-04-19 |
| 8 | 202341028624-COMPLETE SPECIFICATION [19-04-2023(online)].pdf | 2023-04-19 |
| 9 | 202341028624-FORM-9 [25-04-2023(online)].pdf | 2023-04-25 |
| 10 | 202341028624-MARKED COPIES OF AMENDEMENTS [23-06-2023(online)].pdf | 2023-06-23 |
| 11 | 202341028624-FORM 13 [23-06-2023(online)].pdf | 2023-06-23 |
| 12 | 202341028624-Proof of Right [19-10-2023(online)].pdf | 2023-10-19 |
| 13 | 202341028624-FER.pdf | 2024-07-03 |
| 14 | 202341028624-RELEVANT DOCUMENTS [26-12-2024(online)].pdf | 2024-12-26 |
| 15 | 202341028624-PETITION UNDER RULE 137 [26-12-2024(online)].pdf | 2024-12-26 |
| 16 | 202341028624-OTHERS [26-12-2024(online)].pdf | 2024-12-26 |
| 17 | 202341028624-FER_SER_REPLY [26-12-2024(online)].pdf | 2024-12-26 |
| 18 | 202341028624-CORRESPONDENCE [26-12-2024(online)].pdf | 2024-12-26 |
| 19 | 202341028624-COMPLETE SPECIFICATION [26-12-2024(online)].pdf | 2024-12-26 |
| 20 | 202341028624-CLAIMS [26-12-2024(online)].pdf | 2024-12-26 |
| 21 | 202341028624-ABSTRACT [26-12-2024(online)].pdf | 2024-12-26 |
| 1 | SearchHistoryE_24-01-2024.pdf |