Sign In to Follow Application
View All Documents & Correspondence

“Systems And Methods For Extracting Named Attribute(s) From Document Media”

Abstract: ABSTRACT “SYSTEMS AND METHODS FOR EXTRACTING NAMED ATTRIBUTE(S) FROM DOCUMENT MEDIA” Embodiments herein disclose systems and methods for extracting named attribute(s) from document media, which uses an Optical Character Recognition (OCR) based solution and is enabled by end-to-end enabled by state-of-the-art deep learning based methods, wherein the output is a structured form in the form of key-value pairs. Embodiments herein disclose systems and methods for extracting named attribute(s) from document media, wherein the OCR solution includes skew correction techniques. Embodiments herein disclose systems and methods for extracting named attribute(s) from document media, wherein the extracted attributes are at a word level and can be language agnostic. FIG. 2

Get Free WhatsApp Updates!
Notices, Deadlines & Correspondence

Patent Information

Application #
Filing Date
20 July 2023
Publication Number
35/2023
Publication Type
INA
Invention Field
COMPUTER SCIENCE
Status
Email
Parent Application

Applicants

Subex Assurance LLP
Subex Assurance LLP, 4th Floor Pritech Park, Bellandur, Varthur Hobli, Bangalore 560103, India

Inventors

1. Mrinal Haloi
Subex Assurance LLP, 4th Floor Pritech Park, Bellandur, Varthur Hobli, Bangalore 560103, India
2. Asif Salim
Subex Assurance LLP, 4th Floor Pritech Park, Bellandur, Varthur Hobli, Bangalore 560103, India
3. Shashank Shekhar
Subex Assurance LLP, 4th Floor Pritech Park, Bellandur, Varthur Hobli, Bangalore 560103, India

Specification

Description:
TECHNICAL FIELD
[001] Embodiments disclosed herein relate to media processing, and more particularly to deep learning based media processing techniques for extracting named attributes from a media of a document.

BACKGROUND
[002] Although the existing Optical Character Recognition (OCR) solutions can recognize text in a media, the output of the OCR is not in a structured format. The output of these systems is dumped together without any categorization of the contents.
[003] Further, real-world media to be processed for OCR can have a rotation skew. The existing solutions can correct the skew if it falls in the range from -90 degrees to +90 degrees. Such solutions are unable to perform skew correction if the text in the media takes a form close to “upside-down” (when the skew angle is from -180 degrees to -90 degrees and +90 degrees to +180 degrees).
[004] Existing solutions can be built for a particular document template, and some methods assume that there will not be any background artefacts; i.e., the media space contains only document content. But in the real-world, especially when the document is captured by cameras and uploaded by users, there will be background artefacts. Existing solutions are not able to handle this in an efficient manner.
[005] Hence, there is a need in the art for solutions which will overcome the above mentioned drawbacks, among others.

OBJECTS
[006] The principal object of embodiments herein is to disclose systems and methods for extracting named attribute(s) from document media, which uses an Optical Character Recognition (OCR) based solution and is enabled by end-to-end enabled by state-of-the-art deep learning based methods, wherein the output is a structured form in the form of key-value pairs.
[007] Another object of embodiments herein is to disclose systems and methods for extracting named attribute(s) from document media, wherein the OCR solution includes skew correction techniques.
[008] Another object of embodiments herein is to disclose systems and methods for extracting named attribute(s) from document media, wherein the extracted attributes are at a word level and can be language agnostic.
[009] These and other aspects of the embodiments herein will be better appreciated and understood when considered in conjunction with the following description and the accompanying drawings. It should be understood, however, that the following descriptions, while indicating at least one embodiment and numerous specific details thereof, are given by way of illustration and not of limitation. Many changes and modifications may be made within the scope of the embodiments herein without departing from the spirit thereof, and the embodiments herein include all such modifications.

BRIEF DESCRIPTION OF FIGURES
[0010] Embodiments herein are illustrated in the accompanying drawings, through out which like reference letters indicate corresponding parts in the various figures. The embodiments herein will be better understood from the following description with reference to the drawings, in which:
[0011] FIG. 1 depicts a device configured for extracting named attributes from a media of a document, according to embodiments as disclosed herein;
[0012] FIG. 2 is a flowchart depicting the process of extracting named attribute(s) from document media, according to embodiments as disclosed herein;
[0013] FIG. 3 is a flowchart depicting the process of pre-processing the document media, according to embodiments as disclosed herein;
[0014] FIGs. 4A, 4B and 4C depict the process of correcting the orientation of the document, wherein the document can be rotated and/or skewed, according to embodiments as disclosed herein;
[0015] FIG. 5 depicts the process of detecting the text contents present in the document, according to embodiments as disclosed herein;
[0016] FIG. 6 depicts an example output of orientation and direction aware text detection, according to embodiments as disclosed herein;
[0017] FIG. 7 depicts a process of extracting text from the detected bounding boxes, according to embodiments as disclosed herein;
[0018] FIGs. 8A, 8B, 8C, and 8D is a model depicting the overall process of layout detection and attribute classification, according to embodiments as disclosed herein; and
[0019] FIGs. 9A, 9B, 9C, and 9D depict example input medias and an example output respectively, according to embodiments as disclosed herein.


DETAILED DESCRIPTION
[0020] The embodiments herein and the various features and advantageous details thereof are explained more fully with reference to the non-limiting embodiments that are illustrated in the accompanying drawings and detailed in the following description. Descriptions of well-known components and processing techniques are omitted so as to not unnecessarily obscure the embodiments herein. The examples used herein are intended merely to facilitate an understanding of ways in which the embodiments herein may be practiced and to further enable those of skill in the art to practice the embodiments herein. Accordingly, the examples should not be construed as limiting the scope of the embodiments herein.
[0021] The embodiments herein achieve systems and methods for extracting named attribute(s) from document media. Referring now to the drawings, and more particularly to FIGS. 1 through 9D, where similar reference characters denote corresponding features consistently throughout the figures, there are shown embodiments.
[0022] Media as referred to herein can be any form of media, such as, but not limited to, images, video, animations, presentations, and so on. Embodiments herein are explained using an image as an example of a media; however, it may be obvious to a person of ordinary skill in the art that the embodiments herein may be extended to any type of media.
[0023] For automating the information extraction in document images, embodiments herein disclose an advanced Optical Character Recognition (OCR) based solution for document media that can perform Named Attribute Extraction (NAE). Embodiments herein are end-to-end enabled by state-of-the-art deep learning based methods. Embodiments herein use NAE to provide output in a structured way in form of key-value pairs, wherein the key-value relationship between the key-value pairs can be explicitly learned by advanced deep learning models resulting in intelligent OCR output compared to conventional methods. Embodiments herein also include a set of fool-proof skew correction techniques that many existing methods lack. Embodiments herein comprise text detection, which can be enabled by a deep neural network. Embodiments herein can output orientation and direction aware text detection, that is, it can precisely give the coordinates of the bounding box around the text contents even if the contents are at an angle with respect to coordinate axes and it is also possible to identify the starting position of the text in the boxes. On detecting the text, embodiments herein can use a deep neural network for performing language agnostic text recognition, wherein text can be extracted at a word level. Finally, for delivering the NAE output, embodiments herein can perform a document structure learning using a state-of-the-art BERT model that considers the text, its position in the document, text media and related tokens embeddings in the same learning settings. Embodiments herein are generalizable across document templates, and can work on media with varying resolution, illumination, resolution etc., and it is capable to process text contents of varying scales.
[0024] Embodiments herein can extract attributes from document media automatically and accurately. The OCR process as disclosed herein can along with the attribute extraction system, can generate one or more labels associated with the recognized phrases/text in OCR stage or the tag(s) on top of the words in the document is also learned. If the tag(s)/label(s) on top of the words are missing, embodiments herein can take the tag(s)/label(s) from the user prior to modeling and incorporated into the process for extracting attributes. In an example herein, embodiments herein can understand if a word in an invoice is in fact an item description or an invoice number.
[0025] Along with the conventional textual information learning, embodiments herein can perform implicit document semantic learning about the location information and image information associated with the text contents, and can perform text recognition even if the text is at multiple orientations.
[0026] Embodiments herein work for arbitrary document types captured in images at multiple resolution, scale, illumination, or documents with background artefacts, etc.
[0027] Embodiments herein can automatically correct skew (if any).
[0028] Embodiments herein can recognize text. Embodiments herein disclose a trainable system, which can extract text from any word level input images. Embodiments herein can recognize words/characters from any language.
[0029] Embodiments herein can understand the underlying document structure of a document in a media and can explicitly learn the positions of required attribute values. Embodiments herein can provide an output, which can be at an intelligence level parallel to named attribute extraction, that is, tag(s) or label(s) associated with the extracted information and their respective position(s) in the document is provided along with a conventional OCR output.
[0030] FIG. 1 depicts a device configured for extracting named attributes from a media of a document. Examples of the electronic device 100 can be any device, such as a scanner, a copier, a PDA, a cell phone, a digital camera, a smart phone, a tablet, a wearable device, an Internet of Things (IoT) device, or any other device which can capture media and/or process media. The electronic device 100 comprises a processor 101 coupled to a memory 102. The memory 102 provides storage for instructions, modules, and other data that are executable by the processor 101. The memory 102 comprises a pre-processing module 102A, a rotation correction module 102B, a text detection module 102C, a text extraction module 102D, and a layout detection module 102E.
[0031] The memory 102 stores at least one of, one or more media, one or more data/intermediate data generated during the technique (as disclosed herein), bounding boxes, attributes, detected text, and so on. Examples of the memory 102 may be, but are not limited to, NAND, embedded Multimedia Card (eMMC), Secure Digital (SD) cards, Universal Serial Bus (USB), Serial Advanced Technology Attachment (SATA), solid-state drive (SSD), and so on. Further, the memory 102 may include one or more computer-readable storage media. The memory 102 may include one or more non-volatile storage elements. Examples of such non-volatile storage elements may include Random Access Memory (RAM), Read Only Memory (ROM), magnetic hard discs, optical discs, floppy discs, flash memories, or forms of electrically programmable memories (EPROM) or electrically erasable and programmable (EEPROM) memories. In addition, the memory 102 may, in some examples, be considered a non-transitory storage medium. The term "non-transitory" may indicate that the storage medium is not embodied in a carrier wave or a propagated signal. However, the term "non-transitory" should not be interpreted to mean that the memory is non-movable. In certain examples, a non-transitory storage medium may store data that may, over time, change (e.g., in Random Access Memory (RAM) or cache).
[0032] The term ‘processor 101’ as used in the present disclosure, may refer to, for example, hardware including logic circuits; a hardware/software combination such as a processor executing software; or a combination thereof. For example, the processing circuitry more specifically may include, but is not limited to, a central processing unit (CPU), an arithmetic logic unit (ALU), a digital signal processor, a microcomputer, a field programmable gate array (FPGA), a System-on-Chip (SoC), a programmable logic unit, a microprocessor, application-specific integrated circuit (ASIC), etc. For example, the processor 101 may include at least one of, a single processer, a plurality of processors, multiple homogeneous or heterogeneous cores, multiple Central Processing Units (CPUs) of different kinds, microcontrollers, special media, and other accelerators.
[0033] Consider that a media is taken as an input. The media can be taken in real time from a media capturing means (such as a camera) or from a storage means (such as the memory 102, a data storage, the Cloud, a user device, a scanner, a device with a camera, and so on).
[0034] The pre-processing module 102A pre-processes the input media. The pre-processing module 102A can pre-process the media by improving overall quality of the document and detecting one or more regions of interest (ROIs) in the media. The rotation correction module 102B can check the pre-processed media to check if the document in the media is rotated and/or skewed. If the document in the media is rotated and/or skewed, the rotation correction module 102B can orient the document into a readable angle (i.e., an upright position) by correcting the rotation and/or skew. The text detection module 102C can detect the text contents present in the document. The text detection module 102C can use a deep learning-based detection model for detecting the text. The text detection module 102C can create a bounding box around each of the detected texts. The text extraction module 102D can extract the texts from the detected bounding boxes. The text extraction module 102D can extract the texts using an extraction model, wherein the model is composed of a trainable convolutional feature extractor and a trainable Bidirectional Long Short Term Memory (Bi-LSTM) based sequence classifier combined with beam search decoding. The layout detection module 102E can detect the layout of the document, which comprises of detecting an underlying structure and positions of required values of attributes present in the document. The layout detection module 102E can use a Bidirectional Encoder Representations from Transformers (BERT) based model for detecting layout(s) and classifying attributes.
[0035] FIG. 2 is a flowchart depicting the process of extracting named attribute(s) from document media. In step 201, a media of a document is pre-processed. Pre-processing the media comprises improving overall quality of the document and detecting one or more regions of interest (ROIs) in the document. In step 202, the orientation of the input document is rotated and/or skew corrected, if required. In step 203, the text contents present in the document are detected, using a deep learning-based detection model, wherein a bounding box is created around the detected text. In step 204, the texts are extracted from the detected bounding boxes using a deep learning-based method. The texts can be extracted using an extraction model, wherein the model is composed of a trainable convolutional feature extractor and a trainable Bidirectional Long Short Term Memory (Bi-LSTM) based sequence classifier combined with beam search decoding. In step 205, the layout of the document is detected and classified, which comprises of detecting an underlying structure and positions of required values of attributes present in the document, wherein attribute values can be the texts that have been detected in the document. Embodiments herein can use a Bidirectional Encoder Representations from Transformers (BERT) based model for classifying attributes. The various actions in method 200 may be performed in the order presented, in a different order or simultaneously. Further, in some embodiments, some actions listed in FIG. 2 may be omitted.
[0036] FIG. 3 is a flowchart depicting the process of pre-processing the document media. Embodiments herein pre-process the input document media, wherein the media can be at least one of a video (wherein one or more frames of the video are considered), an image, an animation (wherein one or more frames of the animation are considered), a screenshot, and so on. Embodiments herein can process the input document using image processing and deep learning methods to improve the overall quality of the document and to detect the region of interest (ROI). In step 301, the pre-processing module 102A can convert the documents in a non-image format into an image format. In step 302, the pre-processing module 102A can enhance the image quality using classical image processing. In an embodiment herein, the pre-processing module 102A can enhance the image quality using a deep learning-based image quality improvement method. In step 303, the pre-processing module 102A can detect ROI in the image using a deep learning-based object detection method. The ROI for the OCR processes comprises of those areas in the image that contains pixels of the document in which the area corresponding to external background is removed. Examples of the methods can be, but not limited to, Efficient detection, Mask RCNN, and so on. The pre-processing module 102A can specify the ROI using coordinates of the document with respect to the document coordinates system. The pre-processing module 102A can pass the extracted ROI (which can be in the form of bounding boxes) as input to the rotation correction module 102B. The various actions in method 300 may be performed in the order presented, in a different order or simultaneously. Further, in some embodiments, some actions listed in FIG. 3 may be omitted.
[0037] FIGs. 4A, 4B and 4C depict the process of correcting the orientation of the document, wherein the document can be rotated and/or skewed. The methods 400A, 400B and 400C depict the process of performing face image-based rotation correction (400A), text based rotation correction (400B), and text head detection-based rotation correction (400C) respectively. Embodiments herein use a combination of one or more of the methods 400A, 400B, and 400C for correcting the orientation of the document.
[0038] FIG. 4A depicts the process of performing face image-based rotation correction (400A). On detecting a person’s face in the document, in step 401, the rotation correction module 102B can use at least one face landmark detection method to obtain coordinates of one or more features of the person’s face (in terms of X & Y co-ordinates). In an example herein, the rotation correction module 102B can use a library (such as, but not limited to, mediapipe) for get coordinates of one or more features of the person’s face. Examples of the features can be, but not limited to, the eye, the iris of the eye, the nose, the mouth, eyebrows, and so on. In step 402, the rotation correction module 102B rotates the media by 180 degrees, on detecting that the media is inverted. In an embodiment herein, the rotation correction module 102B can determine if the media is inverted by comparing the Y co-ordinates of the eye of the person with the Y co-ordinates of the nose and/or mouth of the person, wherein the rotation correction module 102B can consider the media to be inverted, if the Y co-ordinates of the eye of the person is higher than the Y co-ordinates of the nose and/or mouth of the person. In step 403, the rotation correction module 102B can rotate the media by a first angle (i.e., the skew angle; the angle by which the document in the media is skewed). In an embodiment herein, for determining the first angle, the rotation correction module 102B can construct a right angled triangle using the two eye points and a third vertex. In an example herein, the third vertex can be below or above the line connecting both the eyes. It is not definite, since the angle is at 90 degrees, that does not cause any technical issue. The first angle is an elevation angle with respect to the right eye. The purpose of this step is to calculate the angle by which the face is aligned. Step 403 can be repeated for a pre-defined number of times to reduce errors (if any). The various actions in method 400A may be performed in the order presented, in a different order or simultaneously. Further, in some embodiments, some actions listed in FIG. 4A may be omitted.
[0039] FIG. 4B depicts the process of performing text based rotation correction (400B). If the document does not contain at least one image of a person, the rotation correction module 102B can use approaches such as, but not limited to, discrete radon transform, text extraction approaches, and so on. In step 404, the rotation correction module 102B performs an unsharp masking and de-meaning operation on the pre-processed media. The rotation correction module 102B obtains a sinogram of the processed image using an approach, such as, but not limited to, discrete radon transformation, wherein an angle corresponding to the highest response in the sinogram stands for the rotation angle by which the media needs to get rotated to correct the skew. For computing the sinogram, the rotation correction module 102B can chose angles in pre-defined degree steps (for example, 5 degrees) starting from 0 to 180 degrees. In step 405, the rotation correction module 102B can rotate the image by the rotation angle. After this rotation, the image skew may be corrected, or an image which is 180-degree rotated version of the skew corrected image is obtained. In step 406, the rotation correction module 102B detects if the image is a 180-degree rotated version by applying OCR on the image to get all detected text and their corresponding bounding boxes. If more than 34% of the extracted texts are not real words, then the inferred rotation angle is a 180-degree rotated version. The various actions in method 400B may be performed in the order presented, in a different order or simultaneously. Further, in some embodiments, some actions listed in FIG. 4B may be omitted.
[0040] FIG. 4C depicts the process of performing text head detection-based rotation correction (400C). Embodiments herein can be applied to any document, irrespective of having at least one person’s face contained within it. In step 407, the rotation correction module 102B calculates an initial alignment of the rotated bounding boxes over the text contents in the image given by a text detection deep learning model. The rotation correction module 102B can calculate the initial alignment using the x and y coordinates of the detected bounding boxes. Each bounding box can have four coordinates, the rotation correction module 102B use these four coordinates to calculate the angle. In an example, consider that 40 bounding boxes have bene detected, the rotation correction module 102B can calculate alignment angle for each bounding box and take an average of all the alignment angles to determine the initial alignment angle. In step 408, the rotation correction module 102B performs a first rotation correction based on this alignment in either a clockwise or an anti-clockwise direction which results in the skew corrected image or its 180-degree inverted version. In step 409, the rotation correction module 102B performs a resolution between the skew corrected image and its 180-degree inverted version using an advanced text detection model. In an embodiment herein, the text detection model can also predict the head position of detected text regions in the image. The head position is that side of the bounding box which corresponds to the starting position of the text. For example, for a text content in English language in a skew corrected image, the head position corresponds to the left side of the bounding box, and for its 180-degree inverted version, the head position corresponds to the right side of the bounding box. The various actions in method 400C may be performed in the order presented, in a different order or simultaneously. Further, in some embodiments, some actions listed in FIG. 4C may be omitted.
[0041] FIG. 5 depicts the process of detecting the text contents present in the document. The text detection module 102C can detect the text contents present in the document. For detecting the text contents, the text detection module 102C can use a deep learning-based text detection model. Embodiments herein use rotated bounding boxes to detect text(s) that are present at arbitrary angle. For detecting text oriented in any direction, the text detection module 102C can handle rotated bounding boxes. With this approach, embodiments herein can detect text contents that are inclined at an angle in a media, which can accurately extract text from an image with any orientation, without restricting it to a horizontally aligned image document. Along with this, embodiments herein can learn to detect text head(s). The learning components corresponding to text head detection can predict that side of the box which is the text head. Hence, along with the orientation of the text, embodiments herein can learn the direction of the text in the form of its starting position.
[0042] The text detection module 102C can use a convolutional feature extractor module 501 to get feature maps of the input image in the first stage. The convolutional feature extractor module 501 can be of any depth in terms of the convolution or pooling layers. To handle the text contents of varying size and scale, the convolutional feature extractor module 501 can generate a plurality of feature maps of varying sizes from a convolutional feature extractor. Feature maps of multiple scales can help in identifying those text areas that vary in a larger scale. In an embodiment herein, the convolutional feature extractor module 501 can generate four feature maps. The convolutional feature extractor module 501 can use one or more generated feature maps as input to a Feature Pyramid Network (FPN) module 502 to construct a pyramid of features of an image. The output of the FPN module 502 is used as input to a second stage of a detector for determining the bounding boxes.
[0043] The text detection module 102C can comprise a region proposal module 503, which can predict the bounding box coordinates with rotation angle and probability of text presence for each pixel in the feature maps. The region proposal module 503 can also predict a probability of text head or text starting position for each pixel.
[0044] The region proposal module 503 can pass the generated proposals along with the corresponding feature maps to an ROI Align module 504, to get proposal-level features. The proposal-level features are used as input to a prediction module 505. The prediction module 505 can predict the probability of the presence of text in that proposal-level feature. The prediction module 505 can predict fine-grained bounding box coordinates along with rotation angle. The prediction module 505 can predict the text head side of the bounding box that corresponds to the starting position of the text.
[0045] During the training phase, losses from the proposals and the prediction module 505 can be minimized by an optimizer module 506, using a Stochastic Gradient Descent (SGD) optimizer with varying learning rates. Embodiments herein can use an adaptive learning rate decay policy for training the network end to end.
[0046] During a training phase, input to the optimizer module 506 is labelled data, wherein the labels comprise of coordinates of each word in the input media. The coordinate of each word comprises of four distinct points in the Cartesian coordinate system. For text head detection, the labels also contain which two coordinates among the total of four corresponds to the head region of the text.
[0047] Note that the steps disclosed in FIG. 5 can also be done along with text head detection-based rotation correction (as depicted in FIG. 4C). That is, the rotation aware bounding box can also be used to do the skew correction with text head detection capability. FIG. 6 depicts an example output of orientation and direction aware text detection.
[0048] FIG. 7 depicts a process of extracting text from the detected bounding boxes. The text extraction module 102D can extract texts from detected bounding boxes for the recognition of the corresponding text contents. For text extraction, the text extraction module 102D can use a deep learning-based method. The extraction model comprises a trainable convolutional feature extraction module 701 (which uses a trainable convolutional feature extraction model) and a trainable Bidirectional Long Short Term Memory (Bi-LSTM) based sequence classifier 702 combined with beam search decoding, which further comprises a classification component. The classification component comprises of N units, where N = 1 + vocabulary length.
[0049] The convolutional feature extraction module 701 takes media as input and outputs a feature sequence. The feature sequence is fed to the Bi-LSTM based sequence classifier 702 for classification. The output of the Bi-LSTM based sequence classifier 702 (i.e., the classification) is provided to a Connectionist Temporal Classification (CTC) beam search decoder module 703, which can extract the final predicted texts from the input media during inference.
[0050] During the training phase, an optimization module 704 can optimize the model parameters using a CTC loss and an SGD-based optimizer. During the prediction phase, a CTC beam search decoder is used to get the predicted texts. For training the model, the optimization module 704 can compute the CTC loss using ground truths values and the outputs of the prediction module 505. The optimization module 704 can minimize the CTC loss using an SGD-based optimizer with adaptive learning rates. The optimization module 704 can be trained on a custom dataset. During the training phase, the input to the optimization module 704 can comprise of labelled data, wherein the labelled data can comprise of word images and corresponding word text.
[0051] FIGs. 8A, 8B, 8C, and 8D is a model depicting the overall process of layout detection and attribute classification. Detecting the document layout is important to understand the underlying structure and positions of the required values of attributes. For detecting the document layout, the layout detection module 102E can use Bidirectional Encoder Representations from Transformers (BERT) based model for attributes classification. For attribute classification, the layout detection module 102E can consider the locations of text in the image and text background. To help the model understand the underlying layout, the layout detection module 102E can feed both extracted text and text bounding boxes to the BERT model to get BERT embeddings for the classification model. The layout detection module 102E can use trainable embedding tables to extract embeddings for text, text position, text token type, text bounding boxes, and text images (in terms of X, Y co-ordinates, width embedding, height embeddings, and so on). The layout detection module 102E can sum the embeddings related to text and text positions elementwise to get the summed embeddings for the next step.
[0052] The layout detection module 102E can use the summed embeddings as an input of the BERT model, which is an attention-based sequence encoding model. The final outputs given by the BERT model encoding can be summed elementwise with text image embeddings by the layout detection module 102E, to get a final representation of the inputs. In an embodiment herein, the layout detection module 102E can use a CNN feature extractor for summing the final outputs given by the BERT model encoding in an elementwise manner with text image encodings. The layout detection module 102E can pass this representation to a dropout and dense layer to get the final classification results from which named attribute extraction is enabled.
[0053] During training phase, the input for training can be labelled data, where each word is assigned, a particular category based on the output requirements and the location of the word with respect to the image coordinates. The embeddings obtained from the BERT model and CNN model are summed elementwise and fed to the classification layer.
[0054] The layout detection module 102E can be trained in three stages:
1) The layout detection module 102E can pretrain the BERT model using embeddings of text and text bounding box's locations as inputs. During training, the layout detection module 102E can follow the token masking-based training process. Note that only text tokens of text sequence input are masked by the layout detection module 102E, while keeping the text sequence bounding boxes related embeddings unmasked. The layout detection module 102E can predict the masked token using a classification layer and cross entropy loss.
2) The layout detection module 102E can pretrain the CNN feature extractor model using a large dataset of images sources from web. The layout detection module 102E can train the model as a classifier using cross entropy loss. The layout detection module 102E can use an SGD based optimizer for training along with an extensive data augmentation. For CNN feature extraction, the layout detection module 102E can use a convolutional deep learning based model.
3) The layout detection module 102E can keep the pretrained BERT and CNN feature extractor model parameter fixed and train the final task specific classification layer. The layout detection module 102E can train the final task specific classification layer using cross entropy loss, an SGD based optimizer and extensive data augmentations. The data augmentation can be used by the layout detection module 102E only to change the input media, text present in the media remains unchanged.
[0055] FIGs. 9A, 9B, 9C, and 9D depict example input medias and an example output respectively, wherein the output is in the form of key-value pairs.
[0056] Embodiments herein can provide output in the levels of named attribute extractions; i.e., instead of a mere OCR output, embodiments herein provide output in the form of key-value pairs where keys are well defined tags over the texts. If these tags are not defined explicitly defined in the images, they will be taken as user input, and it is integrated into the solution.
[0057] The state-of-the-art deep learning components as disclosed herein can learn the entire document semantics instead of text contents. For example, in the case of an ID card that has a structure in which the photo of the person at a particular location, name and other details to its right, address or office location in the bottom portion, etc., these visual cues are also learned implicitly as part of the learning of text extraction.
[0058] Embodiments herein can be generalized to documents of any templates, resolution, scale, illumination, or documents with background artefacts, etc. Embodiments herein disclose an accurate and fool proof skew correction mechanism. The text recognition component, as disclosed herein, can work for any language, and is capable to give outputs in word level. Embodiments herein avoid the process of deep learning based semantic segmentation. Embodiments herein can work with any document template, and it is not dependent on specific document structure.
[0059] The embodiments disclosed herein can be implemented through at least one software program running on at least one hardware device and performing network management functions to control the network elements. The network elements shown in FIG. 1 includes blocks which can be at least one of a hardware device, or a combination of hardware device and software module.
[0060] The embodiment disclosed herein describes systems and methods for extracting named attribute(s) from document media. Therefore, it is understood that the scope of the protection is extended to such a program and in addition to a computer readable means having a message therein, such computer readable storage means contain program code means for implementation of one or more steps of the method, when the program runs on a server or mobile device or any suitable programmable device. The method is implemented in at least one embodiment through or together with a software program written in e.g. Very high speed integrated circuit Hardware Description Language (VHDL) another programming language, or implemented by one or more VHDL or several software modules being executed on at least one hardware device. The hardware device can be any kind of portable device that can be programmed. The device may also include means which could be e.g. hardware means like e.g. an ASIC, or a combination of hardware and software means, e.g. an ASIC and an FPGA, or at least one microprocessor and at least one memory with software modules located therein. The method embodiments described herein could be implemented partly in hardware and partly in software. Alternatively, the invention may be implemented on different hardware devices, e.g. using a plurality of CPUs.
[0061] The foregoing description of the specific embodiments will so fully reveal the general nature of the embodiments herein that others can, by applying current knowledge, readily modify and/or adapt for various applications such specific embodiments without departing from the generic concept, and, therefore, such adaptations and modifications should and are intended to be comprehended within the meaning and range of equivalents of the disclosed embodiments. It is to be understood that the phraseology or terminology employed herein is for the purpose of description and not of limitation. Therefore, while the embodiments herein have been described in terms of embodiments and examples, those skilled in the art will recognize that the embodiments and examples disclosed herein can be practiced with modification within the spirit and scope of the embodiments as described herein.
, Claims:STATEMENT OF CLAIMS
We claim:
1. A method (200) for detecting text in a media, the method comprising:
pre-processing (201), by an electronic device (100), the media, wherein pre-processing comprises improving overall quality of a document in the media and detecting one or more regions of interest (ROIs) in the document;
correcting (202), by the electronic device (100), an orientation of the document by performing at least one of rotation and skew correction;
detecting (203), by the electronic device (100), one or more text contents present in the document using a deep learning-based detection model, wherein a bounding box is created around the one or more detected texts;
extracting (204), by the electronic device (100), the one or more text contents from the bounding boxes using a deep learning-based method, wherein the one or more text contents are extracted using a trainable convolutional feature extractor, a trainable Bidirectional Long Short Term Memory (Bi-LSTM) based sequence classifier, and beam search decoding; and
detecting and classifying (205), by the electronic device (100), a layout of the document comprising an underlying structure of the document and positions and classification of values of attributes present in the document, wherein attribute values are the one or more text contents that have been detected in the document, and a Bidirectional Encoder Representations from Transformers (BERT) based model is used for classifying the attributes.
2. The method, as claimed in claim 1, wherein pre-processing the media comprises:
converting, by the electronic device (100), the document into an image format;
enhancing, by the electronic device (100), quality of the media using a deep learning-based image quality improvement method; and
detecting, by the electronic device (100), at least one ROI in the image using a deep learning-based object detection method, wherein
the at least one ROI for the OCR processes comprises of at least one area in the image that contains pixels of the document in which the area corresponding to external background is removed;
the at least one ROI is specified using coordinates of the document with respect to a coordinate system of the document; and
the at least one ROI is in form of at least one bounding box.
3. The method, as claimed in claim 1, wherein correcting the orientation of the document further comprises performing at least one of a face image-based rotation correction, a text based rotation correction, and a text head detection-based rotation correction.
4. The method, as claimed in claim 3, wherein performing the face image-based rotation correction comprises:
obtaining, by the electronic device (100), coordinates of one or more features of a person’s face in terms of X & Y co-ordinates using at least one face landmark detection method, on detecting a person’s face in the document;
rotating, by the electronic device (100), the media by 180 degrees, on detecting that the media is inverted, wherein determining if the media is inverted by comparing Y co-ordinates of an eye of the person with Y co-ordinates of at least one of a nose and mouth of the person;
rotating, by the electronic device (100), by a first angle, wherein the first angle is an elevation angle with respect to a right eye of the person and is determined by constructing a right angled triangle using two eye points.
5. The method, as claimed in claim 3, wherein performing the text based rotation correction comprises:
performing, by the electronic device (100), an unsharp masking and de-meaning operation on the pre-processed media, which comprises obtaining a sinogram of the processed image by choosing angles in pre-defined degree steps starting from 0 to 180 degrees using a discrete radon transformation, wherein an angle corresponding to the highest response in the sinogram stands for a rotation angle by which the media needs to get rotated to correct the skew, if the document does not comprise face of at least one person;
rotating, by the electronic device (100), the image by the rotation angle; and
detecting, by the electronic device (100), if the image is a 180-degree rotated version by applying Optical Character Recognition (OCR) on the image to get all detected text and their corresponding bounding boxes.
6. The method, as claimed in claim 3, wherein performing the text head detection-based rotation correction comprises:
calculating, by the electronic device (100), an initial alignment of at least one rotated bounding boxes over the at least one text contents in the image using a text detection deep learning model;
performing, by the electronic device (100), a first rotation correction based on the initial alignment in either a clockwise or an anti-clockwise direction, wherein the first rotation correction results in a skew corrected image or a 180-degree inverted version; and
performing, by the electronic device (100), a resolution between the skew corrected image and the 180-degree inverted version using a text detection model, wherein the text detection model can predict a head position of detected text regions in the media.
7. The method, as claimed in claim 1, wherein detecting the one or more text contents present in the document comprises:
obtaining, by the electronic device (100), at least one feature map using a convolutional feature extractor model, wherein the convolutional feature extractor model is of any depth in terms of convolution or pooling layers;
constructing, by the electronic device (100), a pyramid of features of the document using the one or more generated feature maps;
predicting, by the electronic device (100), coordinates of the bounding box, a rotation angle, probability of text presence for each pixel in the feature maps and a probability of text head or text starting position for each pixel;
getting, by the electronic device (100), at least one proposal-level feature; and
predicting, by the electronic device (100), a probability of the presence of text in each proposal-level feature, coordinates of a fine-grained bounding box coordinates with rotation angle, and a text head side of the bounding box that corresponds to a starting position of the at least one text content;
wherein losses from the at least one proposal-level feature and predictions are minimized using a Stochastic Gradient Descent (SGD) optimizer with varying learning rates, during a training phase, wherein an input during the training phase comprises labelled data, wherein the labels comprise of coordinates of each word in the input media and the coordinate of each word comprises of four distinct points in a Cartesian coordinate system; and two coordinates among the coordinates corresponding to the head of the text.
8. The method, as claimed in claim 1, wherein extracting the one or more text contents comprises:
generating, by the electronic device (100), a feature sequence using a trainable convolutional feature extraction model;
performing classification, by the electronic device (100), of the feature sequence using a trainable Bidirectional Long Short Term Memory (Bi-LSTM) based sequence classifier, wherein a classification component comprises of N units, where N = 1 + vocabulary length; and
extracting, by the electronic device (100), predicted texts from the media using the classification and a Connectionist Temporal Classification (CTC) beam search decoder;
wherein a CTC loss and an SGD-based optimizer is used for optimizing model parameters.
9. The method, as claimed in claim 1, wherein detecting the layout of the document comprises using the Bidirectional Encoder Representations from Transformers (BERT) based model by considering locations of text in the media and text background.
10. An electronic device (100) comprising:
a processor (101); and
a memory (102) coupled to the processor (101);
wherein the processor (101) is configured to:
pre-process the media, wherein pre-processing comprises improving overall quality of a document in the media and detecting one or more regions of interest (ROIs) in the document;
correcting an orientation of the document by performing at least one of rotation and skew correction;
detecting one or more text contents present in the document using a deep learning-based detection model, wherein a bounding box is created around the one or more detected texts;
extracting the one or more text contents from the bounding boxes using a deep learning-based method, wherein the one or more text contents are extracted using a trainable convolutional feature extractor, a trainable Bidirectional Long Short Term Memory (Bi-LSTM) based sequence classifier, and beam search decoding; and
detecting and classifying a layout of the document comprising an underlying structure of the document and positions and classification of values of attributes present in the document, wherein attribute values are the one or more text contents that have been detected in the document, and a Bidirectional Encoder Representations from Transformers (BERT) based model is used for classifying the attributes.
11. The electronic device, as claimed in claim 10, wherein the processor (101) is configured to pre-process the media by:
converting the document into an image format;
enhancing quality of the media using a deep learning-based image quality improvement method; and
detecting at least one ROI in the image using a deep learning-based object detection method, wherein;
the at least one ROI for the OCR processes comprises of at least one area in the image that contains pixels of the document in which the area corresponding to external background is removed;
the at least one ROI is specified using coordinates of the document with respect to a coordinate system of the document; and
the at least one ROI is in form of at least one bounding box.
12. The electronic device, as claimed in claim 10, wherein the processor (101) is configured to correct the orientation of the document by performing at least one of a face image-based rotation correction, a text based rotation correction, and a text head detection-based rotation correction.
13. The electronic device, as claimed in claim 12, wherein the processor (101) is configured to perform the face image-based rotation correction by:
obtaining coordinates of one or more features of a person’s face in terms of X & Y co-ordinates using at least one face landmark detection method, on detecting a person’s face in the document;
rotating the media by 180 degrees, on detecting that the media is inverted, wherein determining if the media is inverted by comparing Y co-ordinates of an eye of the person with Y co-ordinates of at least one of a nose and mouth of the person;
rotating by a first angle, wherein the first angle is an elevation angle with respect to a right eye of the person and is determined by constructing a right angled triangle using two eye points.
14. The electronic device, as claimed in claim 12, wherein the processor (101) is configured to perform the text based rotation correction by:
performing an unsharp masking and de-meaning operation on the pre-processed media, which comprises obtaining a sinogram of the processed image by choosing angles in pre-defined degree steps starting from 0 to 180 degrees using a discrete radon transformation, wherein an angle corresponding to the highest response in the sinogram stands for a rotation angle by which the media needs to get rotated to correct the skew, if the document does not comprise face of at least one person;
rotating the image by the rotation angle; and
detecting if the image is a 180-degree rotated version by applying Optical Character Recognition (OCR) on the image to get all detected text and their corresponding bounding boxes.
15. The electronic device, as claimed in claim 12, wherein the processor (101) is configured to perform the text head detection-based rotation correction by:
calculating an initial alignment of at least one rotated bounding boxes over the at least one text contents in the image using a text detection deep learning model;
performing a first rotation correction based on the initial alignment in either a clockwise or an anti-clockwise direction, wherein the first rotation correction results in a skew corrected image or a 180-degree inverted version; and
performing a resolution between the skew corrected image and the 180-degree inverted version using a text detection model, wherein the text detection model can predict a head position of detected text regions in the media.
16. The electronic device, as claimed in claim 10, wherein the processor (101) is configured to detect the one or more text contents present in the document by:
obtaining at least one feature map using a convolutional feature extractor model, wherein the convolutional feature extractor model is of any depth in terms of convolution or pooling layers;
constructing a pyramid of features of the document using the one or more generated feature maps;
predicting coordinates of the bounding box, a rotation angle, probability of text presence for each pixel in the feature maps and a probability of text head or text starting position for each pixel;
getting at least one proposal-level feature; and
predicting a probability of the presence of text in each proposal-level feature, coordinates of a fine-grained bounding box coordinates with rotation angle, and a text head side of the bounding box that corresponds to a starting position of the at least one text content;
wherein losses from the at least one proposal-level feature and predictions are minimized using a Stochastic Gradient Descent (SGD) optimizer with varying learning rates, during a training phase, wherein an input during the training phase comprises labelled data, wherein the labels comprise of coordinates of each word in the input media and the coordinate of each word comprises of four distinct points in a Cartesian coordinate system; and two coordinates among the coordinates corresponding to the head of the text.
17. The electronic device, as claimed in claim 10, wherein the processor (101) is configured to extract the one or more text contents by:
generating a feature sequence using a trainable convolutional feature extraction model;
performing classification of the feature sequence using a trainable Bidirectional Long Short Term Memory (Bi-LSTM) based sequence classifier, wherein a classification component comprises of N units, where N = 1 + vocabulary length; and
extracting predicted texts from the media using the classification and a Connectionist Temporal Classification (CTC) beam search decoder;
wherein a CTC loss and an SGD-based optimizer is used for optimizing model parameters.
18. The electronic device, as claimed in claim 10, wherein the processor (101) is configured to detect the layout of the document by using the Bidirectional Encoder Representations from Transformers (BERT) based model by considering locations of text in the media and text background.

Documents

Application Documents

# Name Date
1 202341049079-PROOF OF RIGHT [20-07-2023(online)].pdf 2023-07-20
2 202341049079-POWER OF AUTHORITY [20-07-2023(online)].pdf 2023-07-20
3 202341049079-FORM 1 [20-07-2023(online)].pdf 2023-07-20
4 202341049079-DRAWINGS [20-07-2023(online)].pdf 2023-07-20
5 202341049079-COMPLETE SPECIFICATION [20-07-2023(online)].pdf 2023-07-20
6 202341049079-FORM-9 [21-07-2023(online)].pdf 2023-07-21
7 202341049079-FORM 3 [21-07-2023(online)].pdf 2023-07-21
8 202341049079-FORM 18 [21-07-2023(online)].pdf 2023-07-21
9 202341049079-ENDORSEMENT BY INVENTORS [26-07-2023(online)].pdf 2023-07-26
10 202341049079-FER.pdf 2025-03-12

Search Strategy

1 202341049079searchE_06-05-2024.pdf