A System And Method For Validating Optical Character Recognition

A System And Method For Validating Optical Character Recognition Output

Abstract: A method and a device (102) for creating and training machine learning models is disclosed. In an embodiment, a method for training a machine learning model for identifying entities from data includes creating (302) a first plurality of clusters from a first plurality of data samples in a first dataset (204) and a second plurality of clusters from a second plurality of data samples in a second dataset (206). The method further includes determining (304) a rank for each of the first plurality of clusters and a rank for each of the second plurality of clusters (306). The method includes retraining (308) the machine learning model using at least one of the first plurality of clusters weighted based on the rank determined for each of the first plurality of clusters and at least one of the second plurality of clusters weighted based on the rank determined for each of the second plurality of clusters.

Patent Information

Application #

Filing Date

31 December 2018

Publication Number

27/2020

Publication Type

INA

Invention Field

COMPUTER SCIENCE

Status

Email

Parent Application

Applicants

L&T TECHNOLOGY SERVICES LIMITED

DLF IT SEZ PARK, 2'nd FLOOR - BLOCK 3, MOUNT POONAMALLEE ROAD, RAMAPURAM, CHENNAI - 600 089. TAMILNADU, INDIA.

Inventors

1. MRIDUL BALARAMAN

B 206,SVS PALMS 2,CHINNAPANHALLI MAIN ROAD, DODANEKUNDI BANGALORE, KARNATAKA, INDIA-560037

2. MADHUSUDAN SINGH

B-603, AJMERA STONE PARK, 1'st CROSS, ELECTRONIC CITY-1 BANGALORE, KARNATAKA, INDIA-560100

3. AMIT KUMAR

FLAT NO.206, SANSITA PRIDE, VIDYANAGAR LAYOUT, THANISANDRA MAIN ROAD, BANGALORE, KARNATAKA, INDIA-560077

4. MRINAL GUPTA

78-A VASANT VIHAR, TALAB TILLO, JAMMU, JAMMU and KASHMIR, INDIA-180002

Specification

FIELD OF INVENTION
The invention relates to optical character recognition(OCR) and, more particularly, to validating OCR results.
BACKGROUND
There are many existing techniques for OCR recognition. However, achieving a high level of accuracy in OCR recognition is challenging. A lot of error occurs in digital image data because of various factors such as noise, image resolution, font size etc. In existing techniques, training a machine learning algorithm for OCR recognition involves adding many variants of character in the image database. However, the OCR models need to be continuously updated to handle new variants of character. Pushing all character images to the database will overload it. So, we need a method to selectively add variant images to the database. Manually selecting and adding variants to the database is a time consuming process and also the person needs to be an expert.
The present invention is directed to overcoming one or more of the problems as set forth above.
SUMMARY OF THE INVENTION
Exemplary embodiments of the invention disclose a system and method for validating optical
character recognition (OCR) output. According to an embodiment, the disclosed system and
method receives an image obtained by applying an OCR process to a text of a document. Each
E N; TcharaGteFin thefreceiyed image ^recognized using^ plurality ofj character models in a database

and top N matching results for the character are retrieved from the database. Each character model is trained on a first set and a second set for a character to provide a probability of occurrence of the character. The top N matching results are validated for each of die recognized characters based on a user input. For each of the validated characters, on determining a validated character and user input being same and the probability of occurrence corresponding to the character model of the validated character is low, the character image is added to the first set of the character model in the database; and on determining a validated character and user input being different and the probability of occurrence corresponding to the character model of the validated character is high, the character image is added to the second set of the character model in the database.
BRIEF DESCRIPTION OF DRAWINGS
Other objects, features, and advantages of the invention will be apparent from the following
description when read with reference to the accompanying drawings. In the drawings, wherein
like reference numerals denote corresponding parts throughout the several views:
Figure 1 illustrates a method for validating OCR output, according to an embodiment of the
invention; and
Figure 2 illustrates a system for validating OCR output, according to an embodiment of the
invention.
DETAILED DESCRIPTION OF DRAWINGS
The following description with reference to the accompanying drawings is provided to assist M E Niri a Comprehensive underltaiiding 5f exemplaiyCernbodlmentsSof the£inve4Jti;on. It includes

various specific details to assist in that understanding but these are to be regarded as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the invention. In addition, descriptions of well-known functions and constructions are omitted for clarity and conciseness.
According to embodiments of the invention, a system and method for validating OCR output is disclosed.
Figure 1 illustrates a method for validating OCR output, according to an embodiment of the invention.
At step 102, an image is received. The image is obtained by applying an OCR process to a text of a document. According to an embodiment, the image may correspond to one or more images. The OCR process may be performed by an OCR-capable software application.
At step 104, the OCR process identifies individual characters from the image received in step 102. According to an embodiment, the identification of individual characters may include identifying individual character positions from the image. Each individual character in the received image is recognized using a plurality of character models stored in a database and top N matching results for the character are retrieved from the corresponding character model. In other words, the top N matching results correspond to N most probable matching results for the character. By way of an example, if N corresponds to 3 then top 3 most probable results of a character is selected from the corresponding character model. The top N results may be

presented on a user interface. The user may select the correct option from the top N results or enter the correct option if the correct option is not present in the top N results.
According to an embodiment, at least one character model for each character is stored in a database. According to an embodiment, a character may be a digit, symbol or an alphabet. According to an embodiment, the database may be an image database. The database may contain images of plurality of characters of different font, style, thickness and size. The images in the database may be labelled. That is, each image may indicate the character it refers to.
According to an exemplary embodiment, each character model may be based on a machine learning algorithm created and trained for each character. According to another embodiment, the character model may be a mathematical representation of the character image. According to an exemplary embodiment, the machine learning algorithm may be based on a multi-layer feed-forward neural networks (MLFFNNs).
Each character model is trained on a first set and a second set for a character to provide a probability of occurrence of the character in the received image.
The character model may output probability close to 1 when the character is identified and 0 if the character is not identified. According to an embodiment, the first set contains a set of images of the character associated with a respective character model. According to an embodiment, the first set may correspond to a true set of the character model. In other words, the true set represents a set of images belonging to the character. That is, for character model of'a', the true set is all images of different style, font, thickness, etc. that is labelled as 'a'. The r E N Gharafcterfmodelfis trairiedit6\prb#ide a high probability fonHmages in trjeitrutelet.

According to an embodiment, the second set contains a set of images of characters other than the character associated with the respective character model. The second set may correspond to a false set of the character model. The false set may represent a set of images that are not labelled as the character. According to an exemplary embodiment, for a character model of say 'a', the false set is all images in the data set that is not labelled as 'a'. Hence, images labelled as 'b\ 'c', etc. fall under the false set of'a'.
At step 106, the top N matching results for each of the recognized individual characters are validated based on a user input. The top N matching results may be presented on a user interface. The user may select the correct option from the top N results or enter the correct option if the correct option is not present in the top N results, by providing an input on a user interface.
According to an exemplary embodiment, the OCR process may be based on multi-layer feed¬forward neural networks (MLFFNNs) as the machine learning algorithms. Thus, for each recognized character, the OCR technique determines top N (e.g., top 4) matching results or MLFFNNs output values and provides option to the user to select the correct one from the top 4 results or enter the correct option if the correct option is not present in the top 4 results
While validating the recognized characters, a determination is made if the recognized character and user input are same or different. Further, based on the determination, the machine learning algorithm is trained for more accurate character recognition.

If the recognized character and user input are same and the probability of occurrence corresponding to the character model of the recognized character is low, then the character image is added to the first set of the character model in the database. In other words, the OCR process is able to recognize the character correctly. However, since probability of occurrence of the character from the corresponding character model is low, the character image is added to true set of the character model indicating the OCR process need to be trained to provide higher rank to the character image corresponding to the recognized character.
According to an embodiment, adding the character image to the first set of the character model may generate a new variant of the validated character in the image database. According to another embodiment, the new variant of the validated character may include a variation of the character in font, style, thickness or size.
If the recognized character and user input are different and the probability of occurrence corresponding to the character model of the recognized character is high, then the character image is added to the second set of the character model in the database. In other words, the OCR process is unable to recognize the character correctly and probability of occurrence of the character from the corresponding character model is high. Hence, the character image is added to false set of the character model indicating the OCR process need to be trained to recognize the variation of character.
Figure 2 illustrates an exemplary system for validating OCR output, according to an embodiment of the invention.

The disclosed system 200 may include a data store 202 and one or more processors 204 configured to provide a user interface component 206, an image processing component 208, a training component 210 and a validation component 212. According to an embodiment, the one or more processors may correspond to a microcontroller.
The data store 202 may be configured to store at least one character model for each character. According to an exemplary embodiment, each character model may be based on a machine learning algorithm created and trained for each character. According to another embodiment, the character model may be a mathematical representation of the character image. A character may be a digit, symbol or an alphabet. The data store 202 may be configured to store images. The data store 202 may contain labelled images of plurality of characters of different font, style, thickness and size. Each labelled image may refer to a character.
The user interface component 206 may be configured to receive an image. The image may be obtained by applying an OCR process to a text of a document. According to an embodiment, the image may correspond to one or more images. The OCR process may be performed by an OCR-capable software application.
The image processing component 208 may be configured to recognize each individual character in the image that is received through the user interface component 204. The individual characters may be recognized using a plurality of character models that are stored in a data store 202 and top N matching results for the character may be retrieved from the corresponding character model.

The training component 210 may be configured to train each character model in the data store 202. Each character model may be trained on a first set and a second set for a character to provide a probability of occurrence of the character in the received image. According to an embodiment, the first set contains a set of images of the character associated with a respective character model. In other words, the first set may correspond to a true set of the character model. According to another embodiment, the second set contains a set of images of characters other than the character associated with the respective character model. In other words, the second set may correspond to a false set of the character model.
The validation component 212 may be configured to validate the top N matching results for each of the recognized individual characters by the image processing component 204. The validation component may perform validation based on a user input received through the user interface component 202. The validation component 210 validates by determining if the recognized character by the image processing component 206 and user input received through the user interface component 204 are same or different.
If the recognized character and user input are same and the probability of occurrence of the character from corresponding character model is low, then the validation component 210 may add character image to the first set of the character model in the data store 202. If the recognized character and user input are different and the probability of occurrence of the character from corresponding character model is high, then the validation component 210 may add character image to the second set of the character model in the data store 202.
The present invention is not limited to OCR. The present invention may be used in conjunction
with a wide variety of applications for detecting misrecognized words. The present invention TENT OFFICE CTIENNAI l)nt)l HU19 l 6 = 3 6

is capable of detecting misrecognized words on the basis of a spell checking algorithm, a grammar checking algorithm, a natural language algorithm, or any combination thereof. Further, the present invention is compatible with numerous other applications such as, but not limited to, predictive analysis, image classification, sentiment analysis or any other machine learning classification models.
In the drawings and specification there has been set forth preferred embodiments of the invention, and although specific terms are employed, these are used in a generic and descriptive sense only and not for purposes of limitation. Changes in die form and the proportion of parts, as well as in the substitution of equivalents, are contemplated as circumstances may suggest or render expedient without departing from die spirit or scope of the invention.
Throughout the various contexts described in this disclosure, the embodiments of the invention further encompass computer apparatus, computing systems and machine-readable media configured to carry out the foregoing systems and methods. In addition to an embodiment consisting of specifically designed integrated circuits or other electronics, the present invention may be conveniently implemented using a conventional general purpose or a specialized digital computer or microprocessor programmed according to the teachings of the present disclosure, as will be apparent to those skilled in the computer art.
The embodiments of die present invention may be provided as a computer program product that may include a machine-readable medium, having stored tiiereon instructions which may be used to program a computer (or other electronic devices) to perform a process according to the present invention. The machine-readable medium may include, but is not limited to, floppy
diskettes, optical disks, compact disc read-only memories .(CD-ROMs), and magneto-optical
UE.N.T OFFICE CHENNAI 0-lv 01/2019 I 6'* 4 6 &

disks, ROMs, random access memories (RAMs), erasable programmable read-only memories (EPROMs), electrically erasable programmable read-only memories (EEPROMs), magnetic or optical cards, flash memory, or other type of media/machine-readable medium suitable for storing electronic instructions. In addition to an embodiment consisting of specifically designed integrated circuits or other electronics, the present invention may be conveniently implemented using a conventional general purpose or a specialized digital computer or microprocessor programmed according to the teachings of the present disclosure, as will be apparent to those skilled in the computer art.
Appropriate software coding can readily be prepared by skilled programmers based on the teachings of the present disclosure, as will be apparent to those skilled in the software art. The invention may also be implemented by the preparation of application specific integrated circuits or by interconnecting an appropriate network of conventional component circuits, as will be readily apparent to those skilled in die art.

1. A method for validating optical character recognition (OCR) output, the method comprising:
receiving an image obtained by applying an OCR process to a text of a document;
recognizing each character in the received image using a plurality of character models in a database and retrieving top N matching results for the character from a corresponding character model, each character model being trained on a first set and a second set for a character to provide a probability of occurrence of the character;
validating the top N matching results for each of the recognized characters based on a user input; and
for each of the validated character,
on determining a validated character and user input being same and a probability of occurrence of the validated character from the corresponding character model is low, adding the character image to the first set of the character model in the database; and
on determining a validated character and user input being different and the probability of occurrence of the validated character from the corresponding character model is high, adding the character image to the second set of the character model in the database.
2. The method as claimed in claim 1, wherein recognizing each individual character in the received image includes identifying each character position from the received image.

The method as claimed in claim 1, wherein adding the character image to the first set of the character model generates a new variant of the validated character in the image database.
The method as claimed in claim 3, wherein the new variant of the validated character includes a variation of the character in font, style, thickness or size.

Documents

Application Documents

#	Name	Date
1	Form2 Title Page_Provisional_31-12-2018.pdf	2018-12-31
2	Form 5_As Filed_31-12-2018.pdf	2018-12-31
3	Form 3_As Filed_31-12-2018.pdf	2018-12-31
4	Form 1_As Filed_31-12-2018.pdf	2018-12-31
5	Drawing_As Filed_31-12-2018.pdf	2018-12-31
6	Description Provisional_As Filed_31-12-2018.pdf	2018-12-31
7	Correspondence by Applicant_As Filed_31-12-2018.pdf	2018-12-31
8	Claims_As Filed_31-12-2018.pdf	2018-12-31
9	Form-1_After Filing_12-03-2019.pdf	2019-03-12
10	Correspondence by Applicant_Form-1_12-03-2019.pdf	2019-03-12
11	Form 3_After Provisional_09-10-2019.pdf	2019-10-09
12	Form 2(Title Page)_Complete_09-10-2019.pdf	2019-10-09
13	Form 1_After Provisional_09-10-2019.pdf	2019-10-09
14	Drawing_After Provisional_09-10-2019.pdf	2019-10-09
15	Description(Complete)_After Provisional_09-10-2019.pdf	2019-10-09
16	Correspondence by Applicant_Request for Certified Copy_09-10-2019.pdf	2019-10-09
17	Correspondence by Applicant_Complete_09-10-2019.pdf	2019-10-09
18	Claims_After Provisional_09-10-2019.pdf	2019-10-09
19	Abstract_After Provisional_09-10-2019.pdf	2019-10-09
20	abstract_201841050033.jpg	2019-10-18
21	Form18_Normal Request_06-11-2019.pdf	2019-11-06
22	Correspondence by Applicant_Form 18_06-11-2019.pdf	2019-11-06
23	201841050033-FER.pdf	2021-10-17
24	201841050033-Correspondence-14-12-2021.pdf	2021-12-14
25	201841050033-SEQUENCE LISTING [05-04-2022(online)].txt	2022-04-05
26	201841050033-OTHERS [05-04-2022(online)].pdf	2022-04-05
27	201841050033-FER_SER_REPLY [05-04-2022(online)].pdf	2022-04-05
28	201841050033-CORRESPONDENCE [05-04-2022(online)].pdf	2022-04-05
29	201841050033-COMPLETE SPECIFICATION [05-04-2022(online)].pdf	2022-04-05
30	201841050033-CLAIMS [05-04-2022(online)].pdf	2022-04-05
31	201841050033-Correspondence_Requesting to Update Email ID_30-06-2022.pdf	2022-06-30
32	201841050033-RELEVANT DOCUMENTS [11-02-2025(online)].pdf	2025-02-11
33	201841050033-MARKED COPIES OF AMENDEMENTS [11-02-2025(online)].pdf	2025-02-11
34	201841050033-FORM 13 [11-02-2025(online)].pdf	2025-02-11
35	201841050033-AMENDED DOCUMENTS [11-02-2025(online)].pdf	2025-02-11
36	201841050033-US(14)-HearingNotice-(HearingDate-04-12-2025).pdf	2025-11-12
37	201841050033-FORM-26 [14-11-2025(online)].pdf	2025-11-14
38	201841050033-Correspondence to notify the Controller [14-11-2025(online)].pdf	2025-11-14

Search Strategy

1	search201841050033E_26-05-2021.pdf