System And Method For Optimized Training Of A Neural Network Model For

System And Method For Optimized Training Of A Neural Network Model For Data Extraction

Abstract: A system (100) and method for optimized training of a neural network model for data extraction is provided. The present invention provides for generating a pre-determined format type of input document by extracting words from input document along with coordinates corresponding to each word. Further, N-grams are generated by analyzing neighboring words associated with entity text present in predetermined format type of document based on threshold measurement criterion and combining extracted neighboring words in pre-defined order. Further, generated N-grams are compared with coordinates corresponding to words for labelling N-grams with field name. Further, each word in N-gram identified by the field name is tokenized in accordance with location of each of the words relative to named entity (NE) for assigning token marker. Lastly, neural network model is trained based on tokenized words in N-gram identified by token marker. Trained neural network model is implemented for extracting data from documents.

Patent Information

Application #

Filing Date

16 February 2023

Publication Number

34/2024

Publication Type

INA

Invention Field

COMPUTER SCIENCE

Status

Email

Parent Application

Applicants

Cognizant Technology Solutions India Pvt. Ltd.

Techno Complex, No. 5/535, Old Mahabalipuram Road, Okkiyam Thoraipakkam, Chennai - 600 097, Tamil Nadu, India

Inventors

1. Saravanan Radhakrishnan

155 / 1, Udaiyar Street, Arasampattu Village, Chetpet Taluk, Vandavasi, Tiruvannamalai – 606 807, Tamil Nadu, India

2. Rahul Agarwal

103, Shiv Sadan Apartment, 112/209 B-1 Swaroop Nagar, Kanpur – 208 002, Uttar Pradesh, India

Specification

We claim:
1. A system (100) for optimized training of a neural network model for data extraction, the system (100) comprising:
a memory (108) storing program instructions;
a processor (106) executing instructions stored in the memory (108); and
a data extraction engine (104) executed by the processor (106) and configured to:
generate a pre-determined format type of an input document by extracting words from the input document along with coordinates corresponding to each word, wherein the extracted words include entity text and neighboring words associated with the entity text;
generate N-grams by analyzing the neighboring words associated with the entity text present in the predetermined format type of the document based on a threshold measurement criterion and combining the extracted neighboring words in a pre-defined order;
compare the generated N-grams with the coordinates corresponding to the words for labelling the N-grams with a field name;
tokenize each word in the N-gram identified by the field name in accordance with a location of each of the words relative to a named entity (NE) for assigning a token marker; and
train a neural network model based on the tokenized words in the N-gram identified by the token marker, wherein the

trained neural network model is implemented for extracting data from documents.
2. The system (100) as claimed in claim 1, wherein the input document is a structured or a semi-structured document and is in a pre-defined format including a Portable Document Format (PDF), or an image format, and wherein the predetermined format type of the input document is an XML file.
3. The system (100) as claimed in claim 1, wherein the data extraction engine (104) comprises an annotation unit (114) executed by the processor (106) and configured to render a Graphical User Interface (GUI) via a input unit (110) for carrying out an annotation operation on the predetermined format type of the document, and wherein the annotation unit (114) generates annotation data by copying text from a relevant field in the predetermined format type or the pre¬defined format of the document and selecting a text field corresponding to the relevant field for pasting the copied data by using a rubber band technique, and wherein the rubber band technique is used to determine coordinates corresponding to the text field with the copied data, which is stored by the annotation unit (114) in a database (126), and wherein the annotation data is used for generation of N-grams.
4. The system (100) as claimed in claim 1, wherein the data extraction engine (104) comprises an N-gram generation and labelling unit (116) executed by the processor (106) and configured to determine entity text by analyzing neighboring words corresponding to the entity text from left, top right and bottom of the entity text present in the predetermined format type for generating N-grams, and wherein the neighboring words of the entity text are analyzed by applying a threshold distance measurement criterion from the entity text.

5. The system (100) as claimed in claim 4, wherein the N-gram generation and labelling unit (116) changes the threshold distance to value -1, in the event it is determined that the neighboring words associated with the entity text are not available at the threshold distance from the entity text, to avoid blank spaces between the neighboring words and entity text.
6. The system (100) as claimed in claim 4, wherein the N-gram generation and labelling unit (116) extracts five neighboring words from the left of the entity text, three neighboring words from the top of the entity text, two neighboring words from right of the entity text, and three neighboring words from bottom of the entity text.
7. The system (100) as claimed in claim 4, wherein the N-gram generation and labelling unit (116) is configured to determine one or more entity text features from the pre-determined format type of the documents for generation of N-grams, and wherein the text features include position of the entity text in the predetermined format type of the documents, and format of the entity text.
8. The system (100) as claimed in claim 7, wherein the position-based features of the entity text include the entity text present at top-left of the document, the entity text present at top-right of the document, the entity text present at bottom-left of the document or the entity text present at bottom-right of the document.
9. The system (100) as claimed in claim 1, wherein an N-gram generation and labelling unit (116) is configured to label the generated N-grams by carrying out a matching operation, and wherein the matching operation is carried out by identifying the N-grams using a field value associated with a particular field in the predetermined format type document and the one
27

or more coordinates along with annotation data, which are stored in the database (126).
10. The system (100) as claimed in claim 9, wherein based on determination of a match, the N-gram generation and labelling unit (116) labels the N-grams with a field name, and wherein the N-gram generation and labelling unit (116) is configured to label all the unmatched N-grams as ‘others’.
11. The system (100) as claimed in claim 1, wherein the data extraction engine (104) comprises a post processing unit (118) executed by the processor (106) and configured to process the predetermined format type of the documents for converting all the numeric values present in the document to a machine-readable format.
12. The system (100) as claimed in claim 1, wherein the data extraction engine (104) comprises a tokenization unit (120) executed by the processor (106) and configured to process the generated and labelled N-grams for carrying out a tokenization operation for tokenizing each N-gram and classifying each token with the token marker.
13. The system (100) as claimed in claim 1, wherein the data extraction engine (104) comprises a data extraction model training unit (122) executed by the processor (106) and configured to receive the tokenized words from a tokenization unit (120) to train the neural network model.
14. The system (100) as claimed in claim 13, wherein the data extraction model training unit (122) is configured to convert the tokenized words in the N-gram into sequences and each tokenized word is assigned an integer, and wherein the sequence is padded such that each tokenized word is of a same length, and wherein the padded sequence of words is used as an input for training the neural network model for data extraction.
28

15. The system (100) as claimed in claim 1, wherein the data extraction engine (104) comprises a model accuracy improvement unit (124) executed by the processor (106) and configured to communicate with a data extraction model training unit (122) for improving accuracy of the neural network model in order to effectively extract data from documents, and wherein the model accuracy improvement unit (124) is configured to receive inputs relating to the extracted data from a data extraction unit (130) for improving accuracy of the neural network model.
16. The system (100) as claimed in claim 15, wherein the model accuracy improvement unit (124) is configured to generate negative N-grams by carrying out a comparison operation with the annotation data, and wherein the model accuracy improvement unit (124) is configured to extract data fields present in the predetermined format type of the document using the trained neural network model and compare the extracted data fields with the annotated data stored in the database (126).
17. The system (100) as claimed in claim 16, wherein in the event the model accuracy improvement unit (124)determines that field values associated with the data fields do not match with the annotated data, then the N-grams that are generated are determined as negative N-grams and are labelled as ‘others’, and wherein one or more criteria are employed for determining the match including determining if distance between extracted fields and annotated data is minimal or within a pre-defined threshold, then the N-grams are not labelled as ‘others’, and if one or more keywords associated with the fields in the document are present in negative N-grams, then such N-grams are not considered as ‘others’, thereby avoiding any positive N-grams being labelled as ‘others’, and wherein the model accuracy improvement unit (124) is configured to up-scale the

generated N-grams for each field except for N-grams labelled as ‘others’.
18. The system (100) as claimed in claim 17, wherein the model accuracy improvement unit (124) is configured to determine a confidence score for the field values present in the document based on predictions made by the neural network model, and wherein in the event the neural network model predicts two or more values for a particular field in the document, then the model accuracy improvement unit (124) is configured to filter the values based on the confidence score, and wherein the model accuracy improvement unit (124) considers the values with maximum confidence score.
19. A method for optimized training of a neural network model for data extraction, the method is implemented by the processor (106) executing instructions stored in the memory (108), the method comprises:
generating a pre-determined format type of an input document by extracting words from the input document along with coordinates corresponding to each word, wherein the extracted words include entity text and neighboring words associated with the entity text;
generating N-grams by analyzing the neighboring words associated with the entity text present in the predetermined format type of the document based on a threshold measurement criterion and combining the extracted neighboring words in a pre-defined order;
comparing the generated N-grams with the coordinates corresponding to the words for labelling the N-grams with a field name;
tokenizing each word in the N-gram identified by the field name in accordance with a location of each of the words
30

relative to a named entity (NE) for assigning a token marker; and
training a neural network model based on the tokenized words in the N-gram identified by the token marker, wherein the trained neural network model is implemented for extracting data from documents.
20. The method as claimed in claim 19, wherein a Graphical User Interface (GUI) is rendered for carrying out an annotation operation on the predetermined format type of the document, and wherein annotation data is generated by copying text from a relevant field in the predetermined format type or the pre-defined format of the document and selecting a text field corresponding to the relevant field for pasting the copied data by using a rubber band technique, and wherein the rubber band technique is used to determine coordinates corresponding to the text field with the copied data, and wherein annotation data is used for generation of N-grams.
21. The method as claimed in claim 20, wherein entity text is determined by analyzing neighboring words corresponding to the entity text from left, top right and bottom of the entity text present in the predetermined format type for generating N-grams, and wherein the neighboring words of the entity text are analyzed by applying a threshold distance measurement criterion from the entity text.
22. The method as claimed in claim 21, wherein in the event it is determined that the neighboring words associated with the entity text are not available at the threshold distance from the entity text, then the threshold distance is changed to value -1 to avoid blank spaces between the neighboring words and entity text.

23. The method as claimed in claim 22, wherein one or more entity text features are determined from the pre-determined format type of the documents for generation of N-grams, and wherein the text features include position of the entity text in the predetermined format type of the documents, and format of the entity text.
24. The method as claimed in claim 19, wherein the generated N-grams are labelled by carrying out a matching operation, and wherein the matching operation is carried out by identifying the N-grams using a field value associated with a particular field in the predetermined format type document.
25. The method as claimed in claim 24, wherein based on determination of a match, the N-grams are labelled with a field name, and wherein all the unmatched N-grams are labelled as ‘others’.
26. The method as claimed in claim 19, wherein the tokenized words in the N-gram are converted into sequences and each tokenized word is assigned an integer, and wherein the sequence is padded such that each tokenized word is of a same length, and wherein the padded sequence of words is used as an input for training the neural network model for data extraction.
27. The method as claimed in claim 24, wherein negative N-grams are generated by carrying out a comparison operation, and wherein data fields present in the predetermined format type of the document are extracted using the trained neural network model and compared with the annotated data.
28. The method as claimed in claim 27, wherein in the event it is determined that field values associated with the data fields do not match with the annotated data, then the N-grams that are generated are determined as negative N-grams and are labelled as ‘others’, and wherein one or more criteria are

employed for determining the match including determining if distance between the extracted fields and annotated data is minimal or within a pre-defined threshold, then the N-grams are not labelled as ‘others’, and if one or more keywords associated with the fields in the document are present in negative N-grams, then such N-grams are not considered as ‘others’, thereby avoiding any positive N-grams being labelled as ‘others’, and wherein the generated N-grams are up-scaled for each field except for N-grams labelled as ‘others’.

Documents

Application Documents

#	Name	Date
1	202341010420-STATEMENT OF UNDERTAKING (FORM 3) [16-02-2023(online)].pdf	2023-02-16
2	202341010420-REQUEST FOR EXAMINATION (FORM-18) [16-02-2023(online)].pdf	2023-02-16
3	202341010420-PROOF OF RIGHT [16-02-2023(online)].pdf	2023-02-16
4	202341010420-POWER OF AUTHORITY [16-02-2023(online)].pdf	2023-02-16
5	202341010420-FORM 18 [16-02-2023(online)].pdf	2023-02-16
6	202341010420-FORM 1 [16-02-2023(online)].pdf	2023-02-16
7	202341010420-FIGURE OF ABSTRACT [16-02-2023(online)].pdf	2023-02-16
8	202341010420-DRAWINGS [16-02-2023(online)].pdf	2023-02-16
9	202341010420-COMPLETE SPECIFICATION [16-02-2023(online)].pdf	2023-02-16
10	202341010420-Request Letter-Correspondence [17-02-2023(online)].pdf	2023-02-17
11	202341010420-Covering Letter [17-02-2023(online)].pdf	2023-02-17
12	202341010420-FORM 3 [31-07-2023(online)].pdf	2023-07-31
13	202341010420-FER.pdf	2025-06-04
14	202341010420-FORM 3 [23-06-2025(online)].pdf	2025-06-23
15	202341010420-RELEVANT DOCUMENTS [03-07-2025(online)].pdf	2025-07-03
16	202341010420-FORM 13 [03-07-2025(online)].pdf	2025-07-03

Search Strategy

1	202341010420_SearchStrategyNew_E_SearchReportE_29-05-2025.pdf