Abstract: This disclosure relates generally to document processing, and more particularly to method and system for determining structural blocks of a document. In one embodiment, the method may include extracting text from the document, the text including text lines. The method may further include generating a feature vector for each of the text lines, the feature vector for the text line including a set of feature values for a set of corresponding features in the text line. The method may further include creating an input matrix for each of the text lines, the input matrix for the text line including a set of feature vectors corresponding to a set of neighboring text lines along with the text line. The method may further include determining a structural block tag for each of the text lines based on the corresponding input matrix using a machine learning model. Figure 3
Claims:WE CLAIMS
1. A method of determining structural blocks of a document, the method comprising:
extracting, by a document analysis device, text from the document, wherein the text comprises a plurality of text lines;
generating, by the document analysis device, a feature vector for each of the plurality of text lines, wherein the feature vector for the text line comprises a set of feature values for a set of corresponding features in the text line;
creating, by the document analysis device, an input matrix for each of the plurality of text lines, wherein the input matrix for the text line comprises a set of feature vectors corresponding to a set of neighboring text lines along with the text line; and
determining, by the document analysis device, a structural block tag for each of the plurality of text lines based on the corresponding input matrix using a machine learning model.
2. The method of claim 1, further comprising:
receiving an image document;
performing an optical text recognition on the image document to generate the document.
3. The method of claim 1, wherein extracting the text comprises applying a text extraction tool with a pre-defined or a dynamic threshold on the document to extract the plurality of text lines from the document.
4. The method of claim 1, wherein extracting the text comprises filtering at least one of a header, a footer, or a table line.
5. The method of claim 1, wherein the set of corresponding features comprise at least one of a positional feature, a font feature, a count feature, a spacing feature, or a semantic feature.
6. The method of claim 1, wherein the feature value for the corresponding feature comprises at least one of positional coordinates of the text line, a font size in the text line, a font weight in the text line, a font relative weight in the text line, one or more flags for one or more font styles, or a length of the text line.
7. The method of claim 1, wherein the set of neighboring text lines comprises a pre-defined set of preceding text lines and a pre-defined set of successive text lines.
8. The method of claim 1, wherein the machine learning model is a sequence to sequence machine learning model.
9. The method of claim 1, wherein the structural block tag comprises a paragraph tag or a non-paragraph tag, wherein the paragraph tag comprises one of a footnote tag, a paragraph start tag, a single line tag, or a paragraph tag, and wherein the non-paragraph tag comprises one of a section header tag, a list tag, a table of content tag, a title tag, or an un-classified tag.
10. The method of claim 1, further comprising training the machine learning model with training data, wherein the training data comprises a set of texts extracted from a set of documents, and, for each of the set of texts, a set of text lines with a corresponding set of feature vectors and a corresponding set of pre-defined structural block tags.
11. A system for determining structural blocks of a document, the system comprising:
a document analysis device comprising at least one processor and a computer-readable medium storing instructions that, when executed by the at least one processor, cause the at least one processor to perform operations comprising:
extract text from the document, wherein the text comprises a plurality of text lines;
generate a feature vector for each of the plurality of text lines, wherein the feature vector for the text line comprises a set of feature values for a set of corresponding features in the text line;
create an input matrix for each of the plurality of text line, wherein the input matrix for the text line comprises a set of feature vectors corresponding to a set of neighboring text lines along with the text line; and
determine a structural block tag for each of the plurality of text lines based on the corresponding input matrix using a machine learning model.
12. The system of claim 11, wherein extracting the text comprises applying a text extraction tool with a pre-defined or a dynamic threshold on the document to extract the plurality of text lines from the document.
13. The system of claim 11, wherein extracting the text comprises filtering at least one of a header, a footer, or a table line.
14. The system of claim 11, wherein the set of corresponding features comprise at least one of a positional feature, a font feature, a count feature, a spacing feature, or a semantic feature.
15. The system of claim 11, wherein the feature value for the corresponding feature comprises at least one of positional coordinates of the text line, a font size in the text line, a font weight in the text line, a font relative weight in the text line, one or more flags for one or more font styles, or a length of the text line.
16. The system of claim 11, wherein the set of neighboring text lines comprises a pre-defined set of preceding text lines and a pre-defined set of successive text lines.
17. The system of claim 11, wherein the machine learning model is a sequence to sequence machine learning model.
18. The system of claim 11, wherein the structural block tag comprises a paragraph tag or a non-paragraph tag, wherein the paragraph tag comprises one of a footnote tag, a paragraph start tag, a single line tag, or a paragraph tag, and wherein the non-paragraph tag comprises one of a section header tag, a list tag, a table of content tag, a title tag, or an un-classified tag.
19. The system of claim 11, wherein the operations further comprise training the machine learning model with training data, wherein the training data comprises a set of texts extracted from a set of documents, and, for each of the set of texts, a set of text lines with a corresponding set of feature vectors and a corresponding set of pre-defined structural block tags.
Dated this 16th day of February 2018
R Ramya Rao
IN/PA-1607
Of K&S Partners
Agent for the Applicant , Description:TECHNICAL FIELD
This disclosure relates generally to document processing, and more particularly to method and system for determining structural blocks of a document.
| # | Name | Date |
|---|---|---|
| 1 | 201841006073-STATEMENT OF UNDERTAKING (FORM 3) [16-02-2018(online)].pdf | 2018-02-16 |
| 2 | 201841006073-REQUEST FOR EXAMINATION (FORM-18) [16-02-2018(online)].pdf | 2018-02-16 |
| 3 | 201841006073-POWER OF AUTHORITY [16-02-2018(online)].pdf | 2018-02-16 |
| 4 | 201841006073-FORM 18 [16-02-2018(online)].pdf | 2018-02-16 |
| 5 | 201841006073-FORM 1 [16-02-2018(online)].pdf | 2018-02-16 |
| 6 | 201841006073-DRAWINGS [16-02-2018(online)].pdf | 2018-02-16 |
| 7 | 201841006073-DECLARATION OF INVENTORSHIP (FORM 5) [16-02-2018(online)].pdf | 2018-02-16 |
| 8 | 201841006073-COMPLETE SPECIFICATION [16-02-2018(online)].pdf | 2018-02-16 |
| 9 | 201841006073-REQUEST FOR CERTIFIED COPY [05-03-2018(online)].pdf | 2018-03-05 |
| 10 | 201841006073-Proof of Right (MANDATORY) [25-04-2018(online)].pdf | 2018-04-25 |
| 11 | Correspondence by Agent_Form 1_01-05-2018.pdf | 2018-05-01 |
| 12 | 201841006073-RELEVANT DOCUMENTS [27-07-2021(online)].pdf | 2021-07-27 |
| 13 | 201841006073-PETITION UNDER RULE 137 [27-07-2021(online)].pdf | 2021-07-27 |
| 14 | 201841006073-OTHERS [27-07-2021(online)].pdf | 2021-07-27 |
| 15 | 201841006073-Information under section 8(2) [27-07-2021(online)].pdf | 2021-07-27 |
| 16 | 201841006073-FORM 3 [27-07-2021(online)].pdf | 2021-07-27 |
| 17 | 201841006073-FER_SER_REPLY [27-07-2021(online)].pdf | 2021-07-27 |
| 18 | 201841006073-DRAWING [27-07-2021(online)].pdf | 2021-07-27 |
| 19 | 201841006073-CORRESPONDENCE [27-07-2021(online)].pdf | 2021-07-27 |
| 20 | 201841006073-COMPLETE SPECIFICATION [27-07-2021(online)].pdf | 2021-07-27 |
| 21 | 201841006073-CLAIMS [27-07-2021(online)].pdf | 2021-07-27 |
| 22 | 201841006073-FER.pdf | 2021-10-17 |
| 23 | 201841006073-PatentCertificate27-12-2023.pdf | 2023-12-27 |
| 24 | 201841006073-IntimationOfGrant27-12-2023.pdf | 2023-12-27 |
| 25 | 201841006073-PROOF OF ALTERATION [18-03-2024(online)].pdf | 2024-03-18 |
| 1 | searchE_27-01-2021.pdf |