Sign In to Follow Application
View All Documents & Correspondence

Method And System For Determining Structural Blocks Of A Document

Abstract: This disclosure relates generally to document processing, and more particularly to method and system for determining structural blocks of a document. In one embodiment, the method may include extracting text from the document, the text including text lines. The method may further include generating a feature vector for each of the text lines, the feature vector for the text line including a set of feature values for a set of corresponding features in the text line. The method may further include creating an input matrix for each of the text lines, the input matrix for the text line including a set of feature vectors corresponding to a set of neighboring text lines along with the text line. The method may further include determining a structural block tag for each of the text lines based on the corresponding input matrix using a machine learning model. Figure 3

Get Free WhatsApp Updates!
Notices, Deadlines & Correspondence

Patent Information

Application #
Filing Date
16 February 2018
Publication Number
34/2019
Publication Type
INA
Invention Field
COMPUTER SCIENCE
Status
Email
bangalore@knspartners.com
Parent Application
Patent Number
Legal Status
Grant Date
2023-12-27
Renewal Date

Applicants

WIPRO LIMITED
Doddakannelli, Sarjapur Road, Bangalore 560035, Karnataka, India.

Inventors

1. RAGHAVENDRA HOSABETTU
#3080/3081, ‘Venkatadri Nilaya’, 2nd Main, 3rd Cross, VBHCS Layout, Banashankari 3rd Stage, Near Kattriguppa Water Tank, Bangalore 560050, Karnataka, India.
2. SNEHA SUBHASCHANDRA BANAKAR
#1452, 2nd Phase, J P Nagar, Bengaluru – 560078, Karnataka, India.

Specification

Claims:WE CLAIMS
1. A method of determining structural blocks of a document, the method comprising:
extracting, by a document analysis device, text from the document, wherein the text comprises a plurality of text lines;
generating, by the document analysis device, a feature vector for each of the plurality of text lines, wherein the feature vector for the text line comprises a set of feature values for a set of corresponding features in the text line;
creating, by the document analysis device, an input matrix for each of the plurality of text lines, wherein the input matrix for the text line comprises a set of feature vectors corresponding to a set of neighboring text lines along with the text line; and
determining, by the document analysis device, a structural block tag for each of the plurality of text lines based on the corresponding input matrix using a machine learning model.

2. The method of claim 1, further comprising:
receiving an image document;
performing an optical text recognition on the image document to generate the document.

3. The method of claim 1, wherein extracting the text comprises applying a text extraction tool with a pre-defined or a dynamic threshold on the document to extract the plurality of text lines from the document.

4. The method of claim 1, wherein extracting the text comprises filtering at least one of a header, a footer, or a table line.

5. The method of claim 1, wherein the set of corresponding features comprise at least one of a positional feature, a font feature, a count feature, a spacing feature, or a semantic feature.

6. The method of claim 1, wherein the feature value for the corresponding feature comprises at least one of positional coordinates of the text line, a font size in the text line, a font weight in the text line, a font relative weight in the text line, one or more flags for one or more font styles, or a length of the text line.

7. The method of claim 1, wherein the set of neighboring text lines comprises a pre-defined set of preceding text lines and a pre-defined set of successive text lines.

8. The method of claim 1, wherein the machine learning model is a sequence to sequence machine learning model.

9. The method of claim 1, wherein the structural block tag comprises a paragraph tag or a non-paragraph tag, wherein the paragraph tag comprises one of a footnote tag, a paragraph start tag, a single line tag, or a paragraph tag, and wherein the non-paragraph tag comprises one of a section header tag, a list tag, a table of content tag, a title tag, or an un-classified tag.

10. The method of claim 1, further comprising training the machine learning model with training data, wherein the training data comprises a set of texts extracted from a set of documents, and, for each of the set of texts, a set of text lines with a corresponding set of feature vectors and a corresponding set of pre-defined structural block tags.

11. A system for determining structural blocks of a document, the system comprising:
a document analysis device comprising at least one processor and a computer-readable medium storing instructions that, when executed by the at least one processor, cause the at least one processor to perform operations comprising:
extract text from the document, wherein the text comprises a plurality of text lines;
generate a feature vector for each of the plurality of text lines, wherein the feature vector for the text line comprises a set of feature values for a set of corresponding features in the text line;
create an input matrix for each of the plurality of text line, wherein the input matrix for the text line comprises a set of feature vectors corresponding to a set of neighboring text lines along with the text line; and
determine a structural block tag for each of the plurality of text lines based on the corresponding input matrix using a machine learning model.

12. The system of claim 11, wherein extracting the text comprises applying a text extraction tool with a pre-defined or a dynamic threshold on the document to extract the plurality of text lines from the document.

13. The system of claim 11, wherein extracting the text comprises filtering at least one of a header, a footer, or a table line.

14. The system of claim 11, wherein the set of corresponding features comprise at least one of a positional feature, a font feature, a count feature, a spacing feature, or a semantic feature.

15. The system of claim 11, wherein the feature value for the corresponding feature comprises at least one of positional coordinates of the text line, a font size in the text line, a font weight in the text line, a font relative weight in the text line, one or more flags for one or more font styles, or a length of the text line.

16. The system of claim 11, wherein the set of neighboring text lines comprises a pre-defined set of preceding text lines and a pre-defined set of successive text lines.

17. The system of claim 11, wherein the machine learning model is a sequence to sequence machine learning model.

18. The system of claim 11, wherein the structural block tag comprises a paragraph tag or a non-paragraph tag, wherein the paragraph tag comprises one of a footnote tag, a paragraph start tag, a single line tag, or a paragraph tag, and wherein the non-paragraph tag comprises one of a section header tag, a list tag, a table of content tag, a title tag, or an un-classified tag.

19. The system of claim 11, wherein the operations further comprise training the machine learning model with training data, wherein the training data comprises a set of texts extracted from a set of documents, and, for each of the set of texts, a set of text lines with a corresponding set of feature vectors and a corresponding set of pre-defined structural block tags.

Dated this 16th day of February 2018

R Ramya Rao
IN/PA-1607
Of K&S Partners
Agent for the Applicant , Description:TECHNICAL FIELD
This disclosure relates generally to document processing, and more particularly to method and system for determining structural blocks of a document.

Documents

Application Documents

# Name Date
1 201841006073-STATEMENT OF UNDERTAKING (FORM 3) [16-02-2018(online)].pdf 2018-02-16
2 201841006073-REQUEST FOR EXAMINATION (FORM-18) [16-02-2018(online)].pdf 2018-02-16
3 201841006073-POWER OF AUTHORITY [16-02-2018(online)].pdf 2018-02-16
4 201841006073-FORM 18 [16-02-2018(online)].pdf 2018-02-16
5 201841006073-FORM 1 [16-02-2018(online)].pdf 2018-02-16
6 201841006073-DRAWINGS [16-02-2018(online)].pdf 2018-02-16
7 201841006073-DECLARATION OF INVENTORSHIP (FORM 5) [16-02-2018(online)].pdf 2018-02-16
8 201841006073-COMPLETE SPECIFICATION [16-02-2018(online)].pdf 2018-02-16
9 201841006073-REQUEST FOR CERTIFIED COPY [05-03-2018(online)].pdf 2018-03-05
10 201841006073-Proof of Right (MANDATORY) [25-04-2018(online)].pdf 2018-04-25
11 Correspondence by Agent_Form 1_01-05-2018.pdf 2018-05-01
12 201841006073-RELEVANT DOCUMENTS [27-07-2021(online)].pdf 2021-07-27
13 201841006073-PETITION UNDER RULE 137 [27-07-2021(online)].pdf 2021-07-27
14 201841006073-OTHERS [27-07-2021(online)].pdf 2021-07-27
15 201841006073-Information under section 8(2) [27-07-2021(online)].pdf 2021-07-27
16 201841006073-FORM 3 [27-07-2021(online)].pdf 2021-07-27
17 201841006073-FER_SER_REPLY [27-07-2021(online)].pdf 2021-07-27
18 201841006073-DRAWING [27-07-2021(online)].pdf 2021-07-27
19 201841006073-CORRESPONDENCE [27-07-2021(online)].pdf 2021-07-27
20 201841006073-COMPLETE SPECIFICATION [27-07-2021(online)].pdf 2021-07-27
21 201841006073-CLAIMS [27-07-2021(online)].pdf 2021-07-27
22 201841006073-FER.pdf 2021-10-17
23 201841006073-PatentCertificate27-12-2023.pdf 2023-12-27
24 201841006073-IntimationOfGrant27-12-2023.pdf 2023-12-27
25 201841006073-PROOF OF ALTERATION [18-03-2024(online)].pdf 2024-03-18

Search Strategy

1 searchE_27-01-2021.pdf

ERegister / Renewals

3rd: 18 Mar 2024

From 16/02/2020 - To 16/02/2021

4th: 18 Mar 2024

From 16/02/2021 - To 16/02/2022

5th: 18 Mar 2024

From 16/02/2022 - To 16/02/2023

6th: 18 Mar 2024

From 16/02/2023 - To 16/02/2024

7th: 18 Mar 2024

From 16/02/2024 - To 16/02/2025

8th: 15 Feb 2025

From 16/02/2025 - To 16/02/2026