Sign In to Follow Application
View All Documents & Correspondence

Method And System For Determining Structural Blocks Of A Document

Abstract: This disclosure relates to method and system for determining structural blocks of a document. The method may include extracting text lines from the document, generating a feature vector for each text line by determining feature values for a set of features in the each text line, and determining at least one dominant feature from among the set of features and at least one corresponding dominance factor, for each structural class, based on the feature vector for each text line. The method may further include deriving a set of rules for classification of the text lines into respective structural classes and determining a structural block tag for each text line based on the set of rules. Each of the set of rules correspond to one of the structural classes and is based on the at least one dominant feature and the at least one corresponding dominance factor for that class. Figure 3

Get Free WhatsApp Updates!
Notices, Deadlines & Correspondence

Patent Information

Application #
Filing Date
30 July 2018
Publication Number
06/2020
Publication Type
INA
Invention Field
COMPUTER SCIENCE
Status
Email
bangalore@knspartners.com
Parent Application
Patent Number
Legal Status
Grant Date
2024-07-08
Renewal Date

Applicants

WIPRO LIMITED
Doddakannelli, Sarjapur Road, Bangalore 560035, Karnataka, India.

Inventors

1. RAGHAVENDRA HOSABETTU
#3080/3081, ‘Venkatadri Nilaya’, 2nd Main, 3rd Cross, VBHCS Layout, Banashankari 3rd Stage, Near Katriguppe Water Tank, Bengaluru –560 050, Karnataka, India.
2. SNEHA SUBHASCHANDRA BANAKAR
#1452, 2nd Phase, J P Nagar, Bengaluru – 560078, Karnataka, India.

Specification

Claims:WE CLAIM
1. A method of determining structural blocks of a document, the method comprising:
extracting, by a document analysis device, a plurality of text lines from the document;
generating, by the document analysis device, a feature vector for each of the plurality of text lines by determining a set of feature values for a set of corresponding features in each of the plurality of text lines;
determining, by the document analysis device, at least one dominant feature from among the set of corresponding features and at least one corresponding dominance factor, for each of a plurality of structural classes, based on the feature vector for each of the plurality of text lines;
deriving, by the document analysis device, a set of rules for classification of the plurality of text lines into the plurality of structural classes, wherein each of the set of rules correspond to one of the plurality of structural classes and is based on the at least one dominant feature and the at least one corresponding dominance factor for that class; and
determining, by the document analysis device, a structural block tag for each of the plurality of text lines based on the set of rules.

2. The method of claim 1, further comprising:
receiving an image document;
performing an optical text recognition on the image document to generate the document.

3. The method of claim 1, wherein extracting the plurality of text lines comprises applying a text extraction tool with a pre-defined or a dynamic threshold on the document.

4. The method of claim 1, wherein the set of corresponding features comprise at least one of a positional feature, a font feature, a count feature, or a spacing feature.

5. The method of claim 1, wherein the set of feature values for the set of corresponding features comprise at least one of positional coordinates of a text line, a font size in the text line, one or more flags for one or more font styles, a length of the text line, or a spacing between at least two of the plurality of text line.

6. The method of claim 1, wherein determining the at least one dominant feature comprises comparing the feature vector for a text line with a set of feature vectors of a set of neighboring text lines, and wherein the set of neighboring text lines comprises at least one of a pre-defined set of preceding text lines, or a pre-defined set of successive text lines.

7. The method of claim 1, wherein each of the set of rules comprises a sum of the at least one dominant feature modified by the at least one corresponding dominance factor.

8. The method of claim 1, further comprising determining at least one corresponding threshold value for the at least one dominant feature, and wherein each of the set of rules comprises a sum of the at least one dominant feature modified by the at least one corresponding dominance factor and the at least one corresponding threshold value.

9. The method of claim 8, wherein determining the at least one corresponding threshold value for the at least one dominant feature comprises:
initializing the at least one corresponding threshold value based on an initial difference between feature values for the at least one dominant feature; and
dynamically adjusting the at least one corresponding threshold value in each of a plurality of subsequent passes.

10. The method of claim 1, wherein the structural block tag comprises one of a paragraph tag, a paragraph start tag, a paragraph end tag, a single line tag, a title tag, a section header tag, a footnote tag, a list tag, or a table of content tag.

11. The method of claim 1, wherein determining the structural block tag comprises:
determining a set of scores for a text line based on the set of rules; and
determining the structural block tag corresponding to one of the plurality of structural classes based on an indicative score among the set of scores.

12. A system for determining structural blocks of a document, the system comprising:
a document analysis device comprising at least one processor and a computer-readable medium storing instructions that, when executed by the at least one processor, cause the at least one processor to perform operations comprising:
extracting a plurality of text lines from the document;
generating a feature vector for each of the plurality of text lines by determining a set of feature values for a set of corresponding features in each of the plurality of text lines;
determining at least one dominant feature from among the set of corresponding features and at least one corresponding dominance factor, for each of a plurality of structural classes, based on the feature vector for each of the plurality of text lines;
deriving a set of rules for classification of the plurality of text lines into the plurality of structural classes, wherein each of the set of rules correspond to one of the plurality of structural classes and is based on the at least one dominant feature and the at least one corresponding dominance factor for that class; and
determining a structural block tag for each of the plurality of text lines based on the set of rules.

13. The system of claim 12, wherein extracting the plurality of text lines comprises applying a text extraction tool with a pre-defined or a dynamic threshold on the document.

14. The system of claim 12, wherein the set of corresponding features comprise at least one of a positional feature, a font feature, a count feature, or a spacing feature, and wherein the set of feature values for the set of corresponding features comprise at least one of positional coordinates of a text line, a font size in the text line, one or more flags for one or more font styles, a length of the text line, or a spacing between at least two of the plurality of text line.

15. The system of claim 12, wherein determining the at least one dominant feature comprises comparing the feature vector for a text line with a set of feature vectors of a set of neighboring text lines, and wherein the set of neighboring text lines comprises at least one of a pre-defined set of preceding text lines, or a pre-defined set of successive text lines.

16. The system of claim 12, wherein each of the set of rules comprises a sum of the at least one dominant feature modified by the at least one corresponding dominance factor.

17. The system of claim 12, wherein the operations further comprise determining at least one corresponding threshold value for the at least one dominant feature, and wherein each of the set of rules comprises a sum of the at least one dominant feature modified by the at least one corresponding dominance factor and the at least one corresponding threshold value.

18. The system of claim 17, wherein determining the at least one corresponding threshold value for the at least one dominant feature comprises:
initializing the at least one corresponding threshold value based on an initial difference between feature values for the at least one dominant feature; and
dynamically adjusting the at least one corresponding threshold value in each of a plurality of subsequent passes.

19. The system of claim 12, wherein determining the structural block tag comprises:
determining a set of scores for a text line based on the set of rules; and
determining the structural block tag corresponding to one of the plurality of structural classes based on an indicative score among the set of scores.

Dated this 30th day of July, 2018

Swetha SN
Of K&S Partners
Agent for the Applicant
IN/PA-2123
, Description:TECHNICAL FIELD
This disclosure relates generally to document processing, and more particularly to method and system for determining structural blocks of a document.

Documents

Application Documents

# Name Date
1 201841028613-STATEMENT OF UNDERTAKING (FORM 3) [30-07-2018(online)].pdf 2018-07-30
2 201841028613-Request Letter-Correspondence [30-07-2018(online)].pdf 2018-07-30
3 201841028613-REQUEST FOR EXAMINATION (FORM-18) [30-07-2018(online)].pdf 2018-07-30
4 201841028613-POWER OF AUTHORITY [30-07-2018(online)].pdf 2018-07-30
5 201841028613-Power of Attorney [30-07-2018(online)].pdf 2018-07-30
6 201841028613-FORM 18 [30-07-2018(online)].pdf 2018-07-30
7 201841028613-FORM 1 [30-07-2018(online)].pdf 2018-07-30
8 201841028613-Form 1 (Submitted on date of filing) [30-07-2018(online)].pdf 2018-07-30
9 201841028613-DRAWINGS [30-07-2018(online)].pdf 2018-07-30
10 201841028613-DECLARATION OF INVENTORSHIP (FORM 5) [30-07-2018(online)].pdf 2018-07-30
11 201841028613-COMPLETE SPECIFICATION [30-07-2018(online)].pdf 2018-07-30
12 abstract 201841028613.jpg 2018-08-29
13 201841028613-Proof of Right (MANDATORY) [21-09-2018(online)].pdf 2018-09-21
14 Correspondence by Agent_Form30,Form1_26-09-2018.pdf 2018-09-26
15 201841028613-PETITION UNDER RULE 137 [06-07-2021(online)].pdf 2021-07-06
16 201841028613-FORM 3 [06-07-2021(online)].pdf 2021-07-06
17 201841028613-OTHERS [07-07-2021(online)].pdf 2021-07-07
18 201841028613-FER_SER_REPLY [07-07-2021(online)].pdf 2021-07-07
19 201841028613-DRAWING [07-07-2021(online)].pdf 2021-07-07
20 201841028613-CORRESPONDENCE [07-07-2021(online)].pdf 2021-07-07
21 201841028613-COMPLETE SPECIFICATION [07-07-2021(online)].pdf 2021-07-07
22 201841028613-CLAIMS [07-07-2021(online)].pdf 2021-07-07
23 201841028613-ABSTRACT [07-07-2021(online)].pdf 2021-07-07
24 201841028613-FER.pdf 2021-10-17
25 201841028613-US(14)-HearingNotice-(HearingDate-07-03-2024).pdf 2024-02-09
26 201841028613-POA [28-02-2024(online)].pdf 2024-02-28
27 201841028613-FORM 13 [28-02-2024(online)].pdf 2024-02-28
28 201841028613-Correspondence to notify the Controller [28-02-2024(online)].pdf 2024-02-28
29 201841028613-AMENDED DOCUMENTS [28-02-2024(online)].pdf 2024-02-28
30 201841028613-Written submissions and relevant documents [21-03-2024(online)].pdf 2024-03-21
31 201841028613-FORM-26 [21-03-2024(online)].pdf 2024-03-21
32 201841028613-PatentCertificate08-07-2024.pdf 2024-07-08
33 201841028613-IntimationOfGrant08-07-2024.pdf 2024-07-08

Search Strategy

1 FER-2020-12-24-17-04-07E_24-12-2020.pdf
2 amended_searchAE_30-11-2021.pdf

ERegister / Renewals

3rd: 08 Oct 2024

From 30/07/2020 - To 30/07/2021

4th: 08 Oct 2024

From 30/07/2021 - To 30/07/2022

5th: 08 Oct 2024

From 30/07/2022 - To 30/07/2023

6th: 08 Oct 2024

From 30/07/2023 - To 30/07/2024

7th: 08 Oct 2024

From 30/07/2024 - To 30/07/2025

8th: 25 Jul 2025

From 30/07/2025 - To 30/07/2026