Sign In to Follow Application
View All Documents & Correspondence

Methods And Devices For Extracting Text From Documents

Abstract: A method and device for extracting text from documents is disclosed. The method includes performing layout analysis on the document to identify a plurality of regions within a plurality of pages in the document. The method further includes identifying a table region from within the plurality of regions based on homogeneity between a plurality of textual lines in a page from the plurality of pages. The method includes identifying at least two rows and at least two columns within the table region. The method further includes identifying a plurality of cells within the table region based on the at least two rows and the at least two columns. The method includes extracting text from each of the plurality of cells. FIG. 3

Get Free WhatsApp Updates!
Notices, Deadlines & Correspondence

Patent Information

Application #
Filing Date
18 May 2017
Publication Number
47/2018
Publication Type
INA
Invention Field
COMPUTER SCIENCE
Status
Email
bangalore@knspartners.com
Parent Application
Patent Number
Legal Status
Grant Date
2024-02-02
Renewal Date

Applicants

WIPRO LIMITED
Doddakannelli, Sarjapur Road, Bangalore 560035, Karnataka, India.

Inventors

1. RAGHAVENDRA HOSABETTU
#3080, Venkatadri Nilaya, 2nd Main, 3rd Cross, VHBCS Layout, Banashankari 3rd Stage, Bangalore - 560085, Karnataka, India.
2. SENDIL KUMAR JAYA KUMAR
S-13, Flat No-201, 1st Floor, Spring Seas BLOSSOM Apartment, NEAR SPICE GARDEN, Silver Spring LAYOUT, Marathalli, Bangalore - 560037, Karnataka, India.
3. RAGHOTTAM MANNOPANTAR
Pristine Paradise, #105, Near Shantiniketan School, Bilekahalli, Bangalore -560076, Karnataka, India.

Specification

Claims:WE CLAIM
1. A method for extracting text from a document, the method comprising:

performing, by a text extraction device, layout analysis on the document to identify a plurality of regions within a plurality of pages in the document;
identifying, by the text extraction device, a table region from within the plurality of regions based on homogeneity between a plurality of textual lines in a page from the plurality of pages, wherein the homogeneity is computed based on a plurality of preselected textual parameters associated with the plurality of textual lines;
identifying, by the text extraction device, at least two rows and at least two columns within the table region;
identifying, by the text extraction device, a plurality of cells within the table region based on the at least two rows and the at least two columns; and
extracting, by the text extraction device, text from each of the plurality of cells.
2. The method of claim 1, wherein the plurality of regions comprises at least one of at least one header, at least one footer, at least one page column, at least one table, or at least one image.
3. The method of claim 2, wherein the at least one page column is identified based on a threshold number of characters and a threshold number of words associated with a page column width.
4. The method of claim 1, wherein identifying the table region comprises:
computing a homogeneity index for each textual line in a page column within the page, wherein a homogeneity index for a textual line is computed based on a number of characters in the textual line and the plurality of preselected textual parameters; and
identifying a set of contiguous textual lines having same homogeneity index, wherein the set of contiguous textual lines form the table region.
5. The method of claim 1, wherein identifying the table region comprises:
determining, for each textual line in a page column within the page, values for the plurality of preselected textual parameters comprising at least one of pixel length, number of words, total pixel space between adjacent words, or number of characters;
computing, for each textual line in the page column, a variance of value of at least one of the plurality of preselected textual parameter from an associated average parameter value determined for all textual lines within the page column;
determining, for each textual line in the page column, average variance based on the variance computed for each of the at least one of the plurality of preselected textual parameters; and
computing, for each textual line in the page column, a covariance based on difference between average variance of each textual line and an associated contiguous textual line within the page column.
6. The method of claim 5 further comprising identifying a set of contiguous lines based on the covariance computed for each textual line in the page column, wherein difference between covariance of contiguous textual lines in the set of contiguous lines is below a predefined threshold, and wherein the set of contiguous lines forms the table region.
7. The method of claim 1, wherein identifying the at least two rows and the at least two columns within the table region comprises:
identifying a plurality of sets of contiguous pixels comprising a predefined color within the table region;
comparing each of the plurality of sets of contiguous pixels along the horizontal direction of the document with a minimum row pixel threshold, to identify the at least two rows; and
comparing each of the plurality of sets of contiguous pixels along the vertical direction of the document with a minimum column pixel threshold, to identify the at least two columns.
8. The method of claim 1 further comprising identifying a header row for the table region based on homogeneity between a first textual line and a second textual line within the table region.
9. The method of claim 1 further comprising storing coordinates of each of the plurality of regions.
10. The method of claim 1 further comprising discarding at least one region from the plurality of regions for further analysis in response to identifying the plurality of regions, wherein each of the at least one region is not a table.
11. The method of claim 1 further comprising storing the text extracted from each of the plurality of cells in a predefined format.
12. A text extraction device for extracting text from a document, the text extraction device comprises:
a processor; and
a memory communicatively coupled to the processor, wherein the memory stores processor instructions, which, on execution, causes the processor to:
perform layout analysis on the document to identify a plurality of regions within a plurality of pages in the document;
identify a table region from within the plurality of regions based on homogeneity between a plurality of textual lines in a page from the plurality of pages, wherein the homogeneity is computed based on a plurality of preselected textual parameters associated with the plurality of textual lines;
identify at least two rows and at least two columns within the table region;
identify a plurality of cells within the table region based on the at least two rows and the at least two columns; and
extract text from each of the plurality of cells.
13. The text extraction device of claim 12, wherein at least one page column is identified based on a threshold number of characters and a threshold number of words associated with a page column width.
14. The text extraction device of claim 12, wherein to identify the table region, the processor instructions further cause the processor to:
compute a homogeneity index for each textual line in a page column within the page, wherein a homogeneity index for a textual line is computed based on a number of characters in the textual line and the plurality of preselected textual parameters; and
identify a set of contiguous textual lines having same homogeneity index, wherein the set of contiguous textual lines form the table region.
15. The text extraction device of claim 12, wherein to identify the table region, the processor instructions further cause the processor to:
determine, for each textual line in a page column within the page, values for the plurality of preselected textual parameters comprising at least one of pixel length, number of words, total pixel space between adjacent words, or number of characters;
compute, for each textual line in the page column, a variance of value of at least one of the plurality of preselected textual parameter from an associated average parameter value determined for all textual lines within the page column;
determine, for each textual line in the page column, average variance based on the variance computed for each of the at least one of the plurality of preselected textual parameters; and
compute, for each textual line in the page column, a covariance based on difference between average variance of each textual line and an associated contiguous textual line within the page column.
16. The text extraction device of claim 15, wherein the processor instructions further cause the processor to identify a set of contiguous lines based on the covariance computed for each textual line in the page column, wherein difference between covariance of contiguous textual lines in the set of contiguous lines is below a predefined threshold, and wherein the set of contiguous lines forms the table region.
17. The text extraction device of claim 12, wherein to identify the at least two rows and the at least two columns within the table region, the processor instructions further cause the processor to:
identify a plurality of sets of contiguous pixels comprising a predefined color within the table region;
compare each of the plurality of sets of contiguous pixels along the horizontal direction of the document with a minimum row pixel threshold, to identify the at least two rows; and
compare each of the plurality of sets of contiguous pixels along the vertical direction of the document with a minimum column pixel threshold, to identify the at least two columns.
18. The text extraction device of claim 12, wherein the processor instructions further cause the processor to identify a header row for the table region based on homogeneity between a first textual line and a second textual line within the table region.
19. The text extraction device of claim 12, wherein the processor instructions further cause the processor to store the text extracted from each of the plurality of cells in a predefined format.
20. A non-transitory computer-readable storage medium having stored thereon, a set of computer-executable instructions causing a computer comprising one or more processors to perform steps comprising:
performing, by a text extraction device, layout analysis on the document to identify a plurality of regions within a plurality of pages in the document;
identifying, by the text extraction device, a table region from within the plurality of regions based on homogeneity between a plurality of textual lines in a page from the plurality of pages, wherein the homogeneity is computed based on a plurality of preselected textual parameters associated with the plurality of textual lines;
identifying, by the text extraction device, at least two rows and at least two columns within the table region;
identifying, by the text extraction device, a plurality of cells within the table region based on the at least two rows and the at least two columns; and
extracting, by the text extraction device, text from each of the plurality of cells.

Dated this 18th day of May, 2017

Swetha SN
Of K&S Partners
Agent for the Applicant
IN/PA-2123
, Description:TECHNICAL FIELD
This disclosure relates generally to text extraction and more particularly to methods and devices for extracting text from documents.

Documents

Application Documents

# Name Date
1 Power of Attorney [18-05-2017(online)].pdf 2017-05-18
2 Form 5 [18-05-2017(online)].pdf 2017-05-18
3 Form 3 [18-05-2017(online)].pdf 2017-05-18
4 Form 18 [18-05-2017(online)].pdf_148.pdf 2017-05-18
5 Form 18 [18-05-2017(online)].pdf 2017-05-18
6 Form 1 [18-05-2017(online)].pdf 2017-05-18
7 Drawing [18-05-2017(online)].pdf 2017-05-18
8 Description(Complete) [18-05-2017(online)].pdf_147.pdf 2017-05-18
9 Description(Complete) [18-05-2017(online)].pdf 2017-05-18
10 REQUEST FOR CERTIFIED COPY [19-05-2017(online)].pdf 2017-05-19
11 PROOF OF RIGHT [11-07-2017(online)].pdf 2017-07-11
12 Correspondence by Agent_Form 1_13-07-2017.pdf 2017-07-13
13 abstract201741017499.jpg 2017-07-17
14 201741017499-PETITION UNDER RULE 137 [16-04-2021(online)].pdf 2021-04-16
15 201741017499-Information under section 8(2) [16-04-2021(online)].pdf 2021-04-16
16 201741017499-FORM 3 [16-04-2021(online)].pdf 2021-04-16
17 201741017499-FER_SER_REPLY [16-04-2021(online)].pdf 2021-04-16
18 201741017499-FER.pdf 2021-10-17
19 201741017499-US(14)-HearingNotice-(HearingDate-11-01-2024).pdf 2023-12-13
20 201741017499-POA [22-12-2023(online)].pdf 2023-12-22
21 201741017499-FORM 13 [22-12-2023(online)].pdf 2023-12-22
22 201741017499-Correspondence to notify the Controller [22-12-2023(online)].pdf 2023-12-22
23 201741017499-AMENDED DOCUMENTS [22-12-2023(online)].pdf 2023-12-22
24 201741017499-Written submissions and relevant documents [26-01-2024(online)].pdf 2024-01-26
25 201741017499-FORM-26 [26-01-2024(online)].pdf 2024-01-26
26 201741017499-FORM 3 [26-01-2024(online)].pdf 2024-01-26
27 201741017499-PatentCertificate02-02-2024.pdf 2024-02-02
28 201741017499-IntimationOfGrant02-02-2024.pdf 2024-02-02

Search Strategy

1 searchE_15-10-2020.pdf

ERegister / Renewals

3rd: 01 May 2024

From 18/05/2019 - To 18/05/2020

4th: 01 May 2024

From 18/05/2020 - To 18/05/2021

5th: 01 May 2024

From 18/05/2021 - To 18/05/2022

6th: 01 May 2024

From 18/05/2022 - To 18/05/2023

7th: 01 May 2024

From 18/05/2023 - To 18/05/2024

8th: 01 May 2024

From 18/05/2024 - To 18/05/2025

9th: 06 May 2025

From 18/05/2025 - To 18/05/2026