Methods And Devices For Extracting Text From Documents

Abstract: A method and device for extracting text from documents is disclosed. The method includes performing layout analysis on the document to identify a plurality of regions within a plurality of pages in the document. The method further includes identifying a table region from within the plurality of regions based on homogeneity between a plurality of textual lines in a page from the plurality of pages. The method includes identifying at least two rows and at least two columns within the table region. The method further includes identifying a plurality of cells within the table region based on the at least two rows and the at least two columns. The method includes extracting text from each of the plurality of cells. FIG. 3

Patent Information

Application #

Filing Date

18 May 2017

Publication Number

47/2018

Publication Type

INA

Invention Field

COMPUTER SCIENCE

Status

Email

bangalore@knspartners.com

Parent Application

Patent Number

Legal Status

Grant Date

2024-02-02

Renewal Date

Applicants

WIPRO LIMITED

Doddakannelli, Sarjapur Road, Bangalore 560035, Karnataka, India.

Inventors

1. RAGHAVENDRA HOSABETTU

#3080, Venkatadri Nilaya, 2nd Main, 3rd Cross, VHBCS Layout, Banashankari 3rd Stage, Bangalore - 560085, Karnataka, India.

2. SENDIL KUMAR JAYA KUMAR

S-13, Flat No-201, 1st Floor, Spring Seas BLOSSOM Apartment, NEAR SPICE GARDEN, Silver Spring LAYOUT, Marathalli, Bangalore - 560037, Karnataka, India.

3. RAGHOTTAM MANNOPANTAR

Pristine Paradise, #105, Near Shantiniketan School, Bilekahalli, Bangalore -560076, Karnataka, India.

Specification

Claims:WE CLAIM
1. A method for extracting text from a document, the method comprising:

performing, by a text extraction device, layout analysis on the document to identify a plurality of regions within a plurality of pages in the document;
identifying, by the text extraction device, a table region from within the plurality of regions based on homogeneity between a plurality of textual lines in a page from the plurality of pages, wherein the homogeneity is computed based on a plurality of preselected textual parameters associated with the plurality of textual lines;
identifying, by the text extraction device, at least two rows and at least two columns within the table region;
identifying, by the text extraction device, a plurality of cells within the table region based on the at least two rows and the at least two columns; and
extracting, by the text extraction device, text from each of the plurality of cells.
2. The method of claim 1, wherein the plurality of regions comprises at least one of at least one header, at least one footer, at least one page column, at least one table, or at least one image.
3. The method of claim 2, wherein the at least one page column is identified based on a threshold number of characters and a threshold number of words associated with a page column width.
4. The method of claim 1, wherein identifying the table region comprises:
computing a homogeneity index for each textual line in a page column within the page, wherein a homogeneity index for a textual line is computed based on a number of characters in the textual line and the plurality of preselected textual parameters; and
identifying a set of contiguous textual lines having same homogeneity index, wherein the set of contiguous textual lines form the table region.
5. The method of claim 1, wherein identifying the table region comprises:
determining, for each textual line in a page column within the page, values for the plurality of preselected textual parameters comprising at least one of pixel length, number of words, total pixel space between adjacent words, or number of characters;
computing, for each textual line in the page column, a variance of value of at least one of the plurality of preselected textual parameter from an associated average parameter value determined for all textual lines within the page column;
determining, for each textual line in the page column, average variance based on the variance computed for each of the at least one of the plurality of preselected textual parameters; and
computing, for each textual line in the page column, a covariance based on difference between average variance of each textual line and an associated contiguous textual line within the page column.
6. The method of claim 5 further comprising identifying a set of contiguous lines based on the covariance computed for each textual line in the page column, wherein difference between covariance of contiguous textual lines in the set of contiguous lines is below a predefined threshold, and wherein the set of contiguous lines forms the table region.
7. The method of claim 1, wherein identifying the at least two rows and the at least two columns within the table region comprises:
identifying a plurality of sets of contiguous pixels comprising a predefined color within the table region;
comparing each of the plurality of sets of contiguous pixels along the horizontal direction of the document with a minimum row pixel threshold, to identify the at least two rows; and
comparing each of the plurality of sets of contiguous pixels along the vertical direction of the document with a minimum column pixel threshold, to identify the at least two columns.
8. The method of claim 1 further comprising identifying a header row for the table region based on homogeneity between a first textual line and a second textual line within the table region.
9. The method of claim 1 further comprising storing coordinates of each of the plurality of regions.
10. The method of claim 1 further comprising discarding at least one region from the plurality of regions for further analysis in response to identifying the plurality of regions, wherein each of the at least one region is not a table.
11. The method of claim 1 further comprising storing the text extracted from each of the plurality of cells in a predefined format.
12. A text extraction device for extracting text from a document, the text extraction device comprises:
a processor; and
a memory communicatively coupled to the processor, wherein the memory stores processor instructions, which, on execution, causes the processor to:
perform layout analysis on the document to identify a plurality of regions within a plurality of pages in the document;
identify a table region from within the plurality of regions based on homogeneity between a plurality of textual lines in a page from the plurality of pages, wherein the homogeneity is computed based on a plurality of preselected textual parameters associated with the plurality of textual lines;
identify at least two rows and at least two columns within the table region;
identify a plurality of cells within the table region based on the at least two rows and the at least two columns; and
extract text from each of the plurality of cells.
13. The text extraction device of claim 12, wherein at least one page column is identified based on a threshold number of characters and a threshold number of words associated with a page column width.
14. The text extraction device of claim 12, wherein to identify the table region, the processor instructions further cause the processor to:
compute a homogeneity index for each textual line in a page column within the page, wherein a homogeneity index for a textual line is computed based on a number of characters in the textual line and the plurality of preselected textual parameters; and
identify a set of contiguous textual lines having same homogeneity index, wherein the set of contiguous textual lines form the table region.
15. The text extraction device of claim 12, wherein to identify the table region, the processor instructions further cause the processor to:
determine, for each textual line in a page column within the page, values for the plurality of preselected textual parameters comprising at least one of pixel length, number of words, total pixel space between adjacent words, or number of characters;
compute, for each textual line in the page column, a variance of value of at least one of the plurality of preselected textual parameter from an associated average parameter value determined for all textual lines within the page column;
determine, for each textual line in the page column, average variance based on the variance computed for each of the at least one of the plurality of preselected textual parameters; and
compute, for each textual line in the page column, a covariance based on difference between average variance of each textual line and an associated contiguous textual line within the page column.
16. The text extraction device of claim 15, wherein the processor instructions further cause the processor to identify a set of contiguous lines based on the covariance computed for each textual line in the page column, wherein difference between covariance of contiguous textual lines in the set of contiguous lines is below a predefined threshold, and wherein the set of contiguous lines forms the table region.
17. The text extraction device of claim 12, wherein to identify the at least two rows and the at least two columns within the table region, the processor instructions further cause the processor to:
identify a plurality of sets of contiguous pixels comprising a predefined color within the table region;
compare each of the plurality of sets of contiguous pixels along the horizontal direction of the document with a minimum row pixel threshold, to identify the at least two rows; and
compare each of the plurality of sets of contiguous pixels along the vertical direction of the document with a minimum column pixel threshold, to identify the at least two columns.
18. The text extraction device of claim 12, wherein the processor instructions further cause the processor to identify a header row for the table region based on homogeneity between a first textual line and a second textual line within the table region.
19. The text extraction device of claim 12, wherein the processor instructions further cause the processor to store the text extracted from each of the plurality of cells in a predefined format.
20. A non-transitory computer-readable storage medium having stored thereon, a set of computer-executable instructions causing a computer comprising one or more processors to perform steps comprising:
performing, by a text extraction device, layout analysis on the document to identify a plurality of regions within a plurality of pages in the document;
identifying, by the text extraction device, a table region from within the plurality of regions based on homogeneity between a plurality of textual lines in a page from the plurality of pages, wherein the homogeneity is computed based on a plurality of preselected textual parameters associated with the plurality of textual lines;
identifying, by the text extraction device, at least two rows and at least two columns within the table region;
identifying, by the text extraction device, a plurality of cells within the table region based on the at least two rows and the at least two columns; and
extracting, by the text extraction device, text from each of the plurality of cells.

Dated this 18th day of May, 2017

Swetha SN
Of K&S Partners
Agent for the Applicant
IN/PA-2123
, Description:TECHNICAL FIELD
This disclosure relates generally to text extraction and more particularly to methods and devices for extracting text from documents.

Documents

Application Documents

#	Name	Date
1	Power of Attorney [18-05-2017(online)].pdf	2017-05-18
2	Form 5 [18-05-2017(online)].pdf	2017-05-18
3	Form 3 [18-05-2017(online)].pdf	2017-05-18
4	Form 18 [18-05-2017(online)].pdf_148.pdf	2017-05-18
5	Form 18 [18-05-2017(online)].pdf	2017-05-18
6	Form 1 [18-05-2017(online)].pdf	2017-05-18
7	Drawing [18-05-2017(online)].pdf	2017-05-18
8	Description(Complete) [18-05-2017(online)].pdf_147.pdf	2017-05-18
9	Description(Complete) [18-05-2017(online)].pdf	2017-05-18
10	REQUEST FOR CERTIFIED COPY [19-05-2017(online)].pdf	2017-05-19
11	PROOF OF RIGHT [11-07-2017(online)].pdf	2017-07-11
12	Correspondence by Agent_Form 1_13-07-2017.pdf	2017-07-13
13	abstract201741017499.jpg	2017-07-17
14	201741017499-PETITION UNDER RULE 137 [16-04-2021(online)].pdf	2021-04-16
15	201741017499-Information under section 8(2) [16-04-2021(online)].pdf	2021-04-16
16	201741017499-FORM 3 [16-04-2021(online)].pdf	2021-04-16
17	201741017499-FER_SER_REPLY [16-04-2021(online)].pdf	2021-04-16
18	201741017499-FER.pdf	2021-10-17
19	201741017499-US(14)-HearingNotice-(HearingDate-11-01-2024).pdf	2023-12-13
20	201741017499-POA [22-12-2023(online)].pdf	2023-12-22
21	201741017499-FORM 13 [22-12-2023(online)].pdf	2023-12-22
22	201741017499-Correspondence to notify the Controller [22-12-2023(online)].pdf	2023-12-22
23	201741017499-AMENDED DOCUMENTS [22-12-2023(online)].pdf	2023-12-22
24	201741017499-Written submissions and relevant documents [26-01-2024(online)].pdf	2024-01-26
25	201741017499-FORM-26 [26-01-2024(online)].pdf	2024-01-26
26	201741017499-FORM 3 [26-01-2024(online)].pdf	2024-01-26
27	201741017499-PatentCertificate02-02-2024.pdf	2024-02-02
28	201741017499-IntimationOfGrant02-02-2024.pdf	2024-02-02

Search Strategy

1	searchE_15-10-2020.pdf