Abstract: A method and device for extracting text from documents is disclosed. The method includes performing layout analysis on the document to identify a plurality of regions within a plurality of pages in the document. The method further includes identifying a table region from within the plurality of regions based on homogeneity between a plurality of textual lines in a page from the plurality of pages. The method includes identifying at least two rows and at least two columns within the table region. The method further includes identifying a plurality of cells within the table region based on the at least two rows and the at least two columns. The method includes extracting text from each of the plurality of cells. FIG. 3
Claims:WE CLAIM
1. A method for extracting text from a document, the method comprising:
performing, by a text extraction device, layout analysis on the document to identify a plurality of regions within a plurality of pages in the document;
identifying, by the text extraction device, a table region from within the plurality of regions based on homogeneity between a plurality of textual lines in a page from the plurality of pages, wherein the homogeneity is computed based on a plurality of preselected textual parameters associated with the plurality of textual lines;
identifying, by the text extraction device, at least two rows and at least two columns within the table region;
identifying, by the text extraction device, a plurality of cells within the table region based on the at least two rows and the at least two columns; and
extracting, by the text extraction device, text from each of the plurality of cells.
2. The method of claim 1, wherein the plurality of regions comprises at least one of at least one header, at least one footer, at least one page column, at least one table, or at least one image.
3. The method of claim 2, wherein the at least one page column is identified based on a threshold number of characters and a threshold number of words associated with a page column width.
4. The method of claim 1, wherein identifying the table region comprises:
computing a homogeneity index for each textual line in a page column within the page, wherein a homogeneity index for a textual line is computed based on a number of characters in the textual line and the plurality of preselected textual parameters; and
identifying a set of contiguous textual lines having same homogeneity index, wherein the set of contiguous textual lines form the table region.
5. The method of claim 1, wherein identifying the table region comprises:
determining, for each textual line in a page column within the page, values for the plurality of preselected textual parameters comprising at least one of pixel length, number of words, total pixel space between adjacent words, or number of characters;
computing, for each textual line in the page column, a variance of value of at least one of the plurality of preselected textual parameter from an associated average parameter value determined for all textual lines within the page column;
determining, for each textual line in the page column, average variance based on the variance computed for each of the at least one of the plurality of preselected textual parameters; and
computing, for each textual line in the page column, a covariance based on difference between average variance of each textual line and an associated contiguous textual line within the page column.
6. The method of claim 5 further comprising identifying a set of contiguous lines based on the covariance computed for each textual line in the page column, wherein difference between covariance of contiguous textual lines in the set of contiguous lines is below a predefined threshold, and wherein the set of contiguous lines forms the table region.
7. The method of claim 1, wherein identifying the at least two rows and the at least two columns within the table region comprises:
identifying a plurality of sets of contiguous pixels comprising a predefined color within the table region;
comparing each of the plurality of sets of contiguous pixels along the horizontal direction of the document with a minimum row pixel threshold, to identify the at least two rows; and
comparing each of the plurality of sets of contiguous pixels along the vertical direction of the document with a minimum column pixel threshold, to identify the at least two columns.
8. The method of claim 1 further comprising identifying a header row for the table region based on homogeneity between a first textual line and a second textual line within the table region.
9. The method of claim 1 further comprising storing coordinates of each of the plurality of regions.
10. The method of claim 1 further comprising discarding at least one region from the plurality of regions for further analysis in response to identifying the plurality of regions, wherein each of the at least one region is not a table.
11. The method of claim 1 further comprising storing the text extracted from each of the plurality of cells in a predefined format.
12. A text extraction device for extracting text from a document, the text extraction device comprises:
a processor; and
a memory communicatively coupled to the processor, wherein the memory stores processor instructions, which, on execution, causes the processor to:
perform layout analysis on the document to identify a plurality of regions within a plurality of pages in the document;
identify a table region from within the plurality of regions based on homogeneity between a plurality of textual lines in a page from the plurality of pages, wherein the homogeneity is computed based on a plurality of preselected textual parameters associated with the plurality of textual lines;
identify at least two rows and at least two columns within the table region;
identify a plurality of cells within the table region based on the at least two rows and the at least two columns; and
extract text from each of the plurality of cells.
13. The text extraction device of claim 12, wherein at least one page column is identified based on a threshold number of characters and a threshold number of words associated with a page column width.
14. The text extraction device of claim 12, wherein to identify the table region, the processor instructions further cause the processor to:
compute a homogeneity index for each textual line in a page column within the page, wherein a homogeneity index for a textual line is computed based on a number of characters in the textual line and the plurality of preselected textual parameters; and
identify a set of contiguous textual lines having same homogeneity index, wherein the set of contiguous textual lines form the table region.
15. The text extraction device of claim 12, wherein to identify the table region, the processor instructions further cause the processor to:
determine, for each textual line in a page column within the page, values for the plurality of preselected textual parameters comprising at least one of pixel length, number of words, total pixel space between adjacent words, or number of characters;
compute, for each textual line in the page column, a variance of value of at least one of the plurality of preselected textual parameter from an associated average parameter value determined for all textual lines within the page column;
determine, for each textual line in the page column, average variance based on the variance computed for each of the at least one of the plurality of preselected textual parameters; and
compute, for each textual line in the page column, a covariance based on difference between average variance of each textual line and an associated contiguous textual line within the page column.
16. The text extraction device of claim 15, wherein the processor instructions further cause the processor to identify a set of contiguous lines based on the covariance computed for each textual line in the page column, wherein difference between covariance of contiguous textual lines in the set of contiguous lines is below a predefined threshold, and wherein the set of contiguous lines forms the table region.
17. The text extraction device of claim 12, wherein to identify the at least two rows and the at least two columns within the table region, the processor instructions further cause the processor to:
identify a plurality of sets of contiguous pixels comprising a predefined color within the table region;
compare each of the plurality of sets of contiguous pixels along the horizontal direction of the document with a minimum row pixel threshold, to identify the at least two rows; and
compare each of the plurality of sets of contiguous pixels along the vertical direction of the document with a minimum column pixel threshold, to identify the at least two columns.
18. The text extraction device of claim 12, wherein the processor instructions further cause the processor to identify a header row for the table region based on homogeneity between a first textual line and a second textual line within the table region.
19. The text extraction device of claim 12, wherein the processor instructions further cause the processor to store the text extracted from each of the plurality of cells in a predefined format.
20. A non-transitory computer-readable storage medium having stored thereon, a set of computer-executable instructions causing a computer comprising one or more processors to perform steps comprising:
performing, by a text extraction device, layout analysis on the document to identify a plurality of regions within a plurality of pages in the document;
identifying, by the text extraction device, a table region from within the plurality of regions based on homogeneity between a plurality of textual lines in a page from the plurality of pages, wherein the homogeneity is computed based on a plurality of preselected textual parameters associated with the plurality of textual lines;
identifying, by the text extraction device, at least two rows and at least two columns within the table region;
identifying, by the text extraction device, a plurality of cells within the table region based on the at least two rows and the at least two columns; and
extracting, by the text extraction device, text from each of the plurality of cells.
Dated this 18th day of May, 2017
Swetha SN
Of K&S Partners
Agent for the Applicant
IN/PA-2123
, Description:TECHNICAL FIELD
This disclosure relates generally to text extraction and more particularly to methods and devices for extracting text from documents.
| # | Name | Date |
|---|---|---|
| 1 | Power of Attorney [18-05-2017(online)].pdf | 2017-05-18 |
| 2 | Form 5 [18-05-2017(online)].pdf | 2017-05-18 |
| 3 | Form 3 [18-05-2017(online)].pdf | 2017-05-18 |
| 4 | Form 18 [18-05-2017(online)].pdf_148.pdf | 2017-05-18 |
| 5 | Form 18 [18-05-2017(online)].pdf | 2017-05-18 |
| 6 | Form 1 [18-05-2017(online)].pdf | 2017-05-18 |
| 7 | Drawing [18-05-2017(online)].pdf | 2017-05-18 |
| 8 | Description(Complete) [18-05-2017(online)].pdf_147.pdf | 2017-05-18 |
| 9 | Description(Complete) [18-05-2017(online)].pdf | 2017-05-18 |
| 10 | REQUEST FOR CERTIFIED COPY [19-05-2017(online)].pdf | 2017-05-19 |
| 11 | PROOF OF RIGHT [11-07-2017(online)].pdf | 2017-07-11 |
| 12 | Correspondence by Agent_Form 1_13-07-2017.pdf | 2017-07-13 |
| 13 | abstract201741017499.jpg | 2017-07-17 |
| 14 | 201741017499-PETITION UNDER RULE 137 [16-04-2021(online)].pdf | 2021-04-16 |
| 15 | 201741017499-Information under section 8(2) [16-04-2021(online)].pdf | 2021-04-16 |
| 16 | 201741017499-FORM 3 [16-04-2021(online)].pdf | 2021-04-16 |
| 17 | 201741017499-FER_SER_REPLY [16-04-2021(online)].pdf | 2021-04-16 |
| 18 | 201741017499-FER.pdf | 2021-10-17 |
| 19 | 201741017499-US(14)-HearingNotice-(HearingDate-11-01-2024).pdf | 2023-12-13 |
| 20 | 201741017499-POA [22-12-2023(online)].pdf | 2023-12-22 |
| 21 | 201741017499-FORM 13 [22-12-2023(online)].pdf | 2023-12-22 |
| 22 | 201741017499-Correspondence to notify the Controller [22-12-2023(online)].pdf | 2023-12-22 |
| 23 | 201741017499-AMENDED DOCUMENTS [22-12-2023(online)].pdf | 2023-12-22 |
| 24 | 201741017499-Written submissions and relevant documents [26-01-2024(online)].pdf | 2024-01-26 |
| 25 | 201741017499-FORM-26 [26-01-2024(online)].pdf | 2024-01-26 |
| 26 | 201741017499-FORM 3 [26-01-2024(online)].pdf | 2024-01-26 |
| 27 | 201741017499-PatentCertificate02-02-2024.pdf | 2024-02-02 |
| 28 | 201741017499-IntimationOfGrant02-02-2024.pdf | 2024-02-02 |
| 1 | searchE_15-10-2020.pdf |