Systems And Methods For Tabular Data Extraction From An Image Of A

< Back

Systems And Methods For Tabular Data Extraction From An Image Of A Document

Abstract: Documents such as invoices typically contain relevant information in a tabular region. Conventional approaches to extract tabular data from an image necessitate that the tabular region have a grid structure with horizontal and vertical lines. An invoice may extend beyond a single page or may have more than one tabular region. The present disclosure addresses the technical problem of identifying tabular region in a received image of a document. A white space approach detects the tabular region if there is no grid structure. Text edges of text elements in the tabular region are used to determine coordinates of the tabular region. The text elements are then grouped based on associated Y-coordinates and the text edges to form candidate rows and candidate columns, respectively. The text elements are then localized within a plurality of cells defined by the intersecting candidate columns and rows to generate a final table. [To be published with FIG.6E]

Get Free WhatsApp Updates!
Notices, Deadlines & Correspondence

Patent Information

Application #

Filing Date

09 February 2021

Publication Number

32/2022

Publication Type

INA

Invention Field

COMPUTER SCIENCE

Status

kcopatents@khaitanco.com

Parent Application

Patent Number

Legal Status

Grant Date

2024-11-07

Renewal Date

Applicants

Tata Consultancy Services Limited

Nirmal Building, 9th Floor, Nariman Point Mumbai Maharashtra India 400021

Inventors

1. MANDAL, Indrajit

Tata Consultancy Services Limited Block -1B, Eco Space, Plot No. IIF/12 (Old No. AA-II/BLK 3. I.T) Street 59 M. WIDE (R.O.W.) Road, New Town, Rajarhat, P.S. Rajarhat, Dist - N. 24 Parganas, Kolkata West Bengal India 700160

2. NARAYANABHATLA, Hari Teja

Tata Consultancy Services Limited Unit-VIII & IX, Think Campus, KIADB Industrial Estate, Electronic City, Phase-II, Bangalore Karnataka India 560100

Specification

FORM 2
THE PATENTS ACT, 1970
(39 of 1970)
&
THE PATENT RULES, 2003
COMPLETE SPECIFICATION (See Section 10 and Rule 13)
Title of invention:
SYSTEMS AND METHODS FOR TABULAR DATA EXTRACTION FROM AN IMAGE OF A DOCUMENT
Applicant
Tata Consultancy Services Limited A company Incorporated in India under the Companies Act, 1956
Having address:
Nirmal Building, 9th floor,
Nariman point, Mumbai 400021,
Maharashtra, India
Preamble to the description
The following specification particularly describes the invention and the manner in which it is to be performed.

TECHNICAL FIELD [001] The disclosure herein generally relates to the field of image processing, and, more particularly, to systems and methods for extracting tabular data from an image of a document.
BACKGROUND
[002] There is an increasing demand for automation in the Business Process Solutions (BPS) industry to reduce human error caused due to manual work and need for an increase in efficiency of services provided. An important area of work is invoice processing which faces problems with poor data entry. An organization receives or generates a large volume of invoices per day. It is challenging to keep track of all these invoices and hence requires a large workforce to convert and store data in spreadsheets for efficient and easy storage. One of the most important parts of an invoice is the tabular region which typically contains itemized details of transactions between a seller and a buyer. This data is stored manually by the workforce in the BPS industry which incurs a lot of capital and huge amount of time in processing the large volume of invoices.
[003] Large deployment of human resources incurs huge operational costs in terms of setting up of an Offshore Development Center (ODC) involving physical space and setup cost. Besides being a time consuming and stressful activity, activities like invoice processing are prone to human errors. State of the art methods attempt to extract data from tabular regions in invoices, only if the tabular region is defined by bounding boxes around it. Again, invoices may not necessarily be confined to a single page. Furthermore, invoices may exist in both structured and non-structured forms. These are some of the challenges that automation needs to address.
SUMMARY [004] Embodiments of the present disclosure present technological improvements as solutions to one or more of the above-mentioned technical problems recognized by the inventors in conventional systems.

[005] In an aspect, there is provided a processor implemented method comprising the steps of: receiving, via one or more hardware processors, an image of a document corresponding to an entity; auto-detecting, via the one or more hardware processors, at least one tabular region in the received image using one or more Computer Vision based methods; generating, via the one or more hardware processors, a final table corresponding to each of the at least one tabular region using one of: a line-based approach if the auto detected at least one tabular region includes horizontal and vertical lines; or a white space approach if no horizontal and vertical lines are detected in the at least one tabular region, wherein the white space approach comprises: identifying text edges of text elements extracted using an Optical Character Recognition (OCR) engine in the at least one tabular region based on one or more predefined criteria; determining coordinates of the at least one tabular region based on (i) the coordinates associated with a top left corner and a bottom right corner of the text elements comprised therein, (ii) associated text edges and (iii) a predefined buffer space to account for the text elements that extend beyond the at least one tabular region; grouping the text elements within the determined coordinates of the at least one tabular region into candidate rows based on associated Y-coordinates including a predefined first threshold, thereby identifying a straight line representing an edge for each row; grouping the text elements within the determined coordinates of the at least one tabular region into candidate columns based on the identified text edges and a predefined second threshold, thereby identifying a straight line representing an edge for each column; merging the one or more candidate rows in the event that (i) the one or more candidate rows do not contain the text elements in a stub column and contain the text elements in a preceding row where the stub column is a non-empty cell and (ii) an intra-distance between the text elements in the one or more candidate rows is lesser than the inter-distance therebetween; generating the final table corresponding to each of the at least one tabular region using the candidate rows and the candidate columns; and localizing the text elements within a plurality of cells defined by the intersecting candidate columns and the candidate rows based on coordinates of the text elements and the coordinates of the plurality of cells and performing error

correction in the generated final table based on a percentage of the text elements located beyond a cell in the plurality of cells.
[006] In another aspect, there is provided a system comprising: one or more data storage devices operatively coupled to one or more hardware processors and configured to store instructions configured for execution via the one or more hardware processors to: receive an image of a document corresponding to an entity; auto-detect at least one tabular region in the received image using one or more Computer Vision based methods; generate a final table corresponding to each of the at least one tabular region using one of: a line-based approach if the auto detected at least one tabular region includes horizontal and vertical lines; or a white space approach if no horizontal and vertical lines are detected in the at least one tabular region, wherein the white space approach comprises: identifying text edges of text elements extracted using an Optical Character Recognition (OCR) engine in the at least one tabular region using on one or more predefined criteria based on i) a plurality of rows having an edge type including a left edge, a right edge or a no edge, the no edge text being representative of a row with centered text having center points of text aligned along a vertical line and ii) a predefined minimum number of rows of text elements being aligned along a vertical line; determining coordinates of the at least one tabular region based on (i) the coordinates associated with a top left corner and a bottom right corner of the text elements comprised therein, (ii) associated text edges and (iii) a predefined buffer space to account for the text elements that extend beyond the at least one tabular region; grouping the text elements within the determined coordinates of the at least one tabular region into candidate rows based on associated Y-coordinates including a predefined first threshold, thereby identifying a straight line representing an edge for each row; grouping the text elements within the determined coordinates of the at least one tabular region into candidate columns based on the identified text edges and a predefined second threshold, thereby identifying a straight line representing an edge for each column; merging the one or more candidate rows in the event that (i) the one or more candidate rows do not contain the text elements in a stub column and contain the text elements in a preceding row where the stub column is a non-empty

cell and (ii) an intra-distance between the text elements in the one or more candidate rows is lesser than the inter-distance therebetween; generating the final table corresponding to each of the at least one tabular region using the candidate rows and the candidate columns; and localizing the text elements within a plurality of cells defined by the intersecting candidate columns and the candidate rows based on coordinates of the text elements and the coordinates of the plurality of cells and performing error correction in the generated final table based on a percentage of the text elements located beyond a cell in the plurality of cells.
[007] In yet another aspect, there is provided a computer program product comprising a non-transitory computer readable medium having a computer readable program embodied therein, wherein the computer readable program, when executed on a computing device, causes the computing device to: receive an image of a document corresponding to an entity; auto-detect at least one tabular region in the received image using one or more Computer Vision based methods; generate a final table corresponding to each of the at least one tabular region using one of: a line-based approach if the auto detected at least one tabular region includes horizontal and vertical lines; or a white space approach if no horizontal and vertical lines are detected in the at least one tabular region, wherein the white space approach comprises: identifying text edges of text elements extracted using an Optical Character Recognition (OCR) engine in the at least one tabular region using on one or more predefined criteria based on i) a plurality of rows having an edge type including a left edge, a right edge or a no edge, the no edge text being representative of a row with centered text having center points of text aligned along a vertical line and ii) a predefined minimum number of rows of text elements being aligned along a vertical line; determining coordinates of the at least one tabular region based on (i) the coordinates associated with a top left corner and a bottom right corner of the text elements comprised therein, (ii) associated text edges and (iii) a predefined buffer space to account for the text elements that extend beyond the at least one tabular region; grouping the text elements within the determined coordinates of the at least one tabular region into candidate rows based on associated Y-coordinates including a predefined first threshold, thereby identifying a straight line

representing an edge for each row; grouping the text elements within the determined coordinates of the at least one tabular region into candidate columns based on the identified text edges and a predefined second threshold, thereby identifying a straight line representing an edge for each column; merging the one or more candidate rows in the event that (i) the one or more candidate rows do not contain the text elements in a stub column and contain the text elements in a preceding row where the stub column is a non-empty cell and (ii) an intra-distance between the text elements in the one or more candidate rows is lesser than the inter-distance therebetween; generating the final table corresponding to each of the at least one tabular region using the candidate rows and the candidate columns; and localizing the text elements within a plurality of cells defined by the intersecting candidate columns and the candidate rows based on coordinates of the text elements and the coordinates of the plurality of cells and performing error correction in the generated final table based on a percentage of the text elements located beyond a cell in the plurality of cells.
[008] In accordance with an embodiment of the present disclosure, the at least one tabular region in the received image is characterized by (i) a grid structure with horizontal and vertical lines, (ii) absence of the horizontal and vertical lines, (iii) the at least one tabular region extending beyond a page in the received image, (iv) the at least one tabular region limited to a page length, or a combination thereof.
[009] In accordance with an embodiment of the present disclosure, the one or more hardware processors are further configured to eliminate the text elements in the received image that are devoid of predefined table characteristics and represent noise prior to the step of identifying text edges of text elements, wherein the predefined table characteristics are related to alignment of the text elements and spaces therebetween.
[010] In accordance with an embodiment of the present disclosure, the (i) the predefined first threshold includes variation in the Y-coordinates on account of the text elements corresponding to subscripts, superscripts and offset text elements in the event that the received image is an electronic form of a hand written

document; and (ii) the predefined second threshold accounts for inter-distance between X-coordinates of the text elements
[011] In accordance with an embodiment of the present disclosure, the one or more hardware processors are further configured to extract an identifier associated with the entity prior to generating the final table using the line-based approach or the white space approach.
[012] In accordance with an embodiment of the present disclosure, the one or more hardware processors are further configured to extract the identifier, wherein the identifier is a logo by: identifying a bounding box across a maximum pixel density location representative of the logo within the received image using Gaussian blur, wherein the maximum pixel density location is obtained using Principal Component Analysis (PCA); obtaining approximate coordinates of the logo using a heuristic approach based on one or more statistical methods including mean, mode, variance and standard deviation on a pixel matrix of the received image; and optimizing the obtained approximate coordinates of the logo to extract the logo having a maximum coordinate and a minimum coordinate associated with a left, right, top and bottom edge, the optimizing comprises reducing the maximum coordinate associated with each of the left, right, top and bottom edge iteratively by a unit value till an estimated heuristic value is reached to obtain optimum coordinates associated thereof, such that the optimum coordinates are greater than the minimum coordinate.
[013] In accordance with an embodiment of the present disclosure, the one or more hardware processors are further configured to perform one of: generating the final table corresponding to each of the at least one tabular region, in the event of a match between the extracted identifier and one of a plurality of identifiers in a database comprised in the one or more data storage devices; or generating the final table corresponding to each of the at least one tabular region, in the event of no match between the extracted identifier and one of the plurality of identifiers using one of the line-based approach or the white space approach; wherein the database comprises (i) the plurality of identifiers mapped to associated entities, (ii) the document associated with each of the entities, and (iii) one or more rules

representative of the coordinates of the at least one tabular region for the document associated with each of the entities.
[014] In accordance with an embodiment of the present disclosure, the one or more hardware processors are further configured to generate the final table corresponding to each of the at least one tabular region, in the event of a match between the extracted identifier and one of a plurality of identifiers in the database by defining the final table corresponding to each of the at least one tabular region by selecting a rule from the one or more rules associated with a corresponding entity in the database, wherein the selected rule provides coordinates of the at least one tabular region that maps to the auto-detected at least one tabular region.
[015] In accordance with an embodiment of the present disclosure, the one or more hardware processors are further configured to perform the line-based approach by: identifying the horizontal and the vertical lines in the at least one tabular region based on orientation derived using pixel values thereof; defining boundaries of the at least one tabular region using the identified horizontal and vertical lines; determining coordinates for a plurality of cells comprised within the defined boundaries based on intersection of the identified horizontal lines and the vertical lines; generating the final table corresponding to each of the at least one tabular region comprising the determined plurality of cells; identifying text elements to be placed within each of the determined plurality of cells using the OCR engine and a heuristically determined third threshold; and performing error correction in the generated table based on a percentage of the text elements located beyond a cell in the plurality of cells by comparing the determined coordinates for each cell in the plurality of cells and coordinates of the identified text elements placed within each of the determined plurality of cells to obtain the final table.
[016] In accordance with an embodiment of the present disclosure, the one or more hardware processors are further configured to perform one or more of the following steps prior to the step of merging the one or more candidate rows: eliminating one or more rows having justified text elements identified based on number of identified text edges, associated edge type and number of the text

elements contained in the one or more rows; and eliminating one or more rows identified as i) super-header row, ii) sub-header row, and iii) caption row.
[017] In accordance with an embodiment of the present disclosure, the one or more hardware processors are further configured to perform one or more of: receiving user inputs pertaining to modifications of the coordinates of the at least one tabular region; updating the one or more rules in the database based on the received user inputs; mapping a rule that is representative of the coordinates of the generated final table obtained in the event of no match, to an associated entity in the database; and extracting tabular data from the generated final table corresponding to each of the at least one tabular region.
[018] It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the invention, as claimed.
BRIEF DESCRIPTION OF THE DRAWINGS
[019] The accompanying drawings, which are incorporated in and constitute a part of this disclosure, illustrate exemplary embodiments and, together with the description, serve to explain the disclosed principles:
[020] FIG.1 illustrates an exemplary block diagram of a system for tabular data extraction from an image of a document, in accordance with some embodiments of the present disclosure.
[021] FIG.2A through FIG.2D illustrates an exemplary flow diagram of a computer implemented method for tabular data extraction from an image of a document, in accordance with some embodiments of the present disclosure.
[022] FIG.3 illustrates an image of an exemplary invoice with a grid structure having horizontal and vertical lines.
[023] FIG.4 illustrates an image of an exemplary invoice having no horizontal and vertical lines.
[024] FIG.5A through FIG.5B illustrate an image of an exemplary invoice without a grid structure that extends beyond a page.

[025] FIG.6A and FIG.6B illustrate determining coordinates of the tabular region without and with a predefined buffer space respectively, in accordance with some embodiments of the present disclosure.
[026] FIG.6C illustrates identified text edges and straight lines representing edges for each column in the image of the exemplary invoice of FIG.4, in accordance with some embodiments of the present disclosure.
[027] FIG.6D illustrates a tabular region determined, in accordance with some embodiments of the present disclosure.
[028] FIG.6E illustrates candidate rows and the candidate columns defining a plurality of cells with localized text elements comprised in a final table, in accordance with some embodiments of the present disclosure.
[029] FIG.6F illustrates extracted tabular data from the final table of FIG.6E, in accordance with some embodiments of the present disclosure.
DETAILED DESCRIPTION OF EMBODIMENTS [030] Exemplary embodiments are described with reference to the accompanying drawings. In the figures, the left-most digit(s) of a reference number identifies the figure in which the reference number first appears. Wherever convenient, the same reference numbers are used throughout the drawings to refer to the same or like parts. While examples and features of disclosed principles are described herein, modifications, adaptations, and other implementations are possible without departing from the scope of the disclosed embodiments.
[031] Invoice processing is an important activity in the Business Process Solutions (BPS) industry that is challenged by problems such as human error in data entry and inefficient storage of incoming invoices. The large amount of workforce needed incurs huge operational cost without an assurance of accuracy. Also, manual invoice processing is a time consuming and tiring job.
[032] Applicant provides a Machine First Delivery Model (MFDM™) approach, wherein the manual process of invoice processing is automated in an intelligent manner. In the context of the present disclosure, invoice processing represents extracting tabular data from documents such as invoices. Invoices

typically contain key information in a tabular format. Also, invoices are received as an image in pdf format. Region of interest in the image is the tabular region. State of the art approaches work with invoices, wherein the tabular region is defined with a grid structure. Conventional approaches cannot be generalized for all types of invoices. For instance, an invoice may have a grid structure but without horizontal and vertical lines. Sometimes, an invoice may extend beyond a single page. Furthermore, invoices may sometimes have more than one tabular region. The present disclosure addresses the technical problem of identifying at least one tabular region in a received image of a document such as an invoice that may belong to any of the types mentioned herein. Although the description hereinafter focuses on data extraction from tabular regions in an invoice, it may be understood by those skilled in the art, that that method and system described in the present disclosure may be applied to any document having one or more tabular regions.
[033] Referring now to the drawings, and more particularly to FIG. 1 through FIG.6F, where similar reference characters denote corresponding features consistently throughout the figures, there are shown preferred embodiments and these embodiments are described in the context of the following exemplary system and/or method.
[034] FIG.1 illustrates an exemplary block diagram of a system 100 for tabular data extraction from an image of a document, in accordance with some embodiments of the present disclosure. In an embodiment, the system 100 includes one or more hardware processors 104, communication interface device(s) or input/output (I/O) interface(s) 106, and one or more data storage devices or memory 102 operatively coupled to the one or more hardware processors 104. The one or more hardware processors 104 can be implemented as one or more microprocessors, microcomputers, microcontrollers, digital signal processors, central processing units, state machines, graphics controllers, logic circuitries, and/or any devices that manipulate signals based on operational instructions. Among other capabilities, the processor(s) are configured to fetch and execute computer-readable instructions stored in the memory. In the context of the present disclosure, the expressions ‘processors’ and ‘hardware processors’ may be used

interchangeably. In an embodiment, the system 100 can be implemented in a variety of computing systems, such as laptop computers, notebooks, hand-held devices, workstations, mainframe computers, servers, a network cloud and the like.
[035] I/O interface(s) 106 can include a variety of software and hardware interfaces, for example, a web interface, a graphical user interface, and the like and can facilitate multiple communications within a wide variety of networks N/W and protocol types, including wired networks, for example, LAN, cable, etc., and wireless networks, such as WLAN, cellular, or satellite. In an embodiment, the I/O interface(s) can include one or more ports for connecting a number of devices to one another or to another server.
[036] The memory 102 may include any computer-readable medium known in the art including, for example, volatile memory, such as static random access memory (SRAM) and dynamic random access memory (DRAM), and/or non-volatile memory, such as read only memory (ROM), erasable programmable ROM, flash memories, hard disks, optical disks, and magnetic tapes. In an embodiment, one or more modules (not shown) of the system 100 can be stored in the memory 102.
[037] FIG.2A through FIG.2D illustrate an exemplary flow diagram of a computer implemented method 200 for tabular data extraction from an image of a document, in accordance with some embodiments of the present disclosure. In an embodiment, the system 100 includes one or more data storage devices or memory 102 operatively coupled to the one or more hardware processors 104 and is configured to store instructions configured for execution of steps of the method 200 by the one or more hardware processors 104. The steps of the method 200 will now be explained in detail with reference to the components of the system 100 of FIG.1. Although process steps, method steps, techniques or the like may be described in a sequential order, such processes, methods and techniques may be configured to work in alternate orders. In other words, any sequence or order of steps that may be described does not necessarily indicate a requirement that the steps be performed in that order. The steps of processes described herein may be performed in any order practical. Further, some steps may be performed simultaneously.

[038] Accordingly, in an embodiment of the present disclosure, the one or more hardware processors 104, are configured to receive, at step 202, an image of a document corresponding to an entity, typically associated with an identifier. In an embodiment, the document may be an invoice. Typically documents such as invoices are received in a Portable Document Format (PDF). If the document being processed by the system of the present disclosure, is a PDF, it is converted to an image for ease of extraction of data. In an embodiment, the entity may be a company, or an organization associated with the received document.
[039] In accordance with the present disclosure, the at least one tabular region within the received image is characterized by (i) a grid structure with defined horizontal and vertical lines, (ii) absence of the horizontal and vertical lines, (iii) the at least one tabular region extending beyond a page (iv) the at least one tabular region limited to a page length, or a combination thereof. FIG.3 illustrates an image of an exemplary invoice with a grid structure having horizontal and vertical lines. FIG.4 illustrates an image of an exemplary invoice having no horizontal and vertical lines. FIG.5A through FIG.5B illustrate an image of an exemplary invoice without a grid structure that extends beyond a page. The exemplary invoices illustrated in both FIG.3 and FIG.4 are limited to a page length.
[040] In an embodiment of the present disclosure, the one or more hardware processors 104, are configured to auto-detect, at step 204, at least one tabular region in the received image using one or more Computer Vision based methods such as Python based package, Java based package or DotNet based package.
[041] In an embodiment of the present disclosure, the one or more hardware processors 104, are configured to generate, at step 206, a final table corresponding to each of the at least one tabular region auto detected in step 204. The final table may be generated using one of two approaches depending on presence or absence of horizontal and vertical lines forming the grid in the autodetected at least one tabular region. If there are autodetected horizontal and vertical lines, the one or more hardware processors 104, are configured to generate the final table, at step 206a using a line-based approach in the event that horizontal

and vertical lines are autodetected. Alternatively, at step 206b, the final table is generated using a white space approach in the event that no horizontal and vertical lines are detected.
[042] The white space approach of step 206b, in accordance with an embodiment of the present disclosure, will now be explained with reference to the image of an exemplary invoice illustrated in FIG.4, the method steps 206b1 through 206b7 being illustrated in FIG.6A through FIG.6F.
[043] In accordance with the present disclosure, the text elements are analyzed in a left-to-right reading manner to identify text edges. Accordingly, in an embodiment, the text edges of the text elements that may be comprised in the at least one tabular region are identified, at step 206b1, based on the following predefined criteria:
i) A pattern associated with a table is that text in a table, typically may be
left aligned, right aligned or center aligned. So, the one or more rows
may be characterized by an edge type such as a left edge, a right edge or
a no edge. In the context of the present disclosure, the no edge row
represents a row wherein center points of the text are aligned along a
vertical line. The center point of text being aligned is a scenario when
there may be a single numerical or a single word (refer column with
header ‘QTY’ in FIG.4) in the row. Text edges may be defined to exist
in locations where a plurality of rows have either left edges, right edges
or no edges.
ii) Text edges of the text elements may be considered when there are a
predefined minimum number of rows of text elements aligned along a
vertical line, for instance, when there are two or more rows aligned
along a vertical line.
[044] In an embodiment, text elements in the received image are extracted
using an Optical Character Recognition (OCR) engine. In an embodiment, the one
or more hardware processors 104, are configured to eliminate the text elements in
the received image that are devoid of predefined table characteristics and represent
noise, wherein the predefined table characteristics are related to alignment of the

text elements and spaces therebetween. This step may be executed prior to the step 206b1 of identifying the text edges. For instance, if there are text elements that are proximate the autodetected at least one tabular region that do not show typically known table characteristics like text elements being aligned along a vertical axis or text elements resembling column formation based on spaces between the text elements, and the like, then such text elements are eliminated from consideration. This helps reducing the search space within the received image.
[045] In an embodiment of the present disclosure, the one or more hardware processors 104, are configured to determine, at step 206b2, coordinates of the at least one tabular region based on the coordinates associated with a top left corner and a bottom right corner of the text elements comprised therein and associated text edges. FIG.6A illustrates determining coordinates of the tabular region, in accordance with some embodiments of the present disclosure. Sometimes, text may extend beyond a tabular region which may have been eliminated when identifying text edges. In an embodiment, a predefined buffer space may be added to the determined coordinates to account for variants such as the text extending beyond the determined coordinates of the at least one tabular region. FIG.6B illustrates determining coordinates of the tabular region with the predefined buffer space, in accordance with some embodiments of the present disclosure.
[046] Further, the one or more hardware processors 104 are configured to generate candidate rows and candidate columns within the determined coordinates of the at least one tabular region. At step 206b3, the text elements are grouped into candidate rows based on associated Y-coordinates including a predefined first threshold, thereby identifying a straight line representing an edge for each row.
[047] In an embodiment, the predefined first threshold includes variation in the Y-coordinates on account of the text elements corresponding to subscripts, superscripts and offset text elements in the event that the received image is an electronic form of a hand written document or invoice where the rows of text elements in a row may not be perfectly aligned (offset). For instance, text elements that form a row at coordinates (120,120) and text elements that form a row at

coordinates (120, 121) are grouped into a single row. This enables identifying a straight line representing each row.
[048] In an embodiment, at step 206b4, the text elements are grouped into candidate columns based on the identified text edges and a predefined second threshold, thereby identifying a straight line representing an edge for each column. In an embodiment, possible vertical lines representing edges for each column may be adjusted based on ending right indentation of each row. In an embodiment, the predefined second threshold accounts for inter-distance between X-coordinates of the text elements. For instance, columns with column headers like “Description” or “Line item” or “Item Description” typically include long text. Alternatively, column names like ‘Item Description’ have very close coordinates that may be located within the predefined second threshold. In an embodiment, the grouping of text elements into candidate columns comprises forming sentences using the text elements based on average space length therebetween. The average space lengths between words (text elements) determines which words can be merged into a single sentence. In an embodiment, inter- distance between rows may be used to decide length of a column. Maximum length of the text elements within a column defines the column width. FIG.6C illustrates identified text edges and straight lines representing edges for each column in the image of the exemplary invoice of FIG.4, in accordance with some embodiments of the present disclosure.
[049] In an embodiment of the present disclosure, the one or more hardware processors 104, are configured to merge, at step 206b5, the one or more candidate rows in the event that (i) the one or more candidate rows do not contain the text elements in a stub column and contain the text elements in a preceding row where the stub column is a non-empty cell (Refer Description for Op.11 in FIG.5A which has 2 rows with no text element in the stub column for the second row) and (ii) an intra-distance between the text elements in the one or more candidate rows is lesser than the inter-distance therebetween. When the intra-distance between the text elements is less, the one or more rows are considered to be wrapped text and belonging to a cell. This will ensure that text intended to be within a row are not split into separate rows.

[050] In an embodiment, the step 206b5 of merging the one or more candidate rows may be preceded by one or more of eliminating one or more rows having justified text elements and eliminating one or more rows identified as i) super-header row, ii) sub-header row, and iii) caption row.
[051] In an embodiment, the justified text elements are identified based on number of identified text edges, associated edge type and number of text elements contained in the one or more rows. For instance, a page that has three text columns side by side contains six edges on each horizontal line drawn through the page width. Such rows of text elements may be mistaken to be a part of a tabular region.
[052] In an embodiment, the super-header row may be identified as a first row of text elements in the at least one tabular region such as “Date and Time”. The sub-header row may be identified as a row wherein the intra-distance between the super-header row and the row being considered is less than the inter-distance between rows. For example, the row having sub-columns “Date” and “Time” under the column “Date and Time” is a sub-header row. Sometimes, font size of the text elements in the sub-header row is less than that of the text elements in the super-header row. Accordingly, in an embodiment, this aspect may be considered for identifying the sub-header row. The caption row may be identified as one or more rows that are placed outside of the at least one tabular region at the bottom and typically prefixed with a text element that can be extracted using NER techniques. For instance, in the invoice illustrated in FIG.4, the “NOTES” at the bottom of the table may be identified as the caption row. Again, font size of the text elements in the caption rows are less than that of the text elements within the tabular region. In an embodiment, this aspect may be considered for identifying the caption rows.
[053] In an embodiment of the present disclosure, the one or more hardware processors 104, are configured to generate, at step 206b6, a final table corresponding to each of the at least one tabular region using the candidate rows and the candidate columns. FIG.6D illustrates a tabular region generated, in accordance with some embodiments of the present disclosure. The text elements within a plurality of cells defined by the intersecting candidate columns and the candidate rows are then localized at step 206b7 based on coordinates of the text

elements and the coordinates of the plurality of cells. In an embodiment, error correction may be performed in the generated final table based on a percentage of the text elements located beyond a cell in the plurality of cells. FIG.6E illustrates candidate rows and the candidate columns defining a plurality of cells with localized text elements comprised in a final table, in accordance with some embodiments of the present disclosure.
[054] In accordance with an embodiment of the present disclosure, by eliminating the text elements that represent noise, the processing of the document is optimized because of the reduced search space. Again, in an embodiment, the complexity in searching for one or more tabular regions within the image and processing time taken is reduced by extracting the identifier from the received image. Accordingly, in an embodiment, the step of generating the final table using the line-based approach or the white space approach is preceded by extracting, via the one or more hardware processors, an identifier associated with the entity. The identifier may be a name of the entity, billing address or any other information associated with the entity that may be typically extracted using Named-entity recognition (NER) techniques. The identifier may also be a logo of the entity. In an embodiment, the method for extracting the logo of the entity may be based on the Applicant’s Patent Application No.202021021574 titled ‘Systems and methods for offline signature identification and extraction’, wherein a Principal Component Analysis (PCA) based approach provided for identification of signature from an image may also be applied for identification of the logo. The Application may be referred for details. However, for completeness of the working of the present disclosure, the method of identification of signature, in an embodiment based on the Applicant’s Patent Application No.202021021574 is provided herein. In an embodiment, the step comprises identifying a bounding box across a maximum pixel density location representative of the logo within the received image using Gaussian blur, wherein the maximum pixel density location is obtained using Principal Component Analysis (PCA). Approximate coordinates of the logo are obtained using a heuristic approach based on one or more statistical methods including mean, mode, variance and standard deviation on a pixel matrix of the

received image. The obtained approximate coordinates of the logo are then optimized to extract the logo having a maximum coordinate and a minimum coordinate associated with a left, right, top and bottom edge, the optimizing comprises reducing the maximum coordinate associated with each of the left, right, top and bottom edge iteratively by a unit value till an estimated heuristic value is reached to obtain optimum coordinates associated thereof, such that the optimum coordinates are greater than the minimum coordinate.
[055] In accordance with the present disclosure, a database comprised in the one or more data storage devices 102 includes: (i) a plurality of identifiers mapped to associated entities, (ii) the document associated with each of the entities, and (iii) one or more rules representative of the coordinates of the at least one tabular region for the document associated with each of the entities. In an embodiment, the database may comprise a plurality of master tables, each table serving a purpose. For instance, a Files table may store document details including an associated image, text objects within the document, tabular regions in the document and the like. A Rule table may store the one or more rules in formats such as JSON, .XLSX, .CSV, and the like. A Jobs table may store changes that are made to the coordinates of the tabular region or any other user input. The Jobs table may also include log associated with the image of the document along with start time and end time for processing the document.
[056] In accordance with the present disclosure, after the identifier is extracted, depending on a match between the extracted identifier and one of the plurality of identifiers in the database, the final table corresponding to each of the at least one tabular region is generated. In an embodiment, if there is a match, the final table corresponding to each of the at least one tabular region is defined by selecting a rule from the one or more rules associated with a corresponding entity in the database, wherein the selected rule provides coordinates of the at least one tabular region that maps to the auto-detected at least one tabular region.
[057] In the event that there is no match between the extracted identifier and one of the plurality of identifiers, the final table is generated using one of the white space approach described above or the line-based approach. In an

embodiment of the present disclosure, the line-based approach comprises identifying, at step 206a1, the horizontal and the vertical lines in the at least one tabular region based on orientation derived using associated pixel values. For instance, if in a given Region Of Interest (ROI), generally same pixel values occur consecutively, then they define straight lines in the ROI. If the X-coordinate is fixed and the Y-coordinate changes, then vertical lines are identified and vice-versa. In an embodiment, at step 206a2, boundaries of the at least one tabular region are defined using the identified horizontal and vertical lines. At step 206a3, coordinates for a plurality of cells comprised within the defined boundaries are determined based on intersection of the identified horizontal lines and the vertical lines. The final table is then generated, at step 206a4, corresponding to each of the at least one tabular region comprising the determined plurality of cells. At step 206a5, text elements to be placed within each of the determined plurality of cells are identified using the OCR engine and a heuristically determined third threshold. If a text length extends beyond the third threshold, then the text elements are considered to be part of a next row or a next column depending on associated coordinates. In an embodiment, error correction may be performed, at step 206a6, in the generated table based on a percentage of the text elements located beyond a cell in the plurality of cells by comparing the determined coordinates for each cell in the plurality of cells and coordinates of the identified text elements placed within each of the determined plurality of cells to obtain the final table.
[058] Once the final table is generated using the rule in the database or the line-based approach or the white space approach, in accordance with an embodiment of the present disclosure, the generated final table maybe presented to a user for receiving user inputs pertaining to modifications of the coordinates of the at least one tabular region. The one or more rules in the database is then updated based on the received user inputs. This is a one-time activity, if at all, for a new document that needs to be processed. Further, a rule that is representative of the coordinates of the generated final table obtained in the event of no match of the extracted identifier with any of the plurality of identifiers in the database, is mapped to an associated entity in the database. Finally, tabular data is extracted from the

generated final table corresponding to each of the at least one tabular region. FIG.6F illustrates extracted tabular data from the final table of FIG.6E, in accordance with some embodiments of the present disclosure.
[059] The method and system of the present disclosure thus facilitate extracting tabular data by primarily addressing the most challenging problem in Business Process Services (BPS) of identifying at least one tabular region within an image of a document such as an invoice, which may be of so many different types as described above and illustrated at least partly in FIG.3, FIG.4, FIG.5A and FIG.5B. The method optimizes the search space by firstly eliminating the text elements that represent noise and secondly by reducing the search space by extracting the identifier which helps in using available information from the database effectively. The method and system of the present disclosure minimizes data entry errors, enables utilizing memory effectively and provide a single pipeline for different types of incoming documents for processing. Furthermore, along with efficient processing, the false positives are also reduced due to exceptions that are effectively handled by the method and system of the present disclosure. False positives are likely when the received image has no table borders, some columns are placed very near to each other, images of hand-written documents are received as inputs. The various thresholds described above serve to effectively address the exceptions arising out of such scenarios.
[060] In an experimental setup the programming language used was Python™, the Technology used included Tensorflow® machine learning platform, Keras® Application Programming Interface (API) serving as a Python™ interface for Artificial Neural Networks (ANN), Computer Vision algorithms and Artificial Intelligence algorithms. The scripting languages included JQuery, JavaScript, Ajax, Hypertext Markup Language (HTML) and Cascading Style Sheets (CSS).
[061] The written description describes the subject matter herein to enable any person skilled in the art to make and use the embodiments. The scope of the subject matter embodiments is defined by the claims and may include other modifications that occur to those skilled in the art. Such other modifications are intended to be within the scope of the claims if they have similar elements that do

not differ from the literal language of the claims or if they include equivalent elements with insubstantial differences from the literal language of the claims.
[062] It is to be understood that the scope of the protection is extended to such a program and in addition to a computer-readable means having a message therein; such computer-readable storage means contain program-code means for implementation of one or more steps of the method, when the program runs on a server or mobile device or any suitable programmable device. The hardware device can be any kind of device which can be programmed including e.g. any kind of computer like a server or a personal computer, or the like, or any combination thereof. The device may also include means which could be e.g. hardware means like e.g. an application-specific integrated circuit (ASIC), a field-programmable gate array (FPGA), or a combination of hardware and software means, e.g. an ASIC and an FPGA, or at least one microprocessor and at least one memory with software processing components located therein. Thus, the means can include both hardware means, and software means. The method embodiments described herein could be implemented in hardware and software. The device may also include software means. Alternatively, the embodiments may be implemented on different hardware devices, e.g. using a plurality of CPUs.
[063] The embodiments herein can comprise hardware and software elements. The embodiments that are implemented in software include but are not limited to, firmware, resident software, microcode, etc. The functions performed by various components described herein may be implemented in other components or combinations of other components. For the purposes of this description, a computer-usable or computer readable medium can be any apparatus that can comprise, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device.
[064] The illustrated steps are set out to explain the exemplary embodiments shown, and it should be anticipated that ongoing technological development will change the manner in which particular functions are performed. These examples are presented herein for purposes of illustration, and not limitation. Further, the boundaries of the functional building blocks have been arbitrarily

defined herein for the convenience of the description. Alternative boundaries can
be defined so long as the specified functions and relationships thereof are
appropriately performed. Alternatives (including equivalents, extensions,
variations, deviations, etc., of those described herein) will be apparent to persons skilled in the relevant art(s) based on the teachings contained herein. Such alternatives fall within the scope of the disclosed embodiments. Also, the words “comprising,” “having,” “containing,” and “including,” and other similar forms are intended to be equivalent in meaning and be open ended in that an item or items following any one of these words is not meant to be an exhaustive listing of such item or items, or meant to be limited to only the listed item or items. It must also be noted that as used herein and in the appended claims, the singular forms “a,” “an,” and “the” include plural references unless the context clearly dictates otherwise.
[065] Furthermore, one or more computer-readable storage media may be utilized in implementing embodiments consistent with the present disclosure. A computer-readable storage medium refers to any type of physical memory on which information or data readable by a processor may be stored. Thus, a computer-readable storage medium may store instructions for execution by one or more hardware processors, including instructions for causing the processor(s) to perform steps or stages consistent with the embodiments described herein. The term “computer-readable medium” should be understood to include tangible items and exclude carrier waves and transient signals, i.e., be non-transitory. Examples include random access memory (RAM), read-only memory (ROM), volatile memory, nonvolatile memory, hard drives, CD ROMs, DVDs, flash drives, disks, and any other known physical storage media.
[066] It is intended that the disclosure and examples be considered as exemplary only, with a true scope of disclosed embodiments being indicated by the following claims.

We Claim:
1. A processor implemented method (200) comprising the steps of:
receiving, via one or more hardware processors, an image of a document corresponding to an entity (202);
auto-detecting, via the one or more hardware processors, at least one tabular region in the received image using one or more Computer Vision based methods (204); and
generating, via the one or more hardware processors, a final table corresponding to each of the at least one tabular region (206) using one of: a line-based approach if the auto detected at least one tabular region includes horizontal and vertical lines (206a); or
a white space approach if no horizontal and vertical lines are detected in the at least one tabular region (206b), wherein the white space approach comprises:
identifying text edges of text elements extracted using an Optical Character Recognition (OCR) engine in the at least one tabular region based on one or more predefined criteria (206b1);
determining coordinates of the at least one tabular region based on (i) the coordinates associated with a top left corner and a bottom right corner of the text elements comprised therein, (ii) associated text edges and (iii) a predefined buffer space to account for the text elements that extend beyond the at least one tabular region (206b2);
grouping the text elements within the determined coordinates of the at least one tabular region into candidate rows based on associated Y-coordinates including a predefined first threshold, thereby identifying a straight line representing an edge for each row (206b3);
grouping the text elements within the determined coordinates of the at least one tabular region into candidate

columns based on the identified text edges and a predefined second threshold, thereby identifying a straight line representing an edge for each column (206b4);
merging the one or more candidate rows in the event that (i) the one or more candidate rows do not contain the text elements in a stub column and contain the text elements in a preceding row where the stub column is a non-empty cell and (ii) an intra-distance between the text elements in the one or more candidate rows is lesser than the inter-distance therebetween (206b5);
generating the final table corresponding to each of the at least one tabular region using the candidate rows and the candidate columns (206b6); and
localizing the text elements within a plurality of cells defined by the intersecting candidate columns and the candidate rows based on coordinates of the text elements and the coordinates of the plurality of cells and performing error correction in the generated final table based on a percentage of the text elements located beyond a cell in the plurality of cells (206b7).
2. The processor implemented method of claim 1, wherein the at least one tabular region in the received image is characterized by (i) a grid structure with horizontal and vertical lines, (ii) absence of the horizontal and vertical lines, (iii) the at least one tabular region extending beyond a page in the received image, (iv) the at least one tabular region limited to a page length, or a combination thereof.
3. The processor implemented method of claim 1, wherein the one or more predefined criteria is based on i) a plurality of rows having an edge type including a left edge, a right edge or a no edge, the no edge text being representative of a row with centered text having center points of text

aligned along a vertical line and ii) a predefined minimum number of rows of text elements being aligned along a vertical line.
4. The processor implemented method of claim 1, wherein the step of identifying text edges of text elements is preceded by eliminating the text elements in the received image that are devoid of predefined table characteristics and represent noise, wherein the predefined table characteristics are related to alignment of the text elements and spaces therebetween.
5. The processor implemented method of claim 1, wherein (i) the predefined first threshold includes variation in the Y-coordinates on account of the text elements corresponding to subscripts, superscripts and offset text elements in the event that the received image is an electronic form of a hand written document; and (ii) the predefined second threshold accounts for inter-distance between X-coordinates of the text elements.
6. The processor implemented method of claim 1, wherein the step of generating the final table using the line-based approach or the white space approach is preceded by extracting, via the one or more hardware processors, an identifier associated with the entity.
7. The processor implemented method of claim 6, wherein the step of extracting the identifier, and wherein the identifier is a logo, comprises:
identifying a bounding box across a maximum pixel density location representative of the logo within the received image using Gaussian blur, wherein the maximum pixel density location is obtained using Principal Component Analysis (PCA);
obtaining approximate coordinates of the logo using a heuristic approach based on one or more statistical methods including mean, mode, variance and standard deviation on a pixel matrix of the received image; and

optimizing the obtained approximate coordinates of the logo to extract the logo having a maximum coordinate and a minimum coordinate associated with a left, right, top and bottom edge, the optimizing comprises reducing the maximum coordinate associated with each of the left, right, top and bottom edge iteratively by a unit value till an estimated heuristic value is reached to obtain optimum coordinates associated thereof, such that the optimum coordinates are greater than the minimum coordinate.
8. The processor implemented method of claim 6, wherein the step of
extracting the identifier is followed by performing, via the one or more
hardware processors, one of:
generating the final table corresponding to each of the at least one tabular region, in the event of a match between the extracted identifier and one of a plurality of identifiers in a database; or
generating the final table corresponding to each of the at least one tabular region, in the event of no match between the extracted identifier and one of the plurality of identifiers using one of the line-based approach or the white space approach; and
wherein the database comprises (i) the plurality of identifiers mapped to associated entities, (ii) the document associated with each of the entities, and (iii) one or more rules representative of the coordinates of the at least one tabular region for the document associated with each of the entities.
9. The processor implemented method of claim 8, wherein the step of
generating a final table corresponding to each of the at least one tabular
region, in the event of a match between the extracted identifier and one of a
plurality of identifiers in the database comprises defining the final table
corresponding to each of the at least one tabular region by selecting a rule
from the one or more rules associated with a corresponding entity in the
database, and wherein the selected rule provides coordinates of the at least
one tabular region that maps to the auto-detected at least one tabular region.

10. The processor implemented method of claim 1, wherein the line-based
approach comprises:
identifying the horizontal and the vertical lines in the at least one tabular region based on orientation derived using pixel values thereof (206a1);
defining boundaries of the at least one tabular region using the identified horizontal and vertical lines (206a2);
determining coordinates for a plurality of cells comprised within the defined boundaries based on intersection of the identified horizontal lines and the vertical lines (206a3);
generating the final table corresponding to each of the at least one tabular region comprising the determined plurality of cells (206a4);
identifying text elements to be placed within each of the determined plurality of cells using the OCR engine and a heuristically determined third threshold (206a5); and
performing error correction in the generated table based on a percentage of the text elements located beyond a cell in the plurality of cells by comparing the determined coordinates for each cell in the plurality of cells and coordinates of the identified text elements placed within each of the determined plurality of cells to obtain the final table (206a6).
11. The processor implemented method of claim 3, wherein the step of merging
the one or more candidate rows is preceded by one or more of:
eliminating one or more rows having justified text elements identified based on number of identified text edges, associated edge type and number of the text elements contained in the one or more rows; and
eliminating one or more rows identified as i) super-header row, ii) sub-header row, and iii) caption row.
12. The processor implemented method of claim 8 further comprising one or
more of:

receiving, via the one or more hardware processors, user inputs pertaining to modifications of the coordinates of the at least one tabular region;
updating, via the one or more hardware processors, the one or more rules in the database based on the received user inputs;
mapping, via the one or more hardware processors, a rule that is representative of the coordinates of the generated final table obtained in the event of no match of the extracted identifier with any of the plurality of identifiers in the database, to an associated entity in the database; and
extracting tabular data, via the one or more hardware processors, from the generated final table corresponding to each of the at least one tabular region.
13. A system (100) comprising:
one or more data storage devices (102) operatively coupled to one or more hardware processors (104) and configured to store instructions configured for execution via the one or more hardware processors to:
receive an image of a document corresponding to an entity; auto-detect at least one tabular region in the received image using one or more Computer Vision based methods; and
generate a final table corresponding to each of the at least one tabular region using one of:
a line-based approach if the auto detected at least one tabular region includes horizontal and vertical lines; or
a white space approach if no horizontal and vertical lines are detected in the at least one tabular region, wherein the white space approach comprises:
identifying text edges of text elements extracted using an Optical Character Recognition (OCR) engine in the at least one tabular region using on one or more predefined criteria based on i) a plurality of rows having an edge type including

a left edge, a right edge or a no edge, the no edge text being representative of a row with centered text having center points of text aligned along a vertical line and ii) a predefined minimum number of rows of text elements being aligned along a vertical line;
determining coordinates of the at least one tabular region based on (i) the coordinates associated with a top left corner and a bottom right corner of the text elements comprised therein, (ii) associated text edges and (iii) a predefined buffer space to account for the text elements that extend beyond the at least one tabular region;
grouping the text elements within the determined coordinates of the at least one tabular region into candidate rows based on associated Y-coordinates including a predefined first threshold, thereby identifying a straight line representing an edge for each row;
grouping the text elements within the determined coordinates of the at least one tabular region into candidate columns based on the identified text edges and a predefined second threshold, thereby identifying a straight line representing an edge for each column;
merging the one or more candidate rows in the event that (i) the one or more candidate rows do not contain the text elements in a stub column and contain the text elements in a preceding row where the stub column is a non-empty cell and (ii) an intra-distance between the text elements in the one or more candidate rows is lesser than the inter-distance therebetween;
generating the final table corresponding to each of the at least one tabular region using the candidate rows and the candidate columns; and

localizing the text elements within a plurality of cells defined by the intersecting candidate columns and the candidate rows based on coordinates of the text elements and the coordinates of the plurality of cells and performing error correction in the generated final table based on a percentage of the text elements located beyond a cell in the plurality of cells.
14. The system of claim 13, wherein the at least one tabular region in the received image is characterized by (i) a grid structure with horizontal and vertical lines, (ii) absence of the horizontal and vertical lines, (iii) the at least one tabular region extending beyond a page in the received image, (iv) the at least one tabular region limited to a page length, or a combination thereof.
15. The system of claim 13, wherein the one or more processors are further configured to eliminate the text elements in the received image that are devoid of predefined table characteristics and represent noise prior to the step of identifying text edges of text elements, wherein the predefined table characteristics are related to alignment of the text elements and spaces therebetween.
16. The system of claim 13, wherein the (i) the predefined first threshold includes variation in the Y-coordinates on account of the text elements corresponding to subscripts, superscripts and offset text elements in the event that the received image is an electronic form of a hand written document; and (ii) the predefined second threshold accounts for inter-distance between X-coordinates of the text elements.
17. The system of claim 13, wherein the one or more processors are further configured to extract an identifier associated with the entity prior to generating the final table using the line-based approach or the white space approach.

18. The system of claim 17, wherein the one or more processors are further
configured to extract the identifier, wherein the identifier is a logo by:
identifying a bounding box across a maximum pixel density location representative of the logo within the received image using Gaussian blur, wherein the maximum pixel density location is obtained using Principal Component Analysis (PCA);
obtaining approximate coordinates of the logo using a heuristic approach based on one or more statistical methods including mean, mode, variance and standard deviation on a pixel matrix of the received image; and
optimizing the obtained approximate coordinates of the logo to extract the logo having a maximum coordinate and a minimum coordinate associated with a left, right, top and bottom edge, the optimizing comprises reducing the maximum coordinate associated with each of the left, right, top and bottom edge iteratively by a unit value till an estimated heuristic value is reached to obtain optimum coordinates associated thereof, such that the optimum coordinates are greater than the minimum coordinate.
19. The system of claim 17, wherein the one or more processors are further
configured to perform one of:
generating the final table corresponding to each of the at least one tabular region, in the event of a match between the extracted identifier and one of a plurality of identifiers in a database comprised in the one or more data storage devices (102); or
generating the final table corresponding to each of the at least one tabular region, in the event of no match between the extracted identifier and one of the plurality of identifiers using one of the line-based approach or the white space approach; and
wherein the database comprises (i) the plurality of identifiers mapped to associated entities, (ii) the document associated with each of the entities, and (iii) one or more rules representative of the coordinates of the at least one tabular region for the document associated with each of the entities.

20. The system of claim 19, wherein the one or more processors are further configured to generate the final table corresponding to each of the at least one tabular region, in the event of a match between the extracted identifier and one of a plurality of identifiers in the database by defining the final table corresponding to each of the at least one tabular region by selecting a rule from the one or more rules associated with a corresponding entity in the database, wherein the selected rule provides coordinates of the at least one tabular region that maps to the auto-detected at least one tabular region.
21. The system of claim 13, wherein the one or more processors are further configured to perform the line-based approach by:
identifying the horizontal and the vertical lines in the at least one tabular region based on orientation derived using pixel values thereof;
defining boundaries of the at least one tabular region using the identified horizontal and vertical lines;
determining coordinates for a plurality of cells comprised within the defined boundaries based on intersection of the identified horizontal lines and the vertical lines;
generating the final table corresponding to each of the at least one tabular region comprising the determined plurality of cells;
identifying text elements to be placed within each of the determined plurality of cells using the OCR engine and a heuristically determined third threshold; and
performing error correction in the generated table based on a percentage of the text elements located beyond a cell in the plurality of cells by comparing the determined coordinates for each cell in the plurality of cells and coordinates of the identified text elements placed within each of the determined plurality of cells to obtain the final table.
22. The system of claim 13, wherein the one or more processors are further
configured to perform one or more of the following steps prior to the step
of merging the one or more candidate rows:

eliminating one or more rows having justified text elements identified based on number of identified text edges, associated edge type and number of the text elements contained in the one or more rows; and
eliminating one or more rows identified as i) super-header row, ii) sub-header row, and iii) caption row.
23. The system of claim 19, wherein the one or more processors are further
configured to perform one or more of:
receiving user inputs pertaining to modifications of the coordinates of the at least one tabular region;
updating the one or more rules in the database based on the received user inputs;
mapping a rule that is representative of the coordinates of the generated final table obtained in the event of no match, to an associated entity in the database; and
extracting tabular data from the generated final table corresponding to each of the at least one tabular region.

Documents

Application Documents

#	Name	Date
1	202121005590-STATEMENT OF UNDERTAKING (FORM 3) [09-02-2021(online)].pdf	2021-02-09
2	202121005590-REQUEST FOR EXAMINATION (FORM-18) [09-02-2021(online)].pdf	2021-02-09
3	202121005590-FORM 18 [09-02-2021(online)].pdf	2021-02-09
4	202121005590-FORM 1 [09-02-2021(online)].pdf	2021-02-09
5	202121005590-FIGURE OF ABSTRACT [09-02-2021(online)].jpg	2021-02-09
6	202121005590-DRAWINGS [09-02-2021(online)].pdf	2021-02-09
7	202121005590-DECLARATION OF INVENTORSHIP (FORM 5) [09-02-2021(online)].pdf	2021-02-09
8	202121005590-COMPLETE SPECIFICATION [09-02-2021(online)].pdf	2021-02-09
9	202121005590-Proof of Right [15-02-2021(online)].pdf	2021-02-15
10	202121005590-Proof of Right [02-08-2021(online)].pdf	2021-08-02
11	Abstract1.jpg	2021-10-19
12	202121005590-FORM-26 [22-10-2021(online)].pdf	2021-10-22
13	202121005590-FER.pdf	2022-09-12
14	202121005590-OTHERS [22-11-2022(online)].pdf	2022-11-22
15	202121005590-FER_SER_REPLY [22-11-2022(online)].pdf	2022-11-22
16	202121005590-CLAIMS [22-11-2022(online)].pdf	2022-11-22
17	202121005590-PatentCertificate07-11-2024.pdf	2024-11-07
18	202121005590-IntimationOfGrant07-11-2024.pdf	2024-11-07

Search Strategy

1	search202121005590E_09-09-2022.pdf