Abstract: The disclosed system (210) and method (500) facilitates to map a predetermined header label to at least one of a column data of a table present in a document. The method (500) detects (502) a table area of the table. The table area corresponds to at least one of the column data of the table. The method extracts (504) content from the detected table area while ensuring accurate isolation of content of the detected table area from other text of the document. Further, the method maps (506) the extracted at least one of column data to the predetermined header label. The mapping is performed using a trained Natural Language Processing (NLP) model. FIG. 4
Description:SYSTEM AND METHOD FOR MAPPING HEADER LABEL WITH COLUMN DATA
FIELD OF INVENTION
[0001] The embodiments of the present disclosure generally relate to a field of recognizing data present in a table area in an unstructured document, and specifically to a system and a method for mapping a predetermined header label to a column data of a table.
BACKGROUND OF THE INVENTION
[0002] The following description of the related art is intended to provide background information pertaining to the field of the disclosure. This section may include certain aspects of the art that may be related to various features of the present disclosure. However, it should be appreciated that this section is used only to enhance the understanding of the reader with respect to the present disclosure, and not as admissions of the prior art.
[0003] Typically, header fields of an invoice are extracted using a generic rule-based approach. However, the main challenge occurs in a line item section, where tabular data varies significantly in structure and format. Invoices often contain a varying number of columns, arranged in different orders to meet specific business requirements. This variation in layouts and tabular structures makes standardization of the invoice difficult.
[0004] Moreover, key columns to be extracted are not consistently present across the invoices, requiring users to manually map relevant columns to expected line item fields. While this manual mapping ensures that correct values are captured, it necessitates human intervention for accurate alignment of the columns.
[0005] Further, column labels in the invoice differ across multiple vendors. Usually, some of the vendors have distinct column labels for certain data items, while other vendors may combine these types of data items into a single column.
[0006] In addition, various generic solutions are available in the market such as template-based approaches which are tailored for multiple vendors along with multiple Machine Learning (ML) based generic solutions that work across multiple vendors. However, these solutions are restricted to only table detection and structure recognition.
[0007] There is, therefore, a need in the art to provide an improved system and a method to enable mapping of a predetermined header label with a column data of the table.
OBJECTIVE OF THE INVENTION
[0008] Some of the objects of the present disclosure, which at least one embodiment herein satisfies are listed herein below.
[0009] It is an object of the present disclosure to provide a system and a method to enable mapping of a predetermined header label with a column data of the table.
[0010] It is an object of the present disclosure to provide a system and a method for automating an extraction process thereby significantly accelerating invoice processing as compared to manual entry.
[0011] It is an object of the present disclosure to provide a system and a method to apply consistent rules and logic for uniform data extraction thereby reducing variability.
[0012] It is an object of the present disclosure to provide a system and a method that adapts to different invoice formats and layouts and enables using the invoices for business of a wide range of suppliers.
[0013] It is an object of the present disclosure to provide a system and a method that improves accuracy of both line-item data and header fields of the invoice which are captured using rule-based methods.
SUMMARY OF THE INVENTION
[0014] This section is provided to introduce certain objects and aspects of the present disclosure in a simplified form that are further described below in the detailed description. This summary is not intended to identify the key features or the scope of the claimed subject matter.
[0015] In an aspect, the present disclosure relates to a method for mapping a predetermined header label to a column data of a table present in a document. The method may include detecting a table area of the table while ensuring accurate isolation of content of the detected table area from other text of the document. The method may extract content from the detected table area. The table area may correspond to at least one of the column data of the table. Further, the method may map the extracted at least one column data to the predetermined header label. The mapping is performed using a trained NLP model.
[0016] In an embodiment, the content from the detected table area may be extracted by processing an image of the detected table area or an image layer of the image.
[0017] In an embodiment, the image of the detected table area or the image layer of the image may be extracted using an optical character recognition (OCR) engine.
[0018] In an embodiment, the processing of the image may be performed using an image preprocessing mechanism.
[0019] In an embodiment, the image preprocessing mechanism may use at least one of a skew correction mechanism and a noise removal mechanism.
[0020] In an embodiment, the content extracted from the detected table area may be at least in a text form or an alphanumeric form.
[0021] In an embodiment, the NLP model may predict OCR data for at least one of the column data of the table.
[0022] In an embodiment, the table area of the table may be detected using at least one of a grid detection mechanism and by analyzing each of a predefined header label and a footer label of the table.
[0023] In an embodiment, the predetermined header label may be determined from a pre-trained data set.
[0024] In an embodiment, a Support Vector Machine (SVM) model may be trained using the pre-trained data set to evaluate one or more data patterns of the least one of the column data of the table.
[0025] In an embodiment, a confidence score may be determined between the extracted column data of the table and the predetermined header label. Upon the confidence score meeting a confidence threshold, the extracted content is assigned under a category of the predetermined header label.
[0026] In an aspect, the present disclosure relates to a system for mapping a predetermined header label with a respective column label of a table. The system may include one or more processors associated with a computing device, a memory operatively coupled to the one or more processors, wherein the memory comprises processor-executable instructions, which on execution, cause the one or more processors to parse one or more elements of the GUI based application being executed on the computing device. The one or more processors may be configured to detect a table area of the table while ensuring accurate isolation of content of the detected table area from other text of the document. The one or more processors may extract content from the detected table area. The table area may correspond to at least one of the column data of the table. Further, the one or more processors may map the extracted at least one column data to the predetermined header label. The mapping is performed using a trained NLP model.
[0027] In an aspect, the present disclosure relates to non-transitory computer-readable medium comprising processor-executable instructions that may cause a processor to map a predetermined header label with a respective column label of a table. The one or more processors may be configured to detect a table area of the table while ensuring accurate isolation of content of the detected table area from other text of the document. Further, the one or more processors may extract content from the detected table area. The table area may correspond to at least one of the column data of the table. Furthermore, the one or more processors may map the extracted at least one column data to the predetermined header label. The mapping is performed using a trained NLP model.
BRIEF DESCRIPTION OF DRAWINGS
[0028] The accompanying drawings, which are incorporated herein, and constitute a part of this disclosure, illustrate exemplary embodiments of the disclosed methods and systems which like reference numerals refer to the same parts throughout the different drawings. Components in the drawings are not necessarily to scale, emphasis instead being placed upon clearly illustrating the principles of the present disclosure. Some drawings may indicate the components using block diagrams and may not represent the internal circuitry of each component. It will be appreciated by those skilled in the art that disclosure of such drawings includes the disclosure of electrical components, electronic components, or circuitry commonly used to implement such components.
[0029] FIG. 1A illustrates an exemplary representation 100A, where a column label “Item Number” represents a “Description” column in a first table and “Item number” column in a second table.
[0030] FIG. 1B illustrates an exemplary representation 100B, where a column label “Part Number” represents a “Description” column in a first table and “Item number” column in a second table.
[0031] FIG. 1C illustrates an exemplary representation 100C, where a column label “Item number” represents an “Item no.” column in a first table and a “Description” column in a second table.
[0032] FIG. 2 illustrates an exemplary block diagram representation of a network architecture implementing a proposed system for mapping a predetermined header label with a respective column data of a table, in accordance with an embodiment of the present disclosure.
[0033] FIG. 3 illustrates exemplary functional units of the proposed system, in accordance with an embodiment of the present disclosure.
[0034] FIG. 4 illustrates an exemplary representation of classification of test data with classified data having a particular classification type and confidence, in accordance with an embodiment of the present disclosure.
[0035] FIG. 5 is a flow diagram depicting a proposed method for mapping a predetermined header label with a respective column data of a table, in accordance with an embodiment of the present disclosure.
[0036] FIG. 6 illustrates an exemplary computer system in which or with which embodiments of the present disclosure may be utilized in accordance with embodiments of the present disclosure.
[0037] The foregoing shall be more apparent from the following more detailed description of the disclosure.
DETAILED DESCRIPTION OF INVENTION
[0038] In the following description, for the purposes of explanation, various specific details are set forth in order to provide a thorough understanding of embodiments of the present disclosure. It will be apparent, however, that embodiments of the present disclosure may be practiced without these specific details. Several features described hereafter can each be used independently of one another or with any combination of other features. An individual feature may not address all of the problems discussed above or might address only some of the problems discussed above. Some of the problems discussed above might not be fully addressed by any of the features described herein.
[0039] The ensuing description provides exemplary embodiments only and is not intended to limit the scope, applicability, or configuration of the disclosure. Rather, the ensuing description of the exemplary embodiments will provide those skilled in the art with an enabling description for implementing an exemplary embodiment. It should be understood that various changes may be made in the function and arrangement of elements without departing from the spirit and scope of the disclosure as set forth.
[0040] Specific details are given in the following description to provide a thorough understanding of the embodiments. However, it will be understood by one of ordinary skill in the art that the embodiments may be practiced without these specific details. For example, circuits, systems, networks, processes, and other components may be shown as components in block diagram form in order not to obscure the embodiments in unnecessary detail. In other instances, well-known circuits, processes, algorithms, structures, and techniques may be shown without unnecessary detail to avoid obscuring the embodiments.
[0041] Also, it is noted that individual embodiments may be described as a process that is depicted as a flowchart, a flow diagram, a data flow diagram, a structure diagram, or a block diagram. Although a flowchart may describe the operations as a sequential process, many of the operations can be performed in parallel or concurrently. In addition, the order of the operations may be re-arranged. A process is terminated when its operations are completed but could have additional steps not included in a figure. A process may correspond to a method, a function, a procedure, a subroutine, a subprogram, etc. When a process corresponds to a function, its termination can correspond to a return of the function to the calling function or the main function.
[0042] The word “exemplary” and/or “demonstrative” is used herein to mean serving as an example, instance, or illustration. For the avoidance of doubt, the subject matter disclosed herein is not limited by such examples. In addition, any aspect or design described herein as “exemplary” and/or “demonstrative” is not necessarily to be construed as preferred or advantageous over other aspects or designs, nor is it meant to preclude equivalent exemplary structures and techniques known to those of ordinary skill in the art. Furthermore, to the extent that the terms “includes,” “has,” “contains,” and other similar words are used in either the detailed description or the claims, such terms are intended to be inclusive in a manner similar to the term “comprising” as an open transition word without precluding any additional or other elements.
[0043] Reference throughout this specification to “one embodiment” or “an embodiment” or “an instance” or “one instance” means that a particular feature, structure, or characteristic described in connection with the embodiment is included in at least one embodiment of the present disclosure. Thus, the appearances of the phrases “in one embodiment” or “in an embodiment” in various places throughout this specification are not necessarily all referring to the same embodiment. Furthermore, the particular features, structures, or characteristics may be combined in any suitable manner in one or more embodiments.
[0044] The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the disclosure. As used herein, the singular forms “a”, “an”, and “the” are intended to include the plural forms as well, unless the context indicates otherwise. It will be further understood that the terms “comprises” and/or “comprising,” when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. As used herein, the term “and/or” includes any and all combinations of one or more of the associated listed items.
[0045] Tabular data extraction has been a long-standing challenge. On top of this, invoices are not originally designed to be processed automatically, often resulting in poor automation performance, making commercialization difficult. A large number of outliers (e.g., vendor-specific cases) and diverse, unpredictable representation of the tabular data has prevented any single technology from addressing all scenarios effectively. The primary objective of this disclosure is to address the problem of inaccurate determination and mapping of relevant column header fields (column labels) with respect to predetermined header labels. This problem occurs due to diversity in layouts and non-standardization of tabular structures. The disclosure facilitates to simplify a process of mapping the predetermined header label with a respective column data of the table thereby enhancing functionality and usage of information that is available inside individual tables of the invoice.
[0046] Typically, column labels in the invoice differ across multiple vendors. Usually, some of the vendors have distinct column labels for data items such as Item No. and Description, while other vendors may combine both of these types of data items into a single column.
[0047] Some vendors may use clear and appropriate labels for the data items Item No. and Description, such as:
? Item No. column labels: “Item Code,” “Item Number”, “Part Number”
? Description column labels: “Item Description”, “Description”
Whereas, other vendors may use column labels that are not distinctive, thus making it difficult to accurately identify the appropriate column. Inappropriate column labels are represented as, for example:
? Item No. column labels: “Item,” “Item No.”, “Item#”, “Part Number”
? Description column labels: “Part Number”, “Item No/Description”, “Part #/Description”
[0048] Duplication of the column labels across multiple columns by the different vendors makes it difficult to correctly map a header label with appropriate column label data.
[0049] FIG. 1A illustrates an exemplary representation 100A, where a column label “Item Number” represents a “Description” column in a first table and “Item number” column in a second table. Automating mapping of the column label “Item Number” using the header label is a challenge in this representation.
[0050] FIG. 1B illustrates an exemplary representation 100B, where a column label “Part Number” represents a “Description” column in a first table and “Item number” column in a second table. Due to different representations of the same column label in different tables, automating mapping of the column label using the header label is a problem in this scenario.
[0051] FIG. 1C illustrates an exemplary representation 100C, where a column label “Item number” represents an “Item no.” column in a first table and the same is used to represent a “Description” column in a second table. In this representation, the second table has a wrong column name for the column label “Item number”.
[0052] Various embodiments of the present disclosure will be explained in detail with reference to FIGs. 2-6.
[0053] FIG. 2 illustrates an exemplary block diagram representation of a network architecture 200 implementing a proposed system 210 for mapping a predetermined header label with a respective column label of a table, according to embodiments of the present disclosure. The network architecture 200 may include the system 210, a computing device 208, a centralized server 218, a decentralized database 220. The system 210 may be communicatively connected to the centralized server 218, and the decentralized database (or node(s)) 220, via a communication network 206. The centralized server 218 may include, but is not limited to, a stand-alone server, a remote server, a cloud computing server, a dedicated server, a rack server, a server blade, a server rack, a bank of servers, a server farm, hardware supporting a part of a cloud service or system, a home server, hardware running a virtualized server, one or more processors executing code to function as a server, one or more machines performing server-side functionality as described herein, at least a portion of any of the above, some combination thereof, and the like. The communication network 206 may be a wired communication network or a wireless communication network. The wireless communication network may be any wireless communication network capable of transferring data between entities of that network such as, but is not limited to, a carrier network including a circuit-switched network, a public switched network, a Content Delivery Network (CDN) network, a Long-Term Evolution (LTE) network, a New Radio (NR), a Global System for Mobile Communications (GSM) network and a Universal Mobile Telecommunications System (UMTS) network, an Internet, intranets, Local Area Networks (LANs), Wide Area Networks (WANs), mobile communication networks, combinations thereof, and the like.
[0054] The system 210 may be implemented by way of a single device or a combination of multiple devices that may be operatively connected or networked together. For example, the system 210 may be implemented by way of a standalone device such as the centralized server 218 (and/or a decentralized server or node(s)), and the like, and may be communicatively coupled to the computing device 208. In another example, the system 210 may be implemented in/ associated with the computing device 208. In yet another example, the system 210 may be implemented in/associated with respective electronic devices 204-1, 204-2, …..., 204-N (individually referred to as electronic device 204, and collectively referred to as electronic devices 204), associated with one or more user 202-1, 202-2, …..., 202-N (individually referred to as the user 202, and collectively referred to as the users 202). In such a scenario, the system 210 may be replicated in each of the electronic devices 204. The users 202 may be a user, but is not limited to, a purchaser, a lender, a realtor, a wholesaler and the like.
[0055] In some instances, the user 202 may include an entity or an administrator, a user making transactions on a healthcare platform or a super-mart platform through the electronic device 204 to generate and/or receive the invoice, and the like. The computing device 208 may be at least one of, an electrical, an electronic, and an electromechanical device. The computing device 208 may include, but is not limited to, a mobile device, a smart- phone, a Personal Digital Assistant (PDA), a tablet computer, a phablet computer, a wearable device, a Virtual Reality/Augmented Reality (VR/AR) device, a laptop, a desktop, a server, and the like. The system 210 may be implemented in hardware or a suitable combination of hardware and software. The system 210 or the centralized server 218 or the decentralized database 220 may be associated with entities (not shown). The entities may include, but are not limited to, an electronic commerce (e-commerce) platform, a healthcare platform, a lending platform, an insurance platform, a hyperlocal platform, a super-mart platform, a media platform, a service providing platform, a social networking platform, a messaging platform, a bot processing platform, an Artificial Intelligence (AI) based platform, and the like.
[0056] Further, the system 210 may include a processor 212, an Input/Output (I/O) interface 214, and a memory 216. The Input/Output (I/O) interface 214 of the system 210 may be used to receive user inputs, from one or more electronic devices 204 associated with the one or more users 202. The processor 212 may be configured to the computing device 208. The processor 212 may be coupled with the memory 216. The memory 216 may store one or more instructions that are executable by the processor to map the predetermined header label with the respective column label of the table.
[0057] In an embodiment, the system 210 may detect a table area of the table. When the table area is detected, the other contents of a document not present within the table area are not considered. The document may include the invoice but for clarity may represent any other document of similar type and form. The table may represent a structured set of data that is present on the invoice. Further, relevant content from the detected table area may be extracted. It is to be noted that the invoice may be available in various types of structured and unstructured documents such as images and PDFs. Further, contents of the invoice may possess structured layouts where data (e.g., line items) may be represented as tables that present data in a structurally defined manner such as in a multi-dimensional format and represent data in a more condensed form.
[0058] The extraction may be done while ensuring accurate isolation of content of the detected table area from other text of the invoice. As may be appreciated, the content from the detected table area may be extracted by processing an image of the detected table area or an image layer of the image. The image may be from image documents such as the invoices. The invoices are kind of documents that typically contain multiple tables, different layouts, cell types, different table elements, etc.
[0059] In an embodiment, the image of the detected table area or the image layer of the image may be extracted using an optical character recognition (OCR) engine. The OCR engine may begin by cleaning the image of the detected table area and correcting errors to optimize the image for reading. Some of the cleaning techniques include, for example, deskewing, which involves slightly adjusting tilt of a scanned document to correct alignment issues, despeckling, where digital image spots are removed, and the edges of text images are smoothed, removing boxes and lines from the image, and script recognition, enabling multi-language OCR processing. As may be appreciated, multiple types of OCR engines may be used, for example, a Simple optical character recognition engine, an Intelligent character recognition engine, and an Intelligent word recognition engine.
[0060] The processing of the image may be performed using an image preprocessing mechanism that uses at least one of a skew correction mechanism and a noise removal mechanism.
[0061] The content extracted from the detected table area may be at least in a text form or an alphanumeric form. Further, the table area of the table may be detected using at least one of a grid detection mechanism and by analyzing each of a predefined header label and a footer label of the table.
[0062] It may be noted that the predetermined header label may be determined from a pre-trained data set. Further, a Support Vector Machine (SVM) model may be trained using the pre-trained data set to evaluate one or more data patterns of the column label of the table.
[0063] In an embodiment, the extracted content may be mapped to the predetermined header label. The mapping may be performed using a trained Natural Language Processing (NLP) model that may predict OCR data for the respective column data of the table. In some embodiments, the trained NLP model (e.g., BERT or WORD2VEC) or hashed vectors may be used to predict the OCR data. For example, such models may encode each word in a row and the header label into separate feature vectors. A distance between these two feature vectors may reflect semantic similarity. Higher is the semantic similarity, higher is the accuracy of predicting the correct OCR data.
[0064] In another embodiment, a confidence score may be determined between the extracted content and the predetermined header label. Upon the determined confidence score meeting a confidence threshold, the extracted content may be assigned under a category of the predetermined header label. By way of an example, the confidence score may include determining what words have the same or similar (e.g., within a threshold distance when the words represent vectors) meaning, even if they are syntactically different.
[0065] In some implementations, the system 210 may include data, and modules. As an example, the data may be stored in the memory 216 configured in the system 210. In an embodiment, the data may be stored in the memory in the form of various data structures. Additionally, the data may be organized using data models, such as relational or hierarchical data models.
[0066] In an embodiment, the data stored in the memory 216 may be processed by the modules of the system 210. The modules may be stored within the memory. In an example, the modules communicatively coupled to the processor configured in the system, may also be present outside the memory, and implemented as hardware. As used herein, the term modules refer to an Application-Specific Integrated Circuit (ASIC), an electronic circuit, a processor (shared, dedicated, or group) and the memory that execute one or more software or firmware programs, a combinational logic circuit, and/or other suitable components that provide the described functionality.
[0067] Further, the system 210 may also include other units such as a display unit, an input unit, an output unit, and the like, however the same are not shown in FIG. 2, for the purpose of clarity. Also, in FIG. 2 only a few units are shown, however, the system 210 or the network architecture 200 may include multiple such units or the system 210/network architecture 200 may include any such numbers of the units, obvious to a person skilled in the art or as required to implement the features of the present disclosure. The system 210 may be a hardware device including the processor 212 executing machine-readable program instructions to populate the GUI based application using the guided NL input.
[0068] Execution of the machine-readable program instructions by the processor 212 may enable the system 210 to receive the NL input and populate the form. The “hardware” may comprise a combination of discrete components, an integrated circuit, an application-specific integrated circuit, a field-programmable gate array, a digital signal processor, or other suitable hardware. The “software” may comprise one or more objects, agents, threads, lines of code, subroutines, separate software applications, two or more lines of code, or other suitable software structures operating in one or more software applications or on one or more processors. The processor 212 may include, for example, but are not limited to, microprocessors, microcomputers, microcontrollers, digital signal processors, central processing units, state machines, logic circuits, and any devices that manipulate data or signals based on operational instructions, and the like. Among other capabilities, the processor 212 may fetch and execute computer-readable instructions in the memory 216 operationally coupled with the system 210 for performing tasks such as data processing, input/ output processing, and/or any other functions. Any reference to a task in the present disclosure may refer to an operation being or that may be performed on data.
[0069] FIG. 3 illustrates, at 300, exemplary functional units of the proposed system 210, in accordance with an exemplary embodiment of the present disclosure. The system 210 may include the one or more processor(s) 302 (represented as processor 212 in FIG. 2). The one or more processor(s) 302 may be implemented as one or more microprocessors, microcomputers, microcontrollers, digital signal processors, central processing units, logic circuitries, and/or any devices that manipulate data based on operational instructions. Among other capabilities, the one or more processor(s) 302 are configured to fetch and execute computer-readable instructions stored in a memory 304. The memory 304 may store one or more computer-readable instructions or routines, which may be fetched and executed to create or share the data units over a network service. The memory 304 may include any non-transitory storage device including, for example, volatile memory such as RAM, or non-volatile memory such as EPROM, flash memory, and the like.
[0070] In an embodiment, the system 210 may also include an interface(s) 306. The interface(s) 306 may include a variety of interfaces, for example, interfaces for data input and output devices, referred to as I/O devices, storage devices, and the like. The interface(s) 306 may facilitate communication with various other devices coupled to the one or more processor(s) 302. The interface(s) 306 may also provide a communication pathway for one or more components of the one or more processor(s) 302. Examples of such components include, but are not limited to, processing engine(s) 308 and database 310.
[0071] In an embodiment, the processing engine(s) 308 may be implemented as a combination of hardware and programming (for example, programmable instructions) to implement one or more functionalities of the processing engine(s) 308. In examples described herein, such combinations of hardware and programming may be implemented in several different ways. For example, the programming for the processing engine(s) 308 may be processor executable instructions stored on a non-transitory machine-readable storage medium and the hardware for the processing engine(s) 308 may include a processing resource (for example, one or more processors), to execute such instructions.
[0072] In the present examples, the machine-readable storage medium may store instructions that, when executed by the processing resource, implement the processing engine(s) 308. In such examples, the processor(s) 302 may include the machine-readable storage medium storing the instructions and the processing resource to execute the instructions, or the machine-readable storage medium may be separate but accessible to the system 210 and the processing resource. In other examples, the processing engine(s) 308 may be implemented by electronic circuitry. The database 310 may include data that is either stored or generated as a result of functionalities implemented by any of the components of the processing engine(s) 308. In an embodiment, the processing engine(s) 308 may include a detection unit 312, an extraction unit 314, a mapping unit 316, and other units(s) 318. The other unit(s) 318 may implement functionalities that supplement applications/ functions performed by the system 210. In another embodiment, the system 210, through the other unit(s) 318, may facilitate mapping a predetermined header label to a column data of a table present in the document.
[0073] The detection unit 312 may identify a table area of the table. This detection may be performed using a grid detection mechanism, or by analyzing the table’s header and footer labels, or by a combination of both methods. During identification of the table area of the table, other text of the document is ignored.
[0074] The extraction unit 314 may be responsible for extracting content from the identified table area while ensuring the extracted content is accurately separated from other text within the table. The extracted content corresponds to the column data of the table. This content may be extracted by processing either an image of the detected table area or a specific image layer. Further, the extraction process may utilize the OCR engine along with image preprocessing techniques that includes, for example, a skew correction technique and a noise removal technique. The extracted content may be provided in either the text or the alphanumeric form.
[0075] The mapping unit 316 may be responsible for aligning the extracted content with the predefined header labels. This mapping may be performed using the trained NLP model, which predicts the OCR data corresponding to each column label in the table.
[0076] The trained NLP model is responsible for identifying semantic similarities between different column headers, even when their syntax differs. Syntax refers to the structure of character sequences (such as sentence structure), while semantics refers to the meaning. For example, “index” and “sensex” are syntactically similar because they share a similar structure, but they differ semantically. In contrast, “index” and “index_no.” are semantically similar because they share the same meaning, even though they are syntactically different.
[0077] In some cases, the trained NLP model processes the semantic and syntactic content of semi-structured or unstructured data, such as images, rather than structured data like databases. The NLP model is designed to parse content and determine both semantic meaning (e.g., understanding words by analyzing them in context with others) and syntactic structure (e.g., the rules governing sentence formation in a given language). The NLP model may recognize keywords, contextual details, and metadata tags associated with different parts of a dataset.
[0078] The NLP model may also analyze summary information, keywords, and text descriptions, using syntactic and semantic elements to identify which words in a table correspond to specific categorical identifiers. These elements may include factors like word frequency, meanings, font styles, italics, hyperlinks, proper names, noun phrases, and parts of speech (e.g., nouns, adverbs, adjectives), as well as context surrounding each word.
[0079] In some implementations, the NLP model may incorporate Named Entity Recognition (NER), an information extraction technique that identifies and categorizes entities in natural language text. These predefined categories may include, for example, names, organizations, locations, dates, quantities, prices, and percentages. For example, in the context of invoice processing, these tags or labels may indicate whether the extracted content from the table corresponds to fields like “Item No.” or a specific description of an item on the invoice.
[0080] In an embodiment, for training of the SVM model a large set of diverse data samples may be used to form a dataset that may accurately represent all expected scenarios of representation of the data corresponding to the column label. Further, the dataset may use the collected dataset to train the SVM model which is a machine learning model that is designed to learn patterns available in the column labels, for example, “item number” and “description”. The SVM model is a supervised learning model that is often applied to classification and regression tasks and functions by identifying an optimal hyperplane that identifies and separates different test data classes.
[0081] FIG. 4 illustrates an exemplary representation 400 of classification of the test data with classified data having a particular classification type and confidence, in accordance with an embodiment of the present disclosure. The dataset that is prepared by collecting the large volume of data samples represents all types of expected labels, for example column labels. With respect to FIG. 4, at step 402, the test data may be fed for classification to the system 210. The test data may be classified using the trained SVM model, at step 404. As may be appreciated, the SVM model may be trained using the pre-trained dataset to evaluate one or more data patterns of the column data of the table.
[0082] Further, at step 406, predicted results from the trained SVM model may contain a classification type of the dataset along with a confidence score. The classification type may be used to classify the dataset under a particular category. In addition, the confidence score may be determined between the extracted content of the table and the predetermined header label. When the confidence score meets a confidence threshold, the extracted content from the table may be assigned one of a category available under multiple predetermined header labels.
[0083] By way of an example, contents (column data) of the column labels (also referred to as an entity) “Item” and “Description” are collected from a large sample dataset and are used to train the SVM model. By analyzing various patterns of the contents that are present within these two columns, the SVM model may learn to distinguish between them and effectively train the dataset.
Table 1 : The SVM model training using column data represented as sample datasets of column labels “description” and “item”.
[0084] With respect to Table 1, a test dataset that needs to be classified uses the SVM model for prediction. The SVM model may return two items as below:
{
“Entity” : “Item” or “Description”
“Confidence Score” : 0 to 100
}
[0085] Based on a value of the entity (also referred to as the column data) and the confidence score, the test data may be classified and categorized as a particular entity, i.e. “Item” or “Description”. As may be appreciated, higher the score, higher is the possibility of categorizing the test data under a particular category of the entity, i.e., assigning the column data under the predetermined header label.
[0086] FIG. 5 is a flow diagram depicting a proposed method for mapping a predetermined header label with a respective column label of a table, in accordance with an embodiment of the present disclosure.
[0087] At step 502, the method includes, detecting a table area of the table while ensuring accurate isolation of content of the detected table area from other text of the document.
[0088] At step 504, content from the detected table area is extracted. The table area corresponds to at least one of the column data of the table. Further, the content from the detected table area is extracted by processing an image of the detected table area or an image layer of the image. The image of the detected table area or the image layer of the image is extracted using the OCR engine.
[0089] At step 506, the extracted at least one column data is mapped to the predetermined header label. The mapping is performed using a trained NLP model.
[0090] Those skilled in the art would appreciate that embodiments of the present disclosure enable mapping the predetermined header label with the column data of the table. For effective mapping, the table area is detected, where an identified area corresponds to the respective column data present in the table. Further, content from the detected table area is extracted while ensuring that it is accurately isolated from other text present within the invoice. Furthermore, the extracted content may be mapped to the predefined header labels using a trained Natural Language Processing (NLP) model. The present disclosure outlines a system and method for automating the extraction process thereby significantly speeding up invoice processing compared to manual entry, thus allowing businesses to efficiently handle higher invoice volumes. Additionally, the present disclosure ensures consistent data extraction by applying similar rules and logic throughout an automated invoice processing mechanism, thus enabling faster, more accurate invoice processing.
[0091] FIG. 6 illustrates an exemplary computer system 600 in which or with which embodiments of the present disclosure may be implemented. As shown in FIG. 6, the computer system 600 may include an external storage device 610, a bus 620, a main memory 630, a read-only memory 640, a mass storage device 650, communication port(s) 660, and a processor 670. A person skilled in the art will appreciate that the computer system 600 may include more than one processor and communication ports. The processor 670 may include various modules associated with embodiments of the present disclosure. The communication port(s) 660 may be any of an RS-232 port for use with a modem-based dialup connection, a 10/100 Ethernet port, a Gigabit or 10 Gigabit port using copper or fiber, a serial port, a parallel port, or other existing or future ports. The communication port(s) 660 may be chosen depending on a network, such a Local Area Network (LAN), Wide Area Network (WAN), or any network to which the computer system 600 connects. The main memory 630 may be random access memory (RAM), or any other dynamic storage device commonly known in the art. The read-only memory 640 may be any static storage device(s) e.g., but not limited to, a Programmable Read Only Memory (PROM) chips for storing static information e.g., start-up or BIOS instructions for the processor 670. The mass storage device 650 may be any current or future mass storage solution, which can be used to store information and/or instructions. Exemplary mass storage device 650 includes, but is not limited to, Parallel Advanced Technology Attachment (PATA) or Serial Advanced Technology Attachment (SATA) hard disk drives or solid-state drives (internal or external, e.g., having Universal Serial Bus (USB) and/or Firewire interfaces), one or more optical discs, Redundant Array of Independent Disks (RAID) storage, e.g. an array of disks.
[0092] The bus 620 communicatively couples the processor 670 with the other memory, storage, and communication blocks. The bus 620 may be, e.g. a Peripheral Component Interconnect (PCI)/PCI Extended (PCI-X) bus, Small Computer System Interface (SCSI), USB, or the like, for connecting expansion cards, drives, and other subsystems as well as other buses, such a front side bus (FSB), which connects the processor 670 to the computer system 600.
[0093] Optionally, operator and administrative interfaces, e.g. a display, keyboard, joystick, and a cursor control device, may also be coupled to the bus 620 to support direct operator interaction with the computer system 600. Other operator and administrative interfaces can be provided through network connections connected through the communication port(s) 660. Components described above are meant only to exemplify various possibilities. In no way should the aforementioned exemplary computer system 600 limit the scope of the present disclosure.
[0094] While the foregoing describes various embodiments of the invention, other and further embodiments of the invention may be devised without departing from the basic scope thereof. The scope of the invention is determined by the claims that follow. The invention is not limited to the described embodiments, versions or examples, which are included to enable a person having ordinary skill in the art to make and use the invention when combined with information and knowledge available to the person having ordinary skill in the art.
[0095] Since many modifications, variations, and changes in detail can be made to the described preferred embodiments of the invention, it is intended that all matters in the foregoing description and shown in the accompanying drawings be interpreted as illustrative and not in a limiting sense. Thus, the scope of the invention should be determined by the appended claims and their legal equivalents.
ADVANTAGES OF THE PRESENT DISCLOSURE
[0096] The present disclosure provides a system and method for mapping a predetermined header label with a respective column data of a table.
[0097] The present disclosure provides a system and method for automating an extraction process that significantly accelerates invoice processing compared to manual entry.
[0098] The present disclosure provides a system and method for minimizing risk of errors occurring due to human intervention.
[0099] The present disclosure provides a system and method that uses the automated extraction process by applying the same rules and logic consistently thus ensuring uniform data extraction.
[00100] The present disclosure provides a system and method that facilitates faster and more accurate invoice processing.
[00101] The present disclosure provides a system and method that enhances accuracy of both line item data and header fields captured using rule-based methods, which together boost straight-through processing rate of documents.
, Claims:We Claim:
1. A method (500) for mapping a predetermined header label to at least one of a column data of a table present in a document, the method comprising:
detecting (502), by one or more processors (212) of a computing device (208), a table area of the table while ensuring accurate isolation of content of the detected table area from other text of the document;
extracting (504), by the one or more processors (212), content from the detected table area, where the table area corresponds to the at least one of column data of the table; and
mapping (506), by the one or more processors (212), the extracted at least one of column data to the predetermined header label, where the mapping is performed using a trained NLP model.
2. The method (500) as claimed in claim 1, wherein the content from the detected table area is extracted by processing an image of the detected table area or an image layer of the image.
3. The method (500) as claimed in claim 2, wherein the image of the detected table area or the image layer of the image is extracted using an optical character recognition (OCR) engine.
4. The method (500) as claimed in claim 2, wherein the processing of the image is performed using an image preprocessing mechanism.
5. The method (500) as claimed in claim 4, wherein the image preprocessing mechanism uses at least one of a skew correction mechanism and a noise removal mechanism.
6. The method (500) as claimed in claim 1, wherein the content extracted from the detected table area is at least in a text form or an alphanumeric form.
7. The method (500) as claimed in claim 1, wherein the NLP model predicts OCR data for at least one of the column data of the table.
8. The method (500) as claimed in claim 1, wherein the table area of the table is detected using at least one of a grid detection mechanism and by analyzing each of a predefined header label and a footer label of the table.
9. The method (500) as claimed in claim 1, wherein the predetermined header label is determined from a pre-trained data set.
10. The method (500) as claimed in claim 9, wherein a Support Vector Machine (SVM) model is trained using the pre-trained data set to evaluate one or more data patterns of at least one of the column data of the table.
11. The method (500) as claimed in claim 1, wherein a confidence score is determined between the extracted column data of the table and the predetermined header label, and upon the confidence score meeting a confidence threshold, the extracted content is assigned under a category of the predetermined header label.
12. A system (210) for mapping a predetermined header label to at least one of a column data of a table present in a document, the system (210) comprising:
one or more processors (212) associated with a computing device (208); and
a memory (216) operatively coupled to the one or more processors (212), wherein the memory (216) comprises processor-executable instructions, which on execution, cause the one or more processors (212) to:
detect a table area of the table while ensuring accurate isolation of content of the detected table area from other text of the document;
extract content from the detected table area, where the table area corresponds to the at least one of column data of the table; and
map the extracted at least one of column data to the predetermined header label, where the mapping is performed using a trained NLP model.
13. The system (210) as claimed in claim 12, wherein the content from the detected table area is extracted by processing an image of the detected table area or an image layer of the image.
14. The system (210) as claimed in claim 13, wherein the image of the detected table area or the image layer of the image is extracted using an optical character recognition (OCR) engine.
15. The system (210) as claimed in claim 13, wherein the processing of the image is performed using an image preprocessing mechanism.
16. The system (210) as claimed in claim 15, wherein the image preprocessing mechanism uses at least one of a skew correction mechanism and a noise removal mechanism.
17. The system (210) as claimed in claim 12, wherein a confidence score is determined between the extracted column data of the table and the predetermined header label, and upon the confidence score meeting a confidence threshold, the extracted content is assigned under a category of the predetermined header label.
18. The system (210) as claimed in claim 12, wherein the NLP model predicts OCR data for at least one of the column data of the table.
19. The system (210) as claimed in claim 12, wherein the table area of the table is detected using at least one of a grid detection mechanism and by analyzing each of a predefined header label and a footer label of the table.
20. A non-transitory computer-readable medium comprising processor-executable instructions that cause a processor (212) to:
detect a table area of the table while ensuring accurate isolation of content of the detected table area from other text of the document;
extract content from the detected table area, where the table area corresponds to the at least one of column data of the table; and
map the extracted at least one of column data to the predetermined header label, where the mapping is performed using a trained NLP model.
Dated this 28th November, 2024
Anu Gupta
Agent of the Applicant (INPA-3548)
| # | Name | Date |
|---|---|---|
| 1 | 202411093396-POWER OF AUTHORITY [28-11-2024(online)].pdf | 2024-11-28 |
| 2 | 202411093396-FORM 1 [28-11-2024(online)].pdf | 2024-11-28 |
| 3 | 202411093396-DRAWINGS [28-11-2024(online)].pdf | 2024-11-28 |
| 4 | 202411093396-DECLARATION OF INVENTORSHIP (FORM 5) [28-11-2024(online)].pdf | 2024-11-28 |
| 5 | 202411093396-COMPLETE SPECIFICATION [28-11-2024(online)].pdf | 2024-11-28 |
| 6 | 202411093396-FORM-9 [29-11-2024(online)].pdf | 2024-11-29 |
| 7 | 202411093396-FORM 18 [01-01-2025(online)].pdf | 2025-01-01 |
| 8 | 202411093396-FORM 3 [28-02-2025(online)].pdf | 2025-02-28 |