Computer Parsable Data Processing Platform And Method To Form A

< Back

Computer Parsable Data Processing Platform And Method To Form A Semantically Tagged Output

Abstract: The present invention provides a computer parsable data processing platform and a method to form a semantically tagged output from a document associated with computer parsable data. The method starts by providing a document input to the computer parsable data processing platform. Then, the document provided as input is analyzed and compared with one or more predefined templates stored in a library template to derive a category of the document. Then, the method continues by mining computer parsable data from the document to tag the computer parsable data with one of the one or more predefined templates that are identical to the document. Subsequently, the data tagged with one of the one or more predefined templates is utilized to form a semantically tagged output

Get Free WhatsApp Updates!
Notices, Deadlines & Correspondence

Patent Information

Application #

Filing Date

30 March 2015

Publication Number

42/2016

Publication Type

INA

Invention Field

COMPUTER SCIENCE

Status

abhay.porwal@ipproinc.com

Parent Application

Applicants

Khemeia Technologies Private Ltd

26 Panchali Amman Koil Street, Arumbakkam, Chennai, Tamil Nadu

Inventors

1. Pierre Fraisse

French national, 27 avenue Foch, 69006, Lyon, France

Specification

CLIAMS:
1. A method for forming a semantically tagged output from a computer parsable data of a document, the method comprising:

deriving a category of the document, wherein a category of the document is one of a legal document, a technology specification document, a bulletin document, a magazine and a comic document;

mining the computer parsable data from the document, wherein the computer parsable data is at least one of a text, an image, a table, an equation and a vector image;

tagging the computer parsable data based on the category of the document, wherein a tag is a label utilized to represent at least one predefined template identical to the category of the document associated with the computer parsable data; and

forming the semantically tagged output in response of tagging the computer parsable data, wherein the semantically tagged output is formed by executing a plurality of instructions coded in a high-level programming output language.

2. The method of claim 1, wherein the document is compared with at least one predefined template for deriving a category associated with the document.

3. The method of claim 2, wherein the category of the document is derived based on a layout of the document.

4. The method of claim 1, wherein the step of mining is performed using at least one of a visual analysis technique, a keyword analysis technique, a geometric positioning technique, a regular expression technique, an integration of dictionaries/indexes technique and a structure/hierarchy technique.

5. The method of claim 4, wherein the visual analysis technique analyzes headline, title, classification and main text of the computer parsable data of the document based on distinguished color and font of the computer parsable data.

6. The method of claim 4, wherein the keyword analysis technique analyzes at least one of terms and keywords in the computer parsable data of the document to determine category of the document.

7. The method of claim 4, wherein the geometric positioning technique pinpoints position of the computer parsable data of the document based on distinguished positioning of the computer parsable data in header, footer, title, sections and main text.

8. The method of claim 4, wherein the regular expression technique analyzes the computer parsable data in the document to match the computer parsable data with predefined patterns comprising proper names, strings, designations, titles and addresses.

9. The method of claim 4, wherein the integration of dictionaries/indexes technique analyzes the computer parsable data in the document to match the computer parsable data with customer specific taxonomies.

10. The method of claim 4, wherein the structure/hierarchy technique analyzes the computer parsable data in the document to detect the computer parsable data matching with the semantically tagged output with a single instruction.

11. The method of claim 1, wherein the step of detecting repetitive elements in the computer parsable data of the documents utilizes at least one of a Documents Type Definition (DTD) and an XML schema.

12. The method of claim 1, wherein the step of tagging the computer parsable data is performed by applying at least one of a predefined rule set configuration to at least one of a block, an inline element and a wrapper to provide a semantically tagged output.

13. The method of claim 12, wherein the at least one of a predefined rule set configuration is applied to the block for vertical isolation of a portion of text.

14. The method of claim 12, wherein the at least one of a predefined rule set configuration is applied to the inline element comprising at least one of a line of text and a list of lines without any break in paragraph for section titles.

15. The method of claim 12, wherein the at least one of a predefined rule set configuration is applied to the wrapper for isolating a portion of line to any one of typographic format or horizontal format.

16. The method of claim 1, wherein the plurality of instructions comprising at least one step of eliminating headers and footers of the computer parsable data of the document, formatting list of tagged data of similar type, resolving links for data, formatting tables by utilizing Continuous Acquisition and Life-cycle Support initiative (CALS) XML schema, formatting subparagraphs similar to the tagged list, computing label column into codes using dictionary and interpreting numerical value of columns according to code value in same column to form a semantically tagged output.

17. A computer parsable data processing platform configured to form a semantically tagged output from a computer parsable data of a document, the computer parsable data processing platform comprising:

an application programming interface (API) configured to receive the computer parsable data input associated with the document, wherein the computer parsable data is processed to form the semantically tagged output and displayed through the application programming interface;

a computer parsable data processor configured to execute subsequent operations for forming the semantically tagged output, wherein the subsequent operations comprising:

deriving a category of the document, wherein a category of the document is one of a legal document, a technology specification document, a bulletin document, a magazine and a comic document;

a library template communicating with the computer parsable data processor configured to store at least one predefined template utilized for tagging the computer parsable data; and the semantically tagged output. ,TagSPECI: FIELD OF THE INVENTION

[0001] The present invention generally relates to a method and system for processing an unstructured data. More specifically, the present invention relates to a computer parsable data processing platform to form a semantically tagged output by parsing an unstructured data.

BACKGROUND OF THE INVENTION

[0002] In general, most of the data that is authored, downloaded, accessed or stored in an electronic form is in a form of an unstructured data which represents a pure or an unannotated text. The unstructured data can be, but need not be limited to, data in a Portable Document Format (PDF), a Word, an American Standard Code for Information Interchange (ASCII), an Hyper Text Markup Language (HTML) documents and the like.

[0003] Conventionally there exist many methods and systems to convert unstructured/original data into a structured data. The methods and systems used for conversion of the unstructured data depend on one or more features of an input document and a method and system used for conversion. The one or more features can be, but need not be limited to, a structural consistency of data, availability of data, consistency of formatting the data and the like. The issues that arise while using the method and system can be, but need not be limited to, sophistication of the method and system, configuration of the system, processing a specific document type and the like. Thus, based on the one or more features, the conventional methods and systems are not capable of providing a required structured document. Though, the structured document is obtained by utilizing the conventional methods and systems. The structured document needs to be reviewed manually by an operator or a content specialist for fixing interference errors or missing desired structure.

[0004] The conventional methods and systems at times provide poor results in the process of conversion because of one or more unresolved errors, which can be, but need not be limited to, inconsistency of input data, unexpected formatting of data and ordering of data in unstructured documents. In order to improve the results of conventional methods and systems, the unstructured documents have to be modified to match predefined conversion rules and patterns or account variability of the document. This complete process can only be made possible by using the conversion-review correction process. However, human intervention is no way avoidable by utilizing these conventional methods and systems to perform a fully automated conversion process for forming semantically and structurally valid documents.

[0005] In view of the above, there is a need to enhance the conventional data conversion techniques for automatic conversion of unstructured document to a structured document without human intervention.

BRIEF DESCRIPTION OF DRAWINGS

[0006] The accompanying figures, where like reference numerals refer to identical or functionally similar elements throughout the separate views and which together with the detailed description below are incorporated in and form part of the specification, serve to further illustrate various embodiments and to explain various principles and advantages all in accordance with the present invention.

[0007] FIG. 1 illustrates a flow diagram for a method of forming a semantically tagged output from a document associated with a computer parsable data in accordance with an embodiment.

[0008] FIG. 2 illustrates a schematic diagram of a plurality of techniques utilized for mining computer parsable data from a document in accordance with an embodiment.

[0009] FIG. 3 illustrates a computer parsable data processing platform to form a semantically tagged output from a document associated with computer parsable data in accordance with an embodiment.

[0010] FIG. 4 illustrates an exemplary scenario of forming a semantically tagged output from a document associated with computer parsable data in accordance with an embodiment.

[0011] Skilled artisans will appreciate that elements in the figures are illustrated for simplicity and clarity and have not necessarily been drawn to scale. For example, the dimensions of some of the elements in the figures may be exaggerated relative to other elements to help to improve understanding of embodiments of the present invention.

DETAILED DESCRIPTION OF THE INVENTION

[0012] Before describing in detail embodiments that are in accordance with the present invention, it should be observed that the embodiments reside primarily in a computer parsable data processing platform that can be used to automatically form a semantically tagged output from a document associated with computer parsable data.

[0013] In this document, relational terms such as first and second, top and bottom, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. The terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article or composition that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article or composition. An element proceeded by “comprises …a” does not, without more constraints, preclude the existence of additional identical elements in the process, method, article or composition that comprises the element.

[0014] Various embodiments of the present invention provide a computer parsable data processing platform and method for forming a semantically tagged output from a document associated with a computer parsable data. The computer parsable data processing platform includes an Application Programming Interface (API). The API is utilized to receive a document associated with a computer parsable data for processing the computer parsable data. The computer parsable data is processed by a computer parsable data processor to form a semantically tagged output by executing one or more operations. The execution of one or more operations starts with a method of deriving a category of the document associated with the computer parsable data. The method continues by mining the computer parsable data from the document and tagging the computer parsable data based on a category of the document. The computer parsable data processing platform includes a library template in communication with the computer parsable data processor for storing one or more predefined templates. The one or more predefined templates are used to determine the category of the document associated with the computer parsable data. Then, the method forms the semantically tagged output in response of tagging the computer parsable data.

[0015] Referring to FIG. 1, which illustrates a flow diagram 100 of a method to form a semantically tagged output from a document associated with computer parsable data in accordance with an embodiment of the present invention. The computer parsable data in the present invention can be defined as an unstructured data and the semantically tagged output can be defined as a structured data.

[0016] The computer parsable data/unstructured data defines the data authored, downloaded, accessed or stored in an electronic form which may include, but need not be limited to, data in a Portable Document Format (PDF), a Word, an American Standard Code for Information Interchange (ASCII) and a Hyper Text Markup Language (HTMLTM) documents. Similarly, the semantically tagged output/structured data defines the data organized in a specific structural format by tagging the complete data in the document or a part of data in the document. The semantically tagged output may include, but need not be limited to, data in the form of an Extensible Markup Language (XML), a type of XML (XBRL) and the like.

[0017] As illustrated in FIG. 1, the method of forming a semantically tagged output starts at step 102 by providing / uploading an unstructured document format to a computer parsable data parsing platform. The computer parsable data processing platform (further explained in conjunction with FIG. 3) is configured to analyze, interpret, extract and convert multiple content elements of the unstructured data / computer parsable data from the document to form a semantically tagged output / structured output. The document can be one of a PDF document, a Word document, ASCII document and an HTML document, where the data is not organized in a specific format. The document can also be referred as an original document which is pure and unannotated.

[0018] The method continues with step 104 by deriving a category of the document. The category of the document is derived by analyzing the document and comparing the document with one or more predefined templates stored in a library template. The analysis of the document is performed by utilizing the computer parsable data processing platform. The comparison of the document is performed by matching a layout of the document with a layout of the one or more predefined templates. The layout of the document is an outline of the document which represents the alignment of pages in series or parallel and the like. Thus, once the layout of the document is matched with a layout of one of a predefined template, a category of the document is derived. The category of the document can be any one of a legal document, a technology specification document, a bulletin document, a comic document, a magazine and the like.

[0019] Next at step 106, the method continues by mining the computer parsable data from the document. The computer parsable data may include, but need not be limited to, a text, an image, a table, an equation, a vector image and the like. In an embodiment, a text mined from the document can be of different types, which may include, but need not be limited to, language specific characters which can be accent, cedilla, diacritics and the like, superscript and subscripts that enable the linkages between main content and footnotes, small caps, bulletin points, paragraph segmentation, following the sequence and logical order of multiple column layouts of information and the like.

[0020] Similarly, an image mined from the document can be converted into a homogenous format which may include, but need not be limited to, Portable Network Graphics (PNG), Joint Photographic Experts Group (JPEG/JPG) and the like. In similar manner, equations can also be mined from the document, which can be identified by using mathematical and chemical operators and clipped as images. Similarly, vector images mined from the document can be identified and clipped as images and the tables mined from the document are analyzed to indicate the separation of columns.

[0021] The method of mining the computer parsable data from the document can be performed by utilizing one or more of a visual analysis technique, a keyword analysis technique, a geometric positioning technique, a regular expression technique, an integration of dictionaries/indexes technique and a structure/hierarchy technique. These techniques are explained in detail in conjunction with FIG. 2.

[0022] Moving on, at step 108, the method continues by tagging the data mined from the computer parsable data with one of the one or more predefined templates based on the category of the document. Here a tag is a label utilized to represent the one or more predefined templates identical to the category of the document. In order to tag the computer parsable data, one or more predefined rule set configurations are applied to an element of a text. The element of text may include, but need not be limited to, a block, an inline element and a wrapper. In an embodiment, a rule applied to a block can be for vertical isolation of a portion of text. The portion of text can be, but need not be limited to, a table and a figure. In other embodiment, a rule applied to an inline element may include any one of a line of text and a list of lines without any break in paragraphs. Inline rules are applied for section titles for example, titles in the form of bold and including regular expression that starts with dotted-number expressions. In another embodiment, a rule applied to a wrapper is used for isolating a portion of line to any one of typographic format or horizontal format, where the wrapper rules can be applied to numbers which may include, but need not be limited to, regular expressions and the like.

[0023] Later at step 110, the method forms a semantically tagged output in response to tagging the computer parsable data based on one of the one or more predefined templates. The semantically tagged output is formed by executing a plurality of instructions coded in a high-level programming output language. The plurality of instructions includes a method of eliminating headers and footers of the document associated with computer parsable data, formatting list of tagged data of similar type, resolving links for data, formatting tables by utilizing Continuous Acquisition and Life-cycle Support initiative (CALS), formatting subparagraphs similar to the tagged list, computing label column into codes using dictionary and interpreting numerical value of columns according to code value in same column. The high-level programming output language used to code the plurality of instructions may include but need not be limited to khemeia output language and the like.

[0024] In an exemplary embodiment, the plurality of instructions executed to form semantically tagged output can be performed by a method of eliminating headers of the document. The headers of the document may include, but need not be limited to, logo, title, code of document, date and the like. Similarly, the method can be performed by eliminating footer of the document, which may include, but need not be limited to, page numbers, reference links of the data in particular page, meanings of words and the like. The method of forming a semantically tagged output can be performed by formatting a list of data based on a hierarchy for consecutively tagging the data of similar type. The similar type of the data is computed from first characters of the list. The method can also be performed by resolving links for data to refer a data module reference. Further, the method can be performed by formatting tables in the document by utilizing CALS XML schema. The method of formatting subparagraphs based on a hierarchy similar to the method used in formatting the list of data can also be performed for forming a semantically tagged output. Here, the similar type of subparagraphs can be computed by using number of bullets used in numbering. Additionally, the method of forming the semantically tagged output can be performed by computing label column into codes using dictionary and interpreting numerical value of columns according to code value in same column.

[0025] Moving on, FIG. 2 illustrates a schematic diagram 200 representing a plurality of techniques utilized for mining computer parsable data from a document in accordance with an embodiment of the present invention.

[0026] As illustrated in FIG. 2, the plurality of techniques used for mining computer parsable data from the document can be one or more of a visual analysis technique 202, a keyword analysis technique 204, a geometric positioning technique 206, a regular expression technique 208, an integration of dictionaries/indexes technique 210 and a structure/hierarchy technique 212.

[0027] Visual analysis technique 202 is used to visually analyze the document associated with computer parsable data by distinguishing font, color, word case, text color and alignment of text in the document. In an embodiment, visual analysis technique 202 is used for a document having a distinguished color and font for each section, where the headline of the document can be of any color which may include, but need not be limited to, black, pink, green and the like with a normal style of words in the document and a paragraph spacing greater than 15pt. Similarly, the title of the document can also be of any color which may include, but need not be limited to, black, pink, green and the like and bold with the paragraph spacing greater than 15pt. Further, the document may include a word classification segment, which can be in regular font in short lines. The other segment of the document can be main text which can be represented in the regular font with a justified line alignment. Thus, each segment of the document can be varied with different colors, fonts and alignment, which can be analyzed by using the visual analysis technique to determine the category of the document.

[0028] Keyword analysis technique 204 is used to analyze one or more terms and keywords in the document associated with computer parsable data. In one embodiment, the keyword that can be used in keyword analysis technique can be a signature in the document. The keyword analysis technique provides plurality of patterns for signatures to detect the often complex signature zone. The plurality of patterns for signatures can be identified based on plurality of semantic elements which may include, but need not be limited to, proper names, dates and the like to determine category of the document.

[0029] Geometric positioning technique 206 is used to pinpoint position of the computer parsable data in the document based on a distinguished positioning of the computer parsable data. The positioning of data in the document may include, but need not be limited to, header, footer, title, sections, main text and the like. In an embodiment, the positioning of data is distinguished by alignment of footnote to left characteristic or right characteristic of the document in view of the main text. Similarly, the alignment of header positions, title and main text can also be pointed out based on their alignment by using geometric positioning technique 206.

[0030] Regular expression technique 208 is used to analyze the computer parsable data in the document to match the computer parsable data with predefined patterns which may include, but need not be limited to, proper names, strings, designations, titles, addresses and the like. In an embodiment, regular expression technique 208 can use proper names to match the computer parsable data in the documents with the documents of legal cases. These legal case documents may require anonymization of the documents by using a rule set of names related to regular expressions. Thus, the rule set of names related to the regular expressions can be detected by using different name patterns which may include, but need not be limited to, surname-name, title-name-surname, surname-initials-name, company_type-company_name and the like.

[0031] Integration of dictionaries/indexes technique 210 is used to analyze the computer parsable data in the document to match the computer parsable data with customer specific taxonomies. In an embodiment, the integration of dictionaries/indexes related to company accounts may include, but need not be limited to, Generally Accepted Accounts Principles (GAAP), International Financial Reporting (IFRS) taxonomies and the like can be used while interpretation of company accounts to form a semantically tagged output.

[0032] Structure/hierarchy technique 212 analyzes ordered list of the computer parsable data in the document by using regular expression technique and forming a specific structure of the document with a title, sub-titles, paragraphs, footnotes and the like by using a single instruction.

[0033] Referring to FIG. 3, a computer parsable data processing platform 300 is illustrated that forms a semantically tagged output from a document associated with computer parsable data in accordance with an embodiment of the present invention.

[0034] As illustrated in FIG. 3, computer parsable data processing platform 300 includes an Application Programming Interface (API) 302, which defines a set of routines, protocols and tools for building an application that provides an interaction between software components. API 302 is used to receive user input requests and provide their corresponding outputs. The input received from a user through API 302 can be any type of document associated with a computer parsable data. Thereafter, the document is transmitted to a computer parsable data processor 304 for processing the documents.

[0035] Computer parsable data processor 304 is configured to execute one or more operations for processing the documents associated with computer parsable data to form a semantically tagged output. The one or more operations performed by computer parsable data processor 304 starts by deriving a category of the document and mining the computer parsable data from the document. Then, the computer parsable data mined from the document can be tagged with one of one or more predefined templates stored in a library template 306. The one of the predefined template tagged to the computer parsable data can be used to form the semantically tagged output based on the category of the document (the method of performing one or more operations in the computer parsable data processor is explained in conjunction with FIG. 1). Further, the semantically tagged output formed by computer parsable data processor 304 is transmitted to the user through API 302. In addition, computer parsable data processor 304 utilizes one or more Document type Definition (DTD) or XML schema to detect repetitive elements in the documents.

[0036] As illustrated, library template 306 in communication with computer parsable data processor 304 is used for storing one or more predefined templates of documents. The one or more predefined templates can be used for tagging corresponding document associated with the computer parsable data. In addition, library template 306 stores the semantically tagged output resulted in response of tagging one of the one or more predefined templates to the corresponding document associated with the computer parsable data. The one or more predefined templates and semantically tagged output stored in library template 306 can be any one of a remote library or a centralized library. The remote library can be used for accessing the one or more predefined templates remotely from a specific data communication device. The remote library can also be used for remotely storing the semantically tagged output. Similarly, the centralized library can be used for accessing one or more predefined templates centrally and for storing the semantically tagged output.

[0037] Referring to FIG. 4, an exemplary scenario for forming a semantically tagged output from a document associated with computer parsable data is illustrated in accordance with an embodiment of the present invention.

[0038] As illustrated in FIG. 4, a PDF document is provided as an input to computer parsable data processing platform 302 at step 402. The PDF document received by computer parsable data processor 304 is processed to form a semantically tagged output. The method of processing initially starts at step 404 by analyzing the PDF document and comparing the PDF document with the one or more predefined templates to derive a category of the PDF document. The PDF document can be compared by matching a layout of the PDF document with a layout of one of the one or more predefined templates to derive a category of the PDF document.

[0039] Next at step 406, the method continues by mining the computer parsable data from the PDF document. The computer parsable data mined from the PDF document may include, but need not be limited to, a text, an image, a table, an equation, a vector image and the like. (The mining of data and the methods used for mining are explained in conjunction with FIG. 1 and FIG. 2).

[0040] Further, the layout of the PDF document and the computer parsable data mined from the PDF document is used to determine the category of the PDF document as a technology specification document. Then, the PDF document is tagged with the technology specification document at step 408. The method of tagging the PDF document associated with the computer parsable data to the technology specification document is performed by applying one or more predefined rule set configuration to an element of a text. The element of the text may include but need not be limited to a block, an inline element, a wrapper and the like.

[0041] Thus, each element of the text tagged with the technology specification document is transformed / converted to form a structured technology specification document at step 410 by executing plurality of instructions. The plurality of instructions includes a method of eliminating headers and footers of the PDF document, formatting list of tagged data of similar type, resolving links for data, formatting tables by utilizing Client Access Licenses (CALs), formatting subparagraphs similar to the tagged list, computing CODE into codes using dictionary and interpreting NUMBER according to code value in same column to form the structured technology specification document. Finally, the structured technology specification document is formed from the PDF document with a hierarchy which includes one or more segments that can be, but need not be limited to, a header, title, contents, heading, subtitles, paragraphs, subparagraphs and footnotes. The structured technology specification is in the form, where the alignment of data in header, main text, tables and footer are at a predefined place.

[0042] Those skilled in the art will realize that the above recognized advantages and other advantages described herein are merely exemplary and are not meant to be a complete rendering of all of the advantages of the various embodiments of the present invention.

[0043] In the foregoing specification, specific embodiments of the present invention have been described. However, one of ordinary skill in the art appreciates that various modifications and changes can be made without departing from the scope of the present invention as set forth in the claims below. Accordingly, the specification is to be regarded in an illustrative rather than a restrictive sense, and all such modifications are intended to be included within the scope of the present invention. The benefits, advantages, solutions to problems, and any element(s) that may cause any benefit, advantage, or solution to occur or become more pronounced are not to be construed as a critical, required, or essential features or elements of any or all the claims. The present invention is defined solely by the appended claims including any amendments made during the pendency of this invention and all equivalents of those claims as issued.

Documents

Application Documents

#	Name	Date
1	KheP002 - Form 5.pdf	2015-04-13
2	KheP002 - Form 2.pdf	2015-04-13
3	KheP002 - Drawings.pdf	2015-04-13
4	1658-CHE-2015-FER.pdf	2019-12-31

Search Strategy

1	search34_13-12-2019.pdf