Abstract: An optimized entity extraction platform system and method for identifying key entities from an unstructured/semi structured text using a visual segregation application. The entity extraction system initially converts the different types of documents into a common document format in order to preserve the look and feel of the collected data/documents. The data/documents are further created into one or more blocks using a sect ionization process in order to sequentially identify the sections within the document. The sections are further identified and classified based on the representation type and style of the document using a visual segregation application and stored into a repositor. The visual segregation application effectively analyses the visual aspects of the data representation within the section and provide higher weight to appropriate set of words whereas, the accuracy of the entities extracted can be improved by using the patterns of the text in the document. A rule based engine is employed to extract the entities from the text based on a small sized domain specific dictionary and text patterns/regular expressions to identify the key entities from the unstructured/semi-structured text.
OPTIMIZED ENTITY EXTRACTION PLATFORM SYSTEM AND METHOD
TECHNICAL FIELD
[0001] Embodiments are generally related to entity extraction systems and methods. Embodiments are also related to visual segregation technique. Embodiments are additionally related to system and method for identifying key entities from an unstructured/semi structured text using entity extraction platform.
BACKGROUND OF THE INVENTION
[0002] Human languages are rich and complicated which includes hundreds of vocabularies with complex grammar and contextual meanings. By way of example, a particular statement, question, thought, meaning, etc. can be expressed in a multitude of different manners. Thus, machine interpretation of the human language is an extremely complex task. For at least this reason, oftentimes, the result/action produced from a human input does not accurately map/correspond to the user intent.
[0003] Most of the web-based applications and other applications provide gadgets to users that generate content based on entities extracted from search queries or documents of the users. For example, some applications present gadgets that present content based on entities extracted from search queries. These entities are typically extracted based on either keyword in the query or a pattern that must match the entire query, rather than a more complex pattern. Some applications present gadgets that present content based on entities extracted from documents. These entities are typically extracted based on keywords in the document. While some applications may recognize more complex patterns of text, they do so only when a document is displayed and not when a document is modified.
[0004] Most of the prior art entity extraction systems employ a NLP (Natural Language Processing) based machine learning technique in order to extract the entities. Natural language input can be useful for a wide variety of applications, including virtually every software application with which humans interact. Typically, during natural language processing the natural language input is separated into tokens and mapped to one or more actions provided by the software application. Each software application can have a unique set of actions, which are somewhat limited in nature. As a result, it can be both time-consuming and repetitive for developers to draft code to interpret natural language input and map the input to the appropriate action for each application.
[0005] Based on the foregoing, it is believed that a need exists for an improved optimized entity extraction platform system and method for identifying key entities from an unstructured/semi structured text using a visual entity extraction technique, as described in greater detail herein.
BRIEF SUMMARY
[0006] The following summary is provided to facilitate an understanding of some of the innovative features unique to the disclosed embodiment and is not intended to be a full description. A full appreciation of the various aspects of the embodiments disclosed herein can be gained by taking the entire specification, claims, drawings, and abstract as a whole.
[0007] It is, therefore, one aspect of the disclosed embodiments to provide for an improved optimized entity extraction platform system and method.
[0008] It is another aspect of the disclosed embodiments to provide for an improved visual segregation application.
[0009] It is further aspect of the disclosed embodiments to provide for an improved system and method for identifying key entities from an unstructured/semi structured text using a visual segregation application.
[0010] The aforementioned aspects and other objectives and advantages can now be achieved as described herein. An optimized entity extraction platform system and method for identifying key entities from an unstructured/semi structured text using a visual segregation application, is disclosed herein. The entity extraction system initially converts the different types of documents (such as, .doc, .rtf, .pdf, .html, .txt, etc) into a common document format (e.g., HTML) in order to preserve the look and feel of the collected data/documents. The data/documents are further created into one or more blocks using a sect ionization process in order to sequentially identify the sections within the document. The sections are further identified and classified based on the representation type and style of the document using a visual segregation application and stored into a repository (e.g., a database). The visual segregation application effectively analyses the visual aspects of the data representation within the section and provide higher weight to appropriate set of words whereas, the accuracy of the entities extracted can be improved by using the patterns of the text in the document. A rule based engine is employed to extract the entities from the text based on a small sized domain specific dictionary and text patterns/regular expressions to identify the key entities from the unstructured/semi-structured text.
[0011] The rule engine of the entity extraction platform can be effectively employed to identify relevant data/documents with respect to a segment or domain where as the data/documents usually follow a pattern which helps in building the rule engine. The rule engine can be based on the content of document, properties of document, data source, document presentation or key elements of the document. The set of domain specific documents can be recognized by the set of keywords which prominently drive the domain. A subset of such prominent keywords can further segregate the appropriate data/documents from the irrelevant data/documents of the selected domain/segment. Also presence of some keywords is forbidden, such that it can enhance the extraction process.
[0012] Note that the application described herein will not only depend on the keywords that drive the domain/segment. It also considers other elements in the documents for segregation of the entities. File types, file attributes (file format and file size) provided by operating system can be one element that provides reasonable segregation of relevant data set from the non-relevant data sets. Apart from the document segmentation, the entity extraction platform can be also employed to identify the components of the data/document of any language through their presentation.
[0013] The common document format of the data/documents encapsulates the information including the visual aspects of the data/documents in a well defined format. Such an aspect facilitates to effective information aggregation without any actually understanding of the content of the document. The visual segregation based entity extraction system described herein does not depend solely on the language processing techniques and relies more heavily on the visual representation of the content. The representation of words in different formats from other words increases the importance of such words which play a vital role in the entity extraction. Also, the visual segregation entity extraction does not depend on the similar template formats. It differentiates itself from the machine learning approach as there is no need for the system to learn from the huge data corpus.
BRIEF DESCRIPTION OF THE DRAWINGS
[0014] The accompanying figures, in which like reference numerals refer to identical or functionally-similar elements throughout the separate views and which are incorporated in and form a part of the specification, further illustrate the present invention and, together with the detailed description of the invention, serve to explain the principles of the present invention.
[0015] FIG. 1 illustrates a graphical representation of an optimized entity extraction system having a visual segregation application, in accordance with the disclosed embodiments;
[0016] FIG. 2 illustrates a block diagram of the optimized entity extraction system, in accordance with the disclosed embodiments; and
[0017] FIG. 3 illustrates a high level flow chart of operation illustration logical operation steps of a method for identifying key entities from an unstructured/semi structured text using a visual segregation application, in accordance with the disclosed embodiments.
DETAILED DESCRIPTION
[0018] The particular values and configurations discussed in these non-limiting examples can be varied and are cited merely to illustrate at least one embodiment and are not intended to limit the scope thereof.
[0019] The embodiments now will be described more fully hereinafter with reference to the accompanying drawings, in which illustrative embodiments of the invention are shown. The embodiments disclosed herein can be embodied in many different forms and should not be construed as limited to the embodiments set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the invention to those skilled in the art. Like numbers refer to like elements throughout. As used herein, the term "and/or" includes any and all combinations of one or more of the associated listed items.
[0020] The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.
[0021] Unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the relevant art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.
[0022] FIG. 1 illustrates a graphical representation of the entity extraction system 100 having a visual segregation application 140, in accordance with the disclosed embodiments. The system includes an entity extraction server 130 that is operatively configured with a visual segregation application 140 in order to segregate one or more documents from a content provider environment 160 having one or more content providers 145, 150 and 155 and semantically provide the extracted data to a user environment 105 with users 110, 115, 120 and 125. The system 100 effectively identifies key entities from an unstructured/semi structured text and transmits the data to the user environment 105 based on the ontology/semantic net. The system 100 also includes a database 135 for storing the collected entity information in the network (e.g., internet).
[0023] The entity extraction server 130 may be specially constructed for performing various processes and operations according to the disclosed embodiments or may include a general-purpose computer selectively activated or reconfigured by the visual segregation application 140 in order to provide the necessary functionality. The processes disclosed herein are not inherently related to any particular computer, network, architecture, environment, or other apparatus, and may be implemented by a suitable combination of hardware, software, and/or firmware.
[0024] While the entity extraction system 100 has been particularly shown and described with reference to a preferred embodiment, it will be understood by those skilled in the art that various changes in form and detail can be made therein without departing from the spirit and scope of the invention. Furthermore, as used in the specification and the appended claims, the term "system" includes any data processing system or apparatus including, but not limited to, personal computers, servers, workstations, network computers, main frame computers, routers, switches, Personal Digital Assistants (PDA's), telephones, and any other system capable of processing, transmitting, receiving, capturing and/or storing data. Thus, the user community 105 depicted in FIG. 1, for example, may equally be implemented as PDA, cellular telephone, Smartphone, laptop computer, iPhone, Blackberry type device as well as other types of personal or desktop computers.
[0025] FIG. 2 illustrates a block diagram of the entity extraction system 200, in accordance with the disclosed embodiments. FIGS. 1-3 identical parts or elements are generally indicated by identical reference numerals. The visual segregation application 140 includes a document conversion unit 220 which converts the documents 260 with different document types 265 such as, for example, .doc, .rtf, .pdf, .html, .txt, etc into a common document format (e.g., HTML) 225 in order to preserve the look and feel of the data collected over the network (e.g., internet). The documents 260 collected over the network are further created into one or more blocks using a sect ionization unit 230 in order to sequentially identify the sections within the document.
[0026] The sections are further identified and classified based on the representation type and style of the document 260 and stored into the database 135. The visual segregation application 140 effectively analyses the visual aspects of the data representation within the section and provide higher weight to appropriate set of words whereas, the accuracy of the entities extracted can be improved by using the patterns of the text in the document. A rule based engine 240 is employed to extract the entities based on a small sized domain specific dictionary 245 and regular expression/text pattern from the text 255 to identify the key entities from the unstructured/semi-structured text of the documents 260 belonging to a specific segment/domain 255. The sections in the document 260 can be classified based on the heading of the segment using a set of content based rules.
[0027] The rule based engine 240 of the entity extraction platform 200 can be effectively employed to
identify relevant data/documents 260 with respect to a segment or domain where as the data/documents 260 usually follow a pattern which helps in building the rule engine 240. The rule engine 240 can be based on the content of document, properties of document, data source, document presentation or key elements of the document 260. The set of domain specific documents can be recognized by the set of keywords which prominently drive the domain 255. A subset of such prominent keywords can further segregate the appropriate data/documents 260 from the irrelevant data/documents of the selected domain/segment 255. Also presence of some keywords is forbidden, such that it can enhance the extraction process. Note that the application described herein will not only depend on the keywords that drive the domain/segment 255. It also considers other elements in the documents 260 for segregation of the entities. File types, file attributes (file format and file size) provided by operating system can be one element that provides reasonable segregation of relevant data set from the non-relevant data sets.
[0028] The common document format of the data/documents 225 encapsulates the information including the visual aspects of the data/documents 260 in a well defined format. Such an aspect facilitates to effective information aggregation without any actually understanding of the content of the document 260. The visual segregation based entity extraction system 200 described herein does not depend solely on the language processing techniques and relies more heavily on the visual representation of the content. The representation of words in different formats from other words increases the importance of such words which play a vital role in the entity extraction. Also, the visual segregation entity extraction 140 does not depend on the similar template formats. It differentiates itself from the machine learning approach as there is no need for the system to learn from the huge data corpus.
[0029] As required, detailed embodiments of the present invention are disclosed herein; however, it is to be understood that the disclosed embodiments are merely exemplary of the invention that may be embodied in various and alternative forms. The figures are not necessarily to scale; some features may be exaggerated or minimized to show details of particular compositions. Therefore, specific structural and functional details disclosed herein are not to be interpreted as limiting, but merely as a representative basis for the claims and/or as a representative basis for teaching one skilled in the art to variously employ the present invention.
[0030] The entity extraction system 100 disclosed herein can be adapted in wide range of applications such as, for example, resume management and other database management applications. Apart from the document segmentation, the entity extraction platform 200 can be also employed to identify the components of the data/document 260. For example, a table in the data/document can be a separate component whereas the list present in the document can be considered as another component and the text within the document can be a group of one or more components. The entity extraction platform 200 can effectively identify the above mentioned components associated with the data/document. The entity extraction platform 200 described herein can be employed to identify the components of any document 160 of any language through their presentation pattern.
[0031] FIG. 3 illustrates a high level flow chart of operation illustration logical operation steps of a method 300 for identifying key entities from an unstructured/semi structured text using the visual segregation application 140, in accordance with the disclosed embodiments. The documents 260 from content providers can be converted into the common document format (e.g., HTML) 225 in order to preserve the look and feel of the data collected over the network, as illustrated at block 310. The data/documents 260 can be further created into one or more blocks using the sect ionization unit 230 in order to sequentially identify the sections within the document, as depicted at block 320. The sections of the document 260 further identified and classified based on the representation type and style of the document and stored into the database 135, as illustrated at block 330. The rule based engine 240 can be employed to extract the entities from the database 260 based on the small sized domain specific dictionary 245 and regular expressions/text patterns 250 to identify the key entities from the unstructured/semi-structured text, as depicted at block 340.
[0032] It will be appreciated that variations of the above-disclosed and other features and functions, or alternatives thereof, may be desirably combined into many other different systems or applications. Also that various presently unforeseen or unanticipated alternatives, modifications, variations or improvements therein may be subsequently made by those skilled in the art which are also intended to be encompassed by the following claims.
CLAIMS
1. A method for identifying at least one key entity from an unstructured/semi structured text, said method comprising:
converting at least one document type of a document into a common document type utilizing an entity extraction platform in order to thereby preserve look and feel of said document;
creating at least one block with respect to said document collected over said network utilizing a sect ionization process in order to sequentially identify and classify at least one section within said document wherein said at least one section is stored into a database; and
extracting at least one entity from said database utilizing a rule based engine in order to thereby identify at least one key entity from said unstructured/semi-structured text wherein said at least one entity is extracted from said database based on a small sized domain specific dictionary and a plurality of regular patterns.
2. The method of claim 1 further comprising identifying and classifying said at least one section within said document based on a representation type and a style of said document via a visual segregation application.
3. The method of claim 1 further comprising analyzing at least one visual aspect to said data representation type within said selection using said visual segregation application in order to provide a higher weight to an appropriate set of words whereas, the accuracy of said at least one entity extracted can be improved by using a plurality of patterns of text in said document.
4. The method of claim 1 further comprising designing said rule based engine based on said plurality of patterns of text in which said documents are extracted.
5. The method of claim 4 further comprising designing said rule based engine based on at least one of the following aspects of said document:
content of said document extracted; properties of said document; data source; and document presentation/key elements.
6. The method of claim 1 further comprising recognizing at least one set of domain specific documents utilizing at least one set of prominent keywords which prominently drive a domain wherein a subset of said prominent keywords can be utilized to further segregate said document from a plurality of irrelevant documents.
7. The method of claim 1 further comprising segregating said document utilizing at least one of the following information with respect to said document:
said at least one prominent keyword;
plurality of elements present in said document;
file type and file attributes of an operating system; and
said at least one entity and attributes of said common document format.
8. The method of claim 1 wherein said common document type can be an html format.
9. The method of claim 1 further comprising identifying a plurality of components with respect to said document utilizing a presentation of said document.
10. The method of claim 1 further comprising encapsulating information about visual aspects of said document utilizing said common document type in order to thereby provide effective information aggregation without any actually understanding of said content of said document.
| # | Name | Date |
|---|---|---|
| 1 | 828-CHE-2011 CORRESPONDENCE OTHERS 17-03-2011.pdf | 2011-03-17 |
| 2 | 828-CHE-2011 POWER OF ATTORNEY 17-03-2011.pdf | 2011-03-17 |
| 3 | 828-CHE-2011 FORM-5 17-03-2011.pdf | 2011-03-17 |
| 4 | 828-CHE-2011 FORM-2 17-03-2011.pdf | 2011-03-17 |
| 5 | 828-CHE-2011 FORM-1 17-03-2011.pdf | 2011-03-17 |
| 6 | 828-CHE-2011 DRAWINGS 17-03-2011.pdf | 2011-03-17 |
| 7 | 828-CHE-2011 DESCRIPTION(PROVISIONAL) 17-03-2011.pdf | 2011-03-17 |
| 8 | 828-CHE-2011 FORM-1 16-03-2012.pdf | 2012-03-16 |
| 9 | 828-CHE-2011 DRAWINGS 16-03-2012.pdf | 2012-03-16 |
| 10 | 828-CHE-2011 DESCRIPTION(COMPLETE) 16-03-2012.pdf | 2012-03-16 |
| 11 | 828-CHE-2011 CORRESPONDENCE OTHERS 16-03-2012.pdf | 2012-03-16 |
| 12 | 828-CHE-2011 CLAIMS 16-03-2012.pdf | 2012-03-16 |
| 13 | 828-CHE-2011 ABSTRACT 16-03-2012.pdf | 2012-03-16 |
| 14 | 828-CHE-2011 FORM-2 16-03-2012..pdf | 2012-03-16 |
| 15 | 828-CHE-2011-FER.pdf | 2019-11-20 |
| 16 | 828-CHE-2011-FORM 3 [20-05-2020(online)].pdf | 2020-05-20 |
| 17 | 828-CHE-2011-FER_SER_REPLY [20-05-2020(online)].pdf | 2020-05-20 |
| 18 | 828-CHE-2011-CORRESPONDENCE [20-05-2020(online)].pdf | 2020-05-20 |
| 19 | 828-CHE-2011-US(14)-HearingNotice-(HearingDate-10-01-2023).pdf | 2022-12-09 |
| 20 | 828-CHE-2011-Correspondence to notify the Controller [10-01-2023(online)].pdf | 2023-01-10 |
| 1 | SearchStrategyMatrix_11-11-2019.pdf |