Context Aware Ontology Based Information Extraction
Abstract:
Systems and methods for context aware ontology based information extraction are described. In one embodiment, the method comprises pre-processing an unstructured text document to obtain an induced tree, wherein the induced tree represents words and grammatical relations between the words in the unstructured text document as induced tree nodes. Further, the method comprises creating an object graph based on the induced tree, wherein the object graph comprises a plurality of object graph nodes including entity nodes, property nodes, and relation nodes. Furthermore, the method comprises identifying an ontological type of each of the plurality of the object graph nodes based at least on the entropy scores of the entity nodes, and generating structured information from the unstructured text document upon identifying the ontological type.
Get Free WhatsApp Updates!
Notices, Deadlines & Correspondence
Nirmal Building 9th Floor Nariman Point Mumbai Maharashtra 400021
Inventors
1. SHAH Sapankumar Hiteshchandra
Tata Research Development and Design Centre Tata Consultancy Services 54 B Hadapsar Industrial Estate Pune 411 013
2. REDDY Sreedhar Sanareddy
Tata Research Development and Design Centre Tata Consultancy Services 54 B Hadapsar Industrial Estate Pune 411 013
Specification
FORM 2
THE PATENTS ACT, 1970
(39 of 1970)
&
THE PATENTS RULES, 2003
COMPLETE SPECIFICATION (See section 10, rule 13)
1. Title of the invention: CONTEXT AWARE ONTOLOGY BASED INFORMATION
EXTRACTION
2. Applicant(s)
NAME NATIONALITY ADDRESS
TATA CONSULTANCY Indian Nirmal Building, 9th Floor, Nariman
SERVICES LIMITED Point, Mumbai, Maharashtra 400021,
India
3. Preamble to the description
COMPLETE SPECIFICATION
The following specification particularly describes the invention and the manner in which it
is to be performed.
TECHNICAL FIELD
[0001] The present subject matter relates, in general, to information extraction and, in
particular, to a system and a method for context aware ontology based information extraction.
BACKGROUND
[0002] In today's world, enormous amount of information pertaining to different domains
of interests is available on the World Wide Web in a scattered and unstructured manner. Extraction and management of information has always been an active field of research. Information extraction (IE) is the task of extracting useful piece of information from unstructured or semi-structured documents, such as research papers, blogs, published articles, e-books, and the like.
[0003] Various systems for information extraction have been developed in the past.
Recently, Ontology based information extraction (OBIE) has emerged as a sub-field of IE where ontologies are used in the IE process. An ontology is defined as a formal and explicit specification of a shared conceptualization. Ontologies play a central role in OBIE by providing a formal means for specifying IE targets and a structure for storing extracted information. Ontology represents a domain in a hierarchical manner and models a domain terminology in terms of concepts, properties, and relations, which can be used to specify IE targets. OBIE system takes domain ontology as input and uses various IE techniques to discover instances of domain specific concepts and their property values.
SUMMARY
[0004] This summary is provided to introduce concepts related to context aware ontology
based information extraction. These concepts are further described below in the detailed description. This summary is not intended to identify essential features of the claimed subject matter nor is it intended for use in determining or limiting the scope of the claimed subject matter.
[0005] In one embodiment, the method for context aware ontology based information
extraction comprises pre-processing an unstructured text document to obtain an induced tree, wherein the induced tree represents words and grammatical relations between the words in the unstructured text document as induced tree nodes. Further, the method comprises creating an object graph based on the induced tree, wherein the object graph comprises a plurality of object graph nodes including entity nodes, property nodes, and relation nodes. Furthermore, the method comprises identifying an ontological type of each of the plurality of the object graph nodes based at least on the entropy scores of the entity nodes, and generating structured information from the unstructured text document upon identifying the ontological type.
BRIEF DESCRIPTION OF THE DRAWINGS
[0006] The detailed description is described with reference to the accompanying
figure(s). In the figure(s), the left-most digit(s) of a reference number identifies the figure in
which the reference number first appears. The same numbers are used throughout the figure(s) to
reference like features and components. Some embodiments of systems and/or methods in
accordance with embodiments of the present subject matter are now described, by way of
example only, and with reference to the accompanying figure(s), in which:
[0007] Figure 1 illustrates a network environment implementing an information
extraction system, according to an embodiment of the present subject matter.
[0008] Figure 2 illustrates components of the information extraction system, according to
an embodiment of the present subject matter.
[0009] Figure 3 illustrates a method for context aware ontology based information
extraction according to an embodiment of the present subject matter.
DETAILED DESCRIPTION
[0010] Ontology is a hierarchical arrangement of a domain that represents a domain
terminology in terms of concepts (classes), properties (data type properties), and relations (object type properties). For example, an ontology of geopolitical entities domain may include concepts, such as country, organization, etc. Each country may have various properties or attributes, such as population, currency, area, and the like. Further, the concepts in the ontology may have relations, for example, India borders with Pakistan. Here, borders with is the relation between the concepts India and Pakistan.
[0011] Ontology based information extraction (OBIE) involves extracting information
pertaining to a particular domain from unstructured documents, identifying entities and their properties in the documents, and relating such entities to concepts in the ontology. The unstructured documents referred herein may be research papers, blogs, published articles, e-books, and the like. Typically, OBIE systems are broadly classified as ontology learning systems and ontology population systems. The task of an ontology learning OBIE system is to construct domain specific concepts and properties from unstructured texts. Whereas, an ontology population OBIE systems extracts instances of domain specific concepts and their property values for a given domain ontology.
[0012] Conventionally, various approaches to OBIE are available. Such approaches can
broadly be classified into machine learning based approaches, and rules based approaches. Conventional machine learning based approaches involve assessment of manually tagged data to learn different models for information extraction. For example, Pandora, a popular internet radio service, employs musicologists to annotate songs with a fixed vocabulary of about five hundred tags. Pandora then creates a personalized music playlist by finding songs that share a large number of tags with a user specified seed song. After about 10 years of effort by about 50 full time musicologists, less than one million songs have been manually annotated, representing less than 5% of the current iTunes, a musical record store, and catalog. However, manual tagging of data is tedious and time consuming job. Further, manual tagging is largely dependent on experts for tagging data pertaining to a domain. Moreover, manual tagging lacks domain compatibility as the manual tagging for one domain may not be applicable to other domains. For each specific domain, the tagging may vary.
[0013] Conventional rule based approach to OBIE employ manually coded rules to
extract entities and relations of interest, pertaining to a given domain, from a given unstructured document. However, most of the rule based OBIE systems use various domain-specific rules for extracting information from the unstructured documents. Such rules are designed and built to operate on specific domains, and are typically of no value if used for other domains. Moreover, significant time is consumed to come up with the rules.
[0014] In accordance with the present subject matter, a method and a system for context
aware ontology based information extraction (CAOBIE) is described. In an embodiment, a
CAOBIE system is configured to extract entities, their properties and relations from a given unstructured text document, and relate the same to concepts in the domain ontology. For this purpose, the CAOBIE system includes a pattern matcher. The pattern matcher utilizes a plurality of patterns that are written using a pattern language rich in linguistic features for extracting entities, properties and relations from the unstructured text document. In one implementation, the patterns referred herein are domain-independent and can be therefore utilized for extracting entities, properties and relations corresponding to any domain.
[0015] In one implementation, the pattern matcher is configured to initially extract the
properties and relations from the unstructured text document. Further, the pattern matcher is configured to take cues from the extracted properties and relations upon matching the properties and relations with the patterns, to extract the entities from the unstructured text document.
[0016] Once the entities, properties, and relations are identified, an object graph is
created. The object graph represents the identified entities, properties, and relations in the form of entity nodes, property nodes, and relation nodes respectively, collectively called object graph nodes. Subsequent to the creation of the object graph, the CAOBIE system utilizes a global context aware type identification algorithm for identifying an ontological type of the entity nodes. In one implementation, the global context aware type identification algorithm uses global level information, such as information included in related entity nodes in addition to local contextual information provided by property and relation nodes for identifying the ontological type of these nodes.
[0017] Use of the global level information helps in making precise decision for
identifying the ontological type of the entity nodes. Further, the concept of entropy is used to determine the uncertainty associated with the ontological type of the entity nodes. The information related to the ontological type is then propagated through the object graph from the entity nodes having low entropy based score to the entity nodes having high entropy based score in an iterative manner. In one implementation, the CAOBIE system determines the ontological type of the property nodes and relation nodes based on computing similarity scores for the property nodes and the relation nodes. The similarity scores are computed, for example, by comparing the property nodes and the relation nodes with corresponding properties and relations in a domain ontology.
[0018] Once the ontological type of the object graph nodes including the entity nodes,
relation nodes and property nodes are determined, the object graph nodes along with their ontological type information are then serialized to RDF notation. In one implementation, the CAOBIE system is configured to process the unstructured text document, to create object graph and subsequently generate the structured information by converting the object graph to RDF notation. In one implementation, the structured information, or the information, thus, extracted can be stored in a data store that can be queried for retrieving domain specific information stored therein.
[0019] The systems and the methods in accordance with the present subject matter
provide an efficient ontology based information extraction. The CAOBIE system implements generically written patterns which are applicable across different domains, thereby eliminating the frequent need for a domain expert to write/re-write patterns for different domains. Further, the context aware approach using global level information, presented in the global context aware type identification algorithm helps in precisely extracting the information from a given unstructured text document.
[0020] The following disclosure describes the system and the method for context aware
ontology based information extraction system. While aspects of the described system and method can be implemented in any number of different computing systems, environments, and/or configurations, embodiments for the context aware ontology based information extraction are described in the context of the following exemplary system(s) and method(s).
[0021] Figure 1 illustrates a network environment 100 implementing an information
extraction system 102, in accordance with an embodiment of the present subject matter.
[0022] In one implementation, the network environment 100 can be a public network
environment, including thousands of personal computers, laptops, various servers, such as blade servers, and other computing devices. In another implementation, the network environment 100 can be a private network environment with a limited number of computing devices, such as personal computers, servers, laptops, and/or communication devices, such as mobile phones and smart phones.
[0023] The information extraction system 102 is communicatively connected to a
plurality of user devices 106-1, 106-2, 106-3...,and 106-N, collectively referred to as user devices 106 and individually referred to as a user device 106, through a network 108. In one implementation, a plurality of users may use the user devices 106 to communicate with the information extraction system 102. In said implementation, the information extraction system 102 is further connected to a data store 104 through the network 106. Though the data store 104 is shown external to the information extraction system 102, it is well appreciated that the data store 104, in another implementation, can be integrated within the information extraction system 102.
[0024] The information extraction system 102 and the user devices 106 may be
implemented in a variety of computing devices, including, servers, a desktop personal computer, a notebook or portable computer, a workstation, a mainframe computer, a laptop and/or communication device, such as mobile phones and smart phones. Further, in one implementation, the information extraction system 102 may be a distributed or centralized network system in which different computing devices may host one or more of the hardware or software components of the information extraction system 102.
[0025] The information extraction system 102 may be connected to the user devices 106
over the network 108 through one or more communication links. The communication links between the information extraction system 102 and the user devices 106 are enabled through a desired form of communication, for example, via dial-up modem connections, cable links, digital subscriber lines (DSL), wireless, or satellite links, or any other suitable form of communication.
[0026] The network 108 may be a wireless network, a wired network, or a combination
thereof. The network 108 can also be an individual network or a collection of many such individual networks, interconnected with each other and functioning as a single large network, e.g., the Internet or an intranet. The network 108 can be implemented as one of the different types of networks, such as intranet, local area network (LAN), wide area network (WAN), the internet, and such. The network 108 may either be a dedicated network or a shared network, which represents an association of the different types of networks that use a variety of protocols, for example, Hypertext Transfer Protocol (HTTP), Transmission Control Protocol/Internet Protocol (TCP/IP), etc., to communicate with each other. Further, the network 108 may include
network devices, such as network switches, hubs, routers, for providing a link between the information extraction system 102 and the user devices 106. The network devices within the network 108 may interact with the information extraction system 102, and the user devices 106 through the communication links.
[0027] According to an implementation of the present subject matter, the information
extraction system 102 may be configured for extracting information pertaining to a particular domain of interest. For this purpose, the information extraction system 102 may include a domain ontology corresponding to the domain of interest pre-stored in a repository associated with the information extraction system 102. Domain ontology includes concepts related to the domain, their properties, and relations between the concepts. Such a domain ontology act as a knowledge base for a particular domain.
[0028] Although, the information extraction according to the present description is
explained with reference to a single domain of interest, it will be apparent to a person skilled in the art that the information extraction can be performed for multiple domains of interest, and multiple domain ontologies can be provided for this purpose.
[0029] In one implementation, the information extraction system 102 may be configured
to populate and enrich the domain ontology using annotations. For example, the concepts of the domain ontology can be enriched with description annotations describing the meaning of the concepts. Enrichment of the concepts enables determining initial probability of an entity having a particular concept type. For each concept in the domain ontology, similarity values/scores between the words in the context of a given entity and the words in the concept description is evaluated. The similarity values are then normalized to get initial probability values.
[0030] Further, for each ontological concept in the domain ontology, relative
identification weights are assigned for its properties and relations. The identification weights indicate the relative importance of a property or relation in identifying the concept. For example, consider an Organization domain with two concepts, say, Employee and Department, and three properties, say, Employee.name, Department.name, and Employee.reports_to. Here, the occurrence of reports_to in text can provide cues that the type of the associated entity is Employee, unlike name. Therefore, reports_to is given more identification weight than name.
[0031] In one implementation, the information extraction system 102 may be configured
to provide annotations by adding synonyms for the concepts, properties and relations present in the domain ontology. Further, the information extraction system 102 may be configured for specifying stricter constraints on the values of the properties related to the concepts. For the purpose, the domain ontology is enriched with value pattern annotations that specify a regular expression pattern that the values of the property should match. For example, in a camera review domain with a property Camera.megapixel, the regex for the value pattern may be specified as: \d+d(\.\d+)?(mp|megapixel).
[0032] In one implementation, the information extraction system 102 is configured to
extract information from an unstructured text document based on the domain ontology. The unstructured text document may be a research paper, a blog post, a news article, and the like, pertaining to the domain. To extract the information, the information extraction system 102 is configured to perform a series of pre-processing steps over the unstructured text document. During the pre-processing of the unstructured text document, linguistic features of the unstructured text document are extracted and a dependency tree is generated. In one implementation, the information extraction system 102 is configured to extract the linguistic features from the unstructured text document based on conventionally known natural language processing technique. A dependency tree provides a representation of the grammatical relations between the words in a sentence. For example, in a dependency tree, a node represents the words and the edges represent the grammatical relations.
[0033] In one implementation, the information extraction system 102 is configured to
generate an induced tree based on the dependency tree. In the said implementation, the induced tree represents the words, as well as the grammatical relation between the words in the form of nodes, hereinafter referred to as induced tree nodes. Further, a set of tree transformations are applied to the induced tree using a conventionally available tree transformation language to handle conjunctions in the induced tree.
[0034] In one implementation, the pattern matching module 110 is configured to process
the induced tree and extract information pertaining to the domain from the induced tree, based on the domain ontology. The pattern matching module 110 applies a plurality of predefined patterns on the induced tree to generate a graph structure, hereinafter referred to as object graph. In one
implementation, the object graph includes a plurality of object graph nodes representing different entities, properties and relations. For example, the entities are represented as entity nodes, properties are represented as property nodes, and relationship between entity nodes are represented as relation nodes in the object graph. In an example, the entity node represents an instance of a domain entity found in the unstructured text document, the property node links an entity node with its property value, and the relation node links two entity nodes that represent domain and range of some ontological object property. In one implementation, the entity nodes are determined based on identification weights assigned to the properties and relations in the domain ontology.
[0035] In one implementation, the type identification module 112 is configured to
identify an ontological type of the object graph nodes. For example, the ontological type for the three type of object graph nodes are: concepts for entity nodes; data property for property nodes; and object property for relation nodes. In one implementation, the type identification module 112 may use the standard Hearst patterns, a concept identification pattern, for identifying the ontological type of the entity nodes in the object graph. In another implementation, the type identification module 112 may be configured to determine the ontological type of the entity nodes based on pre-assigned identification weights assigned to properties and relations in the domain ontology. Once the ontological type of the entity nodes is determined, the type identification module 112 can be configured to determine the uncertainty associated with identified ontological type of the entity nodes, in one implementation. The type identification module 112, for example, utilizes the conventionally known concept of entropy to determine the uncertainty associated with each of the entity nodes in the object graph, and assign correct or certain ontological type for such entity nodes. Subsequent to the identification of the certain identification type, the type identification module 112 may be configured to propagate the information related to the certain ontological type of the entity node through the remaining entity nodes. Once the certain ontological type is identified for the entity nodes, the entity nodes are related or associated with the corresponding concepts in the domain ontology.
[0036] In one implementation, the information extraction system 102 may serialize the
object graph to RDF notation and maintain the RDF notation in the data store 104, which can be queried to retrieve the information pertaining to the domain stored therein.
[0037] Figure 2 illustrates components of an information extraction system 102
according to an embodiment of the present subject matter
[0038] In one implementation, the information extraction system 102 includes one or
more processor(s) 202, I/O interfaces 204, and a memory 206 coupled to the processor 202. The processor 202 can be a single processing unit or a number of units, all of which could include multiple computing units. The processor 202 may be implemented as one or more microprocessors, microcomputers, microcontrollers, digital signal processors, central processing units, state machines, logic circuitries, and/or any devices that manipulate signals based on operational instructions. Among other capabilities, the processor 202 is configured to fetch and execute computer-readable instructions and data stored in the memory 206.
[0039] The I/O interfaces 204 may include a variety of software and hardware interfaces,
for example, interfaces for peripheral device(s), such as a keyboard, a mouse, a display unit, an external memory, and a printer. Further, the I/O interfaces 204 may enable the clinical decision support system 102 to communicate with other devices, such as web servers and external databases. The I/O interfaces 204 can facilitate multiple communications within a wide variety of networks and protocol types, including wired networks, for example, local area network (LAN), cable, etc., and wireless networks, such as Wireless LAN (WLAN), cellular, or satellite. For the purpose, the I/O interfaces 204 may include one or more ports for connecting a number of computing systems with one another or to a network.
[0040] The memory 206 may include any non-transitory computer-readable medium
known in the art including, for example, volatile memory, such as static random access memory (SRAM) and dynamic random access memory (DRAM), and/or non-volatile memory, such as read only memory (ROM), erasable programmable ROM, flash memories, hard disks, optical disks, and magnetic tapes. In one implementation, the information extraction system 102 also includes module(s) 208 and data 210.
[0041] The module(s) 208, amongst other things, include routines, programs, objects,
components, data structures, etc., which perform particular tasks or implement data types. The module(s) 208 may also be implemented as, signal processor(s), state machine(s), logic circuitries, and/or any other device or component that manipulate signals based on operational instructions.
[0042] Further, the module(s) 208 can be implemented in hardware, instructions executed
by a processing unit, or by a combination thereof. The processing unit can comprise a computer, a processor, such as the processor 202, a state machine, a logic array or any other suitable devices capable of processing instructions. The processing unit can be a general-purpose processor which executes instructions to cause the general-purpose processor to perform the required tasks or, the processing unit can be dedicated to perform the required functions.
[0043] In another aspect of the present subject matter, the module(s) 208 may be
machine-readable instructions (software) which, when executed by a processor/processing unit, perform any of the described functionalities. The machine-readable instructions may be stored on an electronic memory device, hard disk, optical disk or other machine-readable storage medium or non-transitory medium. In one implementation, the machine-readable instructions can be also be downloaded to the storage medium via a network connection.
[0044] In one implementation, the module(s) 208 further include a pre-processing
module 212, a pattern matching module 110, a type identification module 112, a structuring module 214, and other module(s) 216. The other modules 216 may include programs or coded instructions that supplement applications and functions of the information extraction system 102.
[0045] The data 210 serves, amongst other things, as a repository for storing data
processed, received, and generated by one or more of the modules 208. The data 210 includes pre-processing data 218, pattern matching data 220, type identification data 222, structured data 224, and other data 226. The other data 226 includes data generated as a result of the execution of one or more modules in the modules 208.
[0046] In one implementation, the information extraction system 102 may be configured
to extract information, pertaining to a particular domain of interest, from an unstructured text document. For extracting the information, the pre-processing module 212 may be configured to process the enrichments such as annotations and descriptions captured in the domain ontology as described previously. Enrichment of the domain ontology helps in identification and classification of entities and their property values present in the unstructured text document. The pre-processing module 212 performs a series of pre-processing steps, such as identifying linguistic features from the unstructured text document, generating an induced tree, and tree transformations. In one implementation, the pre-processing module 212 is configured to identify
linguistic features from the unstructured text documents using a natural language processing technique. Further, for generating the induced tree, the pre-processing module 212 utilizes conventionally known dependencies, for example, Stanford dependencies (SD). The SD includes a total of 53 grammatical relations. The grammatical relations are used to locate the entities once the property and relation occurrences are found. The SD provides a representation of the grammatical relations between the words in a sentence of the unstructured text document. The words of the sentence along with the grammatical relations form a dependency tree, where nodes of the dependency tree represent the words and edges represent the grammatical relations.
[0047] Subsequently, the pre-processing module 212 is configured to generate the
induced tree from the dependency tree. The induced tree, thus, generated represents words and the grammatical relations between the words in the sentence as induced tree nodes. In one implementation, the pre-processing module 212 applies a set of tree transformations to the induced tree to refine the induced tree. In an example, the pre-processing module 212 may store the induced tree in the pre-processing data 218. In an example, the pre-processing module 212 applies a conventionally know tree transformation language such as Stanford TSurgeon for executing these tree transformation patterns. An exemplary illustration of the tree transformation patterns is provided in table 1 below.
Table 1
Tree TRegex Pattern - TSurgeon Operations Remarks
Transformation condition
ConjunctionAnd /. */=head < (cc=vCC move brother $- head; All the conjuncts in and
< delete vConj conjunction becomes
and=vAnd) < Siblings; children of
(conj=vConj < Parent of head conjunct.
/. */=brother) (India borders with Pakistan and China)
CompoundNoun
/. */=head < (nn=vNN accumulate compound Words in a compound noun
< head are considered as single unit
/. */=compound) compound; e.g. India borders with Sri
excise vNN compound Lanka; Sri Lanka is stored as a single induced tree node.
ModifierList /. */=head < accumulate modifier All modifiers are stored
(/. *mod. */=vMod head along with an induced
< /.*/=modifier) modifier; excise vMod tree node of word that they
modifier modify.
CompoundNumber /. */=head < prune vNumber All the words in compound
(numb er=vNumb er number are treated as
< /. */=compound) a single node e.g. I lost $ 3.2 billion. Here, $ 3.2 billion is treated as a single node of number type.
[0048] Once the induced tree is generated, the pattern matching module 110 is
configured to extract information, such as entities, properties and relations, from the induced tree based on matching the words and relations in the induced tree with a plurality of predefined patterns. In one implementation, the pattern matching module 110 is configured to initially extract properties and relations from the induced tree. Based on the properties and relations, the pattern matching module 110 extract entities from the induced tree.
[0049] The patterns referred herein are written in a pattern language. A pattern consists
of a premise and a sequence of actions. The premise is a set of conditions that should hold true for the actions to be executed. In one implementation, the premise consists of tree paths, ontological constraints, and Boolean expressions. The action component in the pattern specifies a sequence of actions to be performed over variable bindings from the premise. Further, the basic constituent used in the action is assignment. The pattern language makes use of language constructs to explicitly refer to various ontological elements such as concepts, property, and relations. An exemplary subset of the grammar of the pattern language is provided below.
"patterns:- pattern*
pattern:- patternID "{" premise "}"
"->" "{" actions "}"
patternID:- (DIGIT)+
premise:- (treePath ";")+
(ontologyConstraint ";")+
("{" boolean_expression
"}" ";")?
treePath:-element| element"--" treePath
ontologyConstraint:- ontologyElement = variable
actions:- ("{" action + "}")+
action :- LHS = RHS ";"
LHS:- ontologyActionElement | variable
RHS:-variable |identifier
|action_function"
[0050] The patterns written using the pattern language helps in identification of
properties and relations, and thereby helps in identifying the entities in the unstructured text document. In one implementation, the tree paths and the ontology constraints are defined in such a manner in the pattern language, that the patterns are generic and are applicable across different domains. An example of such generic, domain independent patterns is provided in table 2 below.
Table 2
India has a coastline of 7515 km. property extraction
1 { -- dobj -- -- prep -- of -- pobj -- ; property = ;
-- nsubj -- ; {isRoot() && isTypeMatching(, Number)};
} -> {source=; target=; property= }
Ratan Tata launched Tata Nano in 2010. Relation extraction
2 { -- nsubj -- ; -- dobj --