Abstract: The present subject matter discloses a system and a method for information extraction. The method may include analyzing an input document to identify text patterns in the input document. Each of the text patterns may be associated with at least one label to provide an analyzed document. The method further includes determining linguistic patterns in the analyzed document. Each of the linguistic patterns may be associated with at least one annotation tag to provide an annotated document. The at least one annotation tag may be stored as a string within the annotated document. The method may include generating at least one extraction pattern based on selection of the at least one annotation tag. The at least one annotation tag may be selected by a user. The method may also include executing the at least one extraction pattern on the annotated document for extracting targeted information from the annotated document.
FORM 2
THE PATENTS ACT, 1970 (39 of 1970) & THE PATENTS RULES, 2003
COMPLETE SPECIFICATION (See section 10, rule 13)
1. Title of the invention: SYSTEM AND METHOD FOR INFORMATION EXTRACTION
2. Applicant(s)
NAME NATIONALITY ADDRESS
TATA CONSULTANCY Indian Nirmal Building, 9th Floor, Nariman Point,
SERVICES LIMITED
Mumbai, Maharashtra 400021
India
3. Preamble to the description
COMPLETE SPECIFICATION
The following specification particularly describes the invention and the manner in which it is to be
performed.
TECHNICAL FIELD
[0001] The present subject matter relates, in general, to information extraction, and in
particular, to extraction of targeted information based on an extraction pattern.
BACKGROUND
[0002] Information extraction (IE) relates to automatic extraction of information in a
structured manner from unstructured and/or semi-structured input documents, such as articles, newspapers, and online transcripts of radio and television broadcasts. Usually, the extraction of information is based upon identification of instances of user-defined search strings, say belonging to a class of events or having a predefined relationship. Conventionally, various IE tools are employed for extracting information pertaining to different domains from different sources, such as web pages and databases.
[0003] Typically, natural language processing (NLP) techniques are implemented in
conventional IE tools for extracting information from the input documents. The NLP techniques employ a plurality of analyses for extracting information. Once extracted, relevant information may be stored in a database or repository in the form of a report or a spreadsheet. At a later point, the user can retrieve information of interest from the database.
SUMMARY
[0004] This summary is provided to introduce concepts related to information
extraction, which is further described below in the detailed description. This summary is neither intended to identify essential features of the claimed subject matter nor is it intended for use in determining or limiting the scope of the claimed subject matter.
[0005] In an embodiment, the present subject matter discloses a system and a method
for extracting targeted information from a plurality of input documents. The method may include analyzing the plurality of input documents to identify a plurality of text patterns in the at least one input document. Each of the plurality of text patterns may be associated with at least one label to provide a set of analyzed documents. The method may further include determining one or more linguistic patterns in the set of analyzed documents. Each of the one or more linguistic patterns may be associated with at least one annotation tag to provide a set of annotated documents. Further, the at least one annotation tag may be stored as a string
within the set of annotated documents. In addition, the method may include generating at least one extraction pattern based on selection of the at least one annotation tag. The at least one annotation tag may be selected by a user. The method may also include executing the at least one extraction pattern on the set of annotated documents for extracting targeted information from the set of annotated documents.
BRIEF DESCRIPTION OF THE DRAWINGS
[0006] The detailed description is described with reference to the accompanying
figures. In the figures, the left-most digit(s) of a reference number identifies the figure in which the reference number first appears. The same numbers are used throughout the drawings to reference like features and components.
[0007] Fig. 1 illustrates a network environment implementing an information
extraction system, in accordance with an embodiment of the present subject matter.
[0008] Fig. 2a illustrates a graphical user interface for generating an extraction pattern,
in accordance with an embodiment of the present subject matter.
[0009] Fig. 2b illustrates the graphical user interface for forming an extraction pattern,
in accordance with another embodiment of the present subject matter.
[0010] Fig. 3 illustrates a method of extracting targeted information, in accordance
with an embodiment of the present subject matter.
DETAILED DESCRIPTION
[0011] Information Extraction (IE) relates to analysis of unstructured information in
order to extract information, such as about pre-specified types of events, entities, and relationships. The unstructured information can be extracted from unstructured and/or semi-structured machine readable documents. Conventional IE tools analyze documents (that may be of different languages) and extract information about the pre-specified types of events, entities, and relationships. For example, the conventional IE tools may provide insights from unstructured information by extracting information from call-center notes, free-text survey-fields, and research papers. However, the users need to write down the extraction patterns using techniques, such as regular expression, for extracting information using the conventional IE tools. The regular expression can be interpreted as a technique in which text
is examined and parts of the text matching the extraction pattern are identified. The identified parts of the text are associated with annotations reflecting a pattern related with the text. The annotations may be in the form of a note, a comment, and a tag. Further, the information is extracted based on the extraction pattern provided by the users.
[0012] In an example, the conventional IE tools may accept extraction patterns written
in regular expression when simple information is to be extracted from the input documents. However, to extract information related to complex patterns; the conventional IE tools may require the extraction patterns to be specified in complicated languages, say codes. For example, Java Annotations Pattern Engine (JAPE) is a higher level language used by General Architecture for Text Engineering (GATE). The GATE serves as a framework that supports the conventional IE tools. Further, the JAPE language uses grammar that includes a set of phases, each of which further includes a set of rules. The set of rules include description of an annotation pattern that needs to be identified in the input documents and description of annotation manipulation statements that are applied upon identification of the annotation pattern.
[0013] To write an extraction pattern using the JAPE language, the user has to be
well-versed with the language, associated syntax, and commands. Accordingly, the user may need to learn the complicated languages before being able to write an extraction pattern and use the convention IE tools for information extraction. Hence, information extraction using such complex platforms and languages may be difficult for novice users. Therefore, as mentioned in above example, the grammar based language of the conventional IE tools can hinder the utilization of such tools by the novice users as the users would need specialized skills to operate the conventional IE tools.
[0014] In certain other conventional IE tools, for extraction of information based on
complex patterns, codes, and rules may be hardcoded into the tools. In such a case, for every different pattern type, different codes have to be defined. In such a situation, configuration of the conventional IE tools may prove to be almost impractical as the configuration of the tools would involve coding for many individual analysis steps for accommodating various types of patterns.
[0015] In some cases, the annotations may be in the form of extensible markup
language (XML) tags. In an example, the XML tags may store the annotations in a tree structure, which may be difficult to process while extracting the information associated with the annotations, and may also render the whole process time consuming. Also, the information extracted from the conventional IE tool may be stored in a database. To extract the information from the database, the user may need to learn complex scripts that may be executed in the database. In an example, the complex scripts used for extraction of the information may be proprietary to a system that may be hosting the conventional IE tool. Therefore, storing the information in the database may become an overhead for the system. This may result in poor performance of the conventional IE tool for extraction of information.
[0016] The present subject matter discloses a system and a method for extracting
targeted information from input documents. The system and method may facilitate extraction of information enriched with various linguistic patterns from the input documents. In an embodiment, the system and the method of the present subject matter may facilitate analysis of an input document to identify a plurality of text patterns. In an example, the plurality of text patterns identified in the input document upon analysis may include Parts-Of-Speech (POS), Named Entity Relation (NER), and a thematic relation, say subject/object identification. Once the text patterns are identified, at least one label may be associated with each of the identified text patterns, such as the POS, the NER, and the thematic relation, in the input document.
[0017] The input document, having the text patterns identified and labeled, is referred
to as an analyzed document. In an example, a label for the POS, such as noun, verb, pronoun, preposition, adverb, and adjective, may be assigned to each word in a sentence of the input document. Further, based on various NER, such as names of people, locations, and organizations may be identified from the input document and labeled with the appropriate NER labels.
[0018] In an implementation, the analyzed document may be further processed to
determine one or more linguistic patterns that may be associated with the text of the analyzed document. In an example, the one or more linguistic patterns may include a lexical pattern, a syntactic pattern, and a semantic pattern. Accordingly, the analyzed document may further go through a linguistic analysis to identify the linguistic patterns. To identify a lexical pattern, a
lexical analysis may be performed on the analyzed document. The lexical analysis may include converting sentences from the analyzed document into a sequence of tokens indicating the lexical pattern. Further, to determine the syntactic pattern the sequence of tokens may be analyzed for determining a grammatical structure of the tokens with respect to a formal grammar. Furthermore, the semantic pattern may be identified by relating syntactic structures, such as phrases and sentences, to their language independent meaning.
[0019] Based on the determination of the one or more linguistic patterns, the text of
the analyzed document may be associated with at least one annotation tag. The at least one annotation tag may indicate the lexical, the syntactic, the semantic pattern, or a combination thereof, in the analyzed document. Further, the at least one annotation tag may be stored along with the analyzed document in a string or linear format. The analyzed document having the annotation tags associated therewith may be referred to as an annotated document. According to an embodiment, the annotated document may be stored within a file system of the system. The string or linear format of the annotation tags may facilitate extracting the information from the file system in a convenient manner without requiring complex extraction patterns.
[0020] Further, the at least one annotation tag may facilitate a user to specify an
extraction pattern. In an implementation, the system may facilitate the user to define an extraction pattern in regular expressions for extracting information of user’s interest as exhibited by, such as lexical, syntactic, and semantic patterns, from the text of the annotated document, without using the annotation tags. The user may submit the extraction pattern for extracting targeted information from the annotated document. As a result, extraction of information from the annotated document is less time consuming and may also use less computational resources. Accordingly, the annotated document may be enriched with information, such as lexical, syntactic, and semantic information, that may further be used for extraction of targeted information.
[0021] Further, the user may choose to extract information that may be associated with
the input document, such as the lexical, syntactic and/or semantic information in the input document. Accordingly, the system and method of the present subject matter may enable the user to combine lexical, syntactic, and semantic features obtained by the annotation tags for forming regular expression based extraction pattern for extracting targeted information.
[0022] In an example, the system may provide a graphical user interface (GUI) for
facilitating the user to express the extraction pattern through regular expressions. In an implementation, the extraction pattern may be defined by using a drag-drop technique or a gazetteer technique. In case of the drag-drop technique, the system may provide the user with a drop down list of various annotation tags present in the annotated document. The user may select and drag an annotation tag from the drop down list to form the extraction pattern. In an example, based on the selected annotation tags, the system may script the extraction pattern using regular expression. It will be understood that the extraction pattern is not written by the user; instead the user selects and fills the annotation tags, say in already provided blank spaces in the GUI, to complete the extraction pattern. This technique may facilitate the novice users to define the extraction pattern of their choice without learning to code in a complex language.
[0023] Further, according to the gazetteer technique, the user may create a list of
keywords that may be of interest to the user. The keywords may be names of companies, and events, such as mergers and acquisitions. The user may further select the annotation tags in such a manner that the extraction pattern generated, based on regular expression, in accordance with the user’s selection may facilitate in extracting the information related to the keywords provided in the list. For example, the user may create a list of companies and may further select some of the annotation tags to form the extraction pattern directed to extract merger related information. The present subject matter may facilitate extraction of merger related information associated with the companies provided in the list of the user.
[0024] The extraction pattern generated in accordance with the present subject matter
may enable a novice user to extract information of interest that may be exhibited by different types of patterns, such as linguistic, lexical, syntactical, semantic, or a combination thereof, without learning new languages for writing complex extraction codes. The present subject matter facilitates use of regular expressions for not only extracting information exhibited by lexical patterns but also by other complex patterns, such as syntactic and semantic. As described above, the enriched format of the annotated document may facilitate removal of hierarchical structure at the sentence level and expressing various annotation tags in a linear format. As a result, the present subject matter provides for a convenient and easy, yet effective information extraction.
[0025] These and other advantages of the present subject matter would be described in
greater detail in conjunction with the following figures. While aspects of described systems and methods for managing versions of a database can be implemented in any number of different computing systems, environments, and/or configurations, the embodiments are described in the context of the following exemplary system(s).
[0026] Fig. 1 illustrates a network environment 100 implementing an information
extraction (IE) system 102, in accordance with an embodiment of the present subject matter. In one implementation, the network environment 100 can be a company network, including thousands of office personal computers, laptops, various servers, such as blade servers, and other computing devices connected over a network. In another implementation, the network environment 100 can be a home network with a limited number of personal computers and laptops connected over the network.
[0027] The IE system 102 may be implemented in a variety of computing systems,
such as a laptop computer, a desktop computer, a notebook, a workstation, a mainframe computer, a server, a network server, and the like. Further, the IE system 102 may be connected to a plurality of user devices, such as 104-1, 104-2, 104-3,...,104-N, collectively referred to as the user devices 104 and individually referred to as a user device 104. Examples of the user devices 104 include, but are not limited to, a desktop computer, a portable computer, a mobile phone, a handheld device, and a workstation. The user devices 104 may be used by database analysts, programmers, developers, data architects, software architects, module leaders, projects leaders, database administrator (DBA), stakeholders, and the like to communicate with the IE system 102.
[0028] As shown in the figure, such user devices 104 are communicatively coupled to
the IE system 102 over a network 106 through one or more communication links for facilitating one or more end users to access and operate the IE system 102. In one implementation, the network 106 may be a wireless network, a wired network, or a combination thereof. The network 106 may also be an individual network or a collection of many such individual networks, interconnected with each other and functioning as a single large network, e.g., the Internet or an intranet. The network 106 may be implemented as one of the different types of networks, such as intranet, local area network (LAN), wide area
network (WAN), the internet, and such. The network 106 may either be a dedicated network or a shared network, which represents an association of the different types of networks that use a variety of protocols, for example, Hypertext Transfer Protocol (HTTP), Transmission Control Protocol/Internet Protocol (TCP/IP), etc., to communicate with each other. Further, the network 106 may include a variety of network devices, including routers, bridges, servers, computing devices, storage devices, and the like.
[0029] In one embodiment, the IE system 102 includes a processor 108, an interface
110, and a memory 112 coupled to the processor 108. The processor 108 may be implemented as one or more microprocessors, microcomputers, microcontrollers, digital signal processors, central processing units, state machines, logic circuitries, and/or any devices that manipulate signals based on operational instructions. Among other capabilities, the processor 108 may be configured to fetch and execute computer-readable instructions stored in the memory 112.
[0030] The interface 110 may include a variety of software and hardware interfaces,
for example, a web interface, a graphical user interface, etc., allowing the IE system 102 to interact with the user devices 104. Further, the interface 110 may enable the IE system 102 to communicate with other computing devices, such as web servers and external data servers (not shown in figure). The interface 110 may facilitate multiple communications within a wide variety of networks, and protocol types, including wired networks, for example, LAN, cable, etc., and wireless networks, such as WLAN, cellular, or satellite. The interface 110 may include one or more ports for connecting a number of devices to each other or to another server.
[0031] The memory 112 can include any computer-readable medium known in the art
including, for example, volatile memory, such as static random access memory (SRAM) and dynamic random access memory (DRAM). In one embodiment, the memory 112 includes module(s) 114 and data 116. The module(s) 114, amongst other things, includes routines, programs, objects, components, data structure, etc., that perform particular task or implement particular abstract data types.
[0032] In one implementation, the module(s) 114 may include an analysis module
118, an annotation module 120, a pattern generation module 122, and other module(s) 124. The other module(s) 124 may include programs or coded instructions that supplement
applications and functions of the IE system 102. It will be appreciated that such modules may be represented as a single module or a combination of different modules. Additionally, the data 116 serves, amongst other things, as a repository for storing data fetched, processed, received and generated by one or more of the module(s) 114. In one implementation, the data 116 may include, for example, document data 126, analysis data 128, annotation data 130, construct(s) data 132, and other data 134. In another implementation, the data 116 may include a file system of the IE system 102. In one embodiment, the data 116 may be stored in the memory 112 in the form of data structures. Additionally, the aforementioned data can be organized using data models, such as relational or hierarchical data models.
[0033] As mentioned previously, the present subject matter discloses aspects related to
information extraction. The IE system 102 may be a rule-based IE system 102. The rules may describe textual patterns of interest, which may be used for extracting information of user’s interest from the input document. In an implementation, the input document may be one of a structured document, a semi-structured document, and an unstructured document. The IE system 102 may enable a user to extract targeted information from the input document. The input document may be analyzed to identify a plurality of text patterns. Thereafter, the input document may be labeled upon identification of each of the plurality of text patterns. In an implementation, labels may be provided to text of the input document to obtain an analyzed document. Further, the analyzed document is parsed by the IE system 102 to determine one or more linguistic patterns. Once identified, each of the linguistic patterns in the analyzed document may be associated with an annotation tag. The annotation tags may be associated with the patterns of the text to obtain an annotated document. Moreover, the annotation tags may be saved within the annotated document to provide an enriched document. Thereafter, a user may specify an extraction pattern to extract targeted information from the annotated document. In an example, the extraction pattern specified by the user may be configured using regular expression technique.
[0034] In an implementation, the analysis module 118 may be configured to analyze
the input document for identifying various text patterns. In an example, the input document may include, but is not limited to, user input text (search terms or prose), webpage Uniform Resource Locator (URL), textual documents, closed captioning text, and other structured or unstructured data sources. Further, in said example, the input document may be in a variety of
formats, such as extensible Markup Language (XML), Rich Text Format (RTF), e-mails, Hypertext Markup Language (HTML), Standard Generalized Markup Language (SGML), Portable Document Format (PDF), Microsoft Word, Microsoft Excel, and plain text. The input document may be stored as the document data 126 in the IE system 102. In present implementation, the input document may be provided by the user through the user device 104.
[0035] Further, the analysis module 118 may be configured to perform the
identification and analysis of the text patterns in the input document. In an implementation, the text pattern analysis may include Part-Of-Speech (POS) tagging, Named Entity Recognition (NER), thematic analysis, or a combination thereof. In said implementation, the analysis module 118 may be configured to perform POS tagging by identifying POS, such as a noun, a verb, an adjective, and a noun-phrase, into which a word or a group of words in the text of the input document may be classified. In said implementation, during POS tagging, the analysis module 118 may provide a POS tag to each word that may be provided in the input document. In an implementation, the POS tagging may be a rule-based POS tagging. The rule-based POS tagging may rely on a dictionary to provide POS tags for a word. The POS tags associated with each word may reduce the number of parses performed by the analysis module 118. Further, the POS tags associated with the input document after the POS tagging may be stored as analysis data 128 in the IE system 102.
[0036] Further, in an example, the analysis module 118 may be configured to perform
the NER by identifying proper names in the text contained in the input document. The proper names, once identified, may be classified in pre-defined categories of interest, such as person names, organization names (companies, government organizations, and committees), location names (cities and countries), and miscellaneous names (date, time, number, percentage, monetary expressions, and measurement expressions). The NER may be performed based on the POS tagging performed above. The analysis module 118 may also be configured to associate labels with each of the words, based on the NER performed for each word or group of words.
[0037] Additionally, the analysis module 118 may be configured to perform the
thematic analysis on the input document. The thematic analysis may refer to identification of meaningful categories or themes in the text contained in the input document. For example, the
analysis module 118 may be configured to identify a subject-object relation in the text and provide a label of the relation to the text under consideration. Accordingly, based on the above described analyses, the analysis module 118 may provide the analyzed document by performing the identification and analysis of text patterns on the input document. Thereafter, the analyzed document may be annotated by the annotation module 120 of the IE system 102. In an embodiment, a third party tool may be employed by the IE system 102 for performing the text pattern analyses.
[0038] Further, according to an aspect of the present subject matter, the annotation
module 120 may be configured to annotate the analyzed document based upon identification of one or more linguistic patterns. In an implementation, the one or more linguistic patterns may include a lexical pattern, a syntactic pattern, a semantic pattern, or a combination of the three. The annotation module 120 may associate at least one annotation tag with each linguistic pattern that may be identified in the analyzed document. It will be understood that an annotation tag may be a comment, a note, an explanation, or other type of external remark that may be attached to the analyzed document, to each word or group of words of the analyzed document, or to a selected part of the analyzed document. The annotation module 120 may also be configured to identify the one or more linguistic patterns based on predefined rules. The pre-defined rules may facilitate the annotation module 120 to identify which word of the analyzed document needs to be associated with which annotation tag. In an example, the pre-defined rules may include a rule to identify meaning or significance of written words. In another example, the pre-defined rules may include a grammatical or a structural rule that may define how symbols in a language are to be combined to form words, phrases, expressions, and other cognizable constructs. It will be understood that the predefined rules may be defined by the user using the user device 104 or an administrator operating the IE system 102.
[0039] Accordingly, the annotation module 120 may be configured to identify the one
or more linguistic patterns in the analyzed document. In an example, the pre-defined rules for lexical analysis of the analyzed document may facilitate the annotation module 120 to identify characters in the text and group the characters as ‘tokens’. The ‘tokens’ may include integers, characters, operators, and the like. For example, the pre-defined rules may include a predefined rule specifying that when characters ‘a’, ‘n’, ‘d’ appear in a sequence and are adjacent
to one another, the characters are tokenized into “and”. In the example, the pre-defined rule may specify position of ‘a’ as a start location of the token and position of ‘d’ as an end location. The annotation module 120 may also include rules for parsing the tokens and providing meaning to the tokens and/or groups of tokens.
[0040] The annotation module 120 may further perform annotation of the analyzed
document based on the syntactic patterns. In an implementation, the annotation module 120 may perform a syntactic analysis on the analyzed document. For example, the syntactic analysis may include determination of structure of text, i.e., to determine the way in which words are put together to form phrases, clauses, and sentences. Based on the syntactic analysis, the analyzed document is annotated.
[0041] Further, for conducting an analysis of the semantic patterns in the analyzed
document, in one example, the annotation module 120, may check for subject-verb agreement, proper use of genders, and expression of a concept. Thereafter, the annotation module 120 can identify information of interest from the analyzed document. The annotation module 120 may associate one or more annotation tags with the identified information. As mentioned before, semantic annotation may include different levels of granularity, such as annotation of complete document, paragraph, sentence, concept, or word. Based on each level, the annotation module 120 may automatically identify semantic patterns and may associate semantic tags with text of the analyzed document. The semantic tags may establish mappings between concepts and information within the analyzed document.
[0042] In an implementation, the annotation module 120 may be configured to
simultaneously annotate the analyzed document with respect to the linguistic patterns. In another implementation, the annotation module 120 may be configured to annotate the analyzed document in a sequential manner, i.e., each linguistic pattern at one time, say first analysis and annotation with reference to lexical patterns, then based on syntactic patterns, and finally based on semantic patterns. The annotation module 120 may also be configured to store the annotation tags associated with the analyzed document as annotation data 130. The annotation tags may be stored along with the analyzed document as a text string to provide the annotated document. Accordingly, the annotation tags may enrich the analyzed document with
different types of information and may thus facilitate the extraction of targeted information from the annotated document based on the user’s requirement.
[0043] Further, in an implementation, the enriched format of the annotated document
may be in a text format. As mentioned previously, the annotation module 120 may store information pertaining to the annotation tags in the file system of the IE system 102. In an example, the annotation module 120 may store the annotated document in the file system of the IE system 102. The annotated document may be stored as flat files in the file system of the IE system 102. In said example, the flat files may be accessed by using operating system (OS) commands for displaying the flat files. As mentioned above, the information in the flat files, such as the annotation tags may be stored as strings. Also, storage of the annotated document in the form of the flat files may take less disk space as the annotate document is in the text format.
[0044] In an implementation, the IE system 102 may facilitate the user to extract
targeted information from the annotated document by using the extraction pattern. In an implementation, the pattern generation module 122 may be configured to facilitate the user to generate the extraction pattern using regular expression technique for extracting targeted information from the annotated document. As used herein, the regular expression technique may be understood as a technique in which the extraction pattern can be depicted in the form of a compact representation that may describe a set of strings without listing all the elements of the set. As will be understood, the extraction pattern may be defined for extracting the information of user’s interest (as reflected in any of the patterns which can be lexical, syntactic or semantic and combination thereof).
[0045] According to said implementation, to generate the extraction pattern, in said
implementation, the pattern generation module 122 may provide the user with the annotation tags stored in the IE system 102. The pattern generation module 122 may facilitate the user to generate a pattern in regular expression for extracting information of interest from the annotated document. In an example, the extraction pattern may be generated on the basis of selection of the text labels and annotation tags by the user. The extraction pattern generated in accordance with the present subject matter may facilitate a novice user, say a user who is not familiar with programming languages and complex codes, to easily and conveniently extract
information of user’s interest (as reflected in any of the patterns which can be lexical, syntactic or semantic and combination thereof). As will be understood from the foregoing description, the IE system 102 facilitates the user in extracting information from the documents as exhibited by, for example, text patterns, lexical patterns, syntactic patterns, semantic patterns, and a combination of such patterns.
[0046] Further, the pattern generation module 122 may provide constructs that may
facilitate generation of the extraction pattern. The constructs may refer to parts of a language that may be arranged in a systematic order to form a sentence or a phrase. In said implementation, the constructs may be similar to English so that the user may select the annotation tags based on the constructs. The pattern generation module 122 may also be configured to store the constructs as constructs data 132 in the IE system 102. The extraction pattern may be generated by two techniques, namely, a drag-drop technique and a gazetteer technique. Both the techniques will be explained later in detail in conjunction with Figs. 2a and 2b.
[0047] Further, according to an implementation, the IE system 102 of the present
subject matter may facilitate the user to combine one or more extraction patterns. The pattern generation module 122 may facilitate the user to use operators, such as Boolean operators to combine the one or more extraction patterns to form a combination extraction pattern. For example, if a first extraction pattern specified by user covers one kind of pattern, such as text patterns, of the input document, the user may combine a second extraction pattern, such as the linguistic patterns with the first extraction pattern to form the combination extraction pattern. As mentioned above, the user may combine the two extraction patterns using AND/OR operators. The pattern generation module 122 may execute the combination extraction pattern on the annotated document instead of executing two extraction patterns separately. Additionally, the pattern generation module 122 may facilitate the user to apply conditions on the extraction patterns. For example, the user may define the extraction pattern in such a way that if the targeted information is extracted by the execution of the first extraction pattern, then the second extraction pattern may not be executed. However, if the targeted information is not extracted by the execution of the first extraction pattern, the second extraction pattern may be executed by the pattern generation module 122 to extract the target information.
[0048] It will be understood that although the present subject matter has been
described with reference to the analysis of a single document at one time, but the subject matter can be implemented to do so the same for a plurality of documents. As mentioned earlier, the pattern generation module 122 may facilitate formation of the extraction patterns using regular expression. Such an extraction pattern may be used for extracting all types of information of user’s interest as exhibited by patterns, such as lexical, syntactic, and semantic patterns from the annotated document.
[0049] Figs. 2a and Fig. 2b illustrate a graphical user interface (GUI) 200 of the IE
system 102 for forming an extraction pattern, in accordance with two different embodiments of the present subject matter. Referring to Fig. 2a, the GUI 200 is provided for generation of the extraction pattern, using regular expression, by implementing drag-drop technique, in accordance with an embodiment of the present subject matter. The present figure illustrates a basic tab of the GUI 200. According to said implementation, the GUI 200 may include an input section 202 where the user may provide input; a preview section 204 for displaying the regular expression based extraction pattern, and a result section 206 that may indicate the targeted information based on the extraction pattern.
[0050] In an implementation, the input section 202 may include a semi-formed
extraction pattern 208, for example, having blanks for filling, by the user to complete the extraction pattern. In present implementation, the user may rephrase the extraction pattern in accordance with the requirements, i.e., the user may re-arrange the constructs provided in the input section of the GUI 200.
[0051] Further, the GUI 200 may include a drop-down list 210 that may include all the
annotation tags associated with the text of the annotated document. The user may interact with the GUI 200 through the interface(s) 110 to select the annotation tags from the drop-down list 210 and complete the semi-formed extraction pattern 208. In an example, the user may drag an annotation tag from the drop down list 210 and drop the annotation tag into the blank spaces provided in the semi-formed extraction pattern 208 to complete the extraction pattern. In another example, the user may double-click on the annotation tags of choice and the pattern generation module 122 may automatically put the annotation tags in the blank spaces of the semi-formed extraction pattern 208.
[0052] As mentioned earlier, the pattern generation module 122 may facilitate
formation of the extraction patterns in regular expression. Therefore, once the user defines the extraction pattern, say in English constructs, by filling in the blanks in the semi-formed extraction pattern 208, the pattern generation module 122 may facilitate automatic generation of the extraction pattern in regular expressions. An example of such an extraction pattern 212 in regular expression is indicated by the preview section 204 of the GUI 200. Further, the IE system 102 may be configured to understand the extraction patterns created in the regular expressions. Accordingly, when the pattern generation module 122 executes the extraction pattern 212 generated in regular expressions, the annotation module 120 can extract the targeted information 214 based on the extraction pattern 212, and provide the targeted information to the user in the result section 206 of the GUI 200.
[0053] In an implementation, the targeted information 214 is extracted in regular
expressions by the annotation module 120; however, the GUI 200 may translate the regular expressions into English like constructs and provide them to the user through the result section 206 of the GUI 200. Accordingly, the GUI 200 of the IE system 102 may facilitate the user to define extraction patterns even if the user is not acquainted with the programming language. In another implementation, the pattern generation module 122 may facilitate the user to define an extraction pattern in regular expressions for extracting information of user’s interest as exhibited by patterns, such as lexical patterns, from the text of the annotated document, without using the annotation tags.
[0054] Referring to Fig. 2b, the GUI 200 is illustrated for generation of the regular
expression-based extraction pattern using the gazetteer technique, in accordance with another embodiment of the present subject matter. In the present figure, an advanced tab of the GUI 200 is illustrated. In a similar manner as explained with reference to Fig. 2a, the GUI 200, according to said implementation, may include the input section 202, the preview section 204, and the result section 206. The input section 202 of the advanced tab of the GUI 200 may include a list of the annotation tags. Further, the GUI 200 may facilitate the user to provide a list of keywords of interest. Accordingly, the user may create the list of the keywords using the user device 104 and interface(s) 110. In an example, the keywords may include an entity, a relation, an event, or a combination thereof and the user may define the extraction pattern based on the list of keywords.
[0055] According to the above example, the user may list down some companies of
interest for which merger and acquisition related information needs to be extracted from the input documents. The GUI 200 may facilitate the user to define the extraction pattern in such a way that information related to the companies provided in the list will be extracted. In an implementation, if none of the companies listed by the user are referenced in the input document, or if the extraction pattern specified by the user does not match with the contents of the input document, no extracted result will be obtained. In another example, the analysis module 118 may be configured to extract information related to the NER. The analysis module 118 may form the extraction pattern in conjunction with gazetteers, such as lists of companies and list of cities, to extract information pertaining to the companies and the cities provided in the gazetteers, by means of regular expression patterns. The analysis module 118 may also be configured to store the gazetteers as the analysis data 128.
[0056] As mentioned earlier, the IE system 102 may facilitate generation of the
extraction pattern in regular expressions. This may enable the user to extract complex patterns from the annotated document without learning a new language. As will be understood from the foregoing description, the IE system 102 may facilitate the user to define extraction patterns in constructs which are similar to English and therefore may be understandable even by the novice user. The IE system 102 may accordingly facilitate the user to generate the extraction pattern for extraction of complex information as exhibited by patterns, such as semantic and syntactic patterns without learning a new programming language for extraction of each of the complex patterns.
[0057] Fig. 3 illustrates an exemplary method 300 of extracting targeted information,
in accordance with an embodiment of the present subject matter. The method 300 may be described in the general context of computer executable instructions. Generally, computer executable instructions can include routines, programs, objects, components, data structures, procedures, modules, functions that perform particular functions or implement particular abstract data types. The method 300 may also be practiced in a distributed computing environment where functions are performed by remote processing devices that are linked through a communication network. In a distributed computing environment, computer executable instructions may be located in both local and remote computer storage media, including memory storage devices.
[0058] The order in which the method 300 is described is not intended to be construed
as a limitation, and any number of the described method blocks can be combined in any order to implement the method 300 or alternative methods. Additionally, individual blocks may be deleted from the method 300 without departing from the spirit and scope of the subject matter described herein. Furthermore, the method 300 can be implemented in any suitable hardware, software, firmware, or combination thereof.
[0059] At block 302, an input document may be analyzed by the analysis module 118.
The input document may include, but is not restricted to, articles, newspapers, and online transcripts of radio and television broadcasts, call-center notes, free-text survey-fields, and reports. Further, the input document may be in a variety of formats, such as extensible Markup Language (XML), Rich Text Format (RTF), e-mails, Hypertext Markup Language (HTML), Standard Generalized Markup Language (SGML), and plain text. Various details about the input document may be stored as document data 126.
[0060] Further, the analysis module 118 may identify text patterns in the input
document. In an implementation, the text pattern analysis may include Part-Of-Speech (POS) tagging, Named Entity Recognition (NER), thematic analysis, or a combination thereof. In an implementation, the analysis module 118 may identify the text patterns based on some predefined rules in the IE system 102. For example, the predefined rules may facilitate the analysis module 118 to identify a noun, a verb, an adjective, and the like, in text of the input document. Similarly, the predefined rules may facilitate the analysis module 118 to identify persons, locations, and organizations from the input document.
[0061] At block 304, the analysis module 118 may associate at least one label with
each of the identified text patterns. The analysis module 118 may associate POS tags, and other notes that may provide information about the NER and thematic analysis with the input document to provide an analyzed document. The analysis module 118 may be configured to store information related to the analyzed document as the analysis data 128.
[0062] At block 306, the analyzed document may be parsed through the annotation
module 120. The annotation module 120 may identify one or more linguistic patterns associated with the text of the analyzed document. The one or more linguistic patterns may include a lexical, a syntactic, and a semantic pattern. The annotation module 120 may identify
the one or more linguistic patterns on the basis of some predefined rules. The predefined rules may facilitate the annotation module 120 to identify the one or more linguistic patterns.
[0063] At block 308, the annotation module 120 may associate at least one annotation
tag with the one or more identified linguistic patterns. In an implementation, the annotation module 120 may annotate the analyzed document on the basis of lexical, syntactic, and semantic information contained in the text of the analyzed document. In said implementation, the annotation module 120 may simultaneously provide the lexical, syntactic, and semantic annotations to the analyzed document. The annotation module 120 may also be configured to store information about the annotation tags as the annotation data 130 in the IE system 102.
[0064] In an implementation, the annotation tags may be associated with the analyzed
document to provide an annotated document. Accordingly, the annotated document may provide information related to the text patterns as well as the linguistic patterns associated with the text. The annotated document may be in a text format that may include the text as well as the annotation tags as a string. Further the IE system 102 may facilitate storing the annotated document in the file system. The user may, accordingly retrieve targeted information from the annotated document by generating the extraction pattern.
[0065] At block 310, at least one extraction pattern may be generated based on an
input provided by the user. The extraction pattern may extract targeted information from the annotated document. In an implementation, the extraction pattern may be defined by the user through the user device 104 and the interface(s) 110. The pattern generation module 122 may facilitate the user to define the extraction pattern based on the requirements of the user. In an implementation, the pattern generation module 122 may facilitate the user to define an extraction pattern in regular expressions for extracting information of user’s interest as exhibited by patterns, such as lexical patterns, from the text of the annotated document, without using the annotation tags. The pattern generation module 122 may also facilitate the user to use operators, such as Boolean operators, to combine the one or more extraction patterns to form a combination extraction pattern.
[0066] The user may generate the extraction pattern through the drag-drop technique
or through the gazetteer technique. The drag-drop technique may facilitate the user to drag an annotation tag from a list of annotation tags and drop the annotation tag in a semi-formed
extraction pattern. Accordingly, the user may complete the extraction pattern for being executed on the annotated document. The gazetteer technique may enable the user to create a list of keywords of interest. The keywords may represent an entity, an event and a relation. Accordingly, the user may restrict the information extraction to the keywords provided in the list. The extraction pattern may provide information related to the keywords. Both the techniques may facilitate a novice user to define the extraction pattern without learning any programming language.
[0067] Further, at block 312, the extraction pattern generated by the user may be
executed by the pattern generation module 122 on the annotated document. Once executed, the extraction pattern may extract targeted information that may be provided to the user. The user may view the targeted information through the user device 104. Accordingly, the user may make use of the same extraction pattern for extracting the complex information of user’s interest as exhibited by the syntactic and semantic patterns, from the annotated document. The present subject matter may provide a convenient and time saving technique for extraction of targeted information from the annotated document.
[0068] Accordingly, the present subject matter facilitates use of regular expressions
for not only extracting information exhibited by lexical patterns but also by other complex patterns, such as syntactic and semantic. As described above, the enriched format of the annotated document may facilitate removal of hierarchical structure at the sentence level and expressing various annotation tags in a linear format, say as a string. As a result, the present subject matter provides for a convenient and easy, yet effective information extraction. The extraction pattern generated in accordance with the present subject matter may enable a novice user to extract information of interest that may be exhibited by different types of linguistic patterns, such as lexical, syntactical, semantic, or a combination thereof, without learning new languages for writing complex extraction codes.
[0069] Although embodiments for an information extraction system have been
described in language specific to structural features and/or methods, it is to be understood that the present subject matter is not necessarily limited to the specific features or methods described. Rather, the specific features and methods are disclosed as exemplary implementations for the information extraction system.
I/We claim:
1. A method for extracting targeted information from at least one input document, the method
comprising:
analyzing the at least one input document to provide at least one analyzed document, wherein the analyzing comprises identifying a plurality of text patterns in the at least one input document, wherein each of the plurality of text patterns is associated with at least one label;
determining one or more linguistic patterns in the at least one analyzed document, wherein each of the one or more linguistic patterns is associated with at least one annotation tag to provide at least one annotated document, and wherein the at least one annotation tag is stored as a string;
generating at least one extraction pattern based on a selection of one or more of the at least one label and the at least one annotation tag, wherein the at least one annotation tag is selected by a user; and
executing the at least one extraction pattern on the at least one annotated document for extracting targeted information from the at least one annotated document.
2. The method as claimed in claim 1, wherein the identifying the plurality of text patterns comprise performing at least one of a part-of-speech tagging, a named entity recognition, and a thematic analysis.
3. The method as claimed in claim 1, wherein the one or more linguistic patterns comprise a lexical pattern, a syntactic pattern, and a semantic pattern.
4. The method as claimed in claim 1, wherein the generating comprises representing the at least one extraction pattern using regular expressions by one of a drag-drop technique and a gazetteer technique.
5. The method as claimed in claim 4, wherein the drag-drop technique comprises selecting at least one annotation tag from a list of annotation tags to generate the at least one extraction pattern.
6. The method as claimed in claim 4, wherein the gazetteer technique comprises listing
keywords of interest based on which the at least one extraction pattern extracts targeted
information.
7. The method as claimed in claim 6, wherein the keywords are at least one of events, entities, and relationships.
8. An information extraction system (102) comprising:
a processor (108); and
a memory (112) coupled to the processor (108), the memory (112) comprising:
an analysis module (118) configured to,
analyze at least one input document to identify a plurality of text patterns in the at least one input document; and
label each of the plurality of text patterns to generate at least one analyzed document;
an annotation module (120) configured to,
determine one or more linguistic patterns in the at least one analyzed document to associate at least one annotation tag with each of the one or more linguistic patterns to generate at least one annotated document; and
a pattern generation module (122) configured to,
generate at least one extraction pattern based on selection of the at least one annotation tag, wherein the at least one annotation tag is selected by a user
9. The information extraction system (102) as claimed in claim 8, wherein the pattern
generation module (122) is further configured to execute the at least one extraction pattern to
extract targeted information from the at least one annotated document.
10. The information extraction system (102) as claimed in claim 8, wherein the plurality of text patterns comprise at least one of a part-of-speech, a named entity, and a thematic pattern.
11. The information extraction system (102) as claimed in claim 8, wherein the one or more linguistic patterns comprise a lexical pattern, a syntactic pattern, and a semantic pattern.
12. The information extraction system (102) as claimed in claim 8, wherein the at least one
extraction pattern is a combination extraction pattern.
13. A non-transitory computer readable medium having embodied thereon a computer
program for executing a method comprising:
analyzing at least one input document to provide at least one analyzed document, wherein the analyzing comprises identifying a plurality of text patterns in the at least one input document, wherein each of the plurality of text patterns is associated with at least one label;
determining one or more linguistic patterns in the at least one analyzed document, wherein each of the one or more linguistic patterns is associated with at least one annotation tag to provide at least one annotated document, and wherein the at least one annotation tag is stored as a string;
generating at least one extraction pattern based on a selection of one or more of the at least one label and the at least one annotation tag, wherein the at least one annotation tag is selected by a user; and
executing the at least one extraction pattern on the at least one annotated document for extracting targeted information from the at least one annotated document.
| # | Name | Date |
|---|---|---|
| 1 | ABSTRACT1.jpg | 2018-08-11 |
| 2 | 788-MUM-2012-POWER OF ATTORNEY(14-6-2012).pdf | 2018-08-11 |
| 3 | 788-MUM-2012-FORM 3.pdf | 2018-08-11 |
| 4 | 788-MUM-2012-FORM 2.pdf | 2018-08-11 |
| 5 | 788-MUM-2012-FORM 18(27-3-2012).pdf | 2018-08-11 |
| 6 | 788-MUM-2012-FER.pdf | 2018-08-11 |
| 7 | 788-MUM-2012-CORRESPONDENCE(27-3-2012).pdf | 2018-08-11 |
| 8 | 788-MUM-2012-CORRESPONDENCE(14-6-2012).pdf | 2018-08-11 |
| 9 | 788-MUM-2012-OTHERS [13-12-2018(online)].pdf | 2018-12-13 |
| 10 | 788-MUM-2012-FER_SER_REPLY [13-12-2018(online)].pdf | 2018-12-13 |
| 11 | 788-MUM-2012-DRAWING [13-12-2018(online)].pdf | 2018-12-13 |
| 12 | 788-MUM-2012-COMPLETE SPECIFICATION [13-12-2018(online)].pdf | 2018-12-13 |
| 13 | 788-MUM-2012-CLAIMS [13-12-2018(online)].pdf | 2018-12-13 |
| 14 | 788-MUM-2012-HearingNoticeLetter-(DateOfHearing-06-03-2020).pdf | 2020-02-20 |
| 15 | 788-MUM-2012-Correspondence to notify the Controller [02-03-2020(online)].pdf | 2020-03-02 |
| 16 | 788-MUM-2012-Written submissions and relevant documents [18-03-2020(online)].pdf | 2020-03-18 |
| 17 | 788-MUM-2012-Written submissions and relevant documents [20-03-2020(online)].pdf | 2020-03-20 |
| 1 | searchstrategy_14-06-2018.pdf |