A Framework For Natural Language Processing

Abstract: Shakti framework provides a method by which Natural Language Processing (NLP) modules (and any other additional modules) can be combined in a modular manner period. This frame work offers a very flexible, module-independent and efficient platform to build NLP applications such as machine translation, search engines, information extraction system and cross language information retrieval systems. This approach is particularly suitable in artificial intelligence (Al), in general, where modules may or may not be able to perform their task with 100% accuracy, and yet must work together without significant drop in over all accuracy. To, The Controller of Patents, Chennai

Patent Information

Application #

Filing Date

04 December 2006

Publication Number

48/2008

Publication Type

INA

Invention Field

COMPUTER SCIENCE

Status

Email

Parent Application

Applicants

INTERNATIONAL INSTITUTE OF INFORMATION TECHNOLOGY

GACHIBOWLI, HYDERABAD,500032 INDIA

Inventors

1. PROF. RAJIV SANGAL

INTERNATIONAL INSTITUTE OF INFORMATION TECHNOLOGY GACHIBOWLI, HYDERABAD,500032 INDIA

Specification

Description of the invention
The framework described here allows Natural Language Processing (NLP) modules (and other associated modules) to be combined in a modular manner to build a system. The framework described here offers a very flexible, module-independent and efficient way to build NLP tools and NLP applications. The term NLP tools includes language analyzers (taggers, shallow parsers, sentence parsers, and coreference taggers, etc.) and language generators, and the term NLP applications includes machine translation, search engines, information extraction system and cross language information retrieval systems, etc. This approach is suitable in Artificial Intelligence (Al), in general, where modules may or may not be able to perform their task with 100% accuracy, and yet must work together without significant drop in overall functioning.
Background
Technical Description of the Problem:
In order to make computers understand and process natural languages like English and Hindi, we need to design algorithms, and data structures. By its very nature this problem is very hard, and systems operate not perfectly but with some degree of accuracy. Research and development in this area produces new improvements every year. In such a situation, it is extremely important to have a framework which allows different modules to be put together and updated when new methods or techniques emerge.
The framework should allow various Natural Language Processing (NLP) modules such as word analyzers, phrase recognizers and analyzers, sentence analyzers and generators to work in a cooperative manner. The framework should also provide module-independent representation of the information that is being generated and used by various NLP modules because if the module is replaced to incorporate a better algorithm or machine learning technique, it should not disturb the rest of the

system. At the same time, a system built based on the framework should be efficient in speed as well as should provide support for parallelism among modules.
Technical Description of the Solution:
The system described here provides a general framework for setting up modules that cooperatively analyze natural language using shared data. The shared data is represented in a common format (called Shakti Standard Format or SSF) used by all the modules. SSF is a kind of blackboard on which all modules operate. It permits partial analysis to be represented and operated upon by different modules.
The Shakti Dashboard is a configuration tool for setting up a system based on the framework. It sets up the control flow of a blackboard based system besides providing many other facilities described below. It assumes that the system which it has to configure consists of modules which operate on an in-memory data structure such as SSF. The data structure is so designed that the modules do processing based on the information present in it, and after that leave their output in the same data structure.
The data structure has a text notation for representing it unambiguously. This will be called as the in-stream version or in-text version. The notation helps in readability, and as we will see, making the in-memory structure usable across different programming languages, and in sending it across different machines, and processes. Together with the notation, one would need two converters, called the "reader" program and a "printer" program that convert from text notation to in-memory data structure, and vice versa respectively. SSF, for example, provides both in-stream notation as well as in-memory representation. The format is general for any Natural Language Processing (NLP) application, including machine translation.
An NLP system is usually designed to consist of a large number of modules, each one of which typically performs a small logical task. This allows the overall task to be broken up into a large number of small sub-tasks each of which can be

accomplished separately. All modules operate on data whose format is fixed. For example, all modules read data in Shakti standard format (SSF) and generate output in the same format. If a module is successful in its task, it adds a new attribute or tree nodes (in the same SSF). Thus, even though the format is fixed, it is extensible in terms of attributes or tree analyses. This approach allows other ready made packages to be used easily. In order to interface such packages all that is required is to convert the output to SSF, and the rest of the modules continue to operate seamlessly. This approach follows the dictum: "Simplify globally, and only if necessary, complicate locally." However, since the number of modules is large and each local module does a small job, the local complexity of individual modules remains under tight control for most of the modules.
An NLP system such as NLP analyzer has limited coverage. It is not always able to produce an output for every input. It fails to produce output either because of limits of the best known algorithms or incompleteness of data or rules. For example, a sentential parser might fail to parse either because it does not know how to deal with a language construction or because a dictionary entry is missing. Similarly, a chunker or a part-of-speech tagger might fail, at times, to produce an analysis. The system is designed to deal with failure at every level. This is facilitated by a common representation for the outputs in SSF of all such modules (such as POS tagger, chunker and parser). The downstream modules continue to operate albeit less effectively, when a more detailed analysis is not available.
The SSF is designed to represent partial information, routinely. Appropriate modules are expected to know what to do when their desired information is available, and when it is not available. In fact, for many modules, there are not just two but several levels at which they operate, depending on the availability of information corresponding to that level. Each level represents a graceful degradation of output quality. All levels of information are represented in SSF.
The above flexibility is achieved by using two kinds of structures: tree structure and feature-structure. The former is used to represent and store phrase level analysis or dependency analysis (or any other tree analysis) and the latter for outputs of many kinds of properties on tree nodes.

In the interest of speed of processing, the modules are (usually) designed to operate on in-memory representation. However, if a module is located on another computer or written in another programming language, the in-memory representation is converted to in-text representation (by "printer" program) sent to the other computer and reconverted to in-memory representation (by "reader" for the concerned programming language). This is also automatically facilitated by the Shakti Dashboard tool mentioned earlier.
The in-text notation is also useful for debugging and human inspection of output by a module. It produces a high degree of transparency.
Experience has shown that this methodology has made debugging as well as the development of the system convenient for programmers and linguists alike. In case, an output is not as expected, one can quickly find out which module went wrong (that is, which module did not function as expected). In fact, linguists are using this quite effectively to debug their linguistic data with ease.
Finally, it should be noted that the framework is described here with SSF as a shared representation, however, it can also be used with another common representation, as long as it has desirable properties of extensibility, in-memory and in-text representation as part of the framework.
Prior Art
The term blackboard has been used earlier in building data driven systems in Speech Processing and Al. However, our system does not have to be data driven. In fact, it has well defined control structure. The term blackboard as used here refers to the shared memory (SSF) representation for the modules in a system.
Parse trees are extensively used in NLP systems, and typically represent the analysis of a sentence. Feature structure are also used along with nodes of parse trees. What is new here is that the representation described here allows multiple types of trees to be represented.

The framework described here may be used with phrase structure trees or any other representation. An example system has been built using Shakti Framework (called Shakti Machine Translation System) wherein dependency trees with features are used.
Overall layout of the Process / Design
First, the architecture of an example language analyzer is shown using SSF,

Now, the details of SSF are given next. As mentioned earlier, though the SSF format is fixed, it is extensible to handle new features. It also has a text representation, which makes it easy to read the output. The following example illustrates the SSF. For example, the following English sentence.
Children are watching some programmes on television in the house. -(1)

contains the following chunks (enclosed by double brackets),
((Children)) [[are watching]] ((some programmes)) ((on television)) ((in the house))
All the chunks are noun phrases, except for one ('are watching') which is a verb group and is shown enclosed in square brackets. If we mark the part-of-speech tag for each word, we have the following:

As shown in Fig. 2, each line represents a word/token or a group marked by the symbol '(('• (Lines with '))' only indicate the end of a group). Each word or group has 3 parts. The first part stores the tree address of each word or group, and is for human readability only. The word or group is in the second part, and part of speech tag or group/phrase category is in the third part.

Specifications
The SSF representation for a sentence consists of a single tree or sequence of trees. Each tree is made up of one or more related nodes.
A node has properties which are given by prop-name and prop-val. For example, a node may have a word 'she' associated with it along with gender 'f. These may be stored or accessed using prop-name TKN , and gend attribute, respectively.
Every node has four "system" properties:
• Address - referred to by property name ADDR_
• Token - acccessed by attribute name TKN_
• Category - accessed by attribute name CAT_
• Others (optional) - user-defined features which are accessed through their attribute names.

A property has a prop-name and prop-val. Here are a few examples for the system properties:

Example: Given below are two nodes (or trees with a single node each), marked by address labels 1 and 2, having their respective tokens as 'children' and 'played', and their categories as NN and VB:

Elements in each row are separated by a single tab character.
Corresponding to the above SSF text stream, an in-memory data structure may be created using the APIs. (However, note that value of the property ADDR is not stored in the in-memory data structure explicitly. It is for human reference and readability only, and is computed for in-text notation when needed.)

Interlinking of nodes
Nodes might be interlinked with each other through directed labeled edges. Usually, these edges have nothing to do with phrase structure tree, and are concerned with dependency structure, thematic structure, etc. These are specified using the attribute value syntax, however, they do not specify a property for a node, rather they give a relation between two nodes.
For example, if a node is karta karaka of another node named 'play1' in the dependency structure (in other words, if there is a directed edge from the latter to the former) it can be represented as follows:

The above says that there is an edge labelled with 'k1' from 'played' to 'children' in the 'drel' tree (dependency relation tree). The node with token 'played' is named as 'play1' using a special attribute called 'name'. All names of nodes within a sentence must be unique.
So the syntax for interlinking of nodes is as follows: if you associate an arc with a node N as follows:
-:
it means that there is an edge from < nodename > to N, and the edge is labelled with < edgelabel >. Name of a node may be declared with the attribute 'name':
name=
(All the words in angle-brackets may be substituted with appropriate user-defined names.)
Note that more than one kind of tree can be represented over the same nodes. For example, Fig. 5 shows the grammatical role tree structure for the same sentence as

in Figure 4, which says that 'children' node is a child of 'play' and the edge is labeled by 'subj':

Utility Value of the Invention
This invention allows computers to process natural languages more easily. There are several uses of this invention. Using this invention one can develop various applications like Machine Translation (automatically translating from one natural language to another, say from English to Hindi) or develop cross language information retrieval systems (that is, if you give query in one language, say Hindi, the application automatically retrieves all documents that are relevant irrespective of the language they are available in).
This invention can be used to further enhance human computer interaction by providing more natural interfaces, as well as human-human interaction through machine translation, etc.
The scheme described here can be used while building any of the following applications (not an exhaustive list):
1. Automatic summarization means the creation of a shortened version of a text
by a computer program. The product of this procedure still contains the most
important points of the original text. Automatic summarization technology can be
extremely useful to show the output of the search engines.
2. A foreign language reading aid is a computer program that assists a non-native
language user to read aloud properly in their target language. The proper reading
means that the pronunciation should be correct and stress to different parts of the
words should be proper.

3. A foreign language writing aid is a computer program that assists a non-native
language user in writing decently in their target language.
4. Information retrieval (IR) is the science of searching for information in
documents, searching for documents themselves, searching for metadata which
describe documents, or searching within databases, whether relational stand-alone
databases or hypertextualiy-networked databases such as the World Wide Web.
5. information extraction (IE) means to take given text and automatically take out information and present it in a structured form. The structured information usually relates to semantically well defined elements of the domain of the text.
6. Machine translation, sometimes referred to by the acronym MT, is a sub-field of computational linguistics that investigates the use of computer software to translate text or speech from one natural language to another. At its basic level, MT performs simple substitution of words in one natural language for words in another, but is much more sophisticated, in general.
7. Natural Language Generation (NLG) is the task of generating natural language from a machine representation system such as a knowledge base or a logical form.
Inventive Steps in the art / technology which can be claimed for Noveltv
The following are the elements of the framework described here which are novel. Modularity through Common Representation:
A common representation allows flexibility and modularity. Each module does its task and puts its analysis in a common extensible representation (SSF). The appropriate module that needs the analysis, takes it from the common representation. This allows for flexibility in introducing new modules in the "pipeline" of modules.

The common representation is like blackboard architecture with shared memory representation. However, it is not the same as conventional blackboard, where scheduling of modules is data driven and modules get activated based on content of blackboard. Here in our scheme the scheduling of modules is with a pre-defined control flow.
Efficiency with flexibility:
The common representation (SSF) has both an in-memory data structure as well as in-text notation. The in-memory representation allows for efficient processing (in speed) by the modules. Normally, all the modules are recommended to be coded to operate on in-memory data structure (rather than on in-text notation).
In-text Notation for Common Representation:
Even though the common in-memory representation is available for speed of processing, in-text notation/representation is also provided for transfer of analysis across computers, or across modules written in different programming languages, etc.
When a module running on another computer is to be invoked, the in-memory SSF representation in the current computer is converted to in-text SSF notation and sent to the other computer. There it is converted to in-memory representation, and the module is invoked. The output is received back from the other computer as in-text notation and re-converted to in-memory representation. Similarly, when a module written in another programming langauge is to be called, the current in-memory representation is converted to in-text notation, and converted to in-memory representation for the new programming language. The conversion is accomplished using converter from/to in-memory representation for the programming language to/from in-text situation.
The above conversion can be carried out automatically, if a suitable tool is used for supporting the scheme (such as Dashboard).
Common Representation:
All the modules operate on the same representation called SSF. Thus, the input and

output format are the same.
Shakti Standard Format (SSF):
SSF is used for representing the analysis of a sentence. It is especially designed to represent the different kinds of linguistic analysis, as well as analysis at different levels of detail.
SSF allows tree structure representation together with feature structures on every node of the tree. This by itself is not new. However, SSF allows multiple trees to be represented at the same time (on the same set of nodes). For example, grammatical role trees, dependency trees, thematic trees, semantic trees can all be represented for the same sentence at the same time for different levels of analysis.
Values can be represented at refined level of detail using double underscore notation. For example, a value 'varg' (argument of verb) may be further refined to be
'varg k1' which indicates that it is argument of verb of subtype k1. Values can also
store disjuncts.
Representing Partial Analysis:
SSF can routinely represent partial analysis. Feature structures allow some attributes to have values, whereas others need not have a value. Thus, when an analysis that supplies values to some of the attributes is not successful, those attributes do not get a value. Other parts of the analysis are free to supply value to other attributes.
Representing Multiple Kinds of Trees:
One kind of tree analysis may be produced and represented in SSF at one time, later another tree may be produced. For example, initially, the analysis might be in terms of grammatical role tree, later it might also have dependency tree or theta-role tree.

All these trees may be represented in SSF at the same time and may be used by modules which handle these.
Modules do processing if the desired tree analysis is available. They may also be designed to deal with partial or incomplete analysis.
Constituent Structure with Dependency Structure
NLP systems usually use analysis which uses constituent (phrase) structure or relational (dependency) structure. SSF permits either analysis to be used. In fact, it also allows the two structures to be combined, if needed. For example, one can have a dependency tree in which leaf nodes are simple phrases.
More than one Common Representation Possible:
The framework given here is not specific to just SSF but for any common representation. If one chooses another representation instead of SSF the benefits given from 7.1 to 7.3 are still applicable.
Finally, it is possible to also use more than one representations each of which might be specific to a group of modules. For example, modules 1 to 20 might use one representation (say, SSF1), and the modules from 21 to 30 might use another representation (say, SSF2).

We claim:
1) A frame-work for combining modules of natural language processing system
wherein;
a. Each combined module performs a pre-determined task.
b. It can incorporate high processing speed through in-memory data
structure.
c. Flexibility of in-text notation is incorporated to locate modules on
multiple computers in many programming languages.
d. A format enabling extensibility of analysis of sentence in natural
language.
2) A frame-work for combining modules of natural language processing system
as claimed in claim 1 (d), where in the format comprises combined
representation for all modules to operate.
3) A frame-work for combining modules of natural language processing system
as claimed in claim 2, where in format is in capable of performing partial
analysis and full analysis.
4) A frame-work for combining modules of natural language processing system
as claimed in claim 2, wherein can be represent multiple levels of detail in
structure can be represented at the same time.
5) A frame-work for combining modules of natural language processing system
as claimed in claim 2, wherein can represent constituent structure with
dependency structure.
6) A frame-work for combining modules of natural language processing system
as claimed in claim 1, where in the frame work can be incorporated to build
system in areas other than natural language processing and artificial
processing.

Documents

Application Documents

#	Name	Date
1	2244-CHE-2006 FOMR-13 30-11-2010.pdf	2010-11-30
1	2244-CHE-2006-AbandonedLetter.pdf	2017-07-19
2	2244-CHE-2006-FER.pdf	2016-12-27
2	2244-CHE-2006 POWER OF ATTORNEY 30-11-2010.pdf	2010-11-30
3	Other Patent Document [05-10-2016(online)].pdf	2016-10-05
3	2244-CHE-2006 FORM-18 30-11-2010.pdf	2010-11-30
4	2244-CHE-2006 CORRESPONDENCE OTHERS 12-02-2015.pdf	2015-02-12
4	2244-CHE-2006 FORM-13 30-11-2010.pdf	2010-11-30
5	2244-che-2006-form 1.pdf	2011-09-04
5	2244-che-2006-abstract.pdf	2011-09-04
6	2244-che-2006-description(provisional).pdf	2011-09-04
6	2244-che-2006-claims.pdf	2011-09-04
7	2244-che-2006-description(complete).pdf	2011-09-04
7	2244-che-2006-correspondnece-others.pdf	2011-09-04
8	2244-che-2006-description(complete).pdf	2011-09-04
8	2244-che-2006-correspondnece-others.pdf	2011-09-04
9	2244-che-2006-description(provisional).pdf	2011-09-04
9	2244-che-2006-claims.pdf	2011-09-04
10	2244-che-2006-abstract.pdf	2011-09-04
10	2244-che-2006-form 1.pdf	2011-09-04
11	2244-CHE-2006 CORRESPONDENCE OTHERS 12-02-2015.pdf	2015-02-12
11	2244-CHE-2006 FORM-13 30-11-2010.pdf	2010-11-30
12	Other Patent Document [05-10-2016(online)].pdf	2016-10-05
12	2244-CHE-2006 FORM-18 30-11-2010.pdf	2010-11-30
13	2244-CHE-2006-FER.pdf	2016-12-27
13	2244-CHE-2006 POWER OF ATTORNEY 30-11-2010.pdf	2010-11-30
14	2244-CHE-2006-AbandonedLetter.pdf	2017-07-19
14	2244-CHE-2006 FOMR-13 30-11-2010.pdf	2010-11-30

Search Strategy

1	2244_search_30-11-2016.pdf