Method For Named Entity Recognition Of Specific Domain Texts

< Back

Method For Named Entity Recognition Of Specific Domain Texts

Abstract: METHOD FOR NAMED ENTITY RECOGNITION OF SPECIFIC DOMAIN TEXTS ABSTRACT Amethod for named entity recognition of domain-specific texts is disclosed. The method 100 includes receiving 102 data from plurality of resource domains for creating multiple datasets for each entity subdomain, receiving 110 processed data for extracting triplets and labeling entities into domain-specific classes. The labeling of entities includes identifying 104 domain-specific word embeddings for the independent data set of each entity subdomain, identifying 106 topics and word distribution for each entity subdomain data set with word-weights and performing 112 local vectorization of arguments as weighted average of words in an argument. Performing global vectorization 108 of arguments by computing average of all subdomain topic vectors, comparing 114 local vectors with global vectors for each subdomain vectors andgenerating 116 final named entity label based on its similarity score for each subdomains. A system for named entity recognition of domain-specific texts is also disclosed. FIG. 1

Get Free WhatsApp Updates!
Notices, Deadlines & Correspondence

Patent Information

Application #

Filing Date

30 May 2023

Publication Number

24/2023

Publication Type

INA

Invention Field

COMPUTER SCIENCE

Status

Parent Application

Applicants

AMRITA VISHWA VIDYAPEETHAM

Bengaluru Campus, Kasavanahalli, Carmelaram P.O., Bangalore – 560035, India

Inventors

1. GUPTA, Deepa

AMRITA VISHWA VIDYAPEETHAM, Bengaluru Campus, Kasavanahalli, Carmelaram P.O., Bangalore – 560035, India

2. GANGADHARAN, Veena

AMRITA VISHWA VIDYAPEETHAM, Bengaluru Campus, Kasavanahalli, Carmelaram P.O., Bangalore – 560035, India

3. KANJIRANGAT, Vani

AMRITA VISHWA VIDYAPEETHAM, Bengaluru Campus, Kasavanahalli, Carmelaram P.O., Bangalore – 560035, India

Specification

Description:F O R M 2

THE PATENTS ACT, 1970
(39 of 1970)

COMPLETE SPECIFICATION
(See section 10 and rule 13)

TITLE
METHOD FOR NAMED ENTITY RECOGNITION OF SPECIFIC DOMAIN TEXTS
INVENTORS:
GUPTA, Deepa
GANGADHARAN, Veena
KANJIRANGAT, Vani
Indian Citizens
AMRITA VISHWA VIDYAPEETHAM
Bengaluru Campus, Kasavanahalli, Carmelaram P.O.
Bangalore – 560035, India

APPLICANT
AMRITA VISHWA VIDYAPEETHAM
Bengaluru Campus
Kasavanahalli, Carmelaram P.O.
Bangalore – 560035, India

THE FOLLOWING SPECIFICATION PARTICULARLY DESCRIBES THE INVENTION AND THE MANNER IN WHICH IT IS TO BE PERFORMED:
METHOD FOR NAMED ENTITY RECOGNITION OF SPECIFIC DOMAIN TEXTS
CROSS-REFERENCES TO RELATED APPLICATION
None.
FIELD OF INVENTION
The present disclosure relates to machine learning, and in particular it relates to name entity recognition for domain specific texts from low resource domains.
DESCRIPTION OF THE RELATED ART
Named Entity Recognition (NER) is an essential sub-problem in Information Extraction(IE) that identifies the significant entities in the raw text. The recognition of such entities is vital in semantic web applications since it improves the quality of the search results. The NER system belongs to two subcategories, open-domain NER and domain-specific NER systems. The generic or open domain NER system identifies the entities present in the raw text such as the names of the person, place, organization, time, date, etc. In the case of domain-specific applications, the entities can be names of disease, enzymes, medicine, proteins, product, manufacturer, etc. A lot of research in open-domain NER focuses on rule-based and machine learning-based models, however the main disadvantage of rule-based NER systems is the manual construction of the rules, which is a time-consuming task. Further, they are heavily domain-dependent, and usually less transferable to other domains.
Statistical-based learning systems require extensive labeled data with feature selections to train the NER model. The model learns parameters from the labeled dataset during training and in the testing phase, the model recognizes similar patterns from unseen data. Currently, it is observed that statistical models are mostly replaced by deep learning systems. Deep learning models with minimal feature engineering are employed in NER systems to achieve state-of-the-art performance. These models have the inherent capability of extracting intricate patterns from raw text and have been successfully applied in various domains. But the scalability of these trained models to domain-specific applications remains a major challenge. There are limited tools available for unsupervised NER and the present approachesare based on complex handcrafted rules or knowledge bases for entity information.
Various publications have tried to resolve the problems presented by domain specific NER systems. Chinese publication 112507127A discloses intelligent extraction based on priori knowledge graph. Chinese publication 112131884A discloses entity classification. Jing et.al. in “Text Classification based on LDA and semantic analysis” discusses selection of topic features by LDA model from text document and calculation of the semantic similarity between these features. In “Low-Resource Adaptation of Neural NLP Models”, FarhadNooralahzadeh discusses reinforcement learning algorithm with partial annotation learning to clean the noisy, distantly supervised data for low-resource named entity recognition in different domains and languages.
Presently, there is a requirement of annotation of datasets for domain specific NER which is a time consuming activity. Further, the semantic complexity of the domain-specific entities makes it difficult to be recognized accurately by existing open NER models.
SUMMARY OF THE INVENTION
The present subject matter relates to named entity recognition for domain specific texts.
In one embodiment of the present subject matter, amethod for named entity recognition of domain-specific texts is disclosed. The method includes receiving data from plurality of resource domains at a preprocessing module for creating multiple datasets, wherein independent data sets are generated for each entity subdomain, receiving processed data from the preprocessing module for extracting triplets at an entity identification module, wherein the triplets include arguments and relations and labeling entities for the data received from the entity identification module into domain-specific classes at the entity recognition module. The labeling of entities includes identifying domain-specific word embeddings for the independent data set of each entity subdomain in a word dictionary module, identifying topics and word distribution for each entity subdomain data set with word-weights in a weight dictionary module and performing local vectorization of arguments as weighted average of words in an argument in a local vector creation module. The labeling further includes performing global vectorization of arguments by computing average of all subdomain topic vectors in a global vector creation module, comparing local vectors with global vectors for each subdomain for computing cosine similarity between local and global vectors andgenerating final named entity label based on its similarity score for each subdomains.
In various embodiments,generating independent datasets for each entity subdomain includes receiving data from plurality of recognized domain-specific sources, performing data extraction for the received data using seed names/key phrases pertaining to entities, wherein the data includes extracted sentences, removing stop words and special characters from the extracted sentences andgenerating independent datasets for each entity subdomain.
In various embodiments, performing identification of triplets to obtain entities includes performing syntactic simplification to convert compound sentences to simple sentences by splitting conjoint clauses, appositive, and relative clauses, performing extraction of triplets with arguments and relations between them, wherein arguments includenoun phrases representing an entity anddetermining the validity of extracted arguments with heuristic rules, wherein entities starting with a preposition or link verbs are filtered out.
In various embodiments, performing vectorization for the independent data set of each entity subdomain using domain-specific word embeddings includes generating word representations from the contextual information of a document for vectors, obtaining semantically similar word vectors for a document and storing word vectors in separate dictionaries for each domain.
In various embodiments, identifying topics and word distribution for each entity subdomain data set with word-weights includes determining topics for each subdomain based on coherence score, wherein the coherence score is obtained by the degree of semantic similarity between high scoring topics, obtaining weights corresponding to the topics for storing them in dictionaries andarranging word-weights in descending order till the weight converges to zero.
In various embodiments, similarity score of an argument above a threshold of θ the global vector label is assigned as final named entity label.
In various embodiments, validity of arguments obtained from the extracted triplets are determined using heuristic rules.
In an embodiment, a system for named entity recognition of domain-specific texts is disclosed. The system includes a preprocessing module for receiving data from plurality of resource domains for creating multiple datasets, wherein the datasets include independent training and test datasets for each subdomain for the domain, an entity identification module for receiving data from the preprocessing module and identifying triplets from arguments and an entity recognition module for labeling entities of data received from the entity recognition module into domain-specific classes. The entity recognition module includes a word dictionary module for identifying domain-specific word embeddings for the independent data set of each entity subdomain, a weight dictionary module for identifying topics and word distribution for each entity subdomain data set with word-weights, a local vector creation module for performing local vectorization of arguments as weighted average of words in an argument for a subdomain, a global vector creation module for performing global vectorization of arguments by computing average of all subdomain topic vectors and a recognition module for comparing local vectors with global vectors for each subdomain and generating final named entity label based on its similarity score for each subdomains.
This and other aspects are described herein.
BRIEF DESCRIPTION OF THE DRAWINGS
The invention has other advantages and features, which will be more readily apparent from the following detailed description of the invention and the appended claims, when taken in conjunction with the accompanying drawings, in which:
FIG. 1illustrates the method for named entity recognition of domain-specific texts, according to an embodiment of the present subject matter.
FIG. 2illustrates generation of independent datasets for each entity subdomain, according to an embodiment of the present subject matter.
FIG. 3illustrates identification of triplets for obtaining entities, according to an embodiment of the present subject matter.
FIG. 4 illustratesvectorization of the independent data set of each entity subdomain,according to an embodiment of the present subject matter.
FIG. 5illustratesidentification of topics and word distribution for each entity subdomain, according to an embodiment of the present subject matter.
FIG. 6 illustrates a system for named entity recognition of domain specific texts, according to an embodiment of the present subject matter
FIG. 7 illustrates a flow chart for named entity recognition for the agricultural domain texts, according to an embodiment of the present subject matter.
Referring to the figures, like numbers indicate like parts throughout the various views.

DETAILED DESCRIPTION OF THE EMBODIMENTS
While the invention has been disclosed with reference to certain embodiments, it will be understood by those skilled in the art that various changes may be made and equivalents may be substituted without departing from the scope of the invention. In addition, many modifications may be made to adapt to a particular situation or material to the teachings of the invention without departing from its scope.
Throughout the specification and claims, the following terms take the meanings explicitly associated herein unless the context clearly dictates otherwise. The meaning of “a”, “an”, and “the” include plural references. The meaning of “in” includes “in” and “on.” Referring to the drawings, like numbers indicate like parts throughout the views. Additionally, a reference to the singular includes a reference to the plural unless otherwise stated or inconsistent with the disclosure herein.
The present subject matter describessystems and methods for unsupervisednamed entity recognition of domain specific texts.The named entity recognition uses a weighted distributional semantic model. The model is domain independent and scalable. The method is useful, particularly for low resource domains which have a scarcity of labeled/annotated documents.
In various embodiments of the subject matter, the embodiments of the system and method are illustrated further with reference to the figures. In one embodiment, a method 100 for named entity recognition (NER) is disclosed, wherein the NER is unsupervised as illustrated in FIG. 1. The method is configured to be implemented in a system as illustrated in FIG. 6. In the first step 102, data 602 is received from a plurality of resource domains at preprocessing module 604 to create multiple independent datasets, wherein the datasets are created for each entity subdomain. In the preprocessing module 604, a domain specific corpus is created by extracting sentences from recognized domain specific websites and other sources. The extracted corpus of sentences is provided to a word dictionary module 610 for identifying in step 104, domain-specific word embeddings using an extended tokenizer for the independent data set of each entity subdomain. The extended tokenizer prevents the splitting of domain-specific words into subwords, which could otherwise affect their semantic representations, thus retaining better representations. In step 106,topics and word distribution are identified for each entity subdomain data set with word-weights in a weight dictionary module 612. For obtaining importance of words in a subdomain, weighted distribution is done by determining score of a single topic by measuring the degree of semantic similarity between high scoring words in the topic in the subdomain. In the next step 108, global vectorization is performed in global vector creation module 614 by computing average of all subdomain topic vectors obtained from the word dictionary module 610 and weight dictionary module 612.
The data received from the data sources 602 is also provided to an entity identification module 606 for extracting triplets from the received data as provided in step 110. In step 112,local vectorization is performed in a local vector creation module 616 as a weighted average of words in an argument obtained from the word dictionary module 610 and weight dictionary module 612. The local and global vectors for each subdomain are compared by computing cosine similarity between local and global vectors as provided in step 114. In step 116, a final entity label is generated in a naming module 618 based on the similarity score for each subdomain.
The generation of independent datasets in step 106 for each entity subdomain is illustrated in FIG. 2. Data is received from plurality of recognized domain-specific sources as provided in step 202, the received data includes plurality of documentwhich is split into sentences. In step 204, data extraction is performed using seed names/key phrases pertaining to entities, wherein the data includes extracted sentences. Stopwords and special characters such as apostrophes, double quotes,single quote, and question marks are removed as provided in step 206. Independent datasets are generated for each entity subdomainas provided in step 208.
The identification of triplets from received data in step 110 to obtain entities subdomain is illustrated in FIG. 3. Sentence simplification is performed to improve the quality of triplets, and syntactic simplification is done by converting compound sentences to simple sentences by splitting conjoint clauses, appositive, and relative clauses as provided in step 302. In step 304, triplets are extracted which include the arguments and the relationsbetween them. The arguments are the significant noun phrases that represent anentity present in the input sentence. After extracting the triplets, heuristics rules are used fordetermining the validity of the extracted arguments. The rules filter out those entities which are too long, starting with a preposition, or link verbs.
FIG. 4 illustrates obtaining word vectors for independent data sets of each entity subdomain using domain-specific word embeddings, in step 102. In the first step 402, word representations are generated from the contextual information of a document for word vectors. Semantically similar word vectors are obtained from the document as provided in step 404. The word vectors obtained are stored in separate dictionaries for each entity subdomain as provided in step 406.
FIG. 5 illustrates identification of topics and word distribution for each entity subdomain data set with word-weights, corresponding to step 106 in FIG. 1. In step 502 topics for each subdomain aredetermined based on coherence score, wherein the coherence score is obtained by the degree of semantic similarity between high scoring topics. Weights corresponding to the determined topics are stored in dictionaries as provided in step 504. In step 506, the word-weights are arranged in descending order till the weight converges to zero.
A system 600 for implementing the method 100 for unsupervised named entity recognition (NER) model is illustrated in FIG. 6. The NER model is categorized into multiple entities based on the domain specific text, the categories indicate the subdomains for the specific domain in consideration. The system includes a preprocessing module and entity identification module for receiving data 602 from plurality of sources, wherein the data is from low resource domains which have scarcity of labeled/annotated documents. In the preprocessing module604, a domain specific corpus is created by extracting sentences from recognized domain specific websites and other sources. For data extraction key phrases/seed names related to subdomains from authorized websites. 80% of the seed data names/key phrases is used to extract the documents (set of sentences) for the creation of the training set while the remaining 20% of key phrases are used for test set creation of the model. An independent data set pertaining to each of the entity classes/subdomains is created. The entity recognition module 606 performs identification of triplets on the data received from the data sources 602. The extracted include the arguments and the relationsbetween them. The arguments are the significant noun phrases that represent anentity present in the input sentence.After extracting the triplets, heuristics rules are utilized to check the validity of the extracted arguments
The entity recognition module 608 includes a word dictionary module 610, a weight dictionary module 612, a global vector module 614, a local vector module 616 and a naming module 618. Domain-specific word embeddings areidentified using an extended tokenizer for the independent data set of each entity subdomain for identifying. The extended tokenizer prevents splitting of domain specific words into subwords which would affect their semantic representation. The domain specific word embeddings are used for generating word vectors with a word embedding model. The word embedding model generates semantically similar word vectors for a document. The word vectors generated are stored in separate dictionaries WE_Entity_1, WE_Entity_2, WE_Entity_3, and WE_Entity_n.
Topics and word distribution areidentified for each entity subdomain data set with word-weights in the weight dictionary module 612. There may be many cross-domain words which occur commonly in multiple subdomains, however the importance of such words may vary across the subdomains. For obtaining importance of words in a subdomain, weighted distribution is done. For determining the importance of words, a main hyperparameter is selected, which may have a number of topics T based on coherence scores. Topic coherence measures score of a single topic by measuring the degree of semantic similarity between high scoring words in the topic. The topic words with corresponding weights are stored in independent word-weight dictionaries corresponding to each entity WW_Entity_1, WW_Entity_2, WW_Entity_3, and WW_Entity_n. The weight of a word depends on its importance in the domain and varies from one domain to another. To create the dictionaries, the number of words required in each topic is another parameter that needs to be identified. For this all vocabulary words and its weights from the generated topics are collected and the words and weights are sorted in descending order. After a particular word-index, the weights converge to zero. Based on the value, the words and corresponding weights are selected for word-weight dictionary creation.
The local vector module receives data from the word dictionary module and weight dictionary module for generating local vector for arguments. The local vector for arguments is the weighted average of words in an argument. For input argument corresponding to n entity subdomains,n Local Vectors LV_1, LV_2, LV_3, and LV_nare created. When an argument gets vectorized to word vectors, it may still produce subword splits based on the subdomain. Hence a weighted score to scale the importance of the words with respect to each subdomain is used. IfArgis an input argument with {w_1, w_2, w_3,…w_m}, Local Vector (LV) is computed as the weighted average of all the m words in Arg using the following equation.
LV(Arg)=(∑_(i=1)^m▒,W_wi*Vec(w_i))/m
In the equation, wiis the ith word, and Wwiis the weight associated with a word wi. The vector for each of the m words is extracted from the domain specific word vectors WE_Entity_1, WE_Entity_2, WE_Entity_3, and WE_Entity_n created in the word dictionary. The weight is obtained from the weight dictionaries WW_Entity_1, WW_Entity_2, WW_Entity_3, and WW_Entity_n.
Global vectorization of arguments is obtained by computing average of all subdomain topic vectors in a global vector module. The global vector represents the global significance of a subdomain in a document, which is computed as presented in the following equation. In the equation, T is the total number of topics, K is the number of words in a topic, wijis the ith word of the jth topic, and Wwij is the weight of that word defined by the topics in a subdomain. Vec(wij) extracts the embedding vector associated with the word wij from the subdomain dictionaries The Global Vector (GV) is then computed as the average of all the subdomain topic vectors. This is repeated for each subdomain, hence producing n Global Vectors GV_1, GV_2, GV_3, and GV_n for n entity subdomains.
GV(Entity)= (∑_(j=1)^T▒( ∑_(i=1)^K▒W_wij *Vec(w_ij)))/T
It is essential to identify the number of words K required in a topic to create global vector. To obtain the value of K,the length of K is varied such that while moving from 20 words to 100 words in a topic, the similarity score is more or less the same or slightly decreasing. By increasing the number of words in a topic there is negligible influence on similarity score, hence top 30 words in a topic maybe used for global vector generation.
The naming module 618 receives the local vectors and global vectors generated by the local vector module 616 and global vector module 614 respectively. The local vectors are compared with the N global vectors {GV_1, GV_2, GV_3, and GV_n} corresponding to each subdomain. Cos(LV, GV) computes the cosine similarity between local and global vectors. The global vector with which the local vector shows maximum similarity (argmax) is identified. The maximum similarity score is computed using the following equations
Label = argmax{Cos(LV _i,GV _i)∶ i ∈ {1,2,3,…n}}

Score = max{Cos(LV _i, GV _i) : i∈ {S1,2,3,…n }}
If the maximum similarity score is above a threshold θ , the tag(label) associated with the global vector is assigned as the final named entity label. Each subdomain is given a different threshold value. The threshold θ is defined as the average similarity score of seed phrases with the global vector.
The named entity recognition model for domain specific texts has several advantages over prior art as set forth herein. The hidden features present in raw text automatically discovered using the topic modeling approach. Further, it is possible to generate the semantically rich domain specific word vectors. As no annotated dataset is used, it facilitates in tackling the tedious task of manually labeling the dataset. The model provides domain specific word vectors, which can be used for downstream applications. Additionally, the proposed approach is domain independent, and is scalable which maybe applied in various domains with little modification. The NER model may be integrated with any Open Information Extraction system for domain specific labeling of arguments, this would also facilitate better relation extractions and further automatic knowledge base creation.
Although the detailed description contains many specifics, these should not be construed as limiting the scope of the invention but merely as illustrating different examples and aspects of the invention. It should be appreciated that the scope of the invention includes other embodiments not discussed herein. Various other modifications, changes and variations which will be apparent to those skilled in the art may be made in the arrangement, operation and details of the system and method of the present invention disclosed herein without departing from the spirit and scope of the invention as described here, and as delineated in the claims appended hereto.
EXAMPLES
EXAMPLE 1: Implementation of NER for agricultural domain
Data collection and preprocessing:
A dataset was created for the agriculturaldomain extracting sentences from recognized agricultural websites and Wikipedia.For data extraction, key-phrases/seed names related to agricultural subdomains were used,viz., Disease, Soil, Pathogen, and Pesticides, from authorized websites. For cropsexisting AGROVOC dictionary was used and geocoding functions were used for place. The American Phytopathological Society website was used for collecting the key phrases associated with plant diseases and pathogens. Information about soil and pesticides from variousrecognized agriculture websites. Examples of key phrases used include {‘Black rot’, ‘Bunchy top’, ‘Enterobacter cloacae’, ‘Erwiniaherbicola’, ‘Enterobacter cloacae’, ‘Erwiniaherbicola’, ‘Red soil’, ‘Hill soil’,..}. Independent data set pertaining to each of the agriculture subdomains was created (Disease,Soil, Pathogen, and Pesticide). After data acquisition preprocessing is done by splitting data into sentences and removing stopwords and special characters.
Triplet extraction from received data:
Triplets may be identified by an OpenInformation Extraction System (OIE). The number of triplets extracted from the simple sentence and complexsentences maybe one and five respectively.After extracting the triplets, heuristics rules are utilized to check the validity of the extracted arguments which are listedbelow:
Rule 1: Label the argument as OTHER, if the argument length is greater thanseven.
Rule 2: Label the argument as OTHER if it starts with preposition.
Rule 3: Label the argument as OTHER if the argument starts with linking verb.
In the case of simple sentence, for arguments the heuristic rules are checked and then passed on to the next module for further processing. However, in the case of complex sentence, many of these arguments are invalid such as:
1. ’Disease of grape vines affecting the above ground part of vine’
2. ’Disease of grape vines caused by a fungus Guignardiabidwellii’
3. ’Favored by warm weather also called grape rot’
4. ’Disease of grape vines caused by a fungus Guignardiabidwellii’.
The aforementioned arguments are removed as per Rule 1 and the arguments are labeled as OTHER. Henceout of ten arguments from the sentence only four arguments "fungus Guignardiabidwellii","favored by warm weather","grape rot", and "Vine" are selected for entity labeling based on the definedheuristic rules. The remaining entities are marked as OTHER and are not consideredfor further entity recognition.
Entity Recognition:
The entity recognition for agricultural domain specific texts is shown in FIG. 7. Once the entities are obtained, each of these entities is labeled to either of the six categories or into the ’OTHER’ category It can be noted that initially we check whether crop entities are present in the given input argument. In the absence of any crop entity, exBERT_LDA+ model is used to detect the presence of the four entities Disease, Soil, Pathogen and Pesticide. If it fails to detect these entities, it is checked the presence of Place entities using Geocoding. AGROVOC dictionary lookup is used to recognize crops. If the given argument is not a crop entity, theexBERT_LDA+ model is used for recognizing the four major entities,viz., Disease,Soil, Pesticide and Pathogen entities. The extended BERT createsdomain-specific word vectors, and LDA outputs domain-specific topic andword distributions.
Subdomain word vectorization using word embedding with extended BERT (exBERT):
Independent training data set pertaining to each of the agriculturesubdomains is vectorized using domain specific word embeddings, as shown in FIG. 8. In order to generate the vector representations of inputs, BERT Basewith an extended tokenizer is used. Once the training dataset is vectorized, thesubdomain word vectors are stored in separate dictionaries named asAgroWE_Disease, AgroWE_Pesticide, AgroWE_Pathogen, and AgroWE_Soil.These domain specific word vectors are used for Global Vector and Local Vectorcreation.
Subdomain topic modeling using LDA:
Topic modeling is used to identify the important topics and its word distribution. There may be many cross-domain words that can occur commonly in these subdomains. The importance of such words can vary across the subdomains. In order to get the importance of such words within a subdomain, a weighted distribution is needed, which is obtained using LDA. For example, if each of the four subdomains is considered as the topic, the word ’Sulfur’ may appear widely in the topic ’Pesticide’ rather than the ’Disease’ topic. The following hyper parameters were used for topic modeling -α :this hyperparameter controls the document-topic density, η: this hyperparameter controls the topic-word density and number of iterations. Other parameters such as T: Number of topics required, K: Numbers of words required in a topic and Number of word-weight distributions required to create the dictionaries forglobal/local vector creation were used.
The default valuesis used topic-word density η hyper parameters. The optimal value of α is computed based on the number of topics and the shape of the data. A low value of α assumes that each document will have only afew dominant topics, while most other topics will have very low probabilities ofappearing in the document. A low value of η assumes that each topic will haveonly a few dominant words, while most other words will have very low probabilities of appearing in the topic. For all experiments the number ofiterations was set to 10. In LDA, the number of topics T is fixed based on the coherence scores and the model withthe highest coherence score is selected. Coherence score is defined as the averageof the pairwise word-similarity scores of the words in a topic. Hence in subdomainmodels, the number of topics may vary depending upon the coherence score. Foreach subdomain, the LDA models were experimented by varying the number oftopics.
For example, in the case of the Soil and Pathogen models, the coherence score ishigh when the number of topics is six. But in the Pesticide and Disease Models, thehighest coherence score is achieved when the number of topics is three and seven,respectively. The topic words with corresponding weights arestored in four independent word-weight dictionaries corresponding to eachsubdomains, viz., WW_PA (Pathogen), WW_D (Disease), WW_PE (Pesticide), and WW_S (Soil). To create such dictionaries, the number of words required is anotherparameter to be identified. For this, all vocabulary words and its weightsfrom the generated topics are collected. Then the words and weights are sorted in descendingorder.
TABLE 1: Distribution of words and weights across different subdomains

Local Vector generation:
The received input arguments are vectorized in local vector module. For the input argument four local vectors LV_S,LV_D,LV_PA, and LV_PE are created corresponding to the four subdomains.A weighted score to scale the importance of the words with respect to each subdomain is used.If Arg is an input argument with {w_1, w_2, w_3,…w_m}, Local Vector (LV) is computed as the weighted average of all the m words in Arg using the following equation.
LV(Arg)=(∑_(i=1)^m▒,W_wi*Vec(w_i))/m
=
In the equation, wi is the ith word, and Wwiis the weight associated with a word wi. The vector for each of the m words is extracted from the domain specific word vectors AgroWE_Disease, AgroWE_Pesticide, AgroWE_Pathogen, and AgroWE_Soil. The weight is obtained from the weight dictionaries WW_PA (Pathogen), WW_D (Disease), WW_PE (Pesticide),and WW_S (Soil)
Global Vector generation:
The Global Vector represents the global significance of a subdomain in a document.
GV(Entity)= (∑_(j=1)^T▒( ∑_(i=1)^K▒W_wij *Vec(w_ij)))/T

Global Vector(GV) is computed as the average of all the subdomain Topic Vectors TWW_S,TWW_D, TWW_PE, and TWW_PA. This is repeated for each subdomain, henceproducing four Global Vectors GV_S, GV_D, GV_PE, and GV_PA.
Named entity labeling for agricultural domain:
Once the local vectors are generated, it is compared with the four global vectors GV_S, GV_D, GV_PE, and GV_PA corresponding toeach subdomain. The Global vector with which the local vector shows maximum similarity (argmax) is identified.
Label = argmax{Cos(LV _i,GV _i)∶ i ∈ {1,2,3,…n}}
The maximum similarity score is computed using
Score = max{Cos(LV _i, GV _i) : i∈ {S1,2,3,…n }}

If the similarity score is above a threshold θ, the tag(label)associated with the Global Vector is assigned as the final named entity label. Otherwise, the input argument is passed on to the place recognitionto check the presence of place entities. Each subdomain is given a differentthreshold value. The threshold θ is defined as the average similarity score ofseed phrases with the Global Vector. For example, in the disease domain, the value used as threshold for disease domain{θdisease = 0.8}. Similarly threshold values are calculated for other domains viz.,{θsoil = 0.85}, {θpathogen = 0.75}, and {θpesticide = 0.70}.
As the entities place and crops are closely related, finding place entities areessential. Heuristic rules to check the presence of followingphrases located, found, distributed, coastal, tract,etc. in the relation phrase alongwith Geocoding, this helps in reducing false positives. If the input argument is not a place entity, it is labeled as’OTHER’ by the NER model.
Named entity Labeling Evaluation:
The NER model which labels the six major entities, viz., Crop, Place,Disease, Pesticide, Pathogen, and Soil is evaluated using standard measures, viz., Precision, Recall, Accuracy, and F-Measure using the following equations

The overall performance of the NER model is shown in FIG. 9A, it may beobserved that the NER model presents a macro average F-measureof 80.43% in recognizing the six major entities. An entity level analysis is presented in FIG. 9B. High recall is observed in recognizing soil entities and the least for pathogen entities. This is mainly becausethe diseases and their causes (pathogens) are related entities. Hence some ofthe pathogen entities are classified as disease entities which resulted in the lowrecall of pathogen entities. The second highest F-score is found for crop anddisease entities other than the "OTHER" entities. It is also noticed that the precisionof recognizing pesticides is less compared to other entities due to the presence ofscientific terms. Hence the F-measure of the model in recognizing pathogen andpesticide entities drops.
The following six input sentences were considered for evaluation of the NER model:
S1. Bacterial blight is caused by Lasiodiplodiatheobromae.
S2. Clayey soil and loamy soil are suitable for growing cereals like wheatand gram.
S3. Black soil is distributed across the districts of Coimbatore and Madurai.
S4. Abamectin is a widely used insecticide and anthelmintic.
S5. Atrazine is an herbicide widely used for control of broad leaf and grassyweeds.
S6. The pesticide Pyraclostrobin is used to protect Fragaria, Rubusidaeus,Vacciniumcorymbosu.
The output for the aforementioned sentences by the NER model is provided in the Table 2.It can be observed from Table 2 that there are 18 triplets with 21 unique argumentsgenerated from 6 sentences. Out of the 21 arguments, 14 arguments are correctlyclassified. The place and soil entities are correctly identified while dictionarylookup failed for the two crop entities ’Gram’,and ’Fragaria’. One of the five pesticide entities ‘Widely used for control of grassy weeds’ was labeled incorrectlyand the model failed to identify the two pesticide entities ’Abamectin’ and ’Anthelmintic’.

The cosine similarity scores obtained by all the arguments except crop and locationentities is presented in Table 3.
TABLE 3: Cosine similarity scores obtained by all the arguments

The input local vectors of input phrases are compared against the four global vectors using the cosine similarity measure. The firstcolumn shows the example phrases to be labeled. The remaining fourcolumns show the cosine similarity score of the input to the global vectors fromeach subdomain. The last column shows the output from the proposed model. For example, the first row shows the similarity score of the phrase ‘Bacterial Blight’ tothe four different global vectors. Among these, the most similar is the global vector that belongs to Disease subdomain, and hence the input phrase is labeled asa DISEASE entity. Similarly, the phrase ‘Lasiodiplodiatheobromae’ is labeledPATHOGEN as it shows the highest similarity with the global vector thatrepresents the Pathogen subdomain. The phrase ’grassy weeds’ is labeled as’OTHER’, which means it doesn’t falls into any of the four entities.
Analysis of Out of Vocabulary (OOV) Agro Entities Identified by NER model:
The percentage of new phrases recognized by the NERmodel. For this 1690 entity names were extracted from 700 test sentences. Table 4 indicates the percentage of phrases that is not in training data,but identified as valid entities.
TABLE 4: Percentage of Out of Vovabulary(OOV) entities identified by NER model

The model identified approximately 63% of out ofvocabulary entities(disease, soil, pesticide and pathogen entities) correctly. It can be observed from Table 4 that maximum OOV entities are identified in the disease domain and theleast in soil domain.
NER model performance analysis based on length of entity:
To determine the influence of varying length of entities on theperformance of the NER model, the input entities into wad divided into different groups. 4 clusters for each entity with varying entity length were created,viz.: one, two, three and four plus. For example, entity with length three was placedin cluster 3. Similarly, other clusters were created with corresponding lengths.For each cluster, we compute the F-measure of the the entities present in it. The following tableshows the individual entities and their F-measure based on entitylength.
TABLE 5: F-Measure Based on Entity Length

It can be observed that at entity length one the highest performance is obtainedfor place entities and least for pathogen entities. The F-measure achieved forsingle-word entities in the disease and soil domains is relatively high. This is likelydue to the prevalence of generic words such as soil, disease, symptoms, cause,sand, etc. in these domains as compared to other domains, which contain morescientific terminology. As a result, the vector representations produced byword embeddings may have been more effective in these domains.While moving from single word to double words, F-Measure increases for allentities except for Crop and Place, where a small drop is observed. In multiwordentities, if more number of words are present, usually the semantic similarity isobserved to be better, specifically for Pathogen and Pesticide entities. A slight dipis observed in F-measure for three-word entities within the soil and disease field.For example, within the disease domain, words such as ’plant’,’leaf’,’seeds’ areprevalent, and using combinations of these words in entity names increases the riskof generating false positives. Examples of three-word entity names within thiscategory are: ’plant including weeds’,’rough sandpapery feeling’,etc. In the case ofsoil entities, we have very few entities whose length is greater than three, hence theF-measure increases. Hence, if the entity length is two or three, theperformance of the NER model is better.
The named entity recognition model for domain specific texts has several advantages over prior art, as set forth herein. The hidden features present in raw text automatically discovered using the topic modeling approach. Further, it is possible to generate the semantically rich domain specific word vectors. As no annotated dataset is used, it facilitates in tackling the tedious task of manually labeling the dataset. The model provides domain specific word vectors, which can be used for downstream applications. Additionally, the proposed approach is domain independent, and is scalable which maybe applied in various domains with little modification. The NER model may be integrated with any Open Information Extraction system for domain specific labeling of arguments,this would also facilitate better relation extractions and further automatic knowledge base creation.
Although the detailed description contains many specifics, these should not be construed as limiting the scope of the invention but merely as illustrating different examples and aspects of the invention. It should be appreciated that the scope of the invention includes other embodiments not discussed herein. Various other modifications, changes and variations which will be apparent to those skilled in the art may be made in the arrangement, operation and details of the system and method of the present invention disclosed herein without departing from the spirit and scope of the invention as described here, and as delineated in the claims appended hereto.
, Claims:WE CLAIM:
1. A method (100)of named entity recognition model fordomain-specific texts, the method (100) comprising:
receiving(102) data from plurality of resource domains at a preprocessing module for creating multiple datasets, wherein independent data sets are generated for each entity subdomain;
receiving(110) processed data from the preprocessing module for extracting triplets at an entity identification module, wherein the triplets include arguments and relations; and
labeling entities for the data received from the entity identification module into domain-specific classes at the entity recognition module, wherein labeling the entities comprises:
identifying(104) domain-specific word embeddings for the independent data set of each entity subdomain in a word dictionary module;
identifying(106) topics and word distribution for each entity subdomain data set with word-weights in a weight dictionary module;
performing (112) local vectorization of arguments as weighted average of words in an argument in a local vector creation module;
performing(108) global vectorization of arguments by computing average of all subdomain topic vectors in a global vector creation module;
comparing (114) local vectors with global vectors for each subdomain for computing cosine similarity between local and global vectors in a naming module; and
generating(116) final named entity label based on its similarity score for each subdomains.

2. The method (100) as claimed in claim 1, wherein generating independent datasets for each entity subdomain comprises:
receiving(202) data from plurality of recognized domain-specific sources;
performing(204) data extraction for the received data using seed names/key phrases pertaining to entities, wherein the data includes extracted sentences;
removing(206) stop words and special characters from the extracted sentences; and
generating(208) independent datasets for each entity subdomain.

3. The method (100) as claimed in claim 1, wherein performing identification of triplets to obtain entities comprises:
performing (302) syntactic simplification to convert compound sentences to simple sentences by splitting conjoint clauses, appositive, and relative clauses;
performing(304) extraction of triplets with arguments and relations between them, wherein arguments include noun phrases representing an entity; and
determining(306) the validity of extracted arguments with heuristic rules, wherein entities starting with a preposition or link verbs are filtered out.

4. The method (100) as claimed in claim 1, wherein performing vectorization for the independent data set of each entity subdomain using domain-specific word embeddings comprises:
generating(402) word representations from the contextual information of a document for vectors;
obtaining (404) semantically similar wordvectors for a document; and
storing(406) word vectors in separate dictionaries for each domain.

5. The method (100) as claimed in claim 1, wherein identifying topics and word distribution for each entity subdomain data set with word-weights comprises:
determining(502) topics for each subdomain based on coherence score, wherein the coherence score is obtained by the degree of semantic similarity between high scoring topics;
obtaining (504) weights corresponding to the topics for storing them in dictionaries; and
arranging(506) word-weights in descending order till the weight converges to zero.

6. The method (100) as claimed in claim 1, wherein for similarity score of an argument above a threshold of θ global vector label is assigned as final named entity label.

7. The method (100) as claimed in claim 1, wherein validity of arguments obtained from the extracted triplets are determined using heuristic rules.

8. A system (600) for named entity recognition model fordomain-specific texts, the system (600) comprising:
a preprocessing module (604) for receiving data from plurality of resource domains (602) for creating multiple datasets, wherein the datasets include independent training and test datasets for each subdomain for the domain;
an entity identification module (606)for receiving data from the preprocessing module and identifying triplets from arguments;
an entity recognition module (608) for labeling entities of data received from the entity recognition module into domain-specific classes, the entity recognition module comprises:
a word dictionary module (610) for identifyingdomain-specific word embeddings for the independent data set of each entity subdomain;
a weight dictionary module (612) for identifying topics and word distribution for each entity subdomain data set with word-weights;
a local vector creation module (616) for performing local vectorization of arguments as weighted average of words in an argument for a subdomain;
a global vector creation module (614) for performing global vectorization of arguments by computing average of all subdomain topic vectors; and
anaming module (618) for comparing local vectors with global vectors for each subdomain and generating final named entity label based on its similarity score for each subdomains.

Sd.- Dr V. SHANKAR IN/PA-1733
For and on behalf of the Applicants

Documents

Application Documents

#	Name	Date
1	202341037313-STATEMENT OF UNDERTAKING (FORM 3) [30-05-2023(online)].pdf	2023-05-30
2	202341037313-REQUEST FOR EXAMINATION (FORM-18) [30-05-2023(online)].pdf	2023-05-30
3	202341037313-REQUEST FOR EARLY PUBLICATION(FORM-9) [30-05-2023(online)].pdf	2023-05-30
4	202341037313-OTHERS [30-05-2023(online)].pdf	2023-05-30
5	202341037313-FORM-9 [30-05-2023(online)].pdf	2023-05-30
6	202341037313-FORM FOR SMALL ENTITY(FORM-28) [30-05-2023(online)].pdf	2023-05-30
7	202341037313-FORM 18 [30-05-2023(online)].pdf	2023-05-30
8	202341037313-FORM 1 [30-05-2023(online)].pdf	2023-05-30
9	202341037313-EVIDENCE FOR REGISTRATION UNDER SSI(FORM-28) [30-05-2023(online)].pdf	2023-05-30
10	202341037313-EDUCATIONAL INSTITUTION(S) [30-05-2023(online)].pdf	2023-05-30
11	202341037313-DRAWINGS [30-05-2023(online)].pdf	2023-05-30
12	202341037313-DECLARATION OF INVENTORSHIP (FORM 5) [30-05-2023(online)].pdf	2023-05-30
13	202341037313-COMPLETE SPECIFICATION [30-05-2023(online)].pdf	2023-05-30
14	202341037313-FER.pdf	2024-05-16
15	202341037313-RELEVANT DOCUMENTS [11-11-2024(online)].pdf	2024-11-11
16	202341037313-RELEVANT DOCUMENTS [11-11-2024(online)]-1.pdf	2024-11-11
17	202341037313-Proof of Right [11-11-2024(online)].pdf	2024-11-11
18	202341037313-PETITION UNDER RULE 137 [11-11-2024(online)].pdf	2024-11-11
19	202341037313-PETITION UNDER RULE 137 [11-11-2024(online)]-1.pdf	2024-11-11
20	202341037313-FORM-26 [11-11-2024(online)].pdf	2024-11-11
21	202341037313-FER_SER_REPLY [11-11-2024(online)].pdf	2024-11-11
22	202341037313-CORRESPONDENCE [11-11-2024(online)].pdf	2024-11-11
23	202341037313-FORM-8 [30-11-2024(online)].pdf	2024-11-30
24	202341037313-RELEVANT DOCUMENTS [04-04-2025(online)].pdf	2025-04-04
25	202341037313-POA [04-04-2025(online)].pdf	2025-04-04
26	202341037313-FORM 13 [04-04-2025(online)].pdf	2025-04-04

Search Strategy

1	SearchStrategyMatrix202341037313E_26-03-2024.pdf
2	D1_NPLE_26-03-2024.pdf