Abstract: The Named Entity Recognition (NER) is a popular method that is used for recognizing entities that are present in a text document. It is a method for processing natural language that can automatically read whole articles, pull out the most important parts, and put them into predefined categories. In this article, an Attention-BiLSTM_DenseNet Model for NER English has been presented. The model works in three phases; datat pre-processing, features extraction and NER Phase. During the pre-processing URLs, Special Symbols, Usernames, stop words are removed and similarly tokenization and normalization are performed on the datasets. In the next phase, the necessary features; domain weights, event weights and textual similarity are extracted and then during the training phase of Attention-BiLSTM-DenseNet, word embeddings were concatenated to obtain context, and then optimal weight parameter coefficients were used to train the model parameters. It has been found that the proposed method has better precision, recall, accuracy, and F-Measure when compared to established methods like LSTM-CRF and BiLSTM-CRF, as demonstrated by statistical measurements.
Description:Field of Invention
Natural Language Processing is a branch of artificial intelligence that focuses on the interaction between computers and humans through natural language. It enables computers to understand, interpret, and generate human language in a valuable way. NLP encompasses various tasks such as text classification, sentiment analysis, machine translation, and named entity recognition (NER), among others. Named Entity Recognition (NER) is a specific task within NLP that involves identifying and classifying named entities in text into predefined categories such as names of people, organizations, locations, dates, and numerical expressions. NER systems play a crucial role in information extraction from unstructured text, enabling applications such as information retrieval, question answering, and entity-based search. By automatically identifying and categorizing named entities, NER systems help in extracting structured information from text, which is essential for various downstream applications in fields like finance, healthcare, and social media analysis.
Background of the Invention
In a text, finding and classifying key nouns and proper nouns is known as "named entity recognition," or NER. It's common to see people's names, organisations, or locations mentioned in stories. Applications like Information Extraction, Question Answering, and Machine Translation all rely heavily on the recognition of named entities. Most NER systems make use of the classes like, person (PER), location (LOC), and organisation (ORG), as well as the more broadly defined as miscellaneous (MIS). There is a strong connection between these classes and news-related corpora. Class labels from other domains are likely to be used in the training and testing of NER systems in those areas too. For sequence labelling, the BIO labelling is the most used representation. A token can be observed at the beginning, middle, or end of the named entity in this representation. Another way, BILOU format uses L and U labels for the Unit-length NEsand the Last token of a multi-token entity accordingly.
Other research activities, such as semantic and sentiment analysis between entities, can also be carried out by the extraction of relevant information from the web. Existing search engines and digital research repositories can benefit from NER by providing researchers and users with more advanced options of parameterized search based on names of the authors,domains, research questions, and so on. Textual context and word sequence are critical to NER's success, and this is reflected in the issue. In this case, "Washington" can refer to a location or the name of a person. Its proper application necessitates an examination of the surrounding environment. Hence, Sequence Labelling is a popular method for resolving NER. The main aim of NER is identification and classification of the names automatically from the text to into specified categories.
Progress in the field of Natural Language Understanding (NLU) has been made over the past decade, and the proposed systems use various NEs methodologies and techniques that fall into three broad categories: rule-based techniques, Machine Learning (ML), and hybrid approaches. The advantage of ML techniques is that the system can be trained and expanded easily to different language domains (US11475209B2).
This type of technique involves manually creating rules for a given field based on activities, which are time-consuming and do not allow for generalization to other fields. The artificially labelled corpora are used for training in statistically-based algorithms, which turn the NER job into a sequence labelling task. Statistics-based methods are flexible and don't necessitate as many hand-designed rules as rule-based methods do, thus they gradually become the norm until deep learning takes hold. All 16 NER systems competing at the CoNLL-2003 conference used statistical approaches.
The learning-based approaches to NER are the second type of NER system. The first category's human-curated rules are replaced by these models. Unsupervised, semi-supervised, and supervised approaches can all be found in this category. As part of supervised and semi-supervised techniques of training, a machine-learning model is trained on examples of inputs and their intended outcomes.
In order to get the best of both worlds, the hybrid approach combines several named entity recognition approaches. Combining more than one machine learning approach result is different from hybrid approach that combines rule based and dictionary based methods.Using this method will result in a better outcome and result. The drawbacks of rule-based classification are still evident in the hybrid system. A domain-specific named entity system could potentially change the normal methods used to recognise different forms of NE when the training domain is changed. Rules-based or machine-learning-based processing systems can begin with either method. Consequently, to deal with the issue of overlapping entities, rule-based approaches can be used; machine learning algorithms estimate the strength of co-occurrences of multiple-word entities (RU2722571C1).
Ling et al. proposed Chemical-NER can be documented using a BiLSTM-CRF model. The method makes use of global document-level information gleaned from attention mechanisms in order to ensure that the same token is tagged consistently across different instances of the document. At BioCreative IV, it outperforms other state-of-the-art methods in both the CHEMDNER task corpus and the Chemical-Disease Relation (CDR) task corpus with less feature engineering (the F-scores of 91.14 and 92.57 percent , respectively). Guohua et al. developed a model for the Chinese CNER task (Att-BiLSTMCRF) will help to alleviate some of these issues. Direct connections between characters can help a self-awareness mechanism learn about long-term dependencies.
Qimin et al. uses a deep learning approach to classify and analyse implicit emotions. The LSTM and BiLSTM models are investigated, both of which use different dropout rates and the ensemble method to improve classification performance and are equipped with attention mechanisms. Experiments show that the system is effective in this task of identifying and classifying implicit emotions. Minsoo et al. created a Bi-LSTM with CRF Model Using a CNN and a Bi-LSTM to retrieve character-level representations,. For comparisonof the proposed model to the ones already in use are done by two widely used datasets: the JNLPBA and NCBI-Disease When it came to NCBI-Disease, the proposed model had an F1-score of 86.93 percent, while its F1-score for JNLPBA was competitive at 75.31 percent.
Hesheng et al. worked on recognizing named entities with the help of Deep Learning Models,. The Bi-LSTM-CRF model is compared and analysed using a variety of annotation methods, A higher F1 value than the word sequence labelling corpus was found when training the model on the named entity's word sequence labelling corpus. Both place names and organization names are better represented by the Bi-LSTM-CRF model using word segmentation than the model using character segmentation, with F1 values that are 67.60 percent and 89.5 percent higher, respectively.
Rongen et al. presented a pre trained XLNet, a bi-directional long-short term memory (Bi-LSTM), and a conditional random field to improve the NER's efficiency (CRF). Using the pre-trained XLNet model, we first extract sentence features, and then we combine those features with the classic NER neural network model. Further evidence suggests that XLNet is superior to its competitors in NER tasks. These results were obtained by comparing the XLNet-BiLSTM-CRF to those obtained by using datasets from the CoNLL-2003 English release as well as WNUT 2017.
Deng et al. studied the abstract texts of TCM patents can be automatically recognised by using a method that combines a Bi-directional Long Short-Term Memory neural network with Conditional Random Field (BiLSTM-CRF). The semantic data in the context is learned by using deep learning methods without the need for feature engineering. Qin et al. used Bi-LSTM Networks and a CRF layer to identify three variants of clinical named entities. Consisting of CBOW training on domain and non-domain corpora, and character-based word-embedding, the word representations fed into the neural networks are combined. Tested on i2b2/VA open datasets, the model has the best NER F1 value of 0.8537 and has been compared to six related works.
Summary of the Invention
Named Entity Recognition (NER), a popular method that is used for recognizing entities that are present in a text document. It is a method for processing natural language that can automatically read whole articles, pull out the most important parts, and put them into predefined categories. In this study, an Attention-BiLSTM_DenseNet Model for NER English has been presented. The model works in three phases; datat pre-processing, features extraction and NER Phase. During the pre-processing URLs, Special Symbols, Usernames, stop words are removed and similarly tokenization and normalization are performed on the datasets. In the next phase, the necessary features; domain weights, event weights and textual similarity are extracted and then during the training phase of Attention-BiLSTM-DenseNet, word embeddings were concatenated to obtain context, and then optimal weight parameter coefficients were used to train the model parameters. It has been found that the proposed method has better precision, recall, accuracy, and F-Measure when compared to established methods like LSTM-CRF and BiLSTM-CRF, as demonstrated by statistical measurements.
Brief Description of Drawings
Figure 1: Architecture of the proposed model
Detailed Description of the Invention
Extracting named entities (NEs) from unstructured text content such as news, articles, social remarks, etc. is a challenging task. There has been a lot of research on the NER framework. Recent years have seen a shift in NER's focus towards the development of Deep Neural Networks and the improvement of pre prepared word installation. In this Study, an Attention-BiLSTM_DenseNet Model for NER English has been presented. The model works in three phases; datat pre-processing, features extraction and NER Phase. During the pre-processing URLs, Special Symbols, Usernames, stop words are removed and similarly tokenization and normalization are performed on the datasets. In the next phase, the necessary features; domain weights, event weights and textual similarity are extracted and then during the training phase of Attention-BiLSTM-DenseNet, word embeddings were concatenated to obtain context, and then optimal weight parameter coefficients were used to train the model parameters. It has been found that the proposed method has better precision, recall, accuracy, and F-Measure when compared to established methods like LSTM-CRF and BiLSTM-CRF, as demonstrated by statistical measurements.
Many models exist to aid in the examination of the findings obtained from employing NER models such as HMM and RF. There's no denying that the PC / machine faces a significant task in trying to extract the relevant details from the sparse visuals. Estimating these limits manually is a gross, laborious process that necessitates a big sample size. Another piece of evidence supporting this decision is a deliberate analysis of the limited representation of data. The labelled text calls for a text-to-border grade, therefore details on the language master's labelling are necessary. Unfortunately, many varieties of speech are subject to property restrictions, but this can be mitigated by using the designated corpus that is integrated into the examination. Other researchers can take advantage of the largest rated corpus yet developed for the English language. To facilitate continuous work on the naming corpus, which is necessary for organised learning, a framework has been constructed. This module gives researchers the opportunity to generate their own marked corpus with minimal manual intervention.
The acquired input data is cleaned of any extraneous text and converted into a numerical formulation in the pre-processing stage so that accurate event predictions can be made. In the beginning, the URL is discarded. The use of named entity recognition is not necessary for the removal of URLs from the gathered database. The removal of the unique sign is completed after this. In this stage, we eliminate certain symbols (such as punctuation marks) that aren't necessary for forecasting. The next step is to delete the Username. This procedure is used to get rid of the user's name before the @ symbol.Tokenization is the next step.
Normalisation is the next step.The normalisation process is activated if there is white space in the gathered sentence. Input data can be transformed into a more precise dataset using a process called normalisation. Words are transformed into a standard layout using data-manipulation procedures. By taking into account variables like abbreviations, writing style, and synonyms of specific terms, normalisation improves text matching. The next step is to delete the stop word. The process of removing stop words from text is called "stop word removal." Finally, after determining the word's suffix and prefix, the stemming process is brought full circle back to the original root. If the letters in the word have some reason for wanting to be apart. In the end, Stemming is done. Some words' accurate inflection types can be converted to a similar source by removing the prefixes and additions from each word. A normal root, or root segment, is one that satisfies these criteria and also contains the other components. The input text is cleaned up and the necessary texts are gathered in the pre-processing stage. Stop word removal is the process of omitting commonly used words like "an," "the," "a," "this," and "that" from the input text. Word synonyms are sorted out by taking the pre-processing phase into account.
Extraction of features is the next step. This is a crucial step in determining the likelihood of a demonstration leading to civil disturbance based on the provided text data. event weight, Domain weight, geographical similarity, textual similarity, temporal similarity, and RDTFD features are just few of the forms of feature extraction used in this research. The domain weight measures how effectively the word designates the specified domain. Domain weight is determined by multiplying the normalised term frequency of the word in the targeted domain set by the inverted text frequency of the word in the open domain set, when both sets of data are available as Twitter tweets. In order to quantitatively differentiate the events from the other events in the same domain, "event weight" is used. It is computed by multiplying two components, such as frequency of the term in the event's text and the frequency of the word's inverse text. Names and domains are compared for their linguistic similarity. Words like "name weight sum" and "domain weight sum" contribute to this textual similarity. Distance between the site where an event took place and the location of the relevant word are used to determine the degree of spatial similarity between the relevant words. When calculating temporal similarity, the initial flurry of words is taken into account based on the specific occurrence. The Poisson algorithm is used to further minimise name-related words in this temporal similarity. Indicating the likelihood of a specific word being detected after a Poison cycle.
There are a number of ways to represent words and sentences numerically, including word embeddings. For example, embeddings can be used to reduce the dimensions of a large sparse vector while preserving semantic relationships. Using word embeddings, a domain or language's individual words can be represented as real-valued vectors in a more compact space. High-dimensional data can be mapped into a lower-dimensional space to solve the Sparse Matrix problem with a bag of words (BOW). Placing semantically similar items close to each other solves the BOW lack of meaningful relationship issue.
When a neural network has a bi-directional long-short term memory (bi-lstm), information can be stored in both directions (forward and backward) (past to future). The main difference between bi-directional LSTM and a regular LSTM is in BI-LSTM input is fed in both directions. There is only one possible direction for the input flow in a standard LSTM. Bi-directional, on the other hand, allows for input in both directions to preserve both current and historical data.
The BI-LSTM model comprises two LSTMs. One is used for taking input data in forword direction and other LSTM is used for taking the data in backward direction.To improve the context in which an algorithm can make decisions, BiLSTMs are a useful addition.
The study of human vision paves the way for our current understanding of how attention works. Cognitively, people will select some visual information to focus on and read, while discarding other visually apparent information. Each output data point in the neural network model has a different significance to us. The weight of important data is increased while the weight of noise data is decreased by using the attention mechanism. A BLSTM model output splicing is combined with an attention mechanism in this paper, which identifies local features while simultaneously extracting global ones. The combined features are then obtained by applying an appropriate weight to each feature vector.
Neural networks that have dense layers are ones where every neuron in the layer above it is connected to every neuron in the layer below it. This layer is the most common in artificial neural networks. The dense layer's neurons perform matrix-vector multiplication on every previous neuron's output. The dense layer's row vector is multiplied by the previous layers' column vector in a process known as matrix vector multiplication. Matrix multiplication requires a matching number of columns in both the row vector and column vector. , Claims:The scope of the invention is defined by the following claims:
Claim:
1. The Design system/method for identify Named Entities in English text using An Attention Based Bi-LSTM DenseNet Model comprising the steps of:
a) A method is for pre processing and feature extraction to remove noisy data and select the best features.
b) A method for Assign the weights to achieve the best named entity classification.
c) A architecture is designed to classify Named Entities from the pre processed English Text.
2. According to claim 1, the system/method for identify Named Entities in English text using An Attention Based Bi-LSTM DenseNet Model as claimed in claim1, A method word embedding is used to identify the feature vectors from the given input text.
3. According to claim 1, the system/method for identify Named Entities in English text using An Attention Based Bi-LSTM DenseNet Model as claimed in claim1, Attention based Bi-LSTM Model is used to assign the weights for the selected feature vectors.
4. According to claim 1, the system/method for identify Named Entities in English text using An Attention Based Bi-LSTM DenseNet Model as claimed in claim1, Dense net with Activation function is used to classify the named entities.
| # | Name | Date |
|---|---|---|
| 1 | 202441032321-REQUEST FOR EARLY PUBLICATION(FORM-9) [24-04-2024(online)].pdf | 2024-04-24 |
| 2 | 202441032321-FORM-9 [24-04-2024(online)].pdf | 2024-04-24 |
| 3 | 202441032321-FORM FOR SMALL ENTITY(FORM-28) [24-04-2024(online)].pdf | 2024-04-24 |
| 4 | 202441032321-FORM 1 [24-04-2024(online)].pdf | 2024-04-24 |
| 5 | 202441032321-EVIDENCE FOR REGISTRATION UNDER SSI(FORM-28) [24-04-2024(online)].pdf | 2024-04-24 |
| 6 | 202441032321-EVIDENCE FOR REGISTRATION UNDER SSI [24-04-2024(online)].pdf | 2024-04-24 |
| 7 | 202441032321-EDUCATIONAL INSTITUTION(S) [24-04-2024(online)].pdf | 2024-04-24 |
| 8 | 202441032321-DRAWINGS [24-04-2024(online)].pdf | 2024-04-24 |
| 9 | 202441032321-COMPLETE SPECIFICATION [24-04-2024(online)].pdf | 2024-04-24 |