System/Method For Enhanced Concept Map Generation From Domain Text

< Back

System/Method For Enhanced Concept Map Generation From Domain Text Using Deep Learning Techniques

Abstract: A concept map is a loose semantic knowledge representation graph that effectively organizes, represents, and visualizes knowledge present in a text. The concept map has to provide an overview of the document that is concise and effortless to understand. Representing and organizing the concepts along with their importance and with maximum information content is a vital aspect of the concept map. Concept maps have been constructed and used for many text mining applications, including summarization. In this work, machine learning algorithms and deep learning techniques are formulated to enhance the concept map and produce a concise and precise concept map from domain text. In this work, a new unsupervised graph-based algorithm capable of directly extracting domain phrases precisely without the use of domain resources is designed. A biased random walk-based overlapping small communities detection algorithm was used on graph-of-words of the domain corpus to identify the proper set of concepts that forms the domain vocabulary. The proposition extraction process was improved by the neural open information extraction method with the tree-based convolution neural network. The use of advanced deep learning techniques alleviated the need for feature engineering for proposition extraction. In addition, precise argument extraction was achieved with dependency tree processing and the domain vocabulary generated using unsupervised graph-based key phrase extraction. Finally concept importance and topic identification needed for the enhanced concept map is obtained by sub-topic detection based on partitional clustering using K-means and HDP.

Get Free WhatsApp Updates!
Notices, Deadlines & Correspondence

Patent Information

Application #

Filing Date

26 June 2025

Publication Number

28/2025

Publication Type

INA

Invention Field

COMPUTER SCIENCE

Status

Parent Application

Applicants

MLR Institute of Technology

Hyderabad

Inventors

1. Mrs. S.Viharika

Department of Information Technology, MLR Institute of Technology, Hyderabad

2. Mrs. J. Adilakshmi

Department of Information Technology, MLR Institute of Technology, Hyderabad

3. Dr. Venkata Nagaraju Thatha

Department of Information Technology, MLR Institute of Technology, Hyderabad

4. Mr. B. VeeraSekharReddy

Department of Information Technology, MLR Institute of Technology, Hyderabad

Specification

Description:Field of Invention
Concept map mining refers to the automatic or semi-automatic creation of concept maps from the text. In general, given a document as input, concept map mining creates a concept map where every node represents a concept designated by a unique label. Concepts and relations are chosen such that the concept map represents the maximum information content of the document. Concept map mining from text documents is carried out by manual, semi-automatic, and automatic approaches. The enhanced concept map mining from domain text using machine learning and deep learning techniques. The enhancements to the concept map through appropriate domain-oriented concept map mining needs to include increasing the information content of concepts and relations, ensuring extraction of only precise concepts and relations by removing redundant and ambiguous concepts, topic-based organization of concepts and determining the concept and relation importance with respect to the document as well as the domain.
Background of the Invention
Concept map mining involves concept extraction, relation extraction, removal of duplicates, and scoring the concepts to organize the concepts into a useful knowledge graph. Concepts refer to significant phrases or terms in the document, and the relations are those terms connecting the concepts to form a concept graph. The graph must be concise, precise, and without ambiguity, so redundant concepts need to be removed. Finally, the selected concepts are scored and organized to represent the document graphically with maximum information content. The information content is the measure of the entropy of the graph and indicates a good representation of the document. To mine concept maps from lecture notes NLP techniques are applied to extract concepts while statistical measures such as term frequency, co-occurrence, and proximity are used to boost concept extraction. Another method to extract concepts from teaching materials employs hypothesis testing to validate the domain pertinence and extract the terms as concepts. The above methods utilize statistical measures to validate the concepts and do not use semantic measures to find similar concepts and remove duplicates (US11080336B2).
Key phrases extracted from documents are used as concepts in various NLP tasks such as information retrieval, summarization, question answering, concept map mining, etc. Usually, key phrases are identified based on statistical, syntactic, or semantic features. Statistical features are those calculated from the data to weigh the significance of the terms. Syntactic features are based on the grammar, i.e., sentence structure of the language, to assess the consistency of the terms. Cue features are certain pragmatic words or characters, the presence of which enhances the probability of the term. Semantic features are focused on the sense or context of the term, thus providing correctness and relatedness between them. The features used for extracting key phrases can be divided into internal or corpus-based and external or resource-based features. Internal features are those extracted from the corpus, and external features are those computed based on additional resources.
Early supervised key phrase extraction methods consider the key phrase extraction process a binary classification problem that classifies the given phrases as key phrases or non-key phrases. The process trains the selected classifier using manually annotated documents with both positive and negative samples to determine whether a candidate phrase is a key phrase or not. Various algorithms used to train the classifier are naïve Bayes, decision trees, bagging, boosting, maximum entropy, multi layer perceptron, support vector machines, and Conditional Random Fields (CRF). The key phrase extraction algorithm is an early supervised key phrase extraction method that treats the problem as a classification task. The method uses Natural language processing to pre-process the candidate phrases and TF-IDF score and the first occurrence of the phrase to weigh the phrases.
Due to the limitations of supervised approaches, unsupervised key phrase extraction has been preferred since it can be applied to a larger dataset without any training, and complex models can be learned. In unsupervised approaches, key phrases extraction is carried out using statistical measures, clustering, NLP, language modeling, simultaneous learning, and graph-based methods. The clustering based methods groups candidate key phrases into topics based on external resources, Latent Dirichlet Allocation (LDA), or graph partition (community detection) techniques. The statistical measure-based method uses the TF-IDF value of the phrase in multiple corpora and uses it to identify key phrases. This is based on the fact that the distribution of a phrase varies with the domain. As it is based only on frequency, the method suffers due to the presence of co-reference and synonymous terms. The clustering-based key phrase extraction utilizes the k-means algorithm to cluster key phrases based on association score between the general corpus and Wikipedia articles, where Wikipedia titles are used as examples (CN115034201A).
Concept map mining employs relation extraction and proposition extraction to generate relational tuples. The relation extraction process takes pairs of concepts as input and tries to extract the relation between them employing statistical, NLP, machine learning, and deep learning. The statistical methods extract frequently co-occurring words with the concepts as relation while NLP-based methods extract linking verbs and verb phrases connecting pair of concepts as relations. POS tags, constituent parsing, or dependency parse labels are used to extract verbs and verb phrases as relations.
Direct proposition extraction methods simultaneously extract concepts along with the relation using rule-based, clause-based, and learning based methods. The rule-based approach uses various parsers to identify the structure of the sentence, then apply rules based on the relation to extract proposition triples. The clause-based approach restructures sentence into independent clauses based on dependency relations. The extracted clauses are classified based on the verb and structure of the clause, and one or more propositions are finally extracted from the sentences. The inter-propositions relations are identified by adding attribute context based on dependency relations or rules. The machine learning-based methods extracts features based on the different parser and employ NLP tools to extract candidate triples. The classification algorithms such as Naïve Bayes, support vector machines SVM, Logistic Regression, etc., are utilized to identify triplets.
The construction of a concept map from the extracted proposition triples involves removing duplicate concepts, scoring the concept, and organizing the concepts in an effective way. The duplicate concepts identification helps in removing redundant concepts and is useful in creating a concise and unambiguous concept map. Similar concepts, i.e., synonym and co-referent concepts are identified by finding morphological variations, string matching by ignoring tokens based on POS, using generic or domain-specific resources, and machine learning approaches such as clustering based on concept vectors. The concepts are scored and ranked based on their significance in the document. The Term frequency (TF), TF-IDF, Concept frequency Inverse Document frequency (CF-IDF), LSA are used to find the statistical significance of the concepts. Graph-based methods such as Hub authority root distance model, Node removal based on significance are used to identify the significance of the concepts with respect to other concepts. The redundant concept removal also helps in boosting the score of the concepts. Finally, the concepts are organized either hierarchically or in spider-web fashion based on the objective of the application for which the concept map is generated.
Open IE systems extract relational tuples employing shallow syntactic features such as parts-of-speech tag patterns, phrase chunking patterns, or shallow semantic features like dependency parse of the sentence. Open IE systems are classified into learning-based, rule based, clause-based, and proposition-based. The learning-based systems employ a classifier with distant or no supervision to identify relations. From the feature set, the model learns the probability of the extracted candidate set being a relational tuple. The rule-based systems manually generate rules based on POS or applying various parsers to the text. The extracted rules are then applied to the text to extract relations from them. Clause-based systems modify the complex sentence structure into clauses and process them to extract precise relations from them. The propositions are extracted from the clauses by applying the patterns based on the verb (relation) and its arguments. The final approach adds context to the relation, which makes the relation tuples true and accurate. The inter-propositions relations are extracted by adding attribution context based on dependency relations or rules.
The tree kernel-based Open IE system extracts relational tuples employing the dependency tree structure of the sentence. The method validates the existence of a relation between a pair of entities and extracts them. The entities and relations are identified using POS and are validated using the SVM classifier. False positives are generated due to the use of POS in selecting the entities and relations. Another clause-based Open IE system extracts relational tuples from the sentence by processing the dependency parse tree. Independent clauses are extracted from the dependency tree by recursively navigating and searching the nodes of the tree. The precise and informative propositions are extracted from these compact clauses using dependency patterns depending on the task. The method suffers from over-generation of triples since a large number of clauses are identified based on the search.
Summary of the Invention
In this an integrated framework for enhanced concept map generation from domain text without the use of knowledge based resources is developed. The employment of suitable, richer context-based vector representation (linear, dependency, and topical context-based joint embedding) of text for sub-tasks of concept map mining such as proposition extraction, finding similar terms, and concept organization improved its performance. The unsupervised graph-based Overlapping Small Community detection algorithm extracted domain key phrases efficiently. Thus, the domain dictionary generated utilizing domain corpus was used as an additional resource for precise proposition extraction. The employment of domain text as a resource without an external repository was advantageous and adaptable to newer domains. The advanced TBCNN deep learning technique produced precise and concise propositions and alleviated the need for feature engineering for proposition extraction. The enhancement based on redundancy elimination, along with the association of concepts, increased the information content of the concept map and helped in generating a concise and unambiguous domain concept map. An informative summary was produced using cue features from the enhanced concept maps, while the learning to rank based re-ranking of concepts generated a domain focussed one. This validated the efficacy of the enhanced concept map generated and its resourcefulness for various downstream applications
Brief Description of Drawings
Figure 1: Enhanced Concept Map Generation Process
Figure 2: Proposition extractions using Tree-based convolution neural network
Figure 3: Enhanced concept map Generation using K-means and HDP Clustering

Detailed Description of the Invention
Concept map generation involves concept extraction followed by relation extraction or direct Proposition Extraction, finding similar concepts for redundancy elimination, concept and relation Scoring, and Concept Map Construction. The enhancement carried out in the concept map generation from domain text is shown in Figure 1. The concept and relation extraction process are enhanced by the neural open information extraction method with the Tree-based Convolution Neural Network. In addition, proper argument extraction is boosted with the domain vocabulary generated using unsupervised graph-based key phrase extraction. A random walk-based overlapping small community detection algorithm is used on graph-of-words to identify the proper set of concepts from the domain corpus. The concept and relation scoring are improved via duplicate removal, associating concepts based on other latent relations. Duplicate removal is improved via the use of distributed word representation models and corresponding measures. The number of links between the concepts gives its information content; hence the added topical and semantic links between the concepts increase it proportionately. Enhanced concept map generation is based on organizing the concept based on those relations and representing the concept with informative features as mentioned above. A summary of the enhanced concept map is generated to exhibit its usefulness. The Lambda Rank learning to rank algorithm is used to extract domain adapted summary using domain repository.
The goal is to formulate an unsupervised direct method for domain key phrase extraction from domain corpus without using any external resources for domain knowledge. In this work, a random walk-based domain key phrases in the graph-of-words. The graph-of-words is built from domain corpus and weighed using a statistical, syntactic and semantic measures. The key phrases are considered as overlapping small communities, which are detected using the entropy of the random walk. If the entropy of the walk is less than the set threshold, the walk is considered to have overlapping small communities. All possible small communities, i.e., key phrases, are extracted from it based on the start and stop probability of the nodes in the graph. The method learns the boundary of the key phrases based on the POS probabilities and cohesion and domain pertinence using topic cohesion. The extracted small communities are scored and ranked to find the domain pertinent key phrases. The overall flow of the algorithm is given in Figure 2. The major contributions are unsupervised graph-based key phrase extraction using a small community detection algorithm and incorporating domain specificity for the nodes of the graph using the hierarchical Latent Dirichlet Allocation topic modeling method. This enables the identification of domain key phrases without using any domain-specific knowledge bases and detects infrequent key phrases while reducing false positives.
In this work, the domain corpus is represented as graph-of-words, as it can represent text more competently than the bag-of-words model, and having a strong mathematical background, provides various operations to extract the necessary information from it. If the nodes are appropriately chosen, the proper association between the nodes is made, and weights to the nodes and edges are assigned appropriately, graph representation can be used to find high-quality information from text. The graph-of-words model proposed by Rousseau et al. (2015) is used as the basis for graph construction, where the vertices represent words and edges represent co-occurrence statistics. The graph-based keyword extraction method proposed by Rousseau et al., identifies words that are maximally connected in a graph based on their structure and post-process it to obtain multi-word keywords. However, as our intention is to find the key phrases, i.e., finding the sequence of words that continually occur together, the graph-of-words is modified to suit our key phrase extraction process.
The major contribution of the work is employing the deep learning method for proposition extraction, thereby reducing the need for feature engineering and learning the features automatically from the data. The other contributions of this work are reduction in false positives and extraction of the meaningful, valid proposition through proper identification of boundary utilizing phrase-based dependency tree and embedding. In addition, all the propositions in the given sentence are extracted without the use of external resources for domain pertinence. The input document is pre-processed to extract sentences and is parsed to extract constituency and dependency parse trees. The dependency parse tree is converted to a phrase parse tree and represented using the joint word embedding. The reduced representation is given to the tree-based convolution neural network for feature engineering. The model extracts feature from it employing the fixed depth non-linear filters and k-max pooling. The features are fed to a fully connected network layer to extract propositions from the sentence.
Proposition extraction needs to identify concepts along with relation from the given sentence. The concepts and the connecting relation occur at various positions, and moreover, the given sentence may contain more than one proposition. Hence, the method needs to effectively identify the presence of different concepts along with the relations and their boundaries. This needs contextual features that are capable of identifying them.
The joint word embedding consists of three components to capture the richer context of domain corpus for identifying concept and relation in the given sentence. Word2vec word embedding using the skip-gram model with negative sampling is generated from the corpus using co-occurrence window-based linear context. It captures distributional semantic similarity, which is useful in identifying related terms effectively. Dependency2vec is generated using dependency syntactic relation context between the words in the sentence . The context is arbitrary but functional and captures long-distance relations between the words. Topic2vec uses word-topic context that considers the word along with the topic as a pseudo word to generate the embedding. It generates different embedding for each topic for the same word that captures topical semantic similarity where the topic of the word is identified using the Hierarchical Dirichlet Process. Topic based vector representation uses topic context based on the generative probabilistic model and helps in capturing domain relation between the terms, thus enabling domain proposition extraction without domain knowledge.
The convolution neural network has two operations to extract the features, convolution, and pooling. The convolution uses a set of filters that are applied to the input to extract the features. The pooling function performs an operation on the convolved features and extract representative values for the area it is applied. The two operations combined together extract the necessary features, and a fully connected layer is used to learn the model. Convolution kernels are used on different sentence representations to extract relations. In this work, the linear convolution filter is modified to a non-linear filter based on the dependency tree. Next, the need is to extract local information from the dependency structure to identify propositions; hence is analyzed for the patterns to mine from it. The basic idea of tree-based convolution neural network (TBCNN) is to design a set of novel sub-tree feature detectors sliding over the dependency tree to identify the sub-tree containing the propositions. The dependency tree is preferred because of its functional linking of long-distance terms, which helps in extracting propositions effectively.
For distributional semantic similarity which is based on the joint word embedding consists of three components, namely co occurrence-based Word2Vec, dependency relation-based embedding, and topic-based embedding. This joint word embedding is used for synonym and co-reference identification to find similar concepts. It is also used to find the topic distribution necessary for sub-topic identification. In this context, two cases are considered: domain corpus is not available and domain corpus is available. When domain corpus is not available, identification of topic distribution within the document is carried out using the joint word embedding as feature for K-means clustering. When domain corpus is available, HDP is used to identify the number of topics in domain corpus and identify topic distribution within the document where the joint word embedding is used as a feature for HDP. Moreover, the joint word embedding is used as a feature for agglomerative clustering-based hierarchical clustering of concepts. As shown in Figure 3, joint word embedding based on word, dependency, and topic context is used in all the modules of the concept map construction. Therefore, concept importance and topic identification needed for the enhanced concept map is obtained by sub-topic detection based on partitional clustering using K-means and HDP. The hierarchical link between topical concepts is discovered via hierarchical agglomerative clustering, and the concepts are scored based on both frequency and topic information. , Claims:The scope of the invention is defined by the following claims:

Claim:
A System/Method for Enhanced Concept map generation from domain text using Deep learning techniques comprising the steps of:
a) An unsupervised graph-based method that devised to extract domain key phrases that form the domain vocabulary from domain corpus that will aid in domain-specific concept map mining.
b) A deep learning method for proposition extraction, thereby reducing the need for feature engineering and learning the features automatically from the data.
c) A precise, concise and unambiguous concept map is generated with enhanced information content by removing duplicates and organizing concepts effectively and by adding latent relations employing joint word embedding representation of concepts.
2. As per claim1, Random Walk-Based Overlapping Small Community Detection method was devised to extract domain key phrases in the graph of words that are built from domain corpus and weighed using a statistical, syntactic and semantic measures.
3. As per claim1, a Tree-based Convolutional Neural Network (TBCNN) is developed to design a set of novel sub-tree feature detectors sliding over the dependency tree to identify the sub-tree containing the propositions.
4. As per claim1, concept importance and topic identification needed for the enhanced concept map is obtained by sub-topic detection based on partitional clustering using K-means and HDP.

Documents

Application Documents

#	Name	Date
1	202541060941-REQUEST FOR EARLY PUBLICATION(FORM-9) [26-06-2025(online)].pdf	2025-06-26
2	202541060941-FORM-9 [26-06-2025(online)].pdf	2025-06-26
3	202541060941-FORM FOR STARTUP [26-06-2025(online)].pdf	2025-06-26
4	202541060941-FORM FOR SMALL ENTITY(FORM-28) [26-06-2025(online)].pdf	2025-06-26
5	202541060941-FORM 1 [26-06-2025(online)].pdf	2025-06-26
6	202541060941-EVIDENCE FOR REGISTRATION UNDER SSI(FORM-28) [26-06-2025(online)].pdf	2025-06-26
7	202541060941-EVIDENCE FOR REGISTRATION UNDER SSI [26-06-2025(online)].pdf	2025-06-26
8	202541060941-EDUCATIONAL INSTITUTION(S) [26-06-2025(online)].pdf	2025-06-26
9	202541060941-DRAWINGS [26-06-2025(online)].pdf	2025-06-26
10	202541060941-COMPLETE SPECIFICATION [26-06-2025(online)].pdf	2025-06-26