Abstract: A CONTROL UNIT TO REPRESENT AT LEAST ONE DOCUMENT IN AN INTERACTION GRAPH REPRESENTATION AND A METHOD THEREOF ABSTRACT The control unit 10 combines all sentences of at least one document 12 into a format 14 for extracting multiple key words and connects the extracted key words by an edge when a pair of the key word co-occur in at least one sentence. It detects a concept by using a community detection technique for creating a pseudo graph 16, where the concept refers to a correlated pair of keywords. It determines a dominant direction of the edges drawn between the detected concepts and generate an undirected concept interaction graph (CIG) 18 from the detected concepts. It assigns each detected concept to a node for constructing an undirected weighted edge between the nodes. The control unit assigns the sentences of the document to respective nodes. The control unit generates a mixed -weighted concept interaction graph (MCIG) from the CIG and the directed weighted edges drawn between the detected concepts/ nodes. (FIGURES 1 &2)
Description:Complete Specification:
The following specification describes and ascertains the nature of this invention and the manner in which it is to be performed:
[0001] Field of the invention:
The invention is related to a control unit to represent at least one document in an interaction graph representation and a method thereof.
[0002] Background of the invention:
Instructional Documents are the procedural documents in which the sentences have a defined sequence of information. For example, Recipes, Do-It-Yourself (DIY), Repair Manuals etc. have sequential instructions. Semantic representation and analysis of Instructional documents is vital for NLP applications such as document retrieval, question-answering systems, etc. Interoperability among heterogeneous data organizations uses traditional approaches (like TF-IDF, LDA, etc.) that mainly focus on short text snippets whereas deep neural networks (like RNNs, CNNs, etc.) miss out on the structural information present in long domain-specific instructional documents. The present invention provides a solution for this.
.
[0003] A technical paper “Matching Long Text Documents via Graph Convolutional Networks” discloses a graph approach to text matching, especially targeting long document matching, such as identifying whether two news articles report the same event in the real world, possibly with different narratives. It proposes the Concept Interaction Graph to yield a graph representation for a document, with vertices representing different concepts, each being one or a group of coherent keywords in the document, and with edges representing the interactions between different concepts, connected by sentences in the document. Based on the graph representation of document pairs, it further proposes a Siamese Encoded Graph Convolutional Network that learns vertex representations through a Siamese neural network and aggregates the vertex features though Graph Convolutional Networks to generate the matching result. Extensive evaluation of the proposed approach based on two labeled news article datasets created at Tencent for its intelligent news products show that the proposed graph approach to long document matching significantly outperforms a wide range of state-of-the-art methods.
[0004] Brief description of the accompanying drawings:
An embodiment of the disclosure is described with reference to the following accompanying drawing,
[0005] Figure 1 illustrates a control unit to represent at least one document in an interaction graph representation according to one embodiment of the invention: and
[0005] Figure 2 illustrates a flow chart of a method of representing at least one document in an interaction graph representation according to the present invention.
Detailed description of the embodiments:
[0006] Figure 1 illustrates a control unit to represent at least one document in an interaction graph representation according to one embodiment of the invention. The control unit 10 combines all sentences 12 of at least one document into a format 14 for extracting multiple key words and connects the extracted key words by an edge when a pair of the key word co-occur in at least one sentence. The control unit 10 detects a concept by using a community detection technique for creating a pseudo graph 16, where the concept refers to a correlated pair of keywords. The control unit 10 determines a dominant direction of the edges drawn between the detected concepts and generate an undirected concept interaction graph (CIG) 18 from the detected concepts. The control unit 10 assigns each detected concept to a node for constructing an undirected weighted edge between the nodes. The control unit 10 assigns the sentences of the document to respective nodes based on a similarity between a vector representation of the detected concepts and the sentences. The control unit 10 generates a mixed -weighted concept interaction graph (MCIG) 20 from the CIG 18 and the directed weighted edges drawn between the detected concepts/ nodes.
[0007] Further the representation of at least one document in the form of an interaction graph representation is explained in detail. The at least one document 12 is chosen from a group of documents comprising manuals, contracts, legal documents, repair manuals, do-it-yourself kits and the like. However, it is to be understood that the type of document is not restricted to above, but can be of any other document that is known to person skilled in the art. The control unit 10 is chosen from a group of control units comprising a microprocessor, a microcontroller, a digital circuit and an integrated chip and the like.
[0008] Figure 2 illustrates a method of representing at least one document in an interaction graph representation according to the present invention. The method involves following steps. In step S1, all sentences of at least one document 12 are combined into a format 14 for extracting multiple key words by a control unit 10 and the extracted key words are connected by an edge when a pair of the key word co-occur in at least one sentence. In step S2, a concept is detected by using a community detection technique for creating a pseudo graph 16 by the control unit 10, where the concept refers to correlated pair of keywords. In step S3, a dominant direction of the edges is between said detected concepts is determined and an undirected concept interaction graph (CIG) 18 is generated from said detected concepts. In step S4, each detected concept is assigned to a node for constructing an undirected weighted edge between the nodes. In step S5, the sentences of the document are assigned to respective nodes based on a similarity between a vector representation of the nodes/detected concepts and the sentences. In step S6, a mixed -weighted concept interaction graph (MCIG) 20 is generated from the CIG 18 and the directed weighted edges between the detected concepts/ nodes.
[0009] The above disclosed method is explained in detail. The control unit 10 perform a pre-processing method for cleaning the at least one document 12 which is to be represented as an interaction graph. In the document pre-processing stage, the at least one document 12 is cleaned to remove special characters and all the uppercase alphabets are replaced by lowercase alphabets. In addition to that, all the cleaned text instructions or sentences are combined to form the format 14. The sentences present in the at least one document 12 is combined to form one long-form text. After the cleaning of the at least one document 12 and combining to form the format 14, the control unit 10 creates a key-word co-occurrence graph 15.
[0010] In this process of creating the key-word co-occurrence graph 15, the control unit 10 extract the keywords from the format 14 using any one of the following extraction techniques comprising an Entity recognition technique or a text-rank technique. However, it is to be noted that, the control unit 10 can use any other technique that is known in the state of the art for extraction of the keywords. In order to construct the graph 15, the control unit 10 connects extracted keywords by an undirected and unweighted edges. The extracted keywords are referred as vertices in some parts of the document 12. These vertices are connected by the edges if the key word pair co-occur in at least one sentence of the document 12.
[0011] After completion of the construction of the co-occurrence graph 15, the control unit 10 performs a concept detection. The control unit 10 detects a dominant direction of edges from the above co-occurrence graph 15. In this concept detection phase, the control unit 10 applies a community detection technique in order to identify the community. The control unit 10 denotes the sets of highly correlated keywords as concepts. It is to be noted that, each concept comprises a bag of keywords that belongs to the concept. The concepts are detected by applying the community detection techniques such as Girvan– Newman algorithm to the keyword co-occurrence graph 15. It is to be noted that the community detection technique is not restricted to above mentioned one, but can be any other that is known to a state of the art.
[0012] After the concept detection stage, the control unit 10 creates the pseudo graph 16 for capturing the flow of information. In this stage, the detected concepts are represented as nodes and directed edges are drawn between the detected concepts/nodes based on the occurrence of concepts in the sequential order of the sentences.
[0013] The control unit 10 determines the dominant direction in the pseudo graph 16 using at least one supergenome graph sorting technique. For each pair of the detected concepts any one of following directional edges are drawn comprising a dominant direction edge, a unidirectional edge, a bidirectional edge is drawn.
For instance, if a node pair has a greater number of edges in a certain direction, then it is considered as the dominant direction and then a unidirectional edge is drawn between the nodes. In another instance, if the pair of nodes has the number of edges in both the directions are same, a bidirectional edge is drawn between the nodes.
[0014] After the determination of dominant direction of the edges, the control unit 10 creates an Undirected Concept Interaction Graph (CIG) 18. While creating the CIG 18, the control unit 10 represents the detected concepts as nodes. As mentioned earlier, every such node contains a bag of keywords that belong to that concept. The sentences from the document 12 are assigned to the nodes based on the similarity between vector representations of concepts (in nodes) and the sentences.
[0015] The undirected weighted edges are constructed between the nodes if the similarity between the vector representations of the set of sentences associated with the node pairs is greater than a predefined threshold. The weight of the edge corresponds to the similarity score. Here in this graph, the control unit 10 constructs the graph based on the direction and the weight of the edges between two nodes/detected concepts.
[0016] The control unit 10 constructs a mixed-weighted concept interaction graph (MCIG) 20 for the detected concepts present in the document. The determination of dominant direction and undirected concept interaction graph (CIG) 18 are combined to get final MCIG. Detected concepts in the concept detection step (which are same for both CIG 18 and dominant direction graph) are represented as nodes. Every such node contains a bag of keywords that belong to that concept.
[0017] The assigned sentences from the document to the nodes in CIG 18 as reassigned with the same nodes in MCIG 20 by the control unit 10. The edges between the nodes are assigned by the control unit 10 based on the dominant direction and the weightage. The control unit 10 provides a directed edge, if there is a directed edge in the sense of dominant direction. If there is no directed edge in the sense of the dominant direction, then the control unit 10 assigns an undirected edge. The weights of each edge of CIG 18 are reassigned to the edges of MCIG 20 for the same pair of nodes.
[0018] It should be understood that embodiments explained in the description above are only illustrative and do not limit the scope of this invention. Many such embodiments and other modifications and changes in the embodiment explained in the description are envisaged. The scope of the invention is only limited by the scope of the claims.
, Claims:We claim: -
1. A control unit (10) to represent at least one document (12) in an interaction graph representation, said control unit (10) adapted to:
- combine all sentences of said at least one document (12) into a format (14) for extracting multiple key words and connect said extracted key words by an edge when a pair of said key word co-occur in at least one sentence;
- detect a concept by using a community detection technique for creating a pseudo graph (16), where said concept refers to correlated said pair of keywords;
- determine a dominant direction of said edges between said detected concepts and generate an undirected concept interaction graph (CIG) (18) from said detected concepts;
- assign each detected concept to a node for constructing an undirected weighted edge between said nodes;
- assign said sentences of said document to respective said nodes based on a similarity between a vector representation of said detected concepts and said sentences;
- generate a mixed -weighted concept interaction graph (MCIG) (20) from said CIG (18) and said directed weighted edges between said detected concepts/ nodes.
2. The control unit (10) as claimed in claim 1, wherein said control unit (10) adapted to remove special characters and to replace upper case alphabets with respective lower-case alphabets using a pre-processing technique.
3. The control unit (10) as claimed in claim 1, wherein said extraction of said keywords are obtained using anyone of following techniques comprising an Entity recognition technique, a text rank technique.
4. The control unit (10) as claimed in claim 1, wherein said concepts are detected using at least one community detection techniques, one such community detection technique is a grivan-newman technique.
5. The control unit (10) as claimed in claim 1, wherein a flow of information is captured by said pseudo graph (16) and directed edges are drawn between said detected concepts based on an occurrence of said concepts in sequential order of said sentences.
6. The control unit as claimed in claim 1, wherein for each pair of said detected concepts any one of following directional edges are drawn comprising a dominant direction edge, a unidirectional edge, a bidirectional edge is drawn.
7. The control unit as claimed in claim 1, wherein said detected concept is referred as said nodes in said CIG (18) and each of said node comprises a group of said keywords that belongs to said detected concept.
8. The control unit as claimed in claim 1, wherein said at least one document (12) are chosen from a group of documents comprising manuals, legal documents, contracts, repair manuals, do-it -yourself kit books.
9. The control unit as claimed in claim 1, wherein at least one undirected weighted edge is constructed between said detected concepts/nodes if a similarity between said vector representations of a set of said sentences associated with said pair of said detected concepts/nodes is greater than a predefined threshold.
10. A method of representing at least one document (12) in an interaction graph representation, said method comprising:
- combining all sentences of said at least one document (12) into a format (14) by a control unit (10) for extracting multiple key words and connect said extracted key words by an edge when a pair of said key word co-occur in at least one sentence;
- detecting a concept by using a community detection technique by said control unit (10) for creating a pseudo graph (16), where said concept refers to correlated said pair of keywords;
- determining a dominant direction of said edges between said detected concepts and generate an undirected concept interaction graph (CIG) (18) from said detected concepts;
- assigning each detected concept to a node for constructing an undirected weighted edge between said nodes;
- assigning said sentences of said document to respective said nodes based on a similarity between a vector representation of said detected concepts and said sentences;
- generating a mixed -weighted concept interaction graph (MCIG) (20) from said CIG (18) and said directed weighted edges between said detected concepts/ nodes.
| # | Name | Date |
|---|---|---|
| 1 | 202241077029-POWER OF AUTHORITY [30-12-2022(online)].pdf | 2022-12-30 |
| 2 | 202241077029-FORM 1 [30-12-2022(online)].pdf | 2022-12-30 |
| 3 | 202241077029-DRAWINGS [30-12-2022(online)].pdf | 2022-12-30 |
| 4 | 202241077029-DECLARATION OF INVENTORSHIP (FORM 5) [30-12-2022(online)].pdf | 2022-12-30 |
| 5 | 202241077029-COMPLETE SPECIFICATION [30-12-2022(online)].pdf | 2022-12-30 |
| 6 | 202241077029-FORM 18 [20-03-2025(online)].pdf | 2025-03-20 |