A Method Of Analyzing A Quality Of Corpus

< Back

A Method Of Analyzing A Quality Of Corpus

Abstract: Abstract A control unit for analyzing a quality of corpus and a method thereof. The control unit identifies multiple words in the corpus and refer each of the word in the corpus as a node. The control unit then assigns each of the node to an embedded vector and plots a graph connecting multiple nodes based on at least one parameter calculated between at least two of the nodes. The control unit then divides the plotted graph into multiple communities by using hierarchical clustering technique for extraction of plurality of topics. The control unit distinguishes the nodes into the plurality of topics based on an identified hierarchy of the communities formed using a community detection technique. The control unit classifies the distinguished nodes into multiple main classes and corresponding multiple sub classes for analyzing the quality of corpus. Figure 1 & 2

Get Free WhatsApp Updates!
Notices, Deadlines & Correspondence

Patent Information

Application #

Filing Date

28 April 2023

Publication Number

44/2024

Publication Type

INA

Invention Field

COMPUTER SCIENCE

Status

Parent Application

Applicants

Bosch Global Software Technologies Private Limited

123, Industrial Layout, Hosur Road, Koramangala, Bangalore – 560095, Karnataka, India

Robert Bosch GmbH

Postfach 30 02 20, 0-70442, Stuttgart, Germany

Inventors

1. Manojit Chakraborty

Badra Baroaritola Road Bye Lane, PO - Italgacha, Kolkata 700079, West Bengal, India

2. Rajesh Nagaraja Rao

1397 South End A Cross, 9 Block Jayanagar, Bengaluru 560069, Karnataka, India

Specification

Description:Complete Specification:

The following specification describes and ascertains the nature of this invention and the manner in which it is to be performed.
Field of the invention
[0001] This invention is related to a control unit for analyzing a quality of corpus and a method thereof.

Background of the invention
[0002] Corpus linguistics is a methodology that involves computer-based empirical analyses (both quantitative and qualitative) of language use by employing large, electronically available collections of naturally occurring spoken and written texts. Conventional computerized tools for semantic analysis process large bodies of documents to identify topics discussed therein. This processing often includes parsing text stored in the documents and creating a data store that associates documents with their constituent terms and the frequency with which the constituent terms occur within the documents. From this data store, conventional semantic analysis tools rank terms by frequency of occurrence and record terms that occur less frequently than others as being more important. Conventional semantic analysis tools focus on these important terms and their location within documents relative to other terms to discern the topics to which the documents are directed.

[0003] A US patent application 20180315141 discloses a system and method for contract business intelligence that includes managing a data-driven contract corpus comprised of at least a set of distinct and related contracts, wherein managing the data-driven contract corpus comprises of acquiring contract related data from the contract corpus, comprising of making contract related data accessible to programmable clauses; storing the contract related data, comprising of processing and organizing the contract related data; analyzing the contract related data; and visualizing the contract related data and the analysis of the contract related data.
Brief description of the accompanying drawings
[0004] Figure 1 illustrates a control unit for analyzing a quality of corpus, in accordance with an embodiment of the invention; and
[0005] Figure 2 illustrates a flowchart for a method for analysing a quality the corpus in accordance with the present invention.

Detailed description of the embodiments
[0006] Figure 1 illustrates a control unit for analyzing a quality of corpus, in accordance with an embodiment of the invention. The control unit 10 identifies multiple words 11 in the corpus 12 and refer each of the word in the corpus as a node 11. The control unit 10 then assigns each of the node 11 to an embedded vector and plots a graph connecting multiple nodes 11 based on at least one parameter calculated between at least two of the nodes 11. The control unit 10 then divides the plotted graph into multiple communities by using hierarchical clustering technique for extraction of plurality of topics. The control unit 10 distinguishes the nodes 11 into the plurality of topics based on an identified hierarchy of the communities formed using a community detection technique. The control unit 10 classifies the distinguished nodes 11 into multiple main classes and corresponding multiple sub classes for analyzing the quality of corpus 12.

[0007] Further the construction of the control unit 10 and the method of working of the control unit is explained in detail. The control unit 10 is chosen from a group of control units like a microprocessor, a microcontroller, a digital circuit, an integrated chip and the like. The method of analyzing the quality of corpus 12 involves different modules of the control unit 10 like comparator 14, classification module 16 and the like. The corpus 12 is a collection of multiple documents together. The document can be of any type comprising an agreement, a contract, a legal document or any other of this kind, but not limited to the above-mentioned types. For example, the corpus 12 disclosed in the present invention comprises 10,000 documents.

[0008] Figure 2 illustrates a flowchart for a method for analysing a quality the corpus in accordance with the present invention. In step S1, multiple words 11 are identified in the corpus 12 and refer each of the word as a node 11 by a control unit 10. In step S2, each of the node 11 is assigned to an embedded vector. In step S3, a graph is plotted connecting multiple of the nodes based on at least one parameter calculated between at least two of the nodes 11. In step S4, the plotted graph is divided into multiple communities by using hierarchical clustering technique for extraction of plurality of topics. In step S5, the nodes 11 are distinguished into the plurality of topics based on an identified hierarchy of the communities formed using a community detection technique. In step S6, the distinguished nodes are classified into multiple main classes and corresponding multiple sub classes for analyzing the quality of the corpus and gaining insights to perform downstream tasks on the corpus 12.

[0009] The community comprises multiple nodes 11 connected using edges. At least two nodes 11 are connected via an edge. One node 11 can have multiple edges depending on the relevancy of the words. The at least two nodes 11 are connected using the edges if similarity between the two nodes are more than a predefined threshold value. The at least one parameter used in connecting the multiple nodes 11 comprises an edge weight value and a cosine similarity value. The words/nodes 11 from the corpus 12 and corresponding vector representations of such words/nodes 11 are created using any one of the embedding models. The control unit 10 uses any one of the embedding models like Word2Vec, Doc2Vec, FastText and the like. The similarity between two words/nodes 11 are derived using the cosine similarity value that is calculated from the embedding vectors of at least two nodes 11.

[0010] The cosine similarity value is calculated using the vector values of at least two nodes 11 and the edge weight is calculated using the calculated cosine similarity value. The at least two words/nodes 11 are connected by an edge only if similarity between the two words/nodes 11 is above a certain threshold value µ (which is compared using the comparator 14 of the control unit 10), which is predefined by the user during a calibration process. The edge weight between two nodes 11 is calculated using a custom function as mentioned below:

Edge_weight(i,j) = (1 )/( (1 – cosine_similairty(i,j)))

Wherein i and j are nodes.

This function maps the cosine similarity value as –

(µ,1) -> (1/((1-µ)), ∞)

Where µ is the predefined value set by the user.

The Cosine similarity value is calculated as shown below:

Cosine_similarity = (I.J)/(|(|I|)|.||J||)

wherein I and J are vector representations for the nodes i and j respectively.

[0011] The graph is constructed where each word/node 11 from the corpus 12 is represented by as a node. It is to be noted that, the graph comprises multiple nodes connected to each other. At least two nodes 11 are connected by an edge and some nodes 11 in the graph are connected to multiple nodes via multiple edges. Thus, created graph is divided into multiple portions using a community detection method for extracting multiple topics. The community detection is performed on the weighted graph.

[0012] The community has more edges between the nodes 11 within it, than the nodes 11 outside the community, thus captures the semantic senses in the graph. For example, in a vehicle manufacturer’s requirements corpus, all similar signal names can form a single community or sets of community. For any domain specific corpus 12, nodes 11 that are non-domain specific, like common terms, language specific nodes can form a single community or sets of community, which can be easily detected as noisy nodes.

[0013] The control unit 10 uses multiple methods 12 in identifying the communities in the community detection algorithm, a parallel Louvain method are used in determining the communities in the graph. The control unit 10 for performing the hierarchical clustering and topic extraction, a specific function is created where the community detection method is performed recursively to identify hierarchies of clusters.

[0014] The topic is represented by a set of words/nodes that indicate a common semantic theme. The concepts are subsets of topics. A larger set of words/nodes would indicate a general topics like sports, politics, business, etc. While a concept will be a more specific set of words/nodes that indicate a concept like NFL, us presidential elections, etc. Here, the clusters / communities of upper-level hierarchy forms topics and as we go to lower level, specific concept related smaller clusters are formed. Topic is represented by a set of nodes 11 that indicate a common semantic theme and said each topic is quantified by a clustering metrics to specify a quality of the semantic theme that is represented by the topic.

[0015] The control unit 10 divides the graph into multiple communities. A community function in the community detection technique takes in a graph, a modularity index threshold, maximum community size and an empty tree data structure to be filled as output. The community detection technique incorporated in the classification module 16 of the control unit 10, gets the communities and the modularity index for the given graph. If the modularity index is below the modularity index threshold, then the control unit 10 identifies that the communities are not well formed and are discarded.

[0016] However, if the modularity index is acceptable, then the control unit 10 creates the communities and adds to the tree data structure. If the newly formed communities are larger than the maximum community size, a sub graph of the community members is created. After a recursive execution of community detection technique, a tree data structure that contains a hierarchical structure of communities in the graph is obtained. The control unit 10 selects the hierarchy level to extract the communities. The words/nodes 11 that are not part of any communities are marked as noise and stored in a separate data structure.

[0017] The control unit 10 distinguish the nodes into plurality of topics based on an identified hierarchy of the communities that is formed using a hierarchical community detection technique. The distinguished communities are classified into multiple main classes by the control unit and corresponding multiple sub classes for analyzing the quality of the corpus 12. The control unit 10 represents the topics in a graphical representation for analyzing the quality of the corpus 12 and gains insights to perform downstream tasks on the corpus 12.

[0018] Each Topic or Concept is associated with meta-data associate with each word/node 11. For example, if the node/word has a label associated with it, the control unit 10 creates distributions of topic/concepts over different labels. The control unit 10 uses a majority Labelling method to obtain labels for each topic/concept extracted from the previous module. The control unit 10 labels each topic in analyzing the quality of the corpus 12 using at least one majority labelling technique.

[0019] According to one embodiment of the invention, each document can be referred as a node instead of a word referring as node, and the same methodology in analyzing the quality of the corpus 12 is applied. For each word in a topic/concept cluster, corresponding documents are taken which contain the word. Now the class labels for each of these documents are obtained by the control unit 10, and the word is labelled having the highest percentage of occurrence. For every word in the cluster, same steps are being repeated and thus the labels are found.

[0020] The control unit 10 labels the topic/concept cluster using the class label with highest percentage of occurrence from all words inside the cluster. Distributions of topics/Concept vs. labels provide information on how different topics/concepts are associated with different Labels and indicates the content in the corpus that is associated with.

[0021] The control unit 10 plots a two-dimensional plot for the corpus. From the distributions of Topics/Concept vs. Labels, the user can take an informed decision on selecting necessary classifiers to perform any downstream tasks with the text corpus. For example, a hierarchical classifier system can be designed, where a fast, simple classifier can be used to classify the highly separated labels in the semantic space and a slow, complex, non-linear classifier can be further used to classify the labels which are overlapping in the semantic space and can reduce the performance significantly if only the first classifier is used to perform downstream tasks on the entire dataset.

[0022] Embodiments explained in the description above are only illustrative and do not limit the scope of this invention. Many such embodiments and other modifications and changes in the embodiment explained in the description are envisaged. The scope of the invention is only limited by the scope of the claims.
, Claims:We Claim:

1. A control unit (10) for analyzing a quality of corpus (12), said control unit (10) adapted to:
- identify multiple words in said corpus (12) and refer each said word (11) as a node;
- generate an embedded vector for each said node (11);
- plot a graph connecting multiple said nodes (11) based on at least one parameter calculated between at least two of said nodes (11);
- divide said plotted graph into multiple communities by using community detection technique for extraction of plurality of topics;
- distinguish said nodes (11) into said plurality of topics based on an identified hierarchy of said communities that is formed using a hierarchical
community detection technique;
- classify said distinguished communities into multiple main classes and corresponding multiple sub classes for analyzing said quality of corpus (12).

2. The control unit (10) as claimed in claim 1, wherein said community comprises multiple nodes (11) connected using edges.

3. The control unit (10) as claimed in claim 1, wherein at least two said nodes (11) are connected using said edges if similarity between said two nodes (11) are more than a predefined threshold value.

4. The control unit (10) as claimed in claim 1, wherein said at least one parameter used in connecting multiple said nodes (11) comprises an edge weight value and a cosine similarity value.

5. The control unit (10) as claimed in claim 4, wherein said cosine similarity value is calculated using said vector values of at least two said nodes (11) and said edge weight is calculated using said calculated cosine similarity value.

6. The control unit (10) as claimed in claim 1, wherein said nodes (11) which are non-domain specific (common nodes) forms as at least one said community that is detected as noise nodes.

7. The control unit (10) as claimed in claim 1, wherein said topic is represented by a set of nodes (11) that indicate a common semantic theme and said each topic is quantified by a clustering metrics to specify a quality of said Semantic theme that is represented by said topic.

8. The control unit (10) as claimed in claim 1, wherein each topic is labelled in analyzing said quality of said corpus (12) using at least one majority labelling technique.

9. A method of analyzing a quality of corpus (12), said method comprising the steps of:
- identifying multiple words (11) in said corpus (12) and refer each said word as a node (11) by a control unit (10);
- generating an embedded vector for each said node (11);
- plotting a graph connecting multiple said nodes (11) based on at least one parameter calculated between at least two of said nodes (11);
- dividing said plotted graph into multiple communities by using community detection technique for extraction of plurality of topics;
- distinguishing said nodes (11) into said plurality of topics based on an identified hierarchy of said communities formed using a hierarchical
community detection technique;
- classifying said distinguished communities into multiple main classes and corresponding multiple sub classes for analyzing said quality of said corpus (12).
10. The method as claimed in claim 9, wherein representing said topics in a graphical representation for analyzing said quality of said corpus (12) and gaining insights to perform downstream tasks on said corpus (12).

Documents

Application Documents

#	Name	Date
1	202341030695-POWER OF AUTHORITY [28-04-2023(online)].pdf	2023-04-28
2	202341030695-FORM 1 [28-04-2023(online)].pdf	2023-04-28
3	202341030695-DRAWINGS [28-04-2023(online)].pdf	2023-04-28
4	202341030695-DECLARATION OF INVENTORSHIP (FORM 5) [28-04-2023(online)].pdf	2023-04-28
5	202341030695-COMPLETE SPECIFICATION [28-04-2023(online)].pdf	2023-04-28
6	202341030695-FORM 18 [12-08-2025(online)].pdf	2025-08-12