Methods And Systems For Mining Multilingual Text Communication

< Back

Methods And Systems For Mining Multilingual Text Communication

Abstract: ABSTRACT Methods and systems for mining multilingual text communication Embodiments herein disclose methods and system for multilingual short text topic mining, by building a word/phrases co-occurrence network and detecting community using a Louvain algorithm and discovering groups of words and phrases belonging to high level topics within the text. Embodiments herein interpret texts using topic similarity networks. The topic similarity networks are graphs in which nodes in the graphs represent latent topics in text collections, and the links in the graphs represent similarity among topics and hierarchy of nodes represent sub-topics within a topic. FIG. 3

Get Free WhatsApp Updates!
Notices, Deadlines & Correspondence

Patent Information

Application #

Filing Date

01 July 2020

Publication Number

01/2022

Publication Type

INA

Invention Field

COMPUTER SCIENCE

Status

patent@bananaip.com

Parent Application

Applicants

Subex Assurance LLP

RMZ Ecoworld, Devarabisanahalli, Outer Ring Road,Bangalore, Karnataka,India-560103

Inventors

1. Harsha Burly

Subex Assurance LLP, RMZ Ecoworld, Devarabisanahalli, Outer Ring Road, Bangalore Karnataka India 560103

2. Anuranjan Prasad

Subex Assurance LLP, RMZ Ecoworld, Devarabisanahalli, Outer Ring Road,Bangalore, Karnataka,India-560103

Specification

DESC:
The following specification particularly describes and ascertains the nature of this invention and the manner in which it is to be performed:-

CROSS REFERENCE TO RELATED APPLICATION
This application is based on and derives the benefit of Indian Provisional Application 202041028027 filed on 1st July 2020, the contents of which are incorporated herein by reference.

TECHNICAL FIELD
[001] Embodiments disclosed herein relate to managing unstructured data, and more particularly to mining text communication of one or more users to generate topics, using a graph theory, and applying community detection methodology hierarchy, wherein the text communication includes short texts.

BACKGROUND
[002] Companies like telecom operators, digital business sectors, or the like, collect data of users/customers from disparate sources such as, client interactions, emails, blogs, product reviews, social networking sites, short messaging service (SMS), center logs, and so on. The volume, velocity, and variety of the data collected by the companies are growing faster than ever before. Major part of the collected data includes text data. The companies may mine/process the collected text data of the users to generate actionable insights, which may be used to understand customer perspective on experience, pain points, complains, and feedback. Further, the text data may be mined to perform cost reduction, enhance processing speed, and provide a better customer experience. Mining of the text data may be used to enhance Average Revenue Per User (ARPU) and the revenue for the companies.
[003] In conventional approaches, the companies may use a topic modelling to mine the collected text data. The topic modelling may be used to organize, understand, summarize large collection of the text data, and generate abstract topics from the text data. The topic modelling may be used to generate the abstract topics, in terms of getting the gist from the text data, which is collected from more than millions of text documents from the disparate sources. In the conventional approaches, the companies may adopt the topic modelling methodologies such as probabilistic latent semantic analysis (PLSA) and latent Dirichlet allocation (LDA) for determining latent semantic structure from the text data without requiring any prior annotations or labeling of the documents. In such methodologies, each text data may be viewed as a mixture of various topics and each topic is characterized by a distribution over all words present in the text data. The words with the highest estimated probabilities for the determined topic may be used as a label for the topic.
[004] The collected text data of the users may include short texts. The short texts may be in disparate formats and languages. The short texts may use a largely causal language and are more informal than traditional grammatically correct texts. Besides, either due to the limit of a number of characters or being short in general, the short texts describe about specific topics. The increasing ubiquity of the short texts and obvious differences with the traditional texts has been attracting increasing attention. In the conventional approaches, the companies may adopt the topic modelling or available short text topic modelling methods to perform one or more tasks on the short texts such as, topic detection, classification, comment summarization, and so on. However, the topic modelling and the short text topic modelling methods may experience large performance degradation over the short texts.
[005] The topic modelling may experience large performance degradation over the short texts due to few characteristics such as:
? The LDA model of the topic modelling is designed to perform well on the collection of the text data using document-word co-occurrence, in which the text data is reasonably long, whereas in the case of the short texts, document-word co-occurrence becomes extremely sparse resulting in poor topic quality; and
? In the topic modeling, each text data is viewed as a mixture of various topics and each topic is characterized by a distribution over all the words; whereas due to few words in the short texts, most short texts are probably generated by only one topic. Besides, the LDA model may also suffer from interpretation.
[006] Also, the topic modelling may experience large performance degradation over the short texts due to the lack of document-word co-occurrence information. Also, interpreting an output/topic from the short texts using the topic modelling is considerably challenging. In order to label specific topics for the short texts, one is left to inspect numerical probability distributions, which may be difficult, non-trivial, and far from straightforward.
[007] In addition, as more text data becomes available and collected every day, it becomes practically impossible to mine the short texts manually or using the topic modelling methodologies.
[008] In the conventional approaches, the companies may also adopt probabilistic models to generate the topics for the text data. However, such probabilistic models may not work when the text data includes the short texts, as the probabilistic models fail to model multinomial distribution on the short texts. Furthermore, each topic generated using the probabilistic models may be independent from the other and hence, the probabilistic model may not detect correlations amongst the topics.

OBJECTS
[009] The principal object of embodiments herein is to disclose methods and systems for mining text communication of one or more users, wherein the text communication includes short texts, which may be multilingual.
[0010] Another object of embodiments herein is to disclose methods and systems for mining the short texts to generate the one or more topics for each short text using a weighted co-occurrence graph and a community detection methodology.
[0011] These and other aspects of the embodiments herein will be better appreciated and understood when considered in conjunction with the following description and the accompanying drawings. It should be understood, however, that the following descriptions, while indicating at least one embodiment and numerous specific details thereof, are given by way of illustration and not of limitation. Many changes and modifications may be made within the scope of the embodiments herein without departing from the spirit thereof, and the embodiments herein include all such modifications.

BRIEF DESCRIPTION OF FIGURES
[0012] Embodiments herein are illustrated in the accompanying drawings, throughout which like reference letters indicate corresponding parts in the various figures. The embodiments herein will be better understood from the following description with reference to the drawings, in which:
[0013] FIG. 1 depicts a text mining system configured for mining text communication, according to embodiments as disclosed herein;
[0014] FIG. 2 depicts a text mining manager performable in the text mining system for mining the text communication to generate topics, according to embodiments as disclosed herein;
[0015] FIG. 3 is an example conceptual diagram depicting mining of the text communication using a weight co-occurrence graph and a community detection methodology to generate one or more topics for each text, according to embodiments as disclosed herein;
[0016] FIGs. 4a and 4b are example diagrams depicting a language based pre-processing performed on the text, according to embodiments as disclosed herein;
[0017] FIG. 5a is an example flowchart depicting a process of generating frequently occurring words/phrases/word combinations in each pre-processed text, according to embodiments as disclosed herein;
[0018] FIG. 5b depicts the frequently occurring word combinations in example texts, according to embodiments as disclosed herein;
[0019] FIG. 6a depicts an example weighted co-occurrence graph, according to embodiments as disclosed herein;
[0020] FIG. 6b is an example flowchart depicting a process of generating a list of communities with strongly connected word combinations/phrases, according to embodiments as disclosed herein;
[0021] FIG. 6c depicts an example list of communities with the strongly connected word combinations, according to embodiments as disclosed herein;
[0022] FIG. 7a is an example flow chart depicting a process of assigning the topics for the communities, according to embodiments as disclosed herein;
[0023] FIG. 7b depicts examples of the topics assigned for the communities, according to embodiments as disclosed herein;
[0024] FIG. 8a is an example flowchart depicting a process of creating a hierarchy of topics/communities, according to embodiments as disclosed herein;
[0025] FIGs. 8b and 8c depict the hierarchies of topics/communities related to payment and devices related issues, respectively, according to embodiments as disclosed herein;
[0026] FIG. 9a is an example flowchart depicting a process of performing a topic mapping to assign the topics for each text, according to embodiments as disclosed herein;
[0027] FIG. 9b depicts an example use case scenario of assigning the topics for the text, according to embodiments as disclosed herein;
[0028] FIG. 10 is a flow diagram depicting a method for mining the text communication, according to embodiments as disclosed herein; and
[0029] FIG. 11 is an example diagram depicting the topics assigned for the plurality of texts, according to embodiments as disclosed herein.

DETAILED DESCRIPTION
[0030] The embodiments herein and the various features and advantageous details thereof are explained more fully with reference to the non-limiting embodiments that are illustrated in the accompanying drawings and detailed in the following description. Descriptions of well-known components and processing techniques are omitted so as to not unnecessarily obscure the embodiments herein. The examples used herein are intended merely to facilitate an understanding of ways in which the embodiments herein may be practiced and to further enable those of skill in the art to practice the embodiments herein. Accordingly, the examples should not be construed as limiting the scope of the embodiments herein.
[0031] Embodiments herein disclose methods and systems for generating one or more topics for each of a plurality of short texts collected from one or more users.
[0032] Embodiments herein disclose methods and systems for generating the one or more topics for each short text using a weighted co-occurrence graph and a community detection methodology.
[0033] Referring now to the drawings, and more particularly to FIGS. 1 through 11, where similar reference characters denote corresponding features consistently throughout the figures, there are shown embodiments.
[0034] Embodiments herein use the terms such as, “text communication”, “textual information”, “document”, “input dataset”, “input text dataset”, and so on, interchangeably to refer to content that includes one or more texts.
[0035] Embodiments herein use the terms such as “texts”, “short texts”, “text data”, “text messages”, and so on, interchangeably to refer to data that include one or more words or phrases or characters.
[0036] Embodiments herein use the terms such as “weighted co-occurrence graph”, “co-occurrence graph”, “co-occurrence matrix”, “word-combinations matrix”, “network of word combinations”, and so on, interchangeably to refer to a graph that depicts co-occurrence of word combinations/phrases across a plurality of texts.
[0037] Embodiments herein use the terms such as “modules”, “groups”, “clusters”, and “communities”, and so on, interchangeably to refer to strength of divisions/partitions of nodes of the weighted co-occurrence graph.
[0038] FIG. 1 depicts a text mining system 100 configured for mining text communication, according to embodiments as disclosed herein.
[0039] The text mining system 100 referred herein may be at least one of, but is not limited to, a cloud computing device, a server, a database, a computing device, and so on. The cloud computing device may be a part of a public cloud or a private cloud. The server may be at least one of, but is not limited to, a standalone server, a server on a cloud, or the like. The computing device may be, but are not limited to, a personal computer, a notebook, a tablet, desktop computer, a laptop, a handheld device, a mobile device, and so on. Also, the text mining system 100 may be at least one of, a microcontroller, a processor, a System on Chip (SoC), an integrated chip (IC), a microprocessor based programmable consumer electronic device, and so on.
[0040] The text mining system 100 may be connected with one or more external entities/sources (not shown) using a communication network. Examples of the external entity may be, but are not limited to, user devices (used by one or more users/customers), application servers, web servers, mail servers, messaging servers, or any other device that collects the text communication of users. Examples of the communication network may be, but are not limited to, the Internet, a wired network (a Local Area Network (LAN), a Controller Area Network (CAN) network, a bus network, Ethernet and so on), a wireless network (a Wi-Fi network, a cellular network, a Wi-Fi Hotspot, Bluetooth, Zigbee and so on) and so on.
[0041] The text mining system 100 may be maintained by one or more organizations such as, but are not limited to, telecom operators, customer care centers, digital business entities, and so on, for mining text communication. Mining the text communication includes performing one or more tasks such as, organizing, understanding, generating topics for the text communication, and summarizing/generating description for the text communication. The text communication may be textual information or documents (a set of texts) of the one or more users. The text mining system 100 may collect the text communication of the one or more users from the one or more external entities/sources over the communication network. The text communication includes one or more texts.
[0042] The text may be data including one or more words or phrases or one or more characters. The text as referred to herein may comprise of, but are not limited to, customer interactions, emails, text messages, social media posts (such as tweets, Facebook posts, Instagram posts, and so on), instant messaging (such as Whatsapp posts, Telegram posts, Facetime posts, Skype posts, and so on), blogs, product reviews, call center logs, calendar entries, memo in terms of notes, and so on. In an example, the text may be reasonably long text data. In another example, the text may be short text, which are in disparate forms and are more informal. It is understood that the text may be in various other forms including those described above.
[0043] In an embodiment, the texts may be multilingual, which implies that the texts may be in any language. The texts hold wealth of actionable insights, which may be mined and extracted.
[0044] The topics generated for the texts may depict subjects/actionable insights disclosed in the texts. Thus, mining of the one or more texts to generate the topics may enable the organization to understand user/customer perspective on experience, pain points, complains, feedback, and so on.
[0045] The text mining system 100 includes a memory 102, a communication interface 104, a display 106, and a processor 108.
[0046] The memory 102 may store at least one of, the text communication including the one or more texts, data outputted by the processor 108, and so on. The memory 102 also includes a text mining manger 200, which may be processed by the processor 108 for mining the one or more texts. Examples of the memory 102 may be, but are not limited to, NAND, embedded Multimedia Card (eMMC), Secure Digital (SD) cards, Universal Serial Bus (USB), Serial Advanced Technology Attachment (SATA), solid-state drive (SSD), and so on. Further, the memory 102 may include one or more computer-readable storage media. The memory 102 may include one or more non-volatile storage elements. Examples of such non-volatile storage elements may include magnetic hard discs, optical discs, floppy discs, flash memories, or forms of electrically programmable memories (EPROM) or electrically erasable and programmable (EEPROM) memories. In addition, the memory 102 may, in some examples, be considered a non-transitory storage medium. The term “non-transitory” may indicate that the storage medium is not embodied in a carrier wave or a propagated signal. However, the term “non-transitory” should not be interpreted to mean that the memory is non-movable. In certain examples, a non-transitory storage medium may store data that can, over time, change (e.g., in Random Access Memory (RAM) or cache).
[0047] The communication interface 104 may be configured to enable the text mining system 100 to communicate with the one or more external entities using an interface supported by the communication network. Examples of the interface may be, but are not limited to, a wired interface, a wireless interface, a wired interface, or any structure supporting communications over a wired or wireless connection.
[0048] The display 106 may be configured to enable an authorized user/administrator of the organization to interact with the text mining system 100. The display 106 may also be configured to display an output (in an example herein: the one or more topics for each text) generated by the processor 108 to the authorized user.
[0049] The processor 108 may include at least one of, a single processer, a plurality of processors, multiple homogeneous or heterogeneous cores, multiple Central Processing Units (CPUs) of different kinds, microcontrollers, special media, and other accelerators.
[0050] The processor 108 may be configured for mining the plurality of texts to generate the one or more topics for each of the plurality of texts. In an embodiment, the processor 108 may generate the one or more topics for each text, using a global cooccurrence network, which may generate word-cooccurrence patterns from the entire corpus of the texts.
[0051] For generating the topics for each text, the processor 108 receives the plurality of texts of the users. In an example, the processor 108 may receive the plurality of texts of the users from the one or more external entities. In another example, the processor 108 may receive the plurality of texts of the users from the memory 102, which have been collected over the time and stored in the memory 102.
[0052] On receiving the plurality of texts of the users, the processor 108 identifies a language of each text and performs a language based pre-processing on each text. In an example, the processor 108 uses standard libraries such as, but are not limited to, SpaCy, TextBlob, Pycld2, and so on, to identify the language of the text. The processor 108 performs the language based pre-processing on the text, by performing one or more steps of a Natural Language Processing (NLP) on the text. Examples of the one or more steps of the NLP may be, but are not limited to, noise removal, language based spelling correction, language based stop word removal, language based normalization, and so on. The noise removal may include removing at least one of, but is not limited to, Uniform Resource Locators (URLs), Hypertext Markup Language (HTML) tags, emojis, punctuations, and so on, from the text. The language based spell correction includes correcting spelling of one or more words in the text. The language based stop word removal includes removing one or more common words present in the text. The language based normalization includes at least one of, but is not limited to, lower casing, stemming/lemmatization, number to text conversion, and so on. In an embodiment, the processor 108 may use supporting libraries specific to the language of the text to perform the steps of the language based spell correction and the language based stop word removal on the text. The processor 108 may use the publicly available libraries for the different languages to perform the steps of the language based spell correction and the language based stop word removal on the texts of the different languages.
[0053] On performing the language based pre-processing on the plurality of texts, the processor 108 generates frequently occurring words in each pre-processed text and determines a correlation between the frequently occurring words. The frequently occurring words may be words/phrases or word combinations that are frequently occur together in the pre-processed text. In an example, the frequently occurring words may include at least one of, a one word, two words, three words, or the like of the pre-processed text. In another example, the frequently occurring words may include a combination of the one or more words present in the pre-processed text.
[0054] For generating the frequently occurring words in each of the plurality of pre-processed texts, the processor 108 selects the text (a first text) from the plurality of texts in sequence and generates n-grams for the selected text based on a distance ‘n’, from all the texts. The distance ‘n’ may be a sliding window for extracting word co-occurrences for the texts larger than usual. The distance ‘n’ may be estimated for the text as a minimum function of a number of words in the pre-processed text and an average number of words over all the texts. The distance ‘n’ may be estimated as:
n = min (number of words in the text, average number of words over all the texts)
[0055] The processor 108 determines the frequently occurring words in the text from the n-grams. The n-grams may predict the multiple words/phrases from 2 to ‘n’ from the text, wherein ‘n’ may be configurable based on the input text dataset of the one or more users.
[0056] On determining the frequently occurring words in the text, the processor 108 filters the determined frequently occurring words from each of the plurality of remaining texts. The processor 108 recursively performs steps of generating the n-grams for the text (the subsequent text) of the plurality of texts in sequence, determining the frequently occurring words in the text, and filtering the frequently occurring words from each of the plurality of remaining texts, until a less than a pre-defined minimum number of texts is left. Thus, determining the frequently occurring words/phrases in the plurality of pre-processed texts. The processor 108 may define the minimum number of texts based on a size of the input text dataset.
[0057] Determining the frequently occurring words/word combinations ensures that the plurality of texts may contribute towards a global corpus of word combinations, so that the processor 108 may determine the detailed and specific topics for the given text. The processor 108 may also assign the topics to the plurality of texts, which have missing/less text data using the global corpus of word combinations.
[0058] In an example, the processor 108 generates the words like “international roaming”, “high charges”, “cancel plan”, or the like occur together in the text as the frequently occurring words and determines the correlation between such words. The correlation may be a normalized measure of number of times the frequency occurring words occur together.
[0059] On determining the frequently occurring words in each of the plurality of texts, the processor 108 generates a weighted co-occurrence graph based on the generated frequently occurring words corresponding to each of the plurality of texts. The weighted co-occurrence graph may depict co-occurrences of the frequently occurring words/word combinations across the plurality of texts. For generating the weighted co-occurrence graph, the processor 108 creates a set of text to word combination (for example: unigram, bigram, trigram, or the like), and a frequency matrix. The processor 108 generates a word-combination occurrence matrix based on the created set of text to word combination and the frequency matrix. The generated word-combination occurrence matrix may be the weighted co-occurrence graph. Thus, the weighted co-occurrence graph may be a network/matrix of the word combinations providing the co-occurrences of the frequently occurring word combinations/words/phrases across the plurality of texts.
[0060] The weighted co-occurrence graph may include following properties:
G(V): Each of the word combinations/words/phrases constitute a vertex/node, V, of the weighted co-occurrence graph.
G(E): An edge E, between V1 and V2 implies a co-occurrence of a word combination V1 and a word combination V2.
G(W): A weight W on the edge E connecting vertices V1 and V2 implies that the co-occurrence of the word combinations V1 and V2 has occurred W times. In an example, high weights on the edges imply a higher number of co-occurrences of the word combinations.
[0061] The graph is undirected, and each edge implies the co-occurrence of the connected phrases without any directionality.
[0062] The processor 108 may also optimize modularity of the weighted co-occurrence graph. The modularity may be a measure of a structure of the weighted co-occurrence graph. The modularity may be used to measure strength of divisions of the nodes of the weighted co-occurrence graph into modules. The modules may be referred as groups or clusters or communities through the document. The weighted co-occurrence graph with the high modularity may have dense connections between the nodes within the modules/communities and connections between the nodes in different modules/communities.
[0063] The modularity of the weighted co-occurrence graph may be a fraction of the edges that fall within the modules/communities minus an expected fraction if the edges may be distributed at random. A value of the modularity lies in a range [-1, 1]. The value of the modularity may be positive, if the number of edges within the modules/communities exceeds an expected fraction. In an example, the processor 108 may define the expected fraction based on the size of the input text dataset. For a given division of the nodes of the weighted co-occurrence graph into some modules/communities, the modularity reflects a concentration of edges within the modules/communities with the random distribution of links between the nodes regardless of the modules/communities.
[0064] In an embodiment, the processor 108 may use a greedy optimization method to optimize the modularity of the partitions/divisions of the nodes of the weighted co-occurrence graph. In accordance with the greedy optimization method, the processor 108 performs the optimization of the modularity of the weighted co-occurrence graph in two steps. In a first step, the processor 108 groups the words combinations/phrases from the weighted co-occurrence graph into “small” communities (‘k’ communities), by optimizing the modularity of the weighted co-occurrence graph locally. In an example, the processor 108 uses a Louvain community detection method to group the word combinations into the ‘k’ communities. The processor 108 checks if the size of the ‘k’ communities is lesser than a pre-defined community threshold or the modularity is negative. In an example, the processor 108 defines the community threshold based on the based on the size of the input text dataset. In an example, the processor 108 may check if the modularity is negative or positive, using the Louvain community detection method.
[0065] If the size of the ‘k’ communities is lesser than the pre-defined community threshold or the modularity is negative, the processor 108 terminates the process of optimizing the modularity of the weighted occurrence graph. If the size of the ‘k’ communities is not less than the pre-defined community threshold or the modularity is not negative, the processor 108 repeats the first step recursively, until a maximum of modularity is attained and a hierarchy of the communities (‘m’ communities) is produced, wherein ‘m’ is equal to or lesser than the ‘k’ communities. On producing the ‘m’ communities, in a second step, the processor 108 aggregates the nodes belonging to the produced same communities and builds a new weighted co-occurrence graph, whose nodes depict a list of communities with the strongly connected word combinations/phrases.
[0066] On optimizing the modularity by recursively partitioning of the weighted co-occurrence graph into the communities based on the community threshold size, the processor 108 assigns the topic for each community. The processor 108 assigns the topic for each community based on domain knowledge and domain grammar related to the frequently occurring word combinations present in the corresponding community. The topic assigned to the community may be the representation of the word combinations present in the community. Embodiments hereinafter may use the words “topics” and “communities”, interchangeably through the document.
[0067] On assigning the topics for the communities, the processor 108 groups the similar topics to create a hierarchy of topics/thematic tree. The processor 108 creates a parent-child relationship between the communities for which the topics have been assigned and creates a super topics for a parent of a collection of children, resulting in the hierarchy of topics. Each node in the thematic tree may represent the community/topic and the topics may be drilled up or drilled down by traversing the thematic tree towards a root node or towards a leaf node. For example, a node of the thematic tree may represent the topics related to car. If drilled down, nodes with subtopics like Car A, Car B and so on, may be obtained. If drilled up, nodes with super topics like four-wheeler transport, may be obtained. It is understood that different granularities of the same topic (such as, super topics, subtopics, or the like) may be used based on use case scenarios or field/domain to which the collected texts belong to.
[0068] After creating the hierarchy of topics/thematic tree, the processor 108 performs a topic mapping and assigns the one or more topics for each of the plurality of texts. The processor 108 performs the topic mapping by mapping the topic corresponding to the community to each text, based on the occurrence of the similar word combinations/phrases in the topic/community and the text.
[0069] For assigning the one or more topics for each text, the processor 108 calculates a final score for each community/topic with respect to each text. On calculating the final score for all the communities/topics with respect to each text, the processor 108 assigns the topics/communities to the respective text, based on the decreasing order of the final scores of the topics/communities.
[0070] For calculating the final score for each community with respect to one of the plurality of the texts, the processor 108 calculates text coverage metrics for the text with respect to each topic/community. The processor 108 may calculate the text coverage metrics for the text with respect to the community based on an intersection of the word combinations/phrases between the text and the topic/community. In an example, the processor 108 may calculate the coverage metrics for each text with respect to each topic/community as:
text coverage metrics = Intersection (text D, community C))/text D
[0071] The processor 108 may also calculate a topic community overlap metrics for each community with respect to the text. The processor 108 may calculate the topic community overlap metrics for each topic/community based on an intersection of the word combinations/phrases between the community and the text.
topic community overlap metrics = (Intersection (text D, Community C))/(Community C)
[0058] The processor 108 calculates the final score for each community as:
Final score = text coverage metrics of the text*weight of the text (weight 1) + topic community overlap metrics * weight of the topic/community (weight 2)
[0059] The processor 108 may assign a weight for the text, based on a length of the word combinations/phrases present in the text. In an example, the processor 108 may assign a weight of ‘1’ to the text if the text includes a 1-word phrase. In another example, the processor 108 may assign a weight of ‘2’ to the text if the text includes a 2-word phrase. The processor 108 may assign the weight to the community/topic based on the input text dataset in an iterative manner to achieve an optimal outcome/output. Thus, the final score may be calculated as a weighted average of the text coverage metrics and the topic community overlap metrics.
[0060] In an embodiment, the processor 108 may also rank the communities based on the weighted average score of the text coverage metrics of the texts and the topic community overlap metrics.
[0061] In an embodiment, the processor 108 may also determine the topic/community or the final score of the topic/community as invalid based on threshold values of the coverage metrics and the topic community overlap metrics. The processor 108 may define the threshold values of the text coverage metrics and the topic community overlap metrics based on the input text dataset in an iterative manner to achieve an optimal outcome/output. In an example, if the coverage metrics and the topic community overlap metrics with respect to the topic/community is lesser than the respective threshold values, the processor 108 determines the topic/community as the invalid.
[0062] On calculating the final score for all the communities with respect to the text, the processor 108 maps the topics corresponding to the communities with the respective text in the order of decreasing final scores. The final score of the topic/community calculated based on the text coverage metrics and the topic community overlap metrics may depict how well the topic describes the text. The processor 108 may repeat the above described topic mapping steps to assign the one or more topics for all the received plurality of texts.
[0063] FIG. 2 depicts the text mining manager 200 performable in the text mining system 100 for mining the text communication to generate the topics, according to embodiments as disclosed herein.
[0064] The text mining manager 200 may be executed by the processor 108 to mine the plurality of texts of the users to generate the topics. The text mining manager 200 includes a pre-processor module 202, a frequency determination module 204, a graph generator module 206 and a topic management module 208.
[0065] The pre-processor module 202 may be configured to perform the language based pre-processing on each of the plurality of texts. The pre-processor module 202 identifies the language of each text and performs the language based pre-processing on each text. The pre-processor module 202 may perform the language based pre-processing on each text by performing the steps of the NLP on the text. Examples of the steps may be, but are not limited to, the noise removal, the language based spelling correction, the language based stop word removal, the language based normalization, and so on.
[0066] The frequency determination module 204 may be configured to generate the frequently occurring words in each pre-processed text and determine the correlation between the frequently occurring words. The frequency determination words 204 generates the n-grams for the text based on the distance ‘n’, from all the texts. The distance ‘n’ may be estimated for the text as the minimum function of the number of words in the pre-processed text and an average number of words over all the texts. On determining the frequently occurring words in the text, the processor 108 filters the determined frequently occurring words from each of the remaining texts. The processor 108 recursively performs the above described steps, until less than the pre-defined minimum number of texts is left. Thus, determining the frequently occurring words in all the plurality of texts.
[0067] The graph generator module 206 may be configured to generate the weighted co-occurrence graph based on the frequently occurring words determined in each pre-processed text. The weighted co-occurrence graph depicts the co-occurrences of the frequently occurring words/word combinations across the plurality of texts. In the weighted co-occurrence graph, the node represents the word combination/words/phrases, the edge between the nodes represents the co-occurrences of the word combinations corresponding to the nodes, and the weight on the edge represents a number of times that the word combinations have been co-occurred.
[0068] The topic management module 208 may be configured to generate the topics for each text based on the weighted co-occurrence graph and using the community detection methodology. The topics may include different granularities such as, but are not limited to, a main topic, subtopics, super topics, and so on.
[0069] For generating the topics for each text, the topic management module 208 groups the nodes/words combinations into the ‘k’ communities using the Louvain community detection method. The topic management module 208 checks if the size of the ‘k’ communities is lesser than the pre-defined community threshold or the modularity is negative. If the size of the ‘k’ communities is lesser than the pre-defined community threshold or the modularity is negative, the processor 108 terminates the process of grouping the nodes/word combinations into the communities. If the size of the ‘k’ communities is not less than the pre-defined community threshold or the modularity is not negative, the topic management module 208 recursively performs the step of grouping the nodes into the communities, until the maximum of modularity is attained and the hierarchy of the communities (‘m’ communities) is produced, wherein ‘m’ is equal to or lesser than the ‘k’ communities. On producing the ‘m’ communities, the topic management module 208 aggregates the nodes/word combinations belonging to the same communities and builds the new weighted co-occurrence graph. In the built new weighted co-occurrence graph, the nodes depict the list of communities with the strongly connected word combinations/phrases.
[0070] On determining the list of communities with the strongly connected word combinations/phrases, the topic management module 208 assigns the topic to each community. The topic management module 208 groups the similar topics and creates the hierarchy of topics/communities/thematic tree. The topic management module 208 may create the hierarchy of topics by establishing the parent-child relationship between the communities. The topic management module 208 may use the different granularities of the topics to establish the parent-child relationship between the communities. For example, the topic management module 208 may use the super topics and the subtopics for establishing the parent-child collection of each community. Thus, for each text, the topic management module 208 may assign the topic, the super topic, and the subtopic.
[0071] On creating the hierarchy of the topics/communities, the topic management module 208 calculates the final score for each community/topic with respect to each text, based on the occurrence of the similar word combinations/phrases in the topic/community and the text. For calculating the final score for each community/topic, the topic management module 208 calculates the text coverage metrics for each text with respect to each community/topic. The topic management module 208 calculates the text coverage metrics for the text with respect to the community/topic as:
text coverage metrics for text D = Intersection (text D, community C))/text D
[0072] The topic management module 208 calculates the topic community overlap metrics for each community with respect to the text as;
topic community overlap metrics = (Intersection (text D, Community C))/(Community C)
[0073] The topic management module 208 calculates the final score for each community with respect to the text based on the text coverage metrics, the weight of the text, the topic overlap community metric, and the weight of the topic/community. The topic management module 208 may calculate the final score for the community with respect to the text as:
Final score = coverage metrics of the text*weight 1 + topic community overlap metrics* weight 2
[0074] The topic management module 208 calculates the final score for all the communities with respect to each of the plurality of texts. The topic management module 208 assigns the topics for the respective text based on the decreasing order of the final scores.
[0075] FIGs. 1and 2 show exemplary blocks of the text mining system 100, but it is to be understood that other embodiments are not limited thereon. In other embodiments, the text mining system 100 may include less or more number of blocks. Further, the labels or names of the blocks are used only for illustrative purpose and does not limit the scope of the embodiments herein. One or more blocks can be combined together to perform same or substantially similar function in the text mining system 100.
[0076] FIG. 3 is an example conceptual diagram depicting mining of the text communication using the weight co-occurrence graph and the community detection methodology to generate the one or more topics for each text, according to embodiments as disclosed herein. The text mining system 100 receives the plurality of texts of the users and generates the one or more topics for each text by performing the following steps:
? language based pre-processing;
? generating frequently occurring word combinations;
? generating the weighted co-occurrence graph of word combinations;
? creating the topics from the weighted co-occurrence graph;
? grouping the similar topics to create hierarchy of topics; and
? performing the topic mapping to assign the one or more topics for each text.
[0077] The text mining system 100 receives the plurality of texts of the one or more users. The text mining system 100 performs the language based pre-processing on each of the received plurality of texts.
[0078] FIG. 4a depicts the language based pre-processing performed on the text. In step 401, the text mining system 100 identifies the language of the pre-processed text. In steps 402-405, the text mining system 100 performs the various steps of the NLP on the text. Examples of the steps may be, but are not limited to, the noise removal, the language based spelling correction, the language based stop word removal, the language based normalization, and so on.
[0072] Consider an example scenario, as depicted in FIG. 4b, wherein the text mining system 100 receives an example text of “I was not abl to mke paymnt as app was not wrking !!! ?”. In such a scenario, the text mining system 100 identifies that the received text in English language. On identifying the language of the text, the text mining system 100 performs (step 402) the noise removal step on the text. The resultant text after the noise removal step may be “I was not abl to mke paymnt as app was not working”. The text mining system 100 performs (step 403) the language based spelling correction on the text to make the text grammatically correct. The resultant text after performing the language based spelling correction may be “I was not able to make payment as app was not working”. The text mining system 100 performs (step 404) the language based stop word removal step on the text to remove the repeated word like “was”. The resultant text after performing the language based stop word removal step may be “not able payment not app working”. The text mining system 100 performs (step 405) the language based normalization on the text. The resultant text after performing the language based normalization may be “not able payment not able working”.
[0073] On performing the language based pre-processing on the text, the text mining system 100 generates the frequently occurring word combinations/words/phrases in each of the plurality of pre-processed text.
[0072] FIG. 5a is an example flowchart 500 depicting a process of generating the frequently occurring words/phrases/word combinations in each pre-processed text. In step 501, the text mining system 100 generates the n-grams for the received text of the plurality of texts in sequence, based on the distance ‘n’, from the other texts. The distance ‘n’ is the sliding window for extracting the word co-occurrences for texts larger than usual. The distance ‘n’ is the minimum function of the number of words in the text, and the average number of words over the other texts. In step 502, the text mining system 100 determines the frequently occurring word combinations/words from the n-grams. The frequently occurring word combinations may be one word, two words and three words combinations. In step 503, the text mining system 100 filters the determined words/word combinations in the text from the other texts. In step 504, the text mining system 100 checks if there less than the pre-defined number of texts left. If there less than the pre-defined number of texts left, the text mining system 100 repeats the steps 501-504 to generate the frequently occurring documents for each text. If there not less than the pre-defined number of texts left, the text mining system 100 provides the filtered texts for which the frequently occurring words have been determined. The various actions in method 500 may be performed in the order presented, in a different order or simultaneously. Further, in some embodiments, some actions listed in FIG. 5a may be omitted.
[0074] Consider an example scenario, as depicted in FIG. 5b, wherein the text mining system 100 receives example pre-processed texts such as, “not able make payment app not work”, “Account suspend payment due app fail”, and “Restore call service suspend waive charge”. In such a scenario, the text mining system 100 generates/determines the words such as “Not make Payment”, “able Payment App”, “Payment App Work”, “Make payment app”, and “Payment App” as the frequently occurring words within the distance of 5-6 words for the example pre-processed text “not able make payment app not work”. The text mining system 100 determines the words as “Account suspend payment”, “Suspend app fail”, “Payment app fail”, “Suspend Payment fail”, and “Suspend due app” as the frequently occurring words within the distance of 5-6 words for the example pre-processed text “Account suspend payment due app fail”. The text mining system 100 determines the words such as “Restore service suspend”, “Service suspend waive”, “Suspend waive charge”, “Restore call service”, and “Restore service waive”, as the frequently occurring words within the distance of 5-6 words for the example pre-processed text “Restore call service suspend waive charge”.
[0075] On generating the frequently occurring word combinations for each of the pre-processed text, the text mining system 100 generates the weighted co-occurrence graph that depicts the co-occurrences of the frequently occurring word combinations across the plurality of texts. The text mining system 100 generates the weighted co-occurrence graph based on the text to word combinations, and the frequency matrix. The graph is undirected, and each edge implies the cooccurrence of the connected phrases without any directionality.
[0076] An example weighted co-occurrence graph is depicted in FIG. 6a. In the example weighted co-occurrence graph, the nodes may represent the frequently occurring word combinations/phrases generated for each text, the edge between the nodes represents the co-occurrences of the word combinations and the weight on the edge represents a number of times that the word combinations have been co-occurred. As depicted in FIG. 6a, in the example weighted co-occurrence graph, an edge E1 connects between the nodes V1 and V2 corresponding to a text 1 and a text 3, respectively and the edge E1 may have the weight of ‘100’. The nodes V1 and V2 may depict the word combination like “Bill Charges High”. The edge E2 connects between the nodes V3 and V4 corresponding to the text 1 and a text 2, respectively and the E2 may have the weight of ‘80’. The nodes V3 and V4 may depict the word combination like “Never Activate Line”. The edge E3 connects between the nodes V5 and V6 corresponding to the text 1 and the text 3, respectively the E3 may have the weight of ‘300’. The nodes V5 and V6 may depict the word combination like “Long Tenure Cost”. An edge E4 connects between the nodes V7 and V8 corresponding to the text 2 and the E4 may have the weight of “420”. The nodes V7 and V8 may depict the word combination “App Not Work”. An edge E5 connects between the nodes V9 and V2 corresponding to the text 2 and the text 3, respectively, and the edge E5 may have the weight of “450”. The V9 and V2 may depict the word combination “Bill Charges High”. The higher weights on the edges imply that the higher co-occurrences of the word combination. In an example herein, the word combination “Bill Charges High” may have higher co-occurrences, compared to the other word combinations.
[0077] On generating the weighted co-occurrence graph, the text mining system 100 divides the nodes of the weighted co-occurrence graph into the communities, by optimizing the modularity of the weighted co-occurrence graph.
[0078] FIG. 6b is an example flowchart 600 depicting a process of generating the list of communities with the strongly connected word combinations/phrases. In step 601, the text mining system 100 divides the nodes/word combinations (words/phrases) into the ‘k’ communities using the Louvain community detection method. In step 602, the text mining system 100 checks if the size of the ‘k’ communities is less than the threshold size (size (k communities) < threshold size) or the modularity of the weighted co-occurrence graph is negative. If the size of the ‘k’ communities is less than the threshold size or the modularity is negative, the text mining system 100 ends the process.
[0079] If the size of the ‘k’ communities is less than the threshold size or the modularity is negative, the text mining system 100 repeats the step 601 for the ‘m’ communities, where a size of the ‘m’ communities is less than or equal to the threshold size. On repeating the step 601 for the ‘m’ communities, the text mining system 100 aggregates the nodes belonging to the communities and builds the new weighted co-occurrence graph. The new weighted co-occurrence graph includes the nodes/communities with the strongly connected word combinations/phrases. The various actions in method 600 may be performed in the order presented, in a different order or simultaneously. Further, in some embodiments, some actions listed in FIG. 6b may be omitted.
[0080] An example list of communities with the strongly connected word combinations is depicted in FIG. 6c.
[0081] On generating the list of communities, the text mining system 100 assigns the topic for each community. FIG. 7a is an example flow chart 700 depicting a process of assigning the topics for the communities. In step 701, the text mining system 100 observes the word combinations/phrases/words present in each community. In step 702, the text mining system 100 assigns the topic based on the heuristics and the domain input to each community of strongly connected word combinations. The topic may be assigned to the community as a representation of the word combinations/phrases present in the respective community. The various actions in method 700 may be performed in the order presented, in a different order or simultaneously. Further, in some embodiments, some actions listed in FIG. 7a may be omitted.
[0082] In an example, the topics assigned for the communities is depicted in FIG. 7b.
[0083] On assigning the topic for each of the list of communities, the text mining system 100 groups the similar topics/communities to form the hierarchy of topics/communities. FIG. 8a is an example flowchart 800 depicting a process of creating the hierarchy of topics/communities. In step 801, the text mining system 100 creates the parent-child relationship between the communities/topics. In step 802, the text mining system 100 creates the super topics for the parent of the collection of the children of the topics/communities, resulting in the hierarchy of topics/communities. The various actions in method 800 may be performed in the order presented, in a different order or simultaneously. Further, in some embodiments, some actions listed in FIG. 8a may be omitted.
[0084] An example hierarchy of topics corresponding to payment is depicted in FIG. 8b. An example hierarchy of topics corresponding to device related issues is depicted in FIG. 8c.
[0085] On creating the hierarchy of topics/communities, the text mining system 100 assigns the topics for each text, by performing the topic mapping.
[0086] FIG. 9a is an example flowchart 900 depicting a process of performing the topic mapping to assign the topics for each text. In step 901, the text mining system 100 calculates the text coverage metrics for the text corresponding to each community. In step 902, the text mining system 100 calculates the text coverage metrics as
text coverage metrics for text D = Intersection (text D, community C))/text D
[0087] In step 903, the text mining system 100 calculates the topic community overlap metrics for each community with respect to the text as;
topic community overlap metrics = (Intersection (text D, Community C))/(Community C)
[0088] In step 904, the topic management module 208 calculates the final score for each community with respect to the text based on the text coverage metrics, the weight of the text and the topic overlap community metric. The topic management module 208 may calculate the final score for the community with respect to the text as:
Final score = coverage metrics of the text*weight 1+ topic community overlap metrics* weight 2
[0089] On calculating the final score for all the communities with respect to each text, in step 905, the topic management module 208 assigns the topics corresponding to the communities to the respective text in the decreasing order of the final scores. The various actions in method 900 may be performed in the order presented, in a different order or simultaneously. Further, in some embodiments, some actions listed in FIG. 9a may be omitted.
[0090] Consider an example scenario, as depicted in FIG. 9b, wherein the text mining system 100 determines 6 topics/communities (C1-C6) by analyzing the co-occurrences of the word combinations across the plurality of texts. In such a scenario, the text mining system 100 calculates the text coverage metrics for the received one of the plurality of texts (for example: text D) with respect to each community based on the intersection of the word combinations between the text D and the respective community. In an example herein, the text mining system 100 may calculate the text coverage metrics for the text D with respect to the communities C1, C2, C3, C4, C5, and C6 as 4%, 10%, 2%, 30%, 10%, and 33%, respectively. On calculating the text coverage metrics for the text D with respect to all the communities, the text mining system 100 calculates the topic community overlap metrics for each community with respect to the text D based on the intersection of the word combinations between the text and the respective community. In an example herein, the text mining system 100 may calculate the topic community overlap metrics for the communities C1, C2, C3, C4, C5, and C6 with respect to the text D as, 12%, 80%, 100%, 85%, 55%, and 45%, respectively.
[0091] On calculating the topic community overlap metrics for all the communities, the text mining system 100 calculates the final score for each topic/community based on the associated text coverage metrices, the weight of the text D, the associated topic community overlap metrics, and the weight of the topic/community. In an example, the weight of the text and the weight of the topic/community may be used as ‘2’, and ‘1’, respectively. In an example herein, the text mining system 100 calculates the final score for the communities C1, C2, C3, C4, C5, and C6 with respect to the text D as, invalid, 33, invalid, 48, 25, and 35, respectively. The text mining system 100 calculates the final score for the community C1 as invalid since the text coverage metrics and the topic community overlap metrics associated with the C1 are less than the respective threshold values. The text mining system 100 calculates the final score for the community C3 as invalid since the text coverage metrics associated with the C3 is less than the respective threshold value. The text mining system 100 also ranks the communities C1-C6 with respect to the text D, based on the weighted average of the text coverage metrics and the topic community overlap metrics. In an example herein, the text mining system 100 may rank the C1, C2, C3, C4, C5, and C6 as invalid, 3, invalid, 1, 4, 2.
[0092] On calculating the final scores for all the communities, the text mining system 100 assigns the topics corresponding to the communities C2, C4, C5, and C6 (as the C1 and C3 are invalid) to the text D in the decreasing order of the final scores. In an example herein, the text mining system 100 may assign the topic corresponding to the C4 as a first topic to the text D, followed by C6, C2, and C5 based on the decreasing order of the final scores. The text mining system 100 may repeat the above described steps for assigning the topics for other texts.
[0093] FIG. 10 is a flow diagram 1000 depicting a method for mining the text communication, according to embodiments as disclosed herein.
[0094] At step 1002, the method includes performing, by the text mining system 100, the language based pre-processing on each text of the plurality of texts. At step 1004, the method includes determining, by the text mining system 100, the frequently occurring word combinations in each text, on performing the language based pre-processing on the plurality of texts.
[0095] At step 1006, the method includes generating, by the text mining system 100, the weighted co-occurrence graph based on the determined frequently occurring word combinations in each pre-processed text. At step 1008, the method includes creating, by the text mining system 100, the topics based on the weighted co-occurrence graph. At step 1010, the method includes creating, by the text mining system 100, the hierarchy of topics by grouping the similar topics.
[0096] At step 1012, the method includes assigning, by the text mining system 100, the at least one topic to each text using the hierarchy of topics. The various actions in method 1000 may be performed in the order presented, in a different order or simultaneously. Further, in some embodiments, some actions listed in FIG. 10 may be omitted.
[0097] FIG. 11 is an example diagram depicting the topics assigned for the plurality of texts, according to embodiments as disclosed herein. Embodiments herein enable the text mining system 100 to provide the topics and description for each text based on the weighted co-occurrence graph and the community detection methodology. In an example herein, the text may include the short text, and the topics may include a main topic/theme, a subtopic/theme, or the like. In an example, the text mining system 100 may generate the description for each text based on at least one of, high frequency n grams. In another example, the text mining system 100 may generate the description for each text by performing an additional fine tuning based on domain knowledge and domain grammar.
[0098] Many real-world applications require semantic understanding of short texts where traditional methods, which are meant for longer texts, suffer from performance degradation. One such case may be for an operator wherein customer care executives have been empowered to issue a good-will credit (of varying amount) to customers when they call in with an issue and which in turn may be adjusted against their bills. While issuing the good-will credit, the customer care executives may capture a credit memo (text varying from few words to few lines). The credit memo may depict an amounting to few million dollars per month. Only way to understand reasons for these good-will credits is to analysis huge collection of credit memos. In such a scenario, embodiments herein provide the customer executives with few high-level topics and additionally few 10’s of sub-topics. With these outcomes the customer executives may be able to take corrective actions by optimizing both good-will credits and enhancing customer experience. Using the above approach continuous process is in place to manage further challenges as well.
[0099] Embodiments herein provide a language compatible text mining solution, which extracts frequently occurring word combinations/top keywords for each text, classifies and clusters the extracted frequency occurring word combination and generates the topics for each text. The language compatible text mining solution may be applied across domains such as, but are not limited to, social networking posts and status updates, comments, product reviews, Short Messaging Service (SMS), customer complaints, feedbacks, or any other means that includes the text of casual language and are more informal than traditional grammatically correct texts.
[00100] The embodiments disclosed herein may be implemented through at least one software program running on at least one hardware device and performing network management functions to control the elements. The elements shown in FIGs. 1, and 2, may be at least one of a hardware device, or a combination of hardware device and software module.
[00101] The embodiments disclosed herein describe methods and systems for mining multilingual text communication. Therefore, it is understood that the scope of the protection is extended to such a program and in addition to a computer readable means having a message therein, such computer readable storage means contain program code means for implementation of one or more steps of the method, when the program runs on a server or mobile device or any suitable programmable device. The method is implemented in a preferred embodiment through or together with a software program written in e.g., Very high speed integrated circuit Hardware Description Language (VHDL) another programming language, or implemented by one or more VHDL or several software modules being executed on at least one hardware device. The hardware device may be any kind of portable device that may be programmed. The device may also include means which could be e.g., hardware means like e.g., an ASIC, or a combination of hardware and software means, e.g., an ASIC and an FPGA, or at least one microprocessor and at least one memory with software modules located therein. The method embodiments described herein could be implemented partly in hardware and partly in software. Alternatively, the invention may be implemented on different hardware devices, e.g., using a plurality of CPUs.
[00102] The foregoing description of the specific embodiments will so fully reveal the general nature of the embodiments herein that others may, by applying current knowledge, readily modify and/or adapt for various applications such specific embodiments without departing from the generic concept, and, therefore, such adaptations and modifications should and are intended to be comprehended within the meaning and range of equivalents of the disclosed embodiments. It is to be understood that the phraseology or terminology employed herein is for the purpose of description and not of limitation. Therefore, while the embodiments herein have been described in terms of embodiments, those skilled in the art will recognize that the embodiments herein may be practiced with modification within the spirit and scope of the embodiments as described herein.
,CLAIMS:STATEMENT OF CLAIMS
I/We claim:
1. A method for mining text communication, the method comprising:
performing, by the text mining system (100), a language based pre-processing on each text of a plurality of texts;
determining, by the text mining system (100), frequently occurring word combinations in each text, on performing the language based pre-processing on the plurality of texts;
generating, by the text mining system (100), a weighted co-occurrence graph based on the determined frequently occurring word combinations in each pre-processed text;
creating, by the text mining system (100), topics based on the weighted co-occurrence graph;
creating, by the text mining system (100), a hierarchy of topics by grouping the similar topics; and
assigning, by the text mining system (100), at least one topic to each text using the hierarchy of topics.

2. The method of claim 1, wherein a text includes a short text, and the text is multilingual.

3. The method of claim 1, wherein performing, by the text mining system (100), the language based pre-processing on each text includes:
identifying a language of each text; and
performing at least one step of Natural Language Processing (NLP) on each text, wherein the at least one step includes at least one of, noise removal, language based spelling correction, language based stop word removal, and language based normalization.

4. The method of claim 1, wherein determining, by the text mining system (100), the frequently occurring word combinations in each text includes: recursively performing steps of:
generating n-grams for the text of the plurality of texts in a sequence by determining a distance from other texts of the plurality of texts, wherein the distance is a sliding window for extracting the word combinations from each text, wherein the distance is determined for the text as a minimum function of a number of words present in the corresponding text and an average number of words present over the plurality of texts;
determining the frequently occurring word combinations in the text using the generated n-grams; and
filtering the determined frequently occurring word combinations in other texts of the plurality of texts, until less than a pre-defined number of texts is remaining.

5. The method of claim 1, wherein generating, by the text mining system (100), the weighted co-occurrence graph includes:
creating a text-word mapping of the determined frequently occurring word combinations with each respective text and a frequency matrix; and
generating a word combination occurrence matrix based on the created text-word mapping and the frequency matrix, wherein the word combination occurrence matrix is the weighted co-occurrence graph providing information about co-occurrences of the frequently occurring word combinations across the plurality of texts, wherein nodes of the weighted co-occurrence graph represent the frequently occurring word combinations, edges connecting the nodes in the weighted co-occurrence graph represent the co-occurrences of the frequently occurring word combinations, and weights on each edge in the weighted co-occurrence graph represent a number of co-occurrences of the frequently occurring word combinations corresponding to the respective edge.

6. The method of claim 1, wherein creating, by the text mining system (100), the topics includes:
generating communities by optimizing a modularity of the weighted co-occurrence graph using a greedy optimization method, wherein each community includes the frequently occurring word combinations; and
assigning a topic for each community based on domain knowledge and domain grammar associated with the frequently occurring word combinations present in the corresponding community, wherein the topic assigned for the community is a representation of the frequently occurring word combinations included in the community.

7. The method of claim 6, wherein generating the communities includes:
grouping the nodes corresponding to the frequently occurring word combinations of the weighted co-occurrence graph into a first number of communities using a Louvain community detection method;
checking if a size of the first number of communities is lesser than a community threshold;
terminating a process of grouping the nodes, if the size of the first number of communities is not lesser than the community threshold;
recursively performing a step of grouping the nodes of the weighted co-occurrence graph into a subsequent number of communities, until a hierarchy of communities is generated, if the size of the first number of communities is lesser than the community threshold; and
aggregating the nodes belonging to the same communities and building a new weighted co-occurrence graph, on generating the hierarchy of communities, wherein the nodes of the new weighted co-occurrence graph depict the communities.

8. The method of claim 1, wherein creating, by the text mining system (100), the hierarchy of topics includes:
establishing a parent-child relationship between the topics corresponding to the communities; and
creating granularities of topics for a parent of a collection of children of the topics that creates the hierarchy of topics.

9. The method of claim 1, wherein assigning, by the text mining system (100), the at least one topic for each text includes:
estimating final scores for the communities with respect to each text, wherein the communities correspond to the topics present in the hierarchy of topics; and
assigning the at least one topic corresponding to at least one of the communities to each respective text, based on a decreasing order of the final scores of the communities.

10. The method of claim 9, wherein estimating a final score for each community includes:
calculating text coverage metrics for each text with respect to each community as;
text coverage metrics = Intersection (text, community))/text;
calculating text community overlap metrics for each community with respect to each text as:
topic community overlap metrics = (Intersection (text, community))/(community); and
estimating the final score for each community as:
final score = text coverage metrics of the text*weight of the text + topic community overlap metrics associated with the corresponding community* weight of the corresponding community.

11. The method of claim 10, further comprising:
estimating the final score of the community as invalid, if at least one of the associated text coverage metrics and topic community overlap metrics is less than respective threshold value; and
ranking the communities based on a weighted average score of the text coverage metrics and the topic overlap community metrics associated with the communities.
12. A text mining system (100) comprising:
a memory (102); and
a processor (108) coupled to the memory (102) configured to:
perform a language based pre-processing on each text of a plurality of texts;
determine frequently occurring word combinations in each text, on performing the language based pre-processing on the plurality of texts;
generate a weighted co-occurrence graph based on the determined frequently occurring word combinations in each pre-processed text;
create topics based on the weighted co-occurrence graph;
create a hierarchy of topics by grouping the similar topics; and
assign at least one topic to each text using the hierarchy of topics.

13. The text mining system (100) of claim 12, wherein a text includes a short text, and the text is multilingual.

Dated this 29th March 2021

Signatures:
Name of the Signatory: Nitin Mohan Nair
Patent Agent No: 2585

Documents

Application Documents

#	Name	Date
1	202041028027-STATEMENT OF UNDERTAKING (FORM 3) [01-07-2020(online)].pdf	2020-07-01
2	202041028027-PROVISIONAL SPECIFICATION [01-07-2020(online)].pdf	2020-07-01
3	202041028027-PROOF OF RIGHT [01-07-2020(online)].pdf	2020-07-01
4	202041028027-POWER OF AUTHORITY [01-07-2020(online)].pdf	2020-07-01
5	202041028027-FORM 1 [01-07-2020(online)].pdf	2020-07-01
6	202041028027-DRAWINGS [01-07-2020(online)].pdf	2020-07-01
7	202041028027-DECLARATION OF INVENTORSHIP (FORM 5) [01-07-2020(online)].pdf	2020-07-01
8	202041028027-Form26_Power of Attorney_01-12-2020.pdf	2020-12-01
9	202041028027-Form1_After Filing_01-12-2020.pdf	2020-12-01
10	202041028027-FORM 18 [29-03-2021(online)].pdf	2021-03-29
11	202041028027-DRAWING [29-03-2021(online)].pdf	2021-03-29
12	202041028027-COMPLETE SPECIFICATION [29-03-2021(online)].pdf	2021-03-29
13	202041028027-Correspondence_Form-1 And POA_28-01-2022.pdf	2022-01-28
14	202041028027-FER.pdf	2022-03-23
15	202041028027-RELEVANT DOCUMENTS [09-03-2023(online)].pdf	2023-03-09
16	202041028027-REQUEST FOR CERTIFIED COPY [22-07-2024(online)].pdf	2024-07-22

Search Strategy

1	202041028027E_21-03-2022.pdf