Abstract: None of the past work specifically discovers “tasks” which are generally carried out in an industry group, automatically from text corpus. Embodiments of the present disclosure provide a method and system for discovering commonsense knowledge about tasks carried out within Industry Groups (IGs) to enrich a commonsense knowledge base. The method addresses above mentioned technical limitations by augmenting considerable knowledge to commonsense knowledge bases (KBs) by eliminating redundant knowledge of general tasks across IGs using a task-IG affinity model, which generates an affinity score for each task with respect to each of the IGs based on an affinity function such that the affinity score is high if the task has a high support and a high specificity with respect to the IG. [To be published with 1B]
Description:FORM 2
THE PATENTS ACT, 1970
(39 of 1970)
&
THE PATENT RULES, 2003
COMPLETE SPECIFICATION
(See Section 10 and Rule 13)
Title of invention:
DISCOVERING COMMONSENSE KNOWLEDGE ABOUT TASKS CARRIED OUT WITHIN INDUSTRY GROUPS (IGs)
Applicant
Tata Consultancy Services Limited
A company Incorporated in India under the Companies Act, 1956
Having address:
Nirmal Building, 9th floor,
Nariman point, Mumbai 400021,
Maharashtra, India
Preamble to the description:
The following specification particularly describes the invention and the manner in which it is to be performed.
TECHNICAL FIELD
The embodiments herein generally relate to the field of enriching commonsense knowledge bases and, more particularly, to a method and system for discovering commonsense knowledge about tasks carried out within Industry Groups (IGs) to enrich commonsense knowledge base.
BACKGROUND
Several Natural Language Processing (NLP) applications take advantage of common sense knowledge such as question answering textual entailment sentiment analysis, and summarization. However, methods to extract a specific type of knowledge, focused on tasks performed by organizations belonging to a certain industry group (IG) is hardly explored. As specified in in literature, the task is understood as a well-defined knowledge-based action that is carried out volitionally.
Most prior work formalizes the augmentation of task-IG knowledge as a classification problem. One of the work address the task of categorizing companies within industry classification schemes using encyclopedia articles. Another existing approach presents a deep neural-based industry classification to construct a database of companies labelled with the corresponding industry. Another existing method develops a technique to rank a triple by estimating pointwise mutual information between two entities using a large pre-trained bidirectional language model. Another work in the literature assumes the existence of the head and tail entities in KB and tries to find more relationships between them. None of the past work specifically discovers “tasks” which are generally carried out in an industry group, automatically from text corpus. An existing method, ConceptNet by Speer et al., 2017 consists of common sense knowledge about the world in the form of triples such as Doctor, is capable of, help a sick person. However, it is observed that ConceptNet has very limited common sense knowledge information about Industry Groups (IGs). For instance, the ConceptNet node for Power company consists of no knowledge of the tasks being carried out in the energy industry. Often, such knowledge is limited or largely absent for most industry groups (IGs) such as Energy, Transportation, Retail, Banking and the like, wherein 24 IGs are identified.
SUMMARY
Embodiments of the present disclosure present technological improvements as solutions to one or more of the above-mentioned technical problems recognized by the inventors in conventional systems.
For example, in one embodiment, a method for discovering commonsense knowledge about tasks carried out within Industry Groups (IGs) to enrich commonsense knowledge bases is provided. The method includes receiving a plurality of text documents from a corpus. Further, the method includes extracting a plurality of tasks by processing a plurality of sentences in each of the plurality of documents using a task extraction technique. Further, the method includes labelling each of the extracted plurality of tasks with an Industry Group (IG) from among the plurality of IGs using an unsupervised Machine Learning (ML) technique. Furthermore, the method includes converting each of the plurality of tasks into a canonical form using a plurality of linguistic rules. Further, the method includes processing each of the plurality of tasks, corresponding labeled IG and associated canonical form of each of the plurality of tasks by a task-IG affinity model, wherein the task-IG affinity model generates an affinity score for each of the plurality of tasks with respect to each of the plurality of IGs based on an affinity function, wherein the affinity score is above an affinity threshold if the task has a high support and a high specificity. The high support indicates a task is observed with similar IG in multiple sentences in the corpus, and the high specificity indicates a task is associated with a specific IG. Further, the method includes clustering one or more tasks mapped to each of the plurality of IGs and retaining one task per cluster as a representative task for the IG associated with the cluster. Furthermore, the method includes generating a plurality of triplets for each of the plurality of IGs using the representative task for the IG and augmenting common sense knowledge to a commonsense knowledge base of task specific industry information by adding the plurality of triplets.
In another aspect, a system for discovering commonsense knowledge about tasks carried out within Industry Groups (IGs) to enrich commonsense knowledge bases is provided. The system comprises a memory storing instructions; one or more Input/Output (I/O) interfaces; and one or more hardware processors coupled to the memory via the one or more I/O interfaces, wherein the one or more hardware processors are configured by the instructions to receive a plurality of text documents from a corpus. Further, the one or more hardware processors are configured to extract a plurality of tasks by processing a plurality of sentences in each of the plurality of documents using a task extraction technique. Further, the one or more hardware processors are configured to label each of the extracted plurality of tasks with an Industry Group (IG) from among the plurality of IGs using an unsupervised Machine Learning (ML) technique. Furthermore, the one or more hardware processors are configured to convert each of the plurality of tasks into a canonical form using a plurality of linguistic rules. Further, the one or more hardware processors are configured to process each of the plurality of tasks, corresponding labeled IG and associated canonical form of each of the plurality of tasks by a task-IG affinity model, wherein the task-IG affinity model generates an affinity score for each of the plurality of tasks with respect to each of the plurality of IGs based on an affinity function, wherein the affinity score is above an affinity threshold if the task has a high support and a high specificity. The high support indicates a task is observed with similar IG in multiple sentences in the corpus, and the high specificity indicates a task is associated with a specific IG. Further, the one or more hardware processors are configured to cluster one or more tasks mapped to each of the plurality of IGs and retaining one task per cluster as a representative task for the IG associated with the cluster. Furthermore, the one or more hardware processors are configured to generate a plurality of triplets for each of the plurality of IGs using the representative task for the IG and augmenting common sense knowledge to a commonsense knowledge base of task specific industry information by adding the plurality of triplets.
In yet another aspect, there are provided one or more non-transitory machine-readable information storage mediums comprising one or more instructions, which when executed by one or more hardware processors causes a method for discovering commonsense knowledge about tasks carried out within Industry Groups (IGs) to enrich commonsense knowledge bases. The method includes receiving a plurality of text documents from a corpus. Further, the method includes extracting a plurality of tasks by processing a plurality of sentences in each of the plurality of documents using a task extraction technique. Further, the method includes labelling each of the extracted plurality of tasks with an Industry Group (IG) from among the plurality of IGs using an unsupervised Machine Learning (ML) technique. Furthermore, the method includes converting each of the plurality of tasks into a canonical form using a plurality of linguistic rules. Further, the method includes processing each of the plurality of tasks, corresponding labeled IG and associated canonical form of each of the plurality of tasks by a task-IG affinity model, wherein the task-IG affinity model generates an affinity score for each of the plurality of tasks with respect to each of the plurality of IGs based on an affinity function, wherein the affinity score is above an affinity threshold if the task has a high support and a high specificity. The high support indicates a task is observed with similar IG in multiple sentences in the corpus, and the high specificity indicates a task is associated with a specific IG. Further, the method includes clustering one or more tasks mapped to each of the plurality of IGs and retaining one task per cluster as a representative task for the IG associated with the cluster. Furthermore, the method includes generating a plurality of triplets for each of the plurality of IGs using the representative task for the IG and augmenting common sense knowledge to a commonsense knowledge base of task specific industry information by adding the plurality of triplets.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the invention, as claimed.
BRIEF DESCRIPTION OF THE DRAWINGS
The accompanying drawings, which are incorporated in and constitute a part of this disclosure, illustrate exemplary embodiments and, together with the description, serve to explain the disclosed principles:
FIG. 1A is a functional block diagram of a system, for discovering commonsense knowledge about tasks carried out within Industry Groups (IGs) to enrich a commonsense knowledge base, in accordance with some embodiments of the present disclosure.
FIG. 1B illustrates an architectural overview of the system of FIG. 1, in accordance with some embodiments of the present disclosure.
FIGS. 2A through 2B (collectively referred as FIG. 2) is a flow diagram illustrating a method for discovering the commonsense knowledge about the tasks carried out within the Industry Groups (IGs) to enrich the commonsense knowledge base, using the system of FIG. 1, in accordance with some embodiments of the present disclosure.
FIG. 3 illustrates comparative representations of tasks learned by a sentence transformer model and a task-IG affinity model of the system of FIG. 1B, , in accordance with some embodiments of the present disclosure.
It should be appreciated by those skilled in the art that any block diagrams herein represent conceptual views of illustrative systems and devices embodying the principles of the present subject matter. Similarly, it will be appreciated that any flow charts, flow diagrams, and the like represent various processes which may be substantially represented in computer readable medium and so executed by a computer or processor, whether or not such computer or processor is explicitly shown.
DETAILED DESCRIPTION OF EMBODIMENTS
Exemplary embodiments are described with reference to the accompanying drawings. In the figures, the left-most digit(s) of a reference number identifies the figure in which the reference number first appears. Wherever convenient, the same reference numbers are used throughout the drawings to refer to the same or like parts. While examples and features of disclosed principles are described herein, modifications, adaptations, and other implementations are possible without departing from the scope of the disclosed embodiments.
As mentioned earlier, attempts have been made to predict an appropriate industry type for a specific organization. But there have been no attempts to automatically discover commonsense knowledge about the tasks being generally carried out in an industry group or industry domain.. Embodiments of the present disclosure provide a method and system for discovering commonsense knowledge about tasks carried out within Industry Groups (IGs) to enrich a commonsense knowledge base. The method addresses above mentioned technical limitations by augmenting considerable knowledge to commonsense knowledge bases (KBs) of the form ‘Energy company, is capable of, operate coal fired plants.’ The solution provided can help Natural Language Processing (NLP) applications by allowing them to better understand the working of industry groups. The method augments an existing commonsense knowledge base with the knowledge of several tasks being performed by an Industry Group (IG).
For addition in a commonsense KB, it is desirable to add most representative tasks of each IG. There may be some tasks that are too specific to an IG while other tasks may be too general. General tasks do not provide any IG specific value add to the commonsense KB as are not specific to any IG. They are generally carried out across several IGs. For example, “declaring financial results” is carried out by organizations across all IGs.. The method disclosed eliminates the redundant knowledge getting added to the commonsense KB with a task-IG affinity model. The task-IG affinity model generates an affinity score for each task with respect to each of the IGs based on an affinity function such that the affinity score is above an affinity threshold if the task has a high support and a high specificity with respect to the IG.
Referring now to the drawings, and more particularly to FIGS. 1A through 3, where similar reference characters denote corresponding features consistently throughout the figures, there are shown preferred embodiments and these embodiments are described in the context of the following exemplary system and/or method.
FIG. 1A is a functional block diagram of a system 100, for discovering commonsense knowledge about tasks carried out within Industry Groups (IGs) to enrich a commonsense knowledge base, in accordance with some embodiments of the present disclosure.
In an embodiment, the system 100 includes a processor(s) 104, communication interface device(s), alternatively referred as input/output (I/O) interface(s) 106, and one or more data storage devices or a memory 102 operatively coupled to the processor(s) 104. The system 100 with one or more hardware processors is configured to execute functions of one or more functional blocks of the system 100.
Referring to the components of system 100, in an embodiment, the processor(s) 104, can be one or more hardware processors 104. In an embodiment, the one or more hardware processors 104 can be implemented as one or more microprocessors, microcomputers, microcontrollers, digital signal processors, central processing units, state machines, logic circuitries, and/or any devices that manipulate signals based on operational instructions. Among other capabilities, the one or more hardware processors 104 are configured to fetch and execute computer-readable instructions stored in the memory 102. In an embodiment, the system 100 can be implemented in a variety of computing systems including laptop computers, notebooks, hand-held devices such as mobile phones, workstations, mainframe computers, servers, and the like.
The I/O interface(s) 106 can include a variety of software and hardware interfaces, for example, a web interface, a graphical user interface and the like, and can facilitate multiple communications within a wide variety of networks N/W and protocol types, including wired networks, for example, LAN, cable, etc., and wireless networks, such as WLAN, cellular and the like. In an embodiment, the I/O interface (s) 106 can include one or more ports for connecting to a number of external devices or to another server or devices. I/O interface 106 facilitates access of text documents from a corpus for processing, and similarly uploading the statements generated by the system 100 to a commonsense knowledge base, as depicted in architectural overview of the system 100 in FIG. 1B.
The memory 102 may include any computer-readable medium known in the art including, for example, volatile memory, such as static random access memory (SRAM) and dynamic random access memory (DRAM), and/or non-volatile memory, such as read only memory (ROM), erasable programmable ROM, flash memories, hard disks, optical disks, and magnetic tapes.
In an embodiment, the memory 102 includes a plurality of modules 110 such as modules (not shown) for performing task extraction, IG labelling, canonical form conversion, task-IG affinity model (as shown in FIG. 1B), modules for post processing and statement generation and the like. Further, the plurality of modules 110 include programs or coded instructions that supplement applications or functions performed by the system 100 for executing different steps involved in the process of discovering commonsense knowledge about tasks carried out within Industry Groups (IGs) to enrich a commonsense knowledge base, being performed by the system 100. The plurality of modules 110, amongst other things, can include routines, programs, objects, components, and data structures, which performs particular tasks or implement particular abstract data types. The plurality of modules 110 may also be used as, signal processor(s), node machine(s), logic circuitries, and/or any other device or component that manipulates signals based on operational instructions. Further, the plurality of modules 110 can be used by hardware, by computer-readable instructions executed by the one or more hardware processors 104, or by a combination thereof.
Further, the memory 102 may comprise information pertaining to input(s)/output(s) of each step performed by the processor(s) 104 of the system100 and methods of the present disclosure.
Further, the memory 102 includes a database 108. The database (or repository) 108 may include a plurality of abstracted piece of code for refinement and data that is processed, received, or generated as a result of the execution of the plurality of modules in the module(s) 110. The corpus, as depicted in FIG. 1B, may be within the database or may be external to database and text documents from the corpus can be accessed by the system 100 into the database 108. Although the data base 108 is shown internal to the system 100, it will be noted that, in alternate embodiments, the database 108 can also be implemented external to the system 100, and communicatively coupled to the system 100. The data contained within such external database may be periodically updated. For example, new data may be added into the database (not shown in FIG. 1) and/or existing data may be modified and/or non-useful data may be deleted from the database. In one example, the data may be stored in an external system, such as a Lightweight Directory Access Protocol (LDAP) directory and a Relational Database Management System (RDBMS). Functions of the components of the system 100 are now explained with reference to the architectural overview of the system 100 as in FIG. 1B and steps in flow diagrams in FIG. 2.
FIGS. 2A through 2B (collectively referred as FIG. 2) is a flow diagram illustrating a method 200 for discovering the commonsense knowledge about the tasks carried out within the Industry Groups (IGs) to enrich the commonsense knowledge base, using the system of FIG. 1, in accordance with some embodiments of the present disclosure.
In an embodiment, the system 100 comprises one or more data storage devices or the memory 102 operatively coupled to the processor(s) 104 and is configured to store instructions for execution of steps of the method 200 by the processor(s) or one or more hardware processors 104. The steps of the method 200 of the present disclosure will now be explained with reference to the components or blocks of the system 100 as depicted in FIGS. 1A and 1B and the steps of flow diagram as depicted in FIG. 2. Although process steps, method steps, techniques or the like may be described in a sequential order, such processes, methods, and techniques may be configured to work in alternate orders. In other words, any sequence or order of steps that may be described does not necessarily indicate a requirement that the steps to be performed in that order. The steps of processes described herein may be performed in any order practical. Further, some steps may be performed simultaneously.
The problem of identifying tasks with appropriate IGs is defined as follows. Consider a set of documents D ( accessed form the corpus) and a set of IGs G= {g_1; g_2;::: ; g_24}, wherein 24 IGs are as defined in standard IG list well known by people having ordinary skill in the art. The goal is to discover a set of tasks {Ti = t_i1;t_i2;::: ; t_im} for each IG g_i, such that the triplets of the form can be directly added to a commonsense knowledge base. The challenge is to design an end to end weakly supervised framework.
Referring to the steps of the method 200, at step 202 of the method 200, the one or more hardware processors 104 receive a plurality of text documents (D) from the corpus as depicted in FIG. 1B.
At step 204 of the method 200, the one or more hardware processors 104 extract a plurality of tasks by processing the plurality of sentences in each of the plurality of documents using a task extraction technique. A set of tasks T from the documents D using are extracted using the task extraction technique proposed by Sachin Pawar, et.al. 2021 in Weakly supervised extraction of tasks from text. The technique is based on a weakly supervised Bidirectional Encoder Representations from Transformers (BERT) -based classification model which predicts for each word whether it is headword of a task phrase or not. A set of dependency tree based rules are then used to expand a task headword to a complete task phrase. Few sample tasks for corresponding IGs from datasets Reuters and TechCrunch is provided below:
IG-Energy
Reuters -Pacific Gas began [construction of the two nuclear power units] in 1969
TechCrunch - Total Petroleum NA ( TPN ) [shut down several small crude oil pipelines] operating near the Texas/Oklahoma border last Friday as a precaution against damage from local flooding, according to Gary Zollinger , manager of operations.-
IG-Transportation
Reuters-Crews hoped to [restore traffic to the line later today after clearing the damage train] and repairing the tracks at Chacapalca , 225 km east of the Capital, Lima.
TechCrunch- We launched new tools for airlines so they can better [predict consumer demand] and [plan their routes].
At step 206 of the method 200, the one or more hardware processors 104 label each of the extracted plurality of tasks with an Industry Group (IG) from among the plurality of IGs using an unsupervised Machine Learning (ML) technique. The unsupervised ML technique comprises labelling each of the extracted plurality of tasks with an IG among the plurality of IGs to generate a plurality of task labels, wherein the labelling comprises combining the IG generated for each of the plurality of tasks by (i) a keyword-lookup technique, (ii) a cosine-sim technique, and (iii) a zero-shot-text classification technique in accordance with a predefined criteria. The one or more tasks not mapping to either of the plurality of IGs are labeled as ‘Others’. The task labelling with IG is described below:
Task-IG Classification. Given the set of 100 extracted tasks T, each task is labelled to a corresponding IG g, i.e., t ? g. It can be noted that some tasks are general and cannot be associated with an IG and are labelled as ’Others’. Due to the unavailability of any labelled training data, focus is on unsupervised methods where the only manual effort is to provide a set of five keywords ?kw?_i associated with each IG g along with a one-sentence description ?HP?_g referred as hypothesis. ?HP?_gis in the form of ‘The previous sentence is about some aspects of g such as ?kw?_(1,) ?kw?_2….., ?kw?_5.
Industry Group and corresponding Keywords
Real Estate: rent, house, residential, apartment, homeowner
Utilities: plumbing sewage, bill, housekeeping, laundry
Media & Entertainment: news, film, advertising, publishing, broadcasting.
Telecommunication telephone: mobile, network, internet, wireless communication.
Semiconductors & Semiconductor equipment: chip, ram, processor, motherboard, CPU.
Software & Services: data, app, outsourcing, programming, server.
Diversified Financials: investment, stock, portfolio, capital, asset management.
Health care equipment & services: hospital, medical, doctor, nurse, diagnostics.
Food & Staples Retailing: agriculture, farm, crop, vegetable, fruit.
Technology hardware & equipment: gadget, smart phone, tablet, graphic card, storage.
Insurance: health insurance, life insurance, medical insurance, risk insurance, insurance brokers.
Banks: loan, mortgage, accounts, payment, money.
Pharmaceuticals: medicine, drug, vaccine, syrup, biotechnology.
Household & personal products: toiletry, eyewear, cleansing, cosmetic, beauty product.
Food beverage & tobacco: alcohol, meat, brewer, distillery, cigarette.
Retailing: ecommerce, merchandise, distributor, shop, supermarket.
Consumer Services: hotel, restaurant, education, resort, casino.
Consumer durables & apparel: textile, footwear, electronic appliance, clothing, houseware.
Automobiles & components: car, truck, vehicle, motorcycle, tire
Transportation: railway, highway, airline, shipping, logistics.
Commercial & professional services: consulting, hiring, human resource, recruitment, printing.
Capital goods: machinery, equipment, aerospace, defense, satellites.
Materials: metal, mining, fertilizer, chemical, cement.
Energy: oil, electricity, coal, renewable, solar.
The three unsupervised methods used are as follows: (i) the keywords-lookup provides a naive approach to look up the keywords for each IG to assign a label (g_ks) to a task t. (ii) the cosine-sim computes the Cosine similarity between embeddings of a task t and the hypotheses ?HP?_g for each IG. The embeddings are obtained using sentence BERT model all-MiniLM-L6-v2, known in the art. The resultant IG with maximum cosine similarity is denoted as g_cs. (iii) zero_shot_tc: a zero-shot text classification technique known in the art is based on natural language inference approach. Here, a relation is predicted for a pair of a task t as a premise and an ?HP?_g (for each IG) as a hypothesis – Entail vs Contradict. IG with the highest entailment probability is returned as g_zs. An ensemble approach is used to combine predictions of these three methods as described in Algorithm 1 to predict a task-level IG for a task t.
A sentence-level IG for each task is predicted by using the same algorithm1 to classify the entire sentence containing the task. The sentence-level IG for a task gets influenced by other context words in the sentence outside the task phrase itself and hence may be different than the task-level IG. Optionally, an IG label is predicted for entire document which is referred as document-level IG. The document-level IG is either available for some documents (e.g., Wikipedia category of a Wikipedia article) or can be considered as the most frequent IG label from the sentence-level IGs of the document.”
At step 208 of the method 200, the one or more hardware processors 104 convert each of the plurality of task into a canonical form using a plurality of linguistic rules. Any task t extracted directly from documents is very specific and needs to be generalized to the canonical form t ^ before adding to the commonsense KB. For example, consider the task t: created virtual creatures and characters in movies like “The Lord of the Rings” belonging to the IG Media & Entertainment. It is very specific to a particular movie and needs to be generalized to say, ‘create virtual creatures and characters.’ The set of linguistic rules are devised to convert a task to its canonical form, i.e., t ?t ^. At first noise such as hyphen, quotes, text in brackets, is removed from the task.
Any task can either be a verb phrase or a noun phrase. If a task is a verb phrase, first any passive voice phrases are converted to active voice and then identify a syntactic pattern of a task as V-NP (verb followed by a noun phrase), V-NP-P-NP (verb followed by a noun phrase followed by a prepositional phrase), etc. The head verbs are converted to base forms and the constituent noun phrases (NPs) are processed to remove any determiners, possessive pronouns, and common adjectives (determined using corpus statistics). The presence of any named entities in these constituent NPs, e.g., GPE (geo-political entity) mentions are processed are replaced with state, country, or continent using the pre-defined list and rest as “places.” Entity mentions of type MONEY, DATE or TIME are removed from the task. The tasks which are noun phrases are similarly processed by first identifying a suitable pattern (such as only NP, NP-P-NP) and then applying the similar rules as above.
At step 210 of the method 200, the one or more hardware processors 104 process each of the plurality of tasks, corresponding labeled IG and associated canonical form of each of the plurality of tasks by the task-IG affinity model, as depicted in FIG. 2B, to map IG for each of the plurality of tasks. The task-IG affinity model generates an affinity score for each of the task with respect to each of the plurality of IGs based on an affinity function, wherein the affinity score generated is high (above an affinity threshold, for example the predefined value identified by subject matter expert based on experimentation can be chosen as ‘0.6’) if is the task has a high support and a high specificity. The high support indicates a task is observed with similar IG in multiple sentences in the corpus, and high specificity indicates a task is associated with a specific IG. The task affinity model comprises two linear transformation layers for tasks and IGs, which transform a sentence transformer representations to obtain new representations. The task-IG affinity model is trained using an instance weighting approach, wherein for each task instance is weighted differently by considering multiple aspects comprising agreement between task-level, sentence-level, and document-level IG predictions. As explained at task-level labelling, the labelling of each of the plurality of sentences with the IG (sentence-level IG predictions) is generated by the unsupervised ML technique, that performs combining the IG generated for each of the plurality of sentences by (i) the keyword-lookup technique, (ii) the cosine-sim technique, and (iii) the zero-shot-text classification technique in accordance with the predefined criteria.
The task-IG Affinity model: For addition in a commonsense KB, it is desirable to add most representative tasks of each IG. There may be some tasks that are too specific to an IG while other tasks may be too general. Hence, given a task in its canonical form t ^ labelled with IG g, the affinity function f(t ^,g) is devised, which predicts affinity score ?[-1; 1]. The affinity function is expected to return a high affinity score if the following conditions hold:
High support: The task t ^ (or other tasks with similar meaning) is observed with IG g in multiple sentences in the corpus.
High specificity: The task t ^ (or other tasks with similar meaning) is specific to g, i.e., it is rarely observed for any IGs other than g in the corpus. To learn the affinity function f(t ^,g), self-supervision is used where the training instances are created as follows for each task (t ) ^labelled with an IG= {?t ^,g,g_n1?_,g_n2,w_i }. Here, g is either task-level or sentence-level IG predicted for task t and w_i is the instance weight that is described later. ?gn?_(1 and) g_n2are negative IGs for t ^_, which are randomly selected from IGs other than g and GEX, where GEX={g_cs,g_(cs,)^' g_cs^'' }represents the most probable IGs returned by cosine similarity based technique. It can be noted that excluding the IGs from GEX gives more robust negative instances.
The task-IG affinity model. The tasks as well as IGs are encoded (using corresponding hypothesis ?HP?_g) using a pre-trained sentence-transformer (SM), for example, all-MiniLM-L6-v2 known in the art, to obtain task and IG embedding in R^384R
Further, output the pre-trained sentence-transformer is passed through two linear feed forward layers to get more compressed representations (?R^100) of the task and IG embeddings. A function tanh activation is used and a dropout with 0.25 probability.
?x^'?_t=tanh?(W_t·x_t+b_t ); ? x?^' g=tanh?(W_g·x_g+b_g )
?x^'?_(g_n1 )=tanh?(W_g·x_(g_n1 )+b_g ); ?? x?^'?_(g_n2 )=tanh?(W_g·x_(g_n2 )+b_g )
Here, W_t;W_g ?2 R^(100×384) and b_t;b_g ?2 R^100are learnable weights of the linear transformation layers. Using these representations, the cosine similarity is computed between the task and each of the IGs in the instance.
f((t,) ^ g) =?sim?_pos= CosineSim (?x^'?_t,?x^'?_g)
?sim?_neg1= CosineSim (?x^'?_t,?x'?_(g_n1 ))
?sim?_neg2= CosineSim (?x^'?_t,?x'?_(g_n2 )) (1)
Then the margin ranking loss is computed to update weights such that the representations of semantically similar t ^ and g are pushed closer to each other and vice versa. Here, the margin value ? is used as 0.5 and the loss function is defined as follows.
?loss?_1=max(0,- (?sim?_pos-?sim?_neg1)+?
?loss?_2=max(0,- (?sim?_pos-?sim?_neg2)+?
loss=w_i (?loss?_1+ ?loss?_2)
Instance weighting. Three different types of instance weights are considered.
Balancing weight (w_b) is assigned to each instance to handle imbalanced distribution instances across IGs (e.g., 22958 tasks labelled with Software & Services compared to only 8994 of Banks). w_b is calculated as N/(24 N_g ), where N is the total number of training instances and N_g is the number of instances with IG g.
Confidence weight (w_c) captures the confidence of the predicted IG g as follows – (a) 1, if both task-level and sentence-level IGs are same as g, (b) 0.75 if task-level IG is g but sentence-level IG is different, (c) 0.25 if sentence-level IG is g and task-level is ‘Others’.
In an embodiment additional document-level IG predictions are also used in determining confidence of the predicted IGs. Thus, the instance weighting with task-level, sentence-level, and document-level IGs is performed in accordance with the rules as below:
Instance weighting:
Here, each instance is weighted differently as per the following strategy. The higher weight indicates higher confidence that the IG g is a true label for the task t.
Here, g can be task-level IG or sentence-level IG or document-level IG for task t.
Let (g_t) be task-level IG of t, (g_s) be sentence-level IG of t, and (g_d) be document-level IG of t.
If g_t!= “Others”
If g_t= g_s and g_t= g_dthen instance_weight = 1.0 and g = g_t//all three prediction are same then there is the highest confidence in the predicted IG, and we assign highest instance weight)
Else if g_t= g_s and g_t!= g_d then instance_weight = 0.75 and g = g_t//there is agreement between sentence-level and task-level IGs but disagreement with document-level IG
Else if g_t= g_dand g_t!= g_sthen instance_weight = 0.5 and g = g_t// there is agreement between document-level and task-level IGs but disagreement with sentence-level IG
Else instance_weight = 0.25 and g = g_t// there is disagreement among all 3 and task-level IG is chosen but with lower instance weight.
Else If g_t= “Others”
If g_t= g_s and g_t= g_d then instance_weight = 0.5 and g=g_t
Else if g_s= g_dand g_s!= “Others” then instance_weight = 0.5 and g = g_s// task-level IG is “Others” but there is agreement between sentence-level and document-level IGs which is chosen as final IG
Else if g_s= g_tand g_d!= “Others” then instance_weight = 0.25 and g = g_d// both task-level IG and sentence-level IG are “Others”, so document-level IG is chosen as final IG
Else if g_s!= “Others” then instance_weight = 0.25 and g = g_s
Else if g_d!= “Others” then instance_weight = 0.25 and g = g_d
Agent-based weight (w_a) assigns a higher weight of 1 to tasks which have some organization as their agent in the sentence and lower weight of 0.5 to other tasks.
Finally, the overall weight of an instance is w_i=w_(a×) w_b ?×w?_c which is multiplied with the loss.
Training the task-IG-affinity model: The model is trained with the following settings: batch size of 64, Adam optimizer with learning rate of 0:0001, and 5 epochs. During inference, for a task-IG pair (t ^,g), the affinity function f(t ^,g)is computed through the model as simpos (Eq. 1).
At step 212 of the method 200, the one or more hardware processors 104 cluster one or more tasks mapped to each of the plurality of IGs and retaining one task per cluster as a representative task for the IG associated with the cluster. The tasks are clustered within each IG using community detection algorithm provided by the sentence-transformers package3 known in the art. The task with the highest affinity score is selected from each cluster as its representative. Table 1 illustrates examples of tasks added for some IGs.
Media &
Entertainment -make money from advertising
-understand news content
Telecommunication
Services -provide wi-fi access for devices
-cellular telephone operations
Semiconductors &
Semiconductor Equipment -chip designs for autonomous systems
-build semiconductor foundries
Pharmaceuticals -fund vaccine trials-develop therapeutic medicines
Table 1: Task-IG pairs to be added to a commonsense KB in the form of
At step 214 of the method 200, the one or more hardware processors 104 generate a plurality of triplets for each of the plurality of IGs using the representative task for the IG. At step 216 of the method 200, the one or more hardware processors 104 augment common sense knowledge to the commonsense knowledge base of task specific industry information by adding the plurality of triplets, as depicted in FIG. 1B.
Results:
Dataset: Two publicly available datasets are used such as news datasets: TechCrunch4 which consists of articles from technical domain published on TechCrunch in 2020 and Reuters (Lewis, 1997) which is a collection of documents from the well-known dataset from financial newswire service. Table 2 briefly provides the details of the dataset.
Dataset #News Items #Sentences #Tasks Extracted
TechCrunch 19081 458268 450071
Reuters 10518 60034 43466
Table 2: Dataset Description
The experiments were carried out on an Intel(R)™ Xeon(R) Gold 6148@2.40GHZ™ machine with 48 GB RAM. Each 238 phase of the framework is evaluated extensively.
Evaluation of Tasks Extraction- files were manually annotated from TechCrunch and Reuters with 890 gold standard task phrases. The precision of 0.82 and recall of 0.72 is observed on this dataset.
Evaluation of Task-IG Classification. Randomly selected 50 tasks labelled with each IG were run and manually verified the precision. The precision of 0.76 across was observed all IGs and for 14 IGs, the precision is more than 0.8.
Evaluation of Task-IG Affinity- The task-IG affinity model is compared with the following two baselines. The input to the baselines are the extracted canonical tasks and the corresponding IG labels. Association Rule Mining (ARM) (Agrawal et al., 1993) method: After stopword removal and lemmatization, for each task-IG pair, an itemset is formed with tokens as the words in the task and its IG. If a token appears multiple times within a single task mention then only one occurrence is retained in the corresponding itemset. Next, apriori rule mining technique is applied to derive the association rules. Only those rules are retained that have one or more IG on the right side. Each task-IG pair is assigned a score proportional to the number of matching association rules and their confidence values, and the top 100 tasks for each IG are selected. A TF-IDF method: For each IG g, a pseudo 267 document Dg is created by concatenating all the tasks labeled 268 by g. For each word w, its term frequency T?TF?_g (w) is computed as the number of times w appears in D_g. Its document frequency DF(w) is computed as the|?{D?_g |w?D_g?g}|. Here, the affinity between a word w and IG g is computed as TF-IDF score = ?TF?_g (w)×log 24/DF(w) . Next, for each task, the affinity score is computed with an IG as the mean value of TF-IDF scores of its constituent words. The affinity scores are sorted to get the top 100 tasks for each IG. The method disclosed tries to satisfy the desired conditions – high support (through TF) and high specificity (through IDF). No separate evaluation dataset is created for evaluating task-IG affinity scores, because it is a corpus level problem. To evaluate the task-IG affinity scores by the method disclosed herein, as well as the above baselines, the metric P@100 is computed. It is computed by manually verifying the precision within top 100 tasks per IG with the highest affinity score for that IG. In other words, approx. 2400 tasks are verified manually (100 tasks per IG, 24 IGs) for each of the 3 techniques shown in Table 3.
IG/Methods ARM TF-IDF Affinity Model
Software & Services 0.74 0.71 0.93
Consumer Services 0.60 0.73 0.87
Transportation 0.57 0.80 0.94
Retailing 0.88 0.88 0.92
Telecommunication Services 0.36 0.72 0.89
Average for 24 IGs 0.66 0.75 0.86
Table 3: ComparativeP@100for all techniques
The task-IG affinity model outperforms both the baselines, and the differences are statistically significant as per two-sample t-test as depicted in Table 4.
Dataset Tasks Industry Group
Reuters Pacific Gas began [construction of the two nuclear power units]in 1969. Energy
TechCrunch Total Petroleum NA (TPN ) [shut down several small crude oil pipelines] operating near the Texas/Oklahoma border last Friday as a precaution against damage from local flooding ,according to Gary Zollinger, manager of operations.
Reuters Crews hoped to [restore traffic to the line later today after clearing the damage train]and repairing the tracks at Chacapalca, 225 km east of the Capital, Lima. Transportation
TechCrunch We launched new tools for airlines so they can better [predict consumer demand] and [plan their routes].
Reuters Wolverine said it will [concentrate its effort in the athletic footwear market in its Brooks footwear division]. Consumer Durables and Apparel
TechCrunch Intel, TSMC and Samsung Electronics are able to [make chips of 10 - nanometers or lower], the fastest and most power - efficient chips currently on the market.
Reuters In other sectors, the Comet electrical chain [raised retail profits] by 46 pct to 17.4mln stg , while the Woolworth chain reported a 120 pct improvement to 38.7 mln. Retailing
TechCrunch n those two months alone, Shopify seems to have [onboarded more merchants] than in the whole of 2018.
Reuters In November 1986, Novo purchased 75 pct of shares in A/S Ferrosan, which heads a group specializing in [research and development of CNS (central nervous treatment) treatments and the sale of pharmaceuticals and vitamins in Scandinavia]. Pharma
TechCrunch
Drugmaker Moderna has [completed its initial efficacy analysis of its COVID - 19 vaccine] from the drug ’s Phase 3 clinical study.
Reuters Laroche said he may [obtain a short - term loan of up to one mln dlrs from Amoskeag Bank to help finance the purchase of shares under the offer], bearing interest of up to nine pct... Banks
TechCrunch
Mobile banking startup Varo is becoming a real bank - The company announced that it has been [granted a national bank charter from the Office of the Comptroller of the Currency]
Reuters The executives would also [get cash settlements of options plans and continuation of insurance and other benefits]. Insurance
TechCrunch
Square said it had also received approval from the FDIC to [conduct deposit insurance].
Reuters Hogan said Systems [provides integrated applications software and processing services]to about 30 community banks. Software and Services
TechCrunch
Unlike its rivals, Zoox is [developing the self - driving software stack, the on-demand ride - sharing app and the vehicle itself]
Reuters CIP, Canada ’s second largest newsprint producer, recently [launched a 366 mln Canadian dlr newsprint mill at Gold River, British Columbia] which is due begin producing 230, 000metric Tonnes per year by fall of 1989. Media and Entertainment
TechCrunch
The company has also worked to [create virtual creatures and characters in movies like "The Lord of the Rings"].
Table 4: Examples of tasks mentioned in various datasets. The task mentions are enclosed in square brackets and are highlighted in bold within sentences. Each task belongs to an Industry Group.
Thus, the method disclosed herein enables the augmentation of task-IG information in a commonsense knowledge base. The method provides a framework to automatically extract task-IG information from natural language text that does not require any manually annotated training instances. The method output enables recommending 2339 triples 300 to be added in ConceptNet whereas there exist 301 only 138 triples of the form indicating addition of considerable new knowledge.
The written description describes the subject matter herein to enable any person skilled in the art to make and use the embodiments. The scope of the subject matter embodiments is defined by the claims and may include other modifications that occur to those skilled in the art. Such other modifications are intended to be within the scope of the claims if they have similar elements that do not differ from the literal language of the claims or if they include equivalent elements with insubstantial differences from the literal language of the claims.
It is to be understood that the scope of the protection is extended to such a program and in addition to a computer-readable means having a message therein; such computer-readable storage means contain program-code means for implementation of one or more steps of the method, when the program runs on a server or mobile device or any suitable programmable device. The hardware device can be any kind of device which can be programmed including e.g., any kind of computer like a server or a personal computer, or the like, or any combination thereof. The device may also include means which could be e.g., hardware means like e.g., an application-specific integrated circuit (ASIC), a field-programmable gate array (FPGA), or a combination of hardware and software means, e.g., an ASIC and an FPGA, or at least one microprocessor and at least one memory with software processing components located therein. Thus, the means can include both hardware means, and software means. The method embodiments described herein could be implemented in hardware and software. The device may also include software means. Alternatively, the embodiments may be implemented on different hardware devices, e.g., using a plurality of CPUs.
The embodiments herein can comprise hardware and software elements. The embodiments that are implemented in software include but are not limited to, firmware, resident software, microcode, etc. The functions performed by various components described herein may be implemented in other components or combinations of other components. For the purposes of this description, a computer-usable or computer readable medium can be any apparatus that can comprise, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device.
The illustrated steps are set out to explain the exemplary embodiments shown, and it should be anticipated that ongoing technological development will change the manner in which particular functions are performed. These examples are presented herein for purposes of illustration, and not limitation. Further, the boundaries of the functional building blocks have been arbitrarily defined herein for the convenience of the description. Alternative boundaries can be defined so long as the specified functions and relationships thereof are appropriately performed. Alternatives (including equivalents, extensions, variations, deviations, etc., of those described herein) will be apparent to persons skilled in the relevant art(s) based on the teachings contained herein. Such alternatives fall within the scope of the disclosed embodiments. Also, the words “comprising,” “having,” “containing,” and “including,” and other similar forms are intended to be equivalent in meaning and be open ended in that an item or items following any one of these words is not meant to be an exhaustive listing of such item or items or meant to be limited to only the listed item or items. It must also be noted that as used herein and in the appended claims, the singular forms “a,” “an,” and “the” include plural references unless the context clearly dictates otherwise.
Furthermore, one or more computer-readable storage media may be utilized in implementing embodiments consistent with the present disclosure. A computer-readable storage medium refers to any type of physical memory on which information or data readable by a processor may be stored. Thus, a computer-readable storage medium may store instructions for execution by one or more processors, including instructions for causing the processor(s) to perform steps or stages consistent with the embodiments described herein. The term “computer-readable medium” should be understood to include tangible items and exclude carrier waves and transient signals, i.e., be non-transitory. Examples include random access memory (RAM), read-only memory (ROM), volatile memory, nonvolatile memory, hard drives, CD ROMs, DVDs, flash drives, disks, and any other known physical storage media.
It is intended that the disclosure and examples be considered as exemplary only, with a true scope of disclosed embodiments being indicated by the following claims.
, Claims:We Claim:
1. A processor implemented method (200), the method comprising:
receiving (202), via one or more hardware processors, a plurality of text documents from a corpus;
extracting (204), via the one or more hardware processors, a plurality of tasks by processing a plurality of sentences in each of the plurality of documents using a task extraction technique;
labelling (206), via the one or more hardware processors, each of the extracted plurality of tasks with an Industry Group (IG) from among the plurality of IGs using an unsupervised Machine Learning (ML) technique;
converting (208), via the one or more hardware processors, each of the plurality of tasks into a canonical form using a plurality of linguistic rules;
processing (210), via the one or more hardware processors, each of the plurality of tasks, corresponding labeled IG and associated canonical form of each of the plurality of tasks by a task-IG affinity model , wherein the task-IG affinity model generates an affinity score for each of the plurality of tasks with respect to each of the plurality of IGs based on an affinity function, wherein the affinity score is above an affinity threshold if the task has a high support and a high specificity, and wherein the high support indicates a task is observed with similar IG in multiple sentences in the corpus, and the high specificity indicates a task is associated with a specific IG;
clustering (212), via the one or more hardware processors, one or more tasks mapped to each of the plurality of IGs and retaining one task per cluster as a representative task for the IG associated with the cluster;
generating (214), via the one or more hardware processors, a plurality of triplets for each of the plurality of IGs using the representative task for the IG; and
augmenting (216), via the one or more hardware processors, common sense knowledge to a commonsense knowledge base of task specific industry information by adding the plurality of triplets.
2. The method as claimed in claim 1, wherein the task affinity model comprises two linear transformation layers for tasks and IGs, which transform a sentence transformer representations to obtain new representations.
3. The method as claimed in claim 2, wherein the task-IG affinity model is trained using an instance weighting approach, and wherein each task instance is weighted differently by considering multiple aspects comprising agreement between task-level, sentence-level, and document-level IG predictions.
4. The method as claimed in claim 1, wherein the unsupervised ML technique comprises:
labelling each of the extracted plurality of tasks with an IG among the plurality of IGs to generate a plurality of task labels, wherein the labelling comprises combining, using an ensemble approach, the IG generated for each of the plurality of tasks by (i) a keyword-lookup technique, (ii) a cosine-sim technique, and (iii) a zero-shot-text classification technique in accordance with a predefined criteria, wherein one or more tasks not mapping to either of the plurality of IGs are labeled as ‘Others’.
5. The method as claimed in claim 4, wherein the unsupervised ML technique is used to:
(i) label of each of the plurality of sentences with the IG by combining, using the ensemble approach, the IG generated for each of the plurality of sentences by (i) the keyword-lookup technique, (ii) the cosine-sim technique and (iii) the zero-shot-text classification technique in accordance with the predefined criteria; and
(ii) label of each of the plurality of documents with the IG based on the IG each document is identified with.
6. A system (100) comprising:
a memory (102) storing instructions;
one or more Input/Output (I/O) interfaces (106); and
one or more hardware processors (104) coupled to the memory (102) via the one or more I/O interfaces (106), wherein the one or more hardware processors (104) are configured by the instructions to:
receive a plurality of text documents from a corpus;
extract a plurality of tasks by processing a plurality of sentences in each of the plurality of documents using a task extraction technique;
label each of the extracted plurality of tasks with an Industry Group (IG) from among the plurality of IGs using an unsupervised Machine Learning (ML) technique;
convert each of the plurality of tasks into a canonical form using a plurality of linguistic rules;
process each of the plurality of tasks, corresponding labeled IG and associated canonical form of each of the plurality of tasks by a task-IG affinity model , wherein the task-IG affinity model generates an affinity score for each of the plurality of tasks with respect to each of the plurality of IGs based on an affinity function, wherein the affinity score is above an affinity threshold if the task has a high support and a high specificity, and wherein the high support indicates a task is observed with similar IG in multiple sentences in the corpus, and the high specificity indicates a task is associated with a specific IG;
cluster one or more tasks mapped to each of the plurality of IGs and retaining one task per cluster as a representative task for the IG associated with the cluster;
generate a plurality of triplets for each of the plurality of IGs using the representative task for the IG; and
augment common sense knowledge to a commonsense knowledge base of task specific industry information by adding the plurality of triplets.
7. The system as claimed in claim 6, wherein the task affinity model comprises two linear transformation layers for tasks and IGs, which transform a sentence transformer representations to obtain new representations.
8. The system as claimed in claim 7, wherein the task-IG affinity model is trained using an instance weighting approach, and wherein each task instance is weighted differently by considering multiple aspects comprising agreement between task-level, sentence-level, and document-level IG predictions.
9. The system as claimed in claim 6, wherein the unsupervised ML technique comprises:
labelling each of the extracted plurality of tasks with an IG among the plurality of IGs to generate a plurality of task labels, wherein the labelling comprises combining, using an ensemble approach, the IG generated for each of the plurality of tasks by (i) a keyword-lookup technique, (ii) a cosine-sim technique, and (iii) a zero-shot-text classification technique in accordance with a predefined criteria, wherein one or more tasks not mapping to either of the plurality of IGs are labeled as ‘Others’.
10. The system as claimed in claim 9, wherein the unsupervised ML technique is used to:
(i) label of each of the plurality of sentences with the IG by combining, using the ensemble approach, the IG generated for each of the plurality of sentences by (i) the keyword-lookup technique, (ii) the cosine-sim technique and (iii) the zero-shot-text classification technique in accordance with the predefined criteria; and
(ii) label of each of the plurality of documents with the IG based on the IG each document is identified with.
Dated this 2nd Day of December 2022
Tata Consultancy Services Limited
By their Agent & Attorney
(Adheesh Nargolkar)
of Khaitan & Co
Reg No IN-PA-1086
| # | Name | Date |
|---|---|---|
| 1 | 202221069727-STATEMENT OF UNDERTAKING (FORM 3) [02-12-2022(online)].pdf | 2022-12-02 |
| 2 | 202221069727-REQUEST FOR EXAMINATION (FORM-18) [02-12-2022(online)].pdf | 2022-12-02 |
| 3 | 202221069727-FORM 18 [02-12-2022(online)].pdf | 2022-12-02 |
| 4 | 202221069727-FORM 1 [02-12-2022(online)].pdf | 2022-12-02 |
| 5 | 202221069727-FIGURE OF ABSTRACT [02-12-2022(online)].pdf | 2022-12-02 |
| 6 | 202221069727-DRAWINGS [02-12-2022(online)].pdf | 2022-12-02 |
| 7 | 202221069727-DECLARATION OF INVENTORSHIP (FORM 5) [02-12-2022(online)].pdf | 2022-12-02 |
| 8 | 202221069727-COMPLETE SPECIFICATION [02-12-2022(online)].pdf | 2022-12-02 |
| 9 | Abstract1.jpg | 2023-01-24 |
| 10 | 202221069727-FORM-26 [15-02-2023(online)].pdf | 2023-02-15 |