Abstract: This disclosure relates to extraction of tasks from documents based on a weakly supervised classification technique, wherein extraction of tasks is identification of mentions of tasks in a document. There are several prior arts addressing the problem of extraction of events, however due to crucial distinctions between events-tasks, task extraction stands as a separate problem. The disclosure explicitly defines specific characteristics of tasks, creates labelled data at a word-level based on a plurality of linguistic rules to train a word-level weakly supervised model for task extraction. The labelled data is created based on the plurality of linguistic rules for a non-negation aspect, a volitionality aspect, an expertise aspect and a plurality of generic aspects. Further the disclosure also includes a phrase expansion technique to capture the complete meaning expressed by the task instead of merely mentioning the task that may not capture the entire meaning of the sentence. [To be published with FIG. 2]
Claims:We Claim:
1. A processor-implemented method (300) for training a word-level data model for extraction of tasks from documents using weakly supervision comprising:
receiving a plurality of documents from a plurality of sources, via one or more hardware processors (302);
pre-processing the plurality of documents using a plurality of pre-processing techniques, via the one or more hardware processors, to obtain a plurality of pre-processed documents comprising a plurality of sentences, a plurality of words within the plurality of sentences, a plurality of Part-of-speech (POS) tags, a plurality of dependency trees and a plurality of wordnet based features (304);
labelling the plurality of words from the plurality of pre-processed documents as one of a task headword and a no-task headword, via the one or more hardware processors, wherein the plurality of words is labelled based on the plurality of sentences using a plurality of linguistic rules (306); and
training a word-level weakly supervised classification model for extraction of tasks, via the one or more hardware processors, using the task headword and the no-task headword labelled using the plurality of linguistic rules, wherein the word-level weakly supervised classification model is the word-level data model for extraction of tasks from documents (308).
2. The method of claim 1, wherein the word-level weakly supervised model is utilized for the extraction of tasks from a plurality of user documents based on a word-level weakly supervision task extraction technique (400), wherein the weakly supervision task extraction technique comprises:
receiving the plurality of user documents for the extraction of tasks (402);
pre-processing the plurality of user documents using the plurality of pre-processing techniques, to obtain a plurality of pre-processed user documents comprising a plurality of user sentences and a plurality of user words within the plurality of user sentences (404);
labelling the plurality of user words from the plurality of pre-processed user documents as one of a task headword and a no-task headword using the word-level weakly supervised classification model (406); and
expanding the labelled user task headword to obtain a task phrase based a phrase expansion technique and the plurality of dependency trees, wherein the task phrase represents the extracted task (408).
3. The method of claim 1, wherein the task extraction comprises identification of mentions of a task in a document, wherein a task is a well-defined knowledge-based action carried out volitionally with expertise for a specific goal within a pre-defined time by a single person, a group of persons, a device, or a system.
4. The method of claim 1, wherein the plurality of pre-processing techniques comprises (a) a sentence-splitting technique for identification of the plurality of sentences, (b) a tokenization technique for identification of the plurality of words within the plurality of sentences, (c) a Part of speech (POS) tagging technique for identification of the plurality of Part-of-speech (POS) tags for each word in the plurality of words, (d) a dependency parsing technique for identification of the plurality of dependency tree for the plurality of sentences, and (e) a WordNet based features identification technique for identification of the plurality of wordnet based features for the plurality of sentences.
5. The method of claim 1, wherein the plurality of words is labelled based on the plurality of linguistic rules for a non-negation aspect, a volitionality aspect, an expertise aspect and a plurality of generic aspects.
6. The method of claim 5, wherein the plurality of linguistic rules for labelling the non-negation aspect labels the plurality of words based on a modified negation using the dependency tree, the plurality of linguistic rules for labelling the volitionality aspect labels the plurality of words based on identification of actions carried out volitionally, and the plurality of linguistic rules for labelling the expertise aspect labels the plurality of words based on a domain expertise or knowledge required for execution of the task.
7. A system (100), comprising:
a memory (102) storing instructions;
one or more communication interfaces (106); and
one or more hardware processors (104) coupled to the memory (102) via the one or more communication interfaces (106), wherein the one or more hardware processors (104) are configured by the instructions to:
receive a plurality of documents from a plurality of sources, via one or more hardware processors;
pre-process the plurality of documents using a plurality of pre-processing techniques, via the one or more hardware processors, to obtain a plurality of pre-processed documents comprising a plurality of sentences, a plurality of words within the plurality of sentences, a plurality of Part-of-speech (POS) tags, a plurality of dependency trees and a plurality of wordnet based features;
label the plurality of words from the plurality of pre-processed documents as one of a task headword and a no-task headword, via the one or more hardware processors, wherein the plurality of words is labelled based on the plurality of sentences using a plurality of linguistic rules; and
train a word-level weakly supervised classification model for extraction of tasks, via the one or more hardware processors, using the task headword and the no-task headword labelled using the plurality of linguistic rules, wherein the word-level weakly supervised classification model is the word-level data model for extraction of tasks from documents.
8. The system of claim 7, wherein the one or more hardware processors are configured by the instructions to perform the extraction of tasks from a plurality of user documents using the word-level weakly supervised model based on a word-level weakly supervision task extraction technique, wherein the weakly supervision task extraction technique comprises:
receiving the plurality of user documents for the extraction of tasks;
pre-processing the plurality of user documents using the plurality of pre-processing techniques, to obtain a plurality of pre-processed user documents comprising a plurality of user sentences and a plurality of user words within the plurality of user sentences;
labelling the plurality of user words from the plurality of pre-processed user documents as one of a task headword and a no-task headword using the word-level weakly supervised classification model; and
expanding the labelled user task headword to obtain a task phrase based a phrase expansion technique and the plurality of dependency trees, wherein the task phrase represents the extracted task.
9. The system of claim 7, wherein the one or more hardware processors are configured by the instructions to perform the task extraction comprising identification of mentions of a task in a document, wherein a task is a well-defined knowledge-based action carried out volitionally with expertise for a specific goal within a pre-defined time by a single person, a group of persons, a device, or a system.
10. The system of claim 7, wherein the one or more hardware processors are configured by the instructions to perform the plurality of pre-processing technique comprising (a) a sentence-splitting technique for identification of the plurality of sentences, (b) a tokenization technique for identification of the plurality of words within the plurality of sentences, (c) a Part of speech (POS) tagging technique for identification of the plurality of Part-of-speech (POS) tags for each word in the plurality of words, (d) a dependency parsing technique for identification of the plurality of dependency tree for the plurality of sentences, and a WordNet based features identification technique for identification of the plurality of wordnet based features for the plurality of sentences.
11. The system of claim 7, wherein the one or more hardware processors are configured by the instructions to perform labeling of the plurality of words based on the plurality of linguistic rules for a non-negation aspect, a volitionality aspect, an expertise aspect and a plurality of generic aspects.
12. The system of claim 7, wherein the one or more hardware processors are configured by the instructions to perform the labelling of the plurality of words based on the plurality of linguistic rules incudes (a) the non-negation aspect labeling the plurality of words based on a modified by a negation using the dependency tree, (b) the volitionality aspect labeling the plurality of words based on identification of actions carried out volitionally, and the (c) the expertise aspect labeling the plurality of words based on a domain expertise or knowledge required for execution of the task.
Dated this 15th day of December 2021
Tata Consultancy Services Limited
By their Agent & Attorney
(Adheesh Nargolkar)
of Khaitan & Co
Reg No IN-PA-1086
, Description:FORM 2
THE PATENTS ACT, 1970
(39 of 1970)
&
THE PATENT RULES, 2003
COMPLETE SPECIFICATION
(See Section 10 and Rule 13)
Title of invention:
EXTRACTION OF TASKS FROM DOCUMENTS USING WEAKLY SUPERVISION
Applicant
Tata Consultancy Services Limited
A company Incorporated in India under the Companies Act, 1956
Having address:
Nirmal Building, 9th floor,
Nariman Point, Mumbai 400021,
Maharashtra, India
Preamble to the description
The following specification particularly describes the invention and the manner in which it is to be performed.
TECHNICAL FIELD
The disclosure herein generally relates to extraction of tasks from documents, and, more particularly, to a method and a system for extraction of tasks from documents using weakly supervision.
BACKGROUND
With the advancement of digital technology, the digital content available has increased exponentially. Considering the increased digital content, several techniques of Natural Language Processing (NLP) are gaining importance due to NLP's ability to detect, analyze and process massive volumes of text data across the digital world. There are several interesting analyses of digital content that can be carried out using NLP of which extraction of tasks from large corpora is yet to be explored.
Extraction of tasks has several useful applications such as tasks extracted from resumes capture the fine-grained experience of a candidate and would be useful for automatically shortlisting candidates for a certain job requirement. Another interesting application of the extraction of tasks and their corresponding roles is to automatically augment common sense knowledge.
The problem of extraction of tasks is to automatically identify mentions of tasks in a document. Syntactically, a task can be mentioned as a verb phrase or as a noun phrase in a sentence. However, certain aspects of the task are also to be included while performing the extraction of tasks such as tasks usually demand some skill and expertise, and tasks are carried out volitionally.
Further the problem of extraction of tasks is not explicitly addressed, while there are several prior arts addressing the problem of extraction of events. Although events are similar to tasks in some respects, there are certain crucial distinctions between events and tasks such as a task requires expertise to be performed while an event can happen without any expertise, and hence it is important to define and address task extraction as a separate problem. Since extraction of tasks has several applications, there is a requirement for Artificial Intelligence (AI) based techniques to address the requirement as the extraction of tasks is still largely unexplored.
SUMMARY
Embodiments of the present disclosure present technological improvements as solutions to one or more of the above-mentioned technical problems recognized by the inventors in conventional systems. For example, in one embodiment, a system for training a word-level data model for extraction of tasks from documents using weakly supervision is provided. The system includes a memory storing instructions, one or more communication interfaces, and one or more hardware processors coupled to the memory via the one or more communication interfaces, wherein the one or more hardware processors are configured by the instructions to receive a plurality of documents from a plurality of sources, via one or more hardware processors. The system is further configured to pre-process the plurality of documents using a plurality of pre-processing techniques, via the one or more hardware processors, to obtain a plurality of pre-processed documents comprising a plurality of sentences, a plurality of words within the plurality of sentences, a plurality of Part-of-speech (POS) tags, a plurality of dependency trees and a plurality of wordnet based features. The system is further configured to label the plurality of words from the plurality of pre-processed documents as one of a task headword and a no-task headword, via the one or more hardware processors, wherein the plurality of words is labelled based on the plurality of sentences using a plurality of linguistic rules. The system is further configured to train a word-level weakly supervised classification model for extraction of tasks, via the one or more hardware processors, using the task headword and the no-task headword labelled using the plurality of linguistic rules.
In another aspect, a method for training a word-level data model for extraction of tasks from documents using weakly supervision is provided. The method includes receiving a plurality of documents from a plurality of sources. The method further includes pre-processing the plurality of documents using a plurality of pre-processing techniques to obtain a plurality of pre-processed documents comprising a plurality of sentences, a plurality of words within the plurality of sentences, a plurality of Part-of-speech (POS) tags, a plurality of dependency trees and a plurality of wordnet based features. The method further includes labelling the plurality of words from the plurality of pre-processed documents as one of a task headword and a no-task headword, via the one or more hardware processors, wherein the plurality of words is labelled based on the plurality of sentences using a plurality of linguistic rules. The method further includes training a word-level weakly supervised classification model for extraction of tasks, via the one or more hardware processors, using the task headword and the no-task headword labelled using the plurality of linguistic rules.
In yet another aspect, a non-transitory computer readable medium for training a word-level data model for extraction of tasks from documents using weakly supervision is provided. The program includes receiving a plurality of documents from a plurality of sources. The program further includes pre-processing the plurality of documents using a plurality of pre-processing techniques to obtain a plurality of pre-processed documents comprising a plurality of sentences, a plurality of words within the plurality of sentences, a plurality of Part-of-speech (POS) tags, a plurality of dependency trees and a plurality of wordnet based features. The program further includes labelling the plurality of words from the plurality of pre-processed documents as one of a task headword and a no-task headword, via the one or more hardware processors, wherein the plurality of words is labelled based on the plurality of sentences using a plurality of linguistic rules. The program further includes training a word-level weakly supervised classification model for extraction of tasks, via the one or more hardware processors, using the task headword and the no-task headword labelled using the plurality of linguistic rules.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the invention, as claimed.
BRIEF DESCRIPTION OF THE DRAWINGS
The accompanying drawings, which are incorporated in and constitute a part of this disclosure, illustrate exemplary embodiments and, together with the description, serve to explain the disclosed principles:
FIG.1 illustrates an exemplary system for extraction of tasks from documents using weakly supervision according to some embodiments of the present disclosure.
FIG.2 is a functional block diagram of a system for extraction of tasks from documents using weakly supervision according to some embodiments of the present disclosure.
FIG.3 is a flow diagram illustrating a method (300) for training a word-level data model for extraction of tasks from documents using weakly supervision in accordance with some embodiments of the present disclosure.
FIG.4 is a flow diagram illustrating a method (400) for extraction of tasks using a word-level weakly supervision task extraction technique in accordance with some embodiments of the present disclosure.
FIG.5 illustrates an example task head word analysis from a sentence for extraction of tasks from documents using weakly supervision, by the system of FIG. 1, in accordance with some embodiments of the present disclosure.
DETAILED DESCRIPTION OF EMBODIMENTS
Exemplary embodiments are described with reference to the accompanying drawings. In the figures, the left-most digit(s) of a reference number identifies the figure in which the reference number first appears. Wherever convenient, the same reference numbers are used throughout the drawings to refer to the same or like parts. While examples and features of disclosed principles are described herein, modifications, adaptations, and other implementations are possible without departing from the scope of the disclosed embodiments.
Task extraction is a process of identification of mentions of a “task” in a document. Syntactically, the task can be mentioned as a verb phrase (example: implemented a model for weather prediction) or as a noun phrase (example: model implementation for weather prediction) in a sentence. Further during task extraction, the extent of a task mention should be such that the complete meaning expressed by the task should be captured. In an example scenario, from the sentence "The researcher implemented a model for weather prediction"., complete meaning expressed by the task should be captured even though the shorter phrase implemented “a model” is a valid task mention but does not capture the entire meaning.
The state-of-the-art techniques do not explicitly perform task extraction, however there are several prior arts performing “event extraction”. Considering an existing state-of-art technique - Event extraction by Xiang and Wang, 2019, an event is defined as a specific occurrence of something happening in a certain time and place which involves one or more participants and can often be described as a change of state. Although events are similar to tasks in some respects, there are certain crucial distinctions that are to be considered to differentiate the task with the event and hence it is important to define and address task extraction as a separate problem. The similarities-distinctions between tasks and events in several aspects including a non-negation aspect, a tense aspect, a genericity aspect, a modality aspect. The several aspects along with a note on similarities-distinctions between tasks and events are listed below:
Non-negation aspect:
Events should be explicitly mentioned as having occurred and typically there is no direct negation used to describe an event. Tasks are also similar to events in this aspect. Considering an example scenario: “He did not implement the solution”, neither an event nor a task is mentioned.
Tense aspect:
Events must be in the past or present tense. However, tasks describe a general category as against specific events which describe a specific occurrence such as “L&T engineers built this bridge”. In event extraction, only specific events are considered whereas tasks can be generic events.
Modality aspect:
Only realistic events which have actually occurred are considered as events. All other modalities such as belief, hypothesis, desire is not considered event but these can be considered as tasks. In the sentence “Engineers are supposed to build bridges which last for years”, a task build bridges is mentioned but it is not an event.
A task is a well-defined knowledge-based action carried out volitionally with expertise for a specific goal within a pre-defined time by a single person, a group of persons, a device, or a system. However, based on the definition of a task and on the above listed aspects – the volitionally and the expertise aspects are not explicitly covered.
Further the inventors have attempted to perform task extraction based on linguistic patterns. One of the advantages, with linguistic patterns for task extraction is that there is no need for any training data. The linguistic patterns for task extraction define an action noun & an action verb from the sentences. However, linguistic pattern extraction has several disadvantages that include : the presence of action verbs or nouns is just a necessary condition and not a sufficient condition for extraction of tasks. Further, the two important aspects of tasks volitionality and need for expertise are not checked explicitly. Moreover, there is a challenge of polysemy which is not handled explicitly. A verb (or noun) may be an action verb (or an action noun) in one particular sense but may not be an action verb (or action noun) in another sense. For example, “synthetic data generation” is a valid task but “next generation of chip technology” is not a valid task because of different senses of the noun generation. Hence based on the discussed existing state of art which is also a previous work of the inventors – the volitionally and the expertise aspects along with the addressing the challenge of polysemy should be addressed for efficient task extraction. Further during task extraction, the extent of a task mention should be such that the complete meaning expressed by the task should be captured considering the sentence and should not be mentioned with just extracted “task”. Hence the disclosure is an improvisation of the inventors previous research on task extraction using linguistic patterns, as the disclosure addresses all the challenges faced during the linguistic patterns technique.
Referring now to the drawings, and more particularly to FIG. 1 through FIG.5, where similar reference characters denote corresponding features consistently throughout the figures, there are shown preferred embodiments and these embodiments are described in the context of the following exemplary system and/or method.
FIG.1 is an exemplary block diagram of a system 100 for extraction of tasks from documents using weakly supervision in accordance with some embodiments of the present disclosure. The task extraction is identification of mentions of a task in a document, wherein a task is a well-defined knowledge-based action carried out volitionally with expertise for a specific goal within a pre-defined time by a single person, a group of persons, a device, or a system.
In an embodiment, the system 100 includes a processor(s) 104, communication interface device(s), alternatively referred as input/output (I/O) interface(s) 106, and one or more data storage devices or a memory 102 operatively coupled to the processor(s) 104. The system 100 with one or more hardware processors is configured to execute functions of one or more functional blocks of the system 100.
Referring to the components of the system 100, in an embodiment, the processor(s) 104, can be one or more hardware processors 104. In an embodiment, the one or more hardware processors 104 can be implemented as one or more microprocessors, microcomputers, microcontrollers, digital signal processors, central processing units, state machines, logic circuitries, and/or any devices that manipulate signals based on operational instructions. Among other capabilities, the one or more hardware processors 104 is configured to fetch and execute computer-readable instructions stored in the memory 102. In an embodiment, the system 100 can be implemented in a variety of computing systems including laptop computers, notebooks, hand-held devices such as mobile phones, workstations, mainframe computers, servers, a network cloud and the like.
The I/O interface(s) 106 can include a variety of software and hardware interfaces, for example, a web interface, a graphical user interface, a touch user interface (TUI) and the like and can facilitate multiple communications within a wide variety of networks N/W and protocol types, including wired networks, for example, LAN, cable, etc., and wireless networks, such as WLAN, cellular, or satellite. In an embodiment, the I/O interface (s) 106 can include one or more ports for connecting a number of devices (nodes) of the system 100 to one another or to another server.
The memory 102 may include any computer-readable medium known in the art including, for example, volatile memory, such as static random-access memory (SRAM) and dynamic random-access memory (DRAM), and/or non-volatile memory, such as read only memory (ROM), erasable programmable ROM, flash memories, hard disks, optical disks, and magnetic tapes.
Further, the memory 102 may include a database 108 configured to include information regarding extraction of tasks, labelling of words, performing pre-processing of received documents. The memory 102 may comprise information pertaining to input(s)/output(s) of each step performed by the processor(s) 104 of the system 100 and methods of the present disclosure. In an embodiment, the database 108 may be external (not shown) to the system 100 and coupled to the system via the I/O interface 106.
Functions of the components of system 100 are explained in conjunction with functional overview of the system 100 in FIG.2 and flow diagram of FIGS.3 for extraction of tasks from documents using weakly supervision.
The system 100 supports various connectivity options such as BLUETOOTH®, USB, ZigBee and other cellular services. The network environment enables connection of various components of the system 100 using any communication link including Internet, WAN, MAN, and so on. In an exemplary embodiment, the system 100 is implemented to operate as a stand-alone device. In another embodiment, the system 100 may be implemented to work as a loosely coupled device to a smart computing environment. The components and functionalities of the system 100 are described further in detail.
FIG.2 is an example functional block diagram of the various modules of the system of FIG.1, in accordance with some embodiments of the present disclosure. As depicted in the architecture, the FIG.2 illustrates the functions of the modules of the system 100 that includes extraction of tasks from documents using weakly supervision.
The system 200 for extraction of tasks from documents using weakly supervision functions in two phase – a training phase and a testing phase based on a user requirement. The training phase comprises of training a word-level weakly supervised classification model for extraction of tasks. The testing phase comprises of using the word-level weakly supervised classification model for extraction of tasks. Hence in the training phase the word-level weakly supervised classification model is generated and trained, and in the testing phase the generated word-level weakly supervised classification model is used for performing the task extraction of user documents. The extraction of tasks from documents is based on weakly supervision, wherein the word-level weakly supervised classification model is trained by creating a labeled training dataset based on the plurality of sentences using a plurality of linguistic rules.
During the training mode, the system 200 for extraction of tasks from documents using weakly supervision is configured to receive a plurality of documents from a plurality of sources, via one or more hardware processors 104. During the testing mode, the system 200 is configured to receive the plurality of user documents for the extraction of tasks, via one or more hardware processors 104.
The system 200 further comprises a pre-processor 202 configured for pre-processing the plurality of documents using a plurality of pre-processing techniques, to obtain a plurality of pre-processed documents. The plurality of pre-processed documents comprises a plurality of sentences, a plurality of words within the plurality of sentences, a plurality of Part-of-speech (POS) tags, a plurality of dependency trees and a plurality of wordnet based features. During the testing mode, the pre-processor 202 configured to pre-process the plurality of user documents using the plurality of pre-processing techniques to obtain a plurality of pre-processed user documents. The plurality of pre-processed user documents comprises a plurality of user sentences and a plurality of user words within the plurality of user sentences.
The system 200 further comprises a linguistic labelling module 204 configured for labelling the plurality of words from the plurality of pre-processed documents. Each of the plurality of words is labelled as one of a task headword and a no-task headword, wherein the plurality of words is labelled based on the plurality of sentences using a plurality of linguistic rules. The labelled plurality of words is used as training data.
The system 200 further comprises a word-level weakly supervised classification model 206, which is a word-level weakly supervised classification model trained for extraction of tasks during the training phase. The word-level weakly supervised classification model is trained using the task headword and the no-task headword labelled using the plurality of linguistic rules. During testing mode, the word-level weakly supervised classification model 206 is configured for labelling the plurality of user words from the plurality of pre-processed user documents. The plurality of pre-processed user documents is labelled as one of a task headword and a no-task headword using the word-level weakly supervised classification model.
The system 200 further comprises a task phrase module 208 configured for expanding the labelled user task headword to obtain a task phrase. The task phrase is obtained based on a phrase expansion technique and the plurality of dependency trees, wherein the task phrase represents the extracted task.
The various modules of the system 100 and the functional blocks in FIG.2 are configured for extraction of tasks from documents using weakly supervision are implemented as at least one of a logically self-contained part of a software program, a self-contained hardware component, and/or, a self-contained hardware component with a logically self-contained part of a software program embedded into each of the hardware component that when executed perform the above method described herein.
Functions of the components of the system 200 are explained in conjunction with functional modules of the system 100 stored in the memory 102 and further explained in conjunction with flow diagram of FIG.3. The FIG.3 with reference to FIG.1, is an exemplary flow diagram illustrating a method 300 for extraction of tasks from documents using weakly supervision using the system 100 of FIG.1 according to an embodiment of the present disclosure.
The steps of the method of the present disclosure will now be explained with reference to the components of the system (100) for extraction of tasks from documents using weakly supervision and the modules (202-208) as depicted in FIG.2 and the flow diagrams as depicted in FIG.3. Although process steps, method steps, techniques or the like may be described in a sequential order, such processes, methods and techniques may be configured to work in alternate orders. In other words, any sequence or order of steps that may be described does not necessarily indicate a requirement that the steps to be performed in that order. The steps of processes described herein may be performed in any order practical. Further, some steps may be performed simultaneously.
At step 302 of the method 300 a plurality of documents is received from a plurality of sources, via one or more hardware processors 104.
In an embodiment, the plurality of documents includes any textual data. In an example scenario, the plurality of documents includes resumes, news articles, user manuals, product reviews, patent documents, etc..
Further the plurality of sources includes sources that produces or publishes the textual documents. In an example scenario, the plurality of sources includes newspaper organizations for new articles, patent websites for patent documents etc.
At step 304 of the method 300, the plurality of documents is pre-processed using a plurality of pre-processing techniques at the pre-processor 202. The plurality of pre-processed documents comprises a plurality of sentences, a plurality of words within the plurality of sentences, a plurality of Part-of-speech (POS) tags, a plurality of dependency trees and a plurality of wordnet based features.
The plurality of pre-processing techniques to obtain the plurality of pre-processed documents comprises:
(a) a sentence-splitting technique for identification of the plurality of sentences,
(b) a tokenization technique for identification of the plurality of words within the plurality of sentences,
(c) a Part of speech (POS) tagging technique for identification of the plurality of Part-of-speech (POS) tags for each word in the plurality of words,
(d) a dependency parsing technique for identification of the plurality of dependency tree for the plurality of sentences, and a WordNet based features identification technique for identification of the plurality of wordnet based features for the plurality of sentences
In an embodiment, an example scenario of a document from a plurality of documents, pre-processed the document using the plurality of pre-processing techniques to obtain the plurality of pre-processed documents is shared below:
===========================================
Document:
===========================================
Amazon has said the number of demands for user data made by U.S. federal and local law enforcement have increased more during the first half of 2020 than during the same period a year earlier. The disclosure came in the company's latest transparency report, published Thursday.
===========================================
After Sentence-splitting and Tokenization:
==========================================
Sentence 1: Amazon has said the number of demands for user data made by U.S. federal and local law enforcement have increased more during the first half of 2020 than during the same period a year earlier.
Sentence 2: The disclosure came in the company’s latest transparency report, published Thursday.
==========================================
POS-tagging and Dependency parsing (information is shown embedded through XML-like tags, however a complex data structure is used to store this information for each sentence):
==========================================
Sentence 1: nsubj_said>Amazon has said
Labeling function Cov Overlap Conflict
non action nouns/verbs 0.625 0.625 0.0000
negation modifier 0.63 0.629 0.006
animate/org agent 0.68 0.667 0.021
non-animate agent 0.638 0.634 0.01
volition marker 0.625 0.625 0.0000
non-volition marker 0.625 0.625 0.0000
explicit expertise marker 0.629 0.629 0.001
corpus expertise score 0.673 0.66 0.013
direct object 0.799 0.687 0.013
adjectival clause 0.799 0.687 0.013
no object or pp 0.708 0.651 0.029
compound modifier 0.641 0.625 0.0000
number-like modifier 0.625 0.625 0.0000
Table 1: Analysis of labeling functions over training dataset
Further, to create ground truth for evaluating various task extraction techniques, 20 documents from each of the 4 datasets are manually annotated with gold-standard task head words and complete task phrases, wherein the ground truth dataset consists of 1869 sentences where 607 tasks are annotated as shown below in Table.2:
Dataset Sentences Tasks
Resumes 1297 167
TechCrunch 292 251
Reuters 178 89
Patents 102 100
Total 1869 607
Table 2: Details of ground truth dataset
For purpose of performance comparison, recent event extraction techniques are used as baselines. The first event extraction technique - "EvExtB1" is a literary event extraction technique which is trained on literature dataset using a BiLSTM based model which used BERT token representations. The second event extraction technique - "EvExtB2" is an Open Domain Event Extraction technique that uses a BiLSTM based supervised event detection model which is trained on distantly generated training data. For both the event extraction techniques, event triggers identified are considered as task head words and complete task phrases are identified using the phrase expansion rules. Any gold-standard task phrase is counted as a true positive (TP) if there is a “matching” predicted task, otherwise it is counted as a false negative (FN). Here, two task phrases are considered to be “matching” if there is at least 80% string similarity between them for strict evaluation and 50% for lenient evaluation. All the remaining predicted tasks which are not TPs, are counted as false positives (FP). In addition, similar to event triggers, TPs, FPs and FNs are computed considering only task head words. Precision, recall and F1-score are then computed for each of these three evaluation strategies – strict evaluation, lenient evaluation and considering only task head words:
P=TP/(TP+FP)+TP/(TP+FP)+(2.P.R)/(+R) (7)
where,
P represents Precision,
TP represents a true positive,
FN represents a false negative, and
FP represents a false positive.
The techniques for task extraction using the method 200- weakly supervised BERT-based task extractor, is experimented on 4 different datasets using 3 evaluation strategies. The performance of the proposed techniques (BERT extractor) is compared with two event extraction baselines (EvExtB1 and EvExtB2) and the inventor’s previous work of linguistic patterns. The detailed result is shared in the
Dataset Technique Only task headword Lenient evaluation Strict evaluation
P R FI P R FI P R FI
Resumes EvExtB1 0.553 0.166 0.255 0.380 0.103 0.162 0.314 0.086 0.136
EvExtB2 0.335 0.669 0.447 0.373 0.714 0.490 0.232 0.454 0.307
Linguistic pattern 0.582 0.771 0.663 0.551 0.730 0.628 0.429 0.568 0.488
disclosed technique 0.552 0.675 0.607 0.505 0.589 0.544 0.311 0.368 0.337
TechCrunch EvExtB1 0.354 0.222 0.273 0.343 0.229 0.274 0.217 0.144 0.173
EvExtB2 0.312 0.763 0.442 0.294 0.734 0.419 0.187 0.476 0.268
Linguistic pattern 0.404 0.510 0.451 0.420 0.542 0.473 0.239 0.310 0.270
disclosed technique 0.449 0.732 0.556 0.422 0.694 0.524 0.262 0.439 0.328
Reuters EvExtB1 0.323 0.370 0.345 0.294 0.364 0.325 0.139 0.170 0.153
EvExtB2 0.188 0.716 0.297 0.188 0.761 0.302 0.095 0.386 0.152
Linguistic pattern 0.210 0.358 0.265 0.218 0.364 0.272 0.122 0.205 0.153
disclosed technique 0.314 0.716 0.436 0.296 0.682 0.412 0.161 0.375 0.225
Patents EvExtB1 0.533 0.075 0.132 0.556 0.085 0.148 0.267 0.034 0.061
EvExtB2 0.371 0.774 0.502 0.370 0.752 0.496 0.179 0.385 0.244
Linguistic pattern 0.420 0.472 0.444 0.515 0.590 0.550 0.220 0.248 0.233
disclosed technique 0.524 0.830 0.642 0.522 0.803 0.633 0.268 0.419 0.327
Average EvExtB1 0.441 0.208 0.251 0.393 0.195 0.227 0.234 0.109 0.131
EvExtB2 0.302 0.731 0.422 0.306 0.740 0.427 0.173 0.425 0.243
Linguistic pattern 0.404 0.528 0.456 0.426 0.557 0.481 0.253 0.333 0.286
disclosed technique 0.460 0.738 0.560 0.436 0.692 0.528 0.251 0.400 0.304
Table.3, wherein except in the Resumes dataset, the BERT-based task extractor outperforms all other techniques on all the datasets. Considering the macro-average across datasets, the BERT-based task extractor turns out to be the best overall technique which also performs consistently across datasets. Further ablation analysis is also conducted to evaluate contribution of POS tag and WordNet-based features and observed that these features have minor positive contribution.
Table 3: Detailed result of comparative task extraction performance
The written description describes the subject matter herein to enable any person skilled in the art to make and use the embodiments. The scope of the subject matter embodiments is defined by the claims and may include other modifications that occur to those skilled in the art. Such other modifications are intended to be within the scope of the claims if they have similar elements that do not differ from the literal language of the claims or if they include equivalent elements with insubstantial differences from the literal language of the claims.
The embodiments of present disclosure herein provide a solution to an unsolved problem of extraction of tasks from documents, wherein extraction of tasks is identification of mentions of tasks in a document. The disclosure for extraction of tasks is based on word-based weakly supervision. There are several prior arts addressing the problem of extraction of events, however due to crucial distinctions between events-tasks, task extraction stands as a separate problem. The disclosure explicitly defines specific characteristics of tasks, creates labelled data at a word-level based on a plurality of linguistic rules, which is used to train a word-level weakly supervised model for task extraction. The labelled data is created based on the plurality of linguistic rules for a non-negation aspect, a volitionality aspect, an expertise aspect and a plurality of generic aspects. Further the disclosure also includes a phrase expansion technique to capture the complete meaning expressed by the task instead of merely mentioning the task that may not capture the entire meaning of the sentence.
It is to be understood that the scope of the protection is extended to such a program and in addition to a computer-readable means having a message therein; such computer-readable storage means contain program-code means for implementation of one or more steps of the method, when the program runs on a server or mobile device or any suitable programmable device. The hardware device can be any kind of device which can be programmed including e.g., any kind of computer like a server or a personal computer, or the like, or any combination thereof. The device may also include means which could be e.g., hardware means like e.g., an application-specific integrated circuit (ASIC), a field-programmable gate array (FPGA), or a combination of hardware and software means, e.g., an ASIC and an FPGA, or at least one microprocessor and at least one memory with software processing components located therein. Thus, the means can include both hardware means and software means. The method embodiments described herein could be implemented in hardware and software. The device may also include software means. Alternatively, the embodiments may be implemented on different hardware devices, e.g., using a plurality of CPUs.
The embodiments herein can comprise hardware and software elements. The embodiments that are implemented in software include but are not limited to, firmware, resident software, microcode, etc. The functions performed by various components described herein may be implemented in other components or combinations of other components. For the purposes of this description, a computer-usable or computer readable medium can be any apparatus that can comprise, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device.
The illustrated steps are set out to explain the exemplary embodiments shown, and it should be anticipated that ongoing technological development will change the manner in which particular functions are performed. These examples are presented herein for purposes of illustration, and not limitation. Further, the boundaries of the functional building blocks have been arbitrarily defined herein for the convenience of the description. Alternative boundaries can be defined so long as the specified functions and relationships thereof are appropriately performed. Alternatives (including equivalents, extensions, variations, deviations, etc., of those described herein) will be apparent to persons skilled in the relevant art(s) based on the teachings contained herein. Such alternatives fall within the scope of the disclosed embodiments. Also, the words “comprising,” “having,” “containing,” and “including,” and other similar forms are intended to be equivalent in meaning and be open ended in that an item or items following any one of these words is not meant to be an exhaustive listing of such item or items, or meant to be limited to only the listed item or items. It must also be noted that as used herein and in the appended claims, the singular forms “a,” “an,” and “the” include plural references unless the context clearly dictates otherwise.
Furthermore, one or more computer-readable storage media may be utilized in implementing embodiments consistent with the present disclosure. A computer-readable storage medium refers to any type of physical memory on which information or data readable by a processor may be stored. Thus, a computer-readable storage medium may store instructions for execution by one or more processors, including instructions for causing the processor(s) to perform steps or stages consistent with the embodiments described herein. The term “computer-readable medium” should be understood to include tangible items and exclude carrier waves and transient signals, i.e., be non-transitory. Examples include random access memory (RAM), read-only memory (ROM), volatile memory, nonvolatile memory, hard drives, CD ROMs, DVDs, flash drives, disks, and any other known physical storage media.
It is intended that the disclosure and examples be considered as exemplary only, with a true scope of disclosed embodiments being indicated by the following claims.
| # | Name | Date |
|---|---|---|
| 1 | 202121058475-STATEMENT OF UNDERTAKING (FORM 3) [15-12-2021(online)].pdf | 2021-12-15 |
| 2 | 202121058475-REQUEST FOR EXAMINATION (FORM-18) [15-12-2021(online)].pdf | 2021-12-15 |
| 3 | 202121058475-PROOF OF RIGHT [15-12-2021(online)].pdf | 2021-12-15 |
| 4 | 202121058475-FORM 18 [15-12-2021(online)].pdf | 2021-12-15 |
| 5 | 202121058475-FORM 1 [15-12-2021(online)].pdf | 2021-12-15 |
| 6 | 202121058475-FIGURE OF ABSTRACT [15-12-2021(online)].jpg | 2021-12-15 |
| 7 | 202121058475-DRAWINGS [15-12-2021(online)].pdf | 2021-12-15 |
| 8 | 202121058475-DECLARATION OF INVENTORSHIP (FORM 5) [15-12-2021(online)].pdf | 2021-12-15 |
| 9 | 202121058475-COMPLETE SPECIFICATION [15-12-2021(online)].pdf | 2021-12-15 |
| 10 | Abstract1.jpg | 2022-03-17 |
| 11 | 202121058475-FORM-26 [20-04-2022(online)].pdf | 2022-04-20 |
| 12 | 202121058475-Power of Attorney [18-08-2022(online)].pdf | 2022-08-18 |
| 13 | 202121058475-Form 1 (Submitted on date of filing) [18-08-2022(online)].pdf | 2022-08-18 |
| 14 | 202121058475-Covering Letter [18-08-2022(online)].pdf | 2022-08-18 |
| 15 | 202121058475-CORRESPONDENCE(IPO)(WIPO DAS)-22-09-2022.pdf | 2022-09-22 |
| 16 | 202121058475-FORM 3 [30-05-2023(online)].pdf | 2023-05-30 |
| 17 | 202121058475-FER.pdf | 2023-11-17 |
| 18 | 202121058475-OTHERS [05-04-2024(online)].pdf | 2024-04-05 |
| 19 | 202121058475-Information under section 8(2) [05-04-2024(online)].pdf | 2024-04-05 |
| 20 | 202121058475-FORM 3 [05-04-2024(online)].pdf | 2024-04-05 |
| 21 | 202121058475-FER_SER_REPLY [05-04-2024(online)].pdf | 2024-04-05 |
| 22 | 202121058475-DRAWING [05-04-2024(online)].pdf | 2024-04-05 |
| 23 | 202121058475-CLAIMS [05-04-2024(online)].pdf | 2024-04-05 |
| 24 | 202121058475-RELEVANT DOCUMENTS [06-04-2024(online)].pdf | 2024-04-06 |
| 25 | 202121058475-PETITION UNDER RULE 137 [06-04-2024(online)].pdf | 2024-04-06 |
| 1 | 202121058475E_06-10-2023.pdf |