Abstract: ABSTRACT SYSTEM AND METHOD FOR CONTENT SELECTION FOR ABSTRACTIVE TEXT SUMMARIZATION Existing works trains model on reference summaries, which are noisy, to identify summary-worthy sentences, wherein trained models learn the underlying behavior and pattern. The present disclosure receives source document D containing one or more document units. Identify plurality of summary worthy segments from received source document D by computing a plurality of linguistic and topic-informed metric scores including informativeness, relevance and redundancy for each document unit of one or more document units comprised in one or more topic-segments of source document. The system further leverages document unit ranking systems on identified plurality of summary worthy segments to dynamically select one or more document units from source document. Fluency of dynamically selected document units is further measured. Content summarization model is trained using identified documents and summary documents. The trained content summarization model is validated on plurality of datasets to evaluate an accuracy of the trained content summarization model. [To be published with FIG. #2]
Description:FORM 2
THE PATENTS ACT, 1970
(39 of 1970)
&
THE PATENT RULES, 2003
COMPLETE SPECIFICATION
(See Section 10 and Rule 13)
Title of invention:
SYSTEM AND METHOD FOR CONTENT SELECTION FOR ABSTRACTIVE TEXT SUMMARIZATION
Applicant
Tata Consultancy Services Limited
A company Incorporated in India under the Companies Act, 1956
Having address:
Nirmal Building, 9th floor,
Nariman point, Mumbai 400021,
Maharashtra, India
Preamble to the description:
The following specification particularly describes the invention and the manner in which it is to be performed.
TECHNICAL FIELD
The disclosure herein generally relates to abstractive text summarization and, more particularly, to a system and method for content selection for abstractive text summarization.
BACKGROUND
Text summarization aims at generating compact outline text covering salient information from a source document. In general, text summarization can be conceptualized with two strategies including an extractive approach, where important information is extracted and directly considered as a part of the summary, and an abstractive approach, where salient information is paraphrased to form the summary. The abstractive text summarization is the task of generating a short and concise summary that captures the salient ideas of the source text. The generated summaries potentially contain new phrases and sentences that may not appear in the source text.
Selecting summary-worthy sentences (henceforth referred to as SWORTS) from the source document is a crucial step in ensuring quality outputs. Majority of the existing works, while training, use reference summaries to evaluate and identify candidate sentences. In the case of extractive summarization, identifying SWORTS is typically considered as a sequence labeling task (in known of the literature work) with greedy search approach (known in the art) as the most popular way for labeling sentences. Further, some of the recent works have demonstrated various issues such as lead bias, underfitting, and monolingual bias with the greedy labeling approach. In this form, various existing techniques including an attention-based sentence selection, a semantic salience-based selection, and a policy-based reinforcement learning are non-explicitly used to identify SWORTS. Different forms of existing attention-based approaches include cross-attention, graph-based attention, and copy mechanism. The aforementioned approaches use reference summaries while training to decide the set of SWORTS.
The existing works consider reference summaries and use only topical distribution and sentence inter-relationships to identify summary-worthy sentences. Further, as obtaining high-quality reference summaries is a tedious task, many of the existing datasets contains noisy reference summaries and the models trained on such a dataset learns the underlying behavior and pattern from these datasets. Collectively, it is evident from the above explanation that these approaches do not precisely plan for better informativeness or relevance while selecting candidate segments.
SUMMARY
Embodiments of the present disclosure present technological improvements as solutions to one or more of the above-mentioned technical problems recognized by the inventors in conventional systems. For example, in one embodiment, a method for content selection for abstractive text summarization is provided. The method includes receiving, via one or more hardware processors, a source document D containing one or more document units and a summary document S, wherein the summary document S is used as a reference document for training a content summarization model; identifying, via the one or more hardware processors, a plurality of summary- worthy segments (SWORTS) from the received source document D by computing a plurality of linguistic and topic-informed metric scores for each document unit of the one or more document units comprised in one or more topic-segments of the source document D by at least one of: defining an informativeness of the one or more document units comprised in the one or more topic-segments of the source document D based on at least one of : (i) one or more u-grams in a first order in the one or more topic-segments contributes at least a percentage with respect to the informativeness of the one or more document units; and (ii) one or more u-grams in a second order in the one or more topic-segments signifying an importance pertaining to a key-information, and wherein the informativeness measures a degree to which the one or more document units provides a non-trivial information; defining a relevance of the one or more document units comprised in the one or more topic-segments of the source document D based on at least one of: (i) computing the relevance of the one or more document units comprised in the source document D against one or more preceding document units comprised in the one or more topic segments of the source document D; (ii) computing the relevance of one or more leading document units comprised in the source document D against the one or more leading document units of the remaining one or more topic segments of the source document D; and (iii) computing the relevance of the one or more leading document units comprised in the source document D against a last document unit of a previous topic segment comprised in the one or more topic segments of the source document D to capture a continuum of a discussion across one or more topics; defining a redundancy of the one or more document units comprised in the one or more topic-segments of the source document D based on one or more words comprised in the one or more document units which occur frequently in the one or more topic-segments; leveraging, via the one or more hardware processors, a plurality of document units ranking systems on the identified plurality of summary worthy segments to dynamically select the one or more document units from the source document D, wherein each of the plurality of document units ranking systems considers one or more combinations of the plurality of linguistic and topic-informed metric scores; measuring, via the one or more hardware processors, a fluency of the dynamically selected one or more document units to identify one or more best document units by computing a perplexity of the dynamically selected the one or more document units using a language model; training, via the one or more hardware processors, the content summarization model using the identified one or more best documents and the summary document S, wherein the trained content summarization model is used to summarize the source document D; and validating, via the one or more hardware processors, the trained content summarization model on a plurality of datasets to evaluate an accuracy of the trained content summarization model.
In another aspect, there is provided a system for content selection for abstractive text summarization. The system comprises: a memory storing instructions; one or more communication interfaces; and one or more hardware processors coupled to the memory via the one or more communication interfaces, wherein the one or more hardware processors are configured by the instructions to: receive a source document D containing one or more document units and a summary document S, wherein the summary document S is used as a reference document for training a content summarization model. The system further comprises identifying, via the one or more hardware processors, a plurality of summary- worthy segments (SWORTS) from the received source document D by computing a plurality of linguistic and topic-informed metric scores for each document unit of the one or more document units comprised in one or more topic-segments of the source document D by at least one of: defining an informativeness of the one or more document units comprised in the one or more topic-segments of the source document D based on at least one of : (i) one or more u-grams in a first order in the one or more topic-segments contributes at least a percentage with respect to the informativeness of the one or more document units; and (ii) one or more u-grams in a second order in the one or more topic-segments signifying an importance pertaining to a key-information, and wherein the informativeness measures a degree to which the one or more document units provides a non-trivial information; defining a relevance of the one or more document units comprised in the one or more topic-segments of the source document D based on at least one of: (i) computing the relevance of the one or more document units comprised in the source document D against one or more preceding document units comprised in the one or more topic segments of the source document D; (ii) computing the relevance of one or more leading document units comprised in the source document D against the one or more leading document units of the remaining one or more topic segments of the source document D; and (iii) computing the relevance of the one or more leading document units comprised in the source document D against a last document unit of a previous topic segment comprised in the one or more topic segments of the source document D to capture a continuum of a discussion across one or more topics; defining a redundancy of the one or more document units comprised in the one or more topic-segments of the source document D based on one or more words comprised in the one or more document units which occur frequently in the one or more topic-segments; leveraging, via the one or more hardware processors, a plurality of document units ranking systems on the identified plurality of summary worthy segments to dynamically select the one or more document units from the source document D, wherein each of the plurality of document units ranking systems considers one or more combinations of the plurality of linguistic and topic-informed metric scores; measuring, via the one or more hardware processors, a fluency of the dynamically selected one or more document units to identify one or more best document units by computing a perplexity of the dynamically selected the one or more document units using a language model; training, via the one or more hardware processors, the content summarization model using the identified one or more best documents and the summary document S, wherein the trained content summarization model is used to summarize the source document D; and validating, via the one or more hardware processors, the trained content summarization model on a plurality of datasets to evaluate an accuracy of the trained content summarization model.
In yet another aspect, there are provided one or more non-transitory machine-readable information storage mediums comprising one or more instructions which when executed by one or more hardware processors cause receiving, via one or more hardware processors, a source document D containing one or more document units and a summary document S, wherein the summary document S is used as a reference document for training a content summarization model; identifying, via the one or more hardware processors, a plurality of summary- worthy segments (SWORTS) from the received source document D by computing a plurality of linguistic and topic-informed metric scores for each document unit of the one or more document units comprised in one or more topic-segments of the source document D by at least one of: defining an informativeness of the one or more document units comprised in the one or more topic-segments of the source document D based on at least one of : (i) one or more u-grams in a first order in the one or more topic-segments contributes at least a percentage with respect to the informativeness of the one or more document units; and (ii) one or more u-grams in a second order in the one or more topic-segments signifying an importance pertaining to a key-information, and wherein the informativeness measures a degree to which the one or more document units provides a non-trivial information; defining a relevance of the one or more document units comprised in the one or more topic-segments of the source document D based on at least one of: (i) computing the relevance of the one or more document units comprised in the source document D against one or more preceding document units comprised in the one or more topic segments of the source document D; (ii) computing the relevance of one or more leading document units comprised in the source document D against the one or more leading document units of the remaining one or more topic segments of the source document D; and (iii) computing the relevance of the one or more leading document units comprised in the source document D against a last document unit of a previous topic segment comprised in the one or more topic segments of the source document D to capture a continuum of a discussion across one or more topics; defining a redundancy of the one or more document units comprised in the one or more topic-segments of the source document D based on one or more words comprised in the one or more document units which occur frequently in the one or more topic-segments; leveraging, via the one or more hardware processors, a plurality of document units ranking systems on the identified plurality of summary worthy segments to dynamically select the one or more document units from the source document D, wherein each of the plurality of document units ranking systems considers one or more combinations of the plurality of linguistic and topic-informed metric scores; measuring, via the one or more hardware processors, a fluency of the dynamically selected one or more document units to identify one or more best document units by computing a perplexity of the dynamically selected the one or more document units using a language model; training, via the one or more hardware processors, the content summarization model using the identified one or more best documents and the summary document S, wherein the trained content summarization model is used to summarize the source document D; and validating, via the one or more hardware processors, the trained content summarization model on a plurality of datasets to evaluate an accuracy of the trained content summarization model.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the invention, as claimed.
BRIEF DESCRIPTION OF THE DRAWINGS
The accompanying drawings, which are incorporated in and constitute a part of this disclosure, illustrate exemplary embodiments and, together with the description, serve to explain the disclosed principles:
FIG. 1 illustrates an exemplary system for, according to some embodiments of the present disclosure.
FIG. 2 is a functional block diagram of the system for content selection for abstractive text summarization, according to some embodiments of the present disclosure.
FIGS. 3A and 3B are flow diagrams illustrating the steps involved in the method for content selection for abstractive text summarization, according to some embodiments of the present disclosure.
DETAILED DESCRIPTION OF EMBODIMENTS
Exemplary embodiments are described with reference to the accompanying drawings. In the figures, the left-most digit(s) of a reference number identifies the figure in which the reference number first appears. Wherever convenient, the same reference numbers are used throughout the drawings to refer to the same or like parts. While examples and features of disclosed principles are described herein, modifications, adaptations, and other implementations are possible without departing from the scope of the disclosed embodiments.
The present disclosure provides a system and method for content selection for abstractive text summarization. The present disclosure defines topic-aware formulations for three metrics, namely, an informativeness (whether a sentence is adding any information), a relevance (whether a sentence is contextually relevant), and a redundancy (whether a sentence is adding redundant information). The present disclosure implements a topic-informed and reference-free pipeline (LIMES- linguistically informed pipeline for Summary-Worthy Segments (SWORTS)) to identify the Summary-Worthy Segments (SWORTS). The present disclosure investigates three variations of LIMES for content selection and reports extensive experiments to understand the effect of content selection in various settings including a fine-tuning, a few-shot training, and a zero-shot inference, an exploring domain adaptation and a self-training.
Referring now to the drawings, and more particularly to FIG. 1 through FIG. 3, where similar reference characters denote corresponding features consistently throughout the figures, there are shown preferred embodiments and these embodiments are described in the context of the following exemplary system and/or method.
FIG. 1 illustrates an exemplary system for content selection for abstractive text summarization, according to some embodiments of the present disclosure. In an embodiment, the system 100 includes one or more processors 104, communication interface device(s) or input/output (I/O) interface(s) 106, one or more data storage devices or memory 102 operatively coupled to the one or more processors 104. The one or more processors 104 that are hardware processors can be implemented as one or more microprocessors, microcomputers, microcontrollers, digital signal processors, central processing units, state machines, graphics controllers, logic circuitries, and/or any devices that manipulate signals based on operational instructions. Among other capabilities, the processor(s) are configured to fetch and execute computer-readable instructions stored in the memory. In an embodiment, the system 100 can be implemented in a variety of computing systems, such as laptop computers, notebooks, hand-held devices, workstations, mainframe computers, servers, a network cloud and the like.
The I/O interface device(s) 106 can include a variety of software and hardware interfaces, for example, a web interface, a graphical user interface, and the like and can facilitate multiple communications within a wide variety of networks N/W and protocol types, including wired networks, for example, LAN, cable, etc., and wireless networks, such as WLAN, cellular, or satellite. In an embodiment, the I/O interface device(s) can include one or more ports for connecting a number of devices to one another or to another server.
The memory 102 may include any computer-readable medium known in the art including, for example, volatile memory, such as static random-access memory (SRAM) and dynamic random access memory (DRAM), and/or non-volatile memory, such as read only memory (ROM), erasable programmable ROM, flash memories, hard disks, optical disks, and magnetic tapes. In an embodiment, one or more modules (not shown) of the system 100 can be stored in the memory 102.
FIG. 2, with reference to FIG. 1, illustrates a functional block diagram of the system for content selection for abstractive text summarization, according to some embodiments of the present disclosure. In an embodiment, the system 200 includes an input module 202, a summary-worthy segments identification module 204, a document units ranking systems module 212, a fluency measurement module 214, a content summarization model training module 216, and a validation module 218. The summary-worthy segments identification (SWORTS) module 204 of the system 200 further includes an informativeness defining module 206, a relevancy defining module 208, and a redundancy defining module 210.
FIGS. 3A and 3B are flow diagrams illustrating a method for content selection for abstractive text summarization using the system 100 of FIGS. 1-2, according to some embodiments of the present disclosure. Steps of the method of FIG. 3 shall be described in conjunction with the components of FIG. 2. At step 302 of the method 300, the one or more hardware processors 104 receive a source document D containing one or more document units and a summary document S which is represented by the input module 202 of the system 200. The summary document S is used as a reference document for training a content summarization model.
At step 304 of the method 300, the one or more hardware processors 104 identify a plurality of summary-worthy segments (SWORTS) from the received source document D by computing a plurality of linguistic and topic-informed metric scores for each document unit of the one or more document units comprised in one or more topic-segments of the source document D which is represented by the summary-worthy segments identification (SWORTS) module 204 of the system 200. The plurality of linguistic and topic-informed metric scores includes the informativeness, relevance and redundancy. Firstly, the informativeness of the one or more document units comprised in the one or more topic-segments of the source document D is defined. For defining the informativeness, the present disclosure considers one or more u-grams in a first order (e.g., a low-order) comprised in the one or more topic-segments contributes at least a percentage with respect to the informativeness of the one or more document units. In an embodiment, the one or more u-grams in a low order comprised in the one or more topic-segments that contribute less with respect to the informativeness of the one or more document units are considered. The present disclosure further considers one or more u-grams in a second order (e.g., high order) comprised in the one or more topic-segments signifying an importance pertaining to a key-information. In an embodiment, u-grams in high order in the topic segments signifying a high/highest important pertaining to the key information are considered. The informativeness measures a degree to which the one or more document units provides a non-trivial information. For defining the relevance, the present disclosure computes the relevance of the one or more document units comprised in the source document D against one or more preceding document units comprised in the one or more topic segments of the source document D. Herein, the idea is to measure the relevance of the document unit in the context of the previous document units. The present disclosure further computes the relevance of one or more leading document units comprised in the source document D against the one or more leading document units of the remaining one or more topic segments of the source document D. Herein, the idea is to measure the relevance of the leading document unit against the leading document units of the other topic segments in the source document D. The present disclosure further computes the relevance of the one or more leading document units comprised in the source document D against a last document unit of a previous topic segment to capture the continuum of a discussion across one or more topics. The present disclosure defines a redundancy of the one or more document units comprised in the one or more topic-segments of the source document D based on one or more words comprised in the one or more document units which occur frequently in the one or more topic-segment and leads to a high redundancy making the one or more document units less summary-worthy.
In an example embodiment of the present disclosure, the following use case example illustrates a document from the CNN-DM dataset. The following example comprises five topic segments namely topic segment 1, topic segment 2, topic segment 3, topic segment 4 and topic segment 5. The one or more topic segments comprises the one or more document units as depicted below.
Topic Segment 1:
Document unit 1: Myanmar warplanes fighting rebels dropped a bomb at a sugarcane field in China, killing four civilians, the latter's state media reported Saturday.
Document unit 2: In addition to the fatalities, nine others were wounded, according to Xinhua news agency.
Topic Segment 2:
Document unit 1: Shortly after the incident Friday, China sent fighter jets to patrol over their shared border.
Document unit 2: The jets are there to "track, monitor, warn and chase away" Myanmar military planes, China's air force told state media.
Topic Segment 3:
Document unit 1: China summoned Myanmar\'s ambassador in Beijing after the incident in the border city of Lincang.
Document unit 2: Liu Zhenmin, the vice foreign minister for China, called on Myanmar to investigate and bring those behind the attack to justice.
Topic Segment 4:
Document unit 1: Myanmar forces have been battling ethnic separatist rebels in the rugged border region across from Yunnan province.
Document unit 2: In recent incidents, stray gunfire has damaged property on the Chinese side of the border, prompting Beijing to warn Myanmar to ensure safety.
Topic Segment 5:
Document unit 1: There was no immediate reaction from Myanmar.
An approach to identify summary-worth segments (SWORTS) from the source document D for the abstractive text summarization task is presented by the present disclosure. More specifically, the present disclosure implements a formulation for three metrics namely informativeness, relevance, and redundancy and leverage the three metrics for SWORTS selection from the source document D in a reference-free setting.
Notations and problem formulation: Given a source document D = {d_1,d_2,...,d_n} containing n sentences and the document summary S = {s_1,s_2,...,s_m} containing m sentences, the proposed approach identifies a set D' of k sentences s.t.D' ? D and k = n. The set D' of k sentences is the SWORTS selected data from the source document D. Furthermore, the SWORTS selection algorithm which is represented by the algorithm 1, considers the source document D as the set of p topics i.e., D = {t_1,t_2,...,t_p} where each topic t_i contains a list of sentences in the same order as they appear in the source document D.
In an embodiment of the present disclosure, three linguistically informed metrics and their formulation for SWORTS selection have been defined. With each formulation, a score is assigned to a sentence in the source document D. A sentence i in D is referred as a document unit (d_i). The formulation for the informativeness - one of the three metrices is explained below.
Informativeness: Given a source document D, firstly the informativeness of the comprising the one or more document units is measured. To a reader, the one or more document units is informative if it offers a change in the knowledge about the one or more document units. Intuitively, informativeness measures the degree to which the one or more document units is surprising. The one or more document units is less informative if the one or more document units contains the likely semantic units (words or phrases). To measure the informativeness, the one or more document units is considered to be made up of semantic units (here, u-grams) where each semantic unit contributes to the overall informativeness of the one or more document units. Formally, the informativeness of a document unit d_i in a topic-segment t_j is defined as:
Info(d_i,t_j )= (?_(u=1)^U¦?w(u,d_i)f_rep (u,d_i,t_j)?)/n (1)
such that,
w(u,d_i )= 1/|Unique u-grams in d_i | (2)
where, f_rep (u,d_i,t_j) represents the repetitiveness of u-grams by measuring the count of u-grams in d_i that occurs in the remaining document units in t_j. Intuitively, a lower-order u-gram is likely to be repeated in the one or more topic segments and contributes less towards the overall informativeness score of d_i due to the relatively lower w(u,d_i) score whereas the repetition of higher-order u-grams in the one or more topic-segments signifies the high importance of u-grams in conveying the key-information. w(u,d_i) controls the degree with which the repetitive behavior of the u-gram in the one or more topic-segments denotes its informativeness.
In an embodiment of the present disclosure, formulation for the relevance - one of the three metrices is explained below.
Relevance: While approximating a document, the key content needs to be kept with minimum loss of information. Leveraging the relevance of each document unit of the one or more document units, the source document D is condensed to deliver the key message. The relevance of a document unit d_i in a topic-segment t_j is formally defined as:
Rel(d_i,t_j ) = R_lead (d_i,t_j), if D starts with d_i
R_topic (d_i,t_j), if t_j starts with d_i
R(d_i,t_j), Otherwise
such that,
R_lead (d_i,t_j) = (?_(v=1)^p¦?f_csim ((d_i ) ?,(d_1^v ) ?)?)/p (3)
R_topic ((d_i ) ?,(t_j ) ?) = f_csim ((d_i ) ?,d_(-1)^(j ?-1)) (4)
R(d_i,t_j) = (?_(x=1)^(i-1)¦?f_csim ((d_i ) ?,(d_x ) ?)?)/(i-1) (5)
where f_csim(.) represents the cosine similarity between the two vectors. In Equation 3, d_1^v ? represents the vector corresponding to the first document unit of topic segment v. Similarly, d_(-1)^(j ?-1) represents the vector corresponding to the last document unit of topic segment j - 1. In the present disclosure, the sentence embedding representation is considered as the document unit vector. Further the relevance of one or more document units is considered with respect to the preceding document units in the same topic segment. Furthermore, the relevance of one or more leading document units of the one or more documents is computed with respect to the one or more leading document units of the remaining topic segments of the source document D. Herein, the embodiment of the present disclosure, leading document unit means the first document unit of the document. The first document unit of a document covers the information that will be discussed across the topic segments in the document. The relevance of the one or more leading document unit of the one or more topic segments is computed with respect to the last document unit of the previous topic segment to capture the continuum of the discussion across topics.
In an embodiment of the present disclosure, formulation for the redundancy - one of the three metrices is explained below.
Redundancy: Redundancy in the one or more document units leads to overlap between various semantic units and increased repetitiveness in the source document D. The redundancy further leads to reduced information coverage per semantic units of the source document D. To measure the redundancy, the one or more document units is considered to be made up of Q words. Formally, the redundancy of a document unit d_i is defined in a topic -segment t_j as:
Red(d_i,t_j) = (?_(q=1)^Q¦?g_rep (q,? t?_j)?)/(Q * |t_j | ) (6)
where, g_rep (q,? t?_j) represents the repetitiveness of the word q by measuring the count of the one or more document units in t_j where the word q occurs. Intuitively, the one or more document units comprising of one or more words that occur frequently in the one or more topic-segment leads to its high redundancy which results in the one or more document units being less summary-worthy.
The following use case example illustrates the three metrics scores namely the informativeness, the relevance and the redundancy computed for each document unit of the one or more document units.
Topic Segment 1:
Document unit 1: Myanmar warplanes fighting rebels dropped a bomb at a sugarcane field in China, killing four civilians, the latter's state media reported Saturday.
Document unit 2: In addition to the fatalities, nine others were wounded, according to Xinhua news agency.
Topic Segment 2:
Document unit 1: Shortly after the incident Friday, China sent fighter jets to patrol over their shared border.
Document unit 2: The jets are there to "track, monitor, warn and chase away" Myanmar military planes, China's air force told state media.
Topic Segment 3:
Document unit 1: China summoned Myanmar\'s ambassador in Beijing after the incident in the border city of Lincang.
Document unit 2: Liu Zhenmin, the vice foreign minister for China, called on Myanmar to investigate and bring those behind the attack to justice.
Topic Segment 4:
Document unit 1: Myanmar forces have been battling ethnic separatist rebels in the rugged border region across from Yunnan province.
Document unit 2: In recent incidents, stray gunfire has damaged property on the Chinese side of the border, prompting Beijing to warn Myanmar to ensure safety.
Topic Segment 5:
Document unit 1: There was no immediate reaction from Myanmar.
The informativeness for the one or more document units (document unit 1, document unit 2, document unit 3, document unit 4 and document unit 5) comprised in the topic segment 1, topic segment 2, topic segment 3, topic segment 4 and topic segment 5 is calculated using the formulation provided in the equation 1 and equation 2 for measuring the informativeness.
Informativeness: [0.009, 0.014, 0.027, 0.02, 0.027, 0.019, 0.024, 0.026, 0. 0]
In the above example, one of the three metrics scores namely the informativeness is illustrated. In the above example, 0.009, 0.014 represents the informativeness score for document unit 1, 0.027, 0.02 represents the informativeness score for document unit 2, 0.027, 0.019 represents the informativeness score for document unit 3, 0.024, 0.026 represents the informativeness score for document unit 4, and 0.009, 0. 0 represents the informativeness score for document unit 5.
The relevance for the one or more document units (document unit 1, document unit 2, document unit 3, document unit 4 and document unit 5) comprised in the topic segment 1, topic segment 2, topic segment 3, topic segment 4 and topic segment 5 is calculated using the formulation provided in the equation 3, equation 4 and equation 5 for measuring the relevance.
Relevance: [0.080, 0.0, -0.181, 0.320, 0.068, 0.157, -0.156, 0.359, 0.022]
In the above example, one of the three metrics scores namely the relevance is illustrated. In the above example, 0.080, 0.0 represents the relevance score for document unit 1, -0.181, 0.320 represents the relevance score for document unit 2, 0.068, 0.157 represents the relevance score for document unit 3, 0.024, 0.026 represents the relevance score for document unit 4 and -0.156, 0.359 represents the relevance score for document unit 5.
The redundancy for the one or more document units (document unit 1, document unit 2, document unit 3, document unit 4 and document unit 5) comprised in the topic segment 1, topic segment 2, topic segment 3, topic segment 4 and topic segment 5 is calculated using the formulation provided in the equation 6 for measuring the redundancy.
Redundancy: [0.143, 0.273, 0.333, 0.273, 0.3, 0.2, 0.25, 0.167, 1]
In the above example, one of the three metrics scores namely the redundancy is illustrated. In the above example, 0.143, 0.273 represents the redundancy score for document unit 1, 0.333, 0.273 represents the redundancy score for document unit 2, 0.3, 0.2 represents the redundancy score for document unit 3, 0.25, 0.167 represents the redundancy score for document unit 4 and 1 represents the redundancy score for document unit 5.
At step 306 of the method 300, the one or more hardware processors 104 leverage a plurality of document unit ranking systems on the identified plurality of summary worthy segments to dynamically select the one or more document units from the source document D which is represented by the document units ranking systems 212 of the system 200. Further, each of the plurality of document unit ranking systems considers one or more combinations of the plurality of linguistic and topic-informed metric scores.
In an embodiment of the present disclosure, the linguistically informed pipeline for SWORTS selection (LIMES) exploiting the three metrics namely the informativeness, the relevance and the redundancy. Further the LIMES consist of four major components including scoring the one or more document units, importance ranking of the one or more document units, SWORTS selection and fluency ranking of SWORTS selected data.
Scoring the document unit: For a given source document D, three linguistic and topic-informed metric scores for each document unit of the one or more document units are computed using the formulation explained in the previous sections. The 3-dimensional metric score for d_I is represented as M^(d_i ) such that:
M^(d_i )= {Info(d_i),Rel(d_i),Red(d_i)} (7)
Importance ranking of document units: The present disclosure proposes a four different document unit ranking systems leveraging metric combinations from M^(d_i ). In each ranking system, the different combinations of the proposed metrics are considered. Formally, the four importance ranking systems for a document unit d_i are defined as follows:
?Imp?_a (d_i) = N_Info (Info(d_i)) + N_Rel (Rel(d_i)) (8) ?Imp?_ß (d_i) = N_Rel (Rel(d_i)) - N_Red (Red(d_i)) (9)
?Imp?_? (d_i) = N_Info (Info(d_i)) - N_Red (Red(d_i)) (10)
?Imp?_d (d_i) = N_Info (Info(d_i)) + N_Rel (Rel(d_i)) - N_Red (Red(d_i)) (11)
where, N_Info(.), N_Rel(.), and N_Red(.) are the functions to scale the metric scores between 0 to 1. Further the redundancy score is subtracted in Equations 9, 10, and 11 as the redundancy of the document unit is undesirable while identifying the summary-worthy content.
SWORTS selection: In the present disclosure, the four document unit ranking systems is used to identify and select the SWORTS. For the source document D, each ranking system independently generates the SWORTS selected data (i.e., ?D'?_a, ?D'?_ß, ?D'?_? and ?D'?_d). The SWORTS selection procedure is presented in Algorithm 1. The present disclosure adopts and modifies the trigram blocking (known in the art) strategy to dynamically select the one or more document units from the source document D. Specifically, in addition to the trigram phrases, the noun phrases (comprising more than one token) are considered as the blocking strategy (see details in algorithm 1). For a document unit d_j, if a trigram phrase or a noun phrase from d_j exists in the already selected document units, d_j is dropped and d_j is not consider as SWORTS.
Algorithm 1 SWORTS selection for LIMES
1: procedure BLOCKING(d_i,Selected_(d_i ))
2: Flag = 0
3: for i in range (len(Selected_(d_i ))) do
4: if Trigram phrase(Selected_(d_i ) [i]?d_j ) then
5: Flag = 1
6: Break
7: if Noun phrase(Selected_(d_i ) [i]?d_j ) then
8: Flag = 1
9: Break
10: return Flag
1: procedure SWORTS_select(D,M)
2: Selected_(d_i ) ? []
3: for i in range(1,n) do
4: c? max(Imp_a (d_i))
Imp_a (d_i)) can be replaced with another importance ranking method
5: j =Imp_a (d_i).index(c)
6: if BLOCKING(d_j,Selected_(d_i )) = = False then
7: Selected_(d_i ).append(d_j)
8: return Selected_(d_i )
In an example embodiment of the present disclosure, the following use case example illustrates the scores assigned to each document unit by four different document unit ranking systems based on the three metric scores computed in the earlier sections.
Topic Segment 1:
Document unit 1: Myanmar warplanes fighting rebels dropped a bomb at a sugarcane field in China, killing four civilians, the latter's state media reported Saturday.
Document unit 2: In addition to the fatalities, nine others were wounded, according to Xinhua news agency.
Topic Segment 2:
Document unit 1: Shortly after the incident Friday, China sent fighter jets to patrol over their shared border.
Document unit 2: The jets are there to "track, monitor, warn and chase away" Myanmar military planes, China's air force told state media.
Topic Segment 3:
Document unit 1: China summoned Myanmar\'s ambassador in Beijing after the incident in the border city of Lincang.
Document unit 2: Liu Zhenmin, the vice foreign minister for China, called on Myanmar to investigate and bring those behind the attack to justice.
Topic Segment 4:
Document unit 1: Myanmar forces have been battling ethnic separatist rebels in the rugged border region across from Yunnan province.
Document unit 2: In recent incidents, stray gunfire has damaged property on the Chinese side of the border, prompting Beijing to warn Myanmar to ensure safety.
Topic Segment 5:
Document unit 1: There was no immediate reaction from Myanmar.
Document Unit Ranking System 1: [0.937, 0.727, 0.848, 1.047, 0.768, 0.957, 0.906, 1.192, 0.022]
In the above example, one or more scores assigned to the one or more document units (document unit 1, document unit 2, document unit 3, document unit 4 and document unit 5) comprised in the topic segment 1, topic segment 2, topic segment 3, topic segment 4 and topic segment 5 by document unit ranking system 1, based on the formulation provided in the equation 8, equation 9, equation 10 and equation 11 is illustrated. In the above example, 0.937, 0.727 represents the score for document unit 1, 0.848, 1.047 represents the score for document unit 2, 0.768, 0.957 represents the score for document unit 3, 0.906, 1.192 represents the score for document unit 4 and 0. 022 represents the score for document unit 5.
Document Unit Ranking System 2: [0.413, 0.518, 1.181, 1.061, 1.068, 0.861, 1.045, 1.322, 0.022]
In the above example, one or more scores assigned to the one or more document units (document unit 1, document unit 2, document unit 3, document unit 4 and document unit 5) comprised in the topic segment 1, topic segment 2, topic segment 3, topic segment 4 and topic segment 5 by document unit ranking system 2, based on the formulation provided in the equation 8, equation 9, equation 10 and equation 11 is illustrated. In the above example, assigned to each document unit by document unit ranking system 2 is illustrated. In the above example, 0.413, 0.518 represents the score for document unit 1, 1.181, 1.061 represents the score for document unit 2, 1.068, 0.861 represents the score for document unit 3, 1.045, 1.322 represents the score for document unit 4 and 0. 022 represents the score for document unit 5.
Document Unit Ranking System 3: [0.476, 0.791, 1.333, 1.013, 1.3, 0.903, 1.138, 1.129, 1]
In the above example, one or more scores assigned to the one or more document units (document unit 1, document unit 2, document unit 3, document unit 4 and document unit 5) comprised in the topic segment 1, topic segment 2, topic segment 3, topic segment 4 and topic segment 5 by document unit ranking system 3, based on the formulation provided in the equation 8, equation 9, equation 10 and equation 11 is illustrated. In the above example, assigned to each document unit by document unit ranking system 3 is illustrated. In the above example, 0.476, 0.791 represents the score for document unit 1, 1.333, 1.013 represents the score for document unit 2, 1.3, 0.903 represents the score for document unit 3, 1.138, 1.129 represents the score for document unit 4 and 1 represents the score for document unit 5.
Document Unit Ranking System 4: [1.270, 1.245, 1.848, 1.788, 1.768, 1.661, 1.795, 2.155, 0.022]
In the above example, one or more scores assigned to the one or more document units (document unit 1, document unit 2, document unit 3, document unit 4 and document unit 5) comprised in the topic segment 1, topic segment 2, topic segment 3, topic segment 4 and topic segment 5 by document unit ranking system 4, based on the formulation provided in the equation 8, equation 9, equation 10 and equation 11 is illustrated. In the above example, assigned to each document unit by document unit ranking system 4 is illustrated. In the above example, 1.270, 1.245 represents the score for document unit 1, 1.848, 1.788 represents the score for document unit 2, 1.768, 1.661 represents the score for document unit 3, 1.795, 2.155 represents the score for document unit 4 and 0. 022 represents the score for document unit 5.
At step 308 of the method 300, the one or more hardware processors 104 measure a fluency of the dynamically selected one or more document units to identify one or more best document units by computing a perplexity of the dynamically selected the one or more document units using a language model which is represented by the fluency measurement module 214 of the system 200.
In an embodiment of the present disclosure, for the source document D, four SWORTS selected one or more documents are obtained. Next, the four documents are ranked to select the best candidate based on the fluency of each document unit of the one or more document units. The fluency of a document is measured using the perplexity computed with an auto-regressive language model. Among ?D'?_a, ?D'?_ß, ?D'?_? and ?D'?_d, the one or more document units with the least perplexity is selected as the final SWORTS selected document (D') for D.
At step 310 of the method 300, the one or more hardware processors 104 train the content summarization model using the identified one or more documents and the summary document S which is represented by the content summarization model training module 216 of the system 200. The trained content summarization model is used to summarize the source document D.
At step 312 of the method 300, the one or more hardware processors 104 validate the trained content summarization model on a plurality of datasets to evaluate an accuracy of the trained content summarization model which is represented by the validation module 218 of the system 200. The plurality of datasets to evaluate the accuracy of the trained content summarization model includes a Cable News Network- DailyMail (CNN-DM) dataset and an eXtreme Summarization (XSUM) dataset. However, the proposed method can also be applied and tested with other existing one or more datasets specific to the summarization domain.
In an example embodiment of the present disclosure, the following use case example illustrates the document after the content selection is done using the proposed approach.
Topic Segment 1:
Document unit 1: Myanmar warplanes fighting rebels dropped a bomb at a sugarcane field in China, killing four civilians, the latter's state media reported Saturday.
Document unit 2: In addition to the fatalities, nine others were wounded, according to Xinhua news agency.
Topic Segment 2:
Document unit 1: Shortly after the incident Friday, China sent fighter jets to patrol over their shared border.
Document unit 2: The jets are there to "track, monitor, warn and chase away" Myanmar military planes, China's air force told state media.
Topic Segment 3:
Document unit 1: China summoned Myanmar\'s ambassador in Beijing after the incident in the border city of Lincang.
Document unit 2: Liu Zhenmin, the vice foreign minister for China, called on Myanmar to investigate and bring those behind the attack to justice.
Topic Segment 4:
Document unit 1: Myanmar forces have been battling ethnic separatist rebels in the rugged border region across from Yunnan province.
Document unit 2: In recent incidents, stray gunfire has damaged property on the Chinese side of the border, prompting Beijing to warn Myanmar to ensure safety.
Topic Segment 5:
Document unit 1: There was no immediate reaction from Myanmar.
Myanmar warplanes fighting rebels dropped a bomb at a sugarcane field in China, killing four civilians, the latter's state media reported Saturday. In addition to the fatalities, nine others were wounded, according to Xinhua news agency. Shortly after the incident Friday, China sent fighter jets to patrol over their shared border. The jets are there to "track, monitor, warn and chase away" Myanmar military planes, China's air force told state media. Liu Zhenmin, the vice foreign minister for China, called on Myanmar to investigate and bring those behind the attack to justice. Myanmar forces have been battling ethnic separatist rebels in the rugged border region across from Yunnan province. In recent incidents, stray gunfire has damaged property on the Chinese side of the border, prompting Beijing to warn Myanmar to ensure safety. There was no immediate reaction from Myanmar.
From the above use case example, it should be noted that the document unit 1 from the topic segment 3 is dropped as per the current method.
In an embodiment of the present disclosure, the application of the LIMES pipeline to a dataset containing multiple document-summary pairs results in the one or more documents with a dynamic number of document units identified per document. On average, K document units per document are obtained in the dataset. Leveraging the capability of LIMES to dynamically prune the source document D, three variants of the LIMES pipeline is selected for content selection. In the first variant of the LIMES pipeline (also known as Lead-K (known in the art), the leading K document units from the source documents are selected. In the next variant (also known as Oracle®), the top-K document units from the source document D are selected based on the ROUGE-1 score (known in the art) between the document units and the reference summary. Lastly, a dynamic version of the second variant (also known as Dynamic Oracle® or D-Oracle®) is proposed. Similar to the Oracle® method, the one or more document units are selected based on the ROUGE-1 score (known in the art) with the reference summary. But instead of selecting K document units for each document, the number of document units are dynamically identified to select based on the LIMES pipeline. For a document, the top-K1 document units are selected, where K1 is the number of documents identified with the LIMES pipeline.
In an embodiment of the present disclosure, the experimental setup to evaluate the effectiveness of the content selection methodology of the present disclosure for the abstractive text summarization task is described herein. In three different configurations, experiment is conducted with Bidirectional Auto-Regressive Transformers (BART (known in the art)) and Pre-training with Extracted Gap-sentences for Abstractive Summarization Sequence-to-sequence models (PEGASUS (known in the art)) models trained on the CNN-DM (Cable News Network- DailyMail (known in the art)) and an Extreme Summarization (XSUM (known in the art)) datasets. Both the CNN-DM (known in the art) and the XSUM datasets are popularly used as a benchmark for the abstractive news summarization task. The CNN-DM dataset contains an approximately 3-4 sentence long summary along with the Cable News Network- DailyMail (CNN-DailyMail or CNN-DM) news article whereas the Extreme Summarization (XSUM (known in the art)) dataset contains a single sentence summary along with a British Broadcasting Corporation (BBC) news article. Further, the three experimental setups are explained in detail in the below sections.
In an embodiment of the present disclosure, as a first setup, an experiment with self-training an existing summarization model is conducted. The self-training is conducted on the SWORTS selected data as the source document D along with the original reference summary. Formally, if a summarization model X is trained on the dataset D ^, the model X is self-trained such that:
X ^ = ?_self (X,LIMES(d_i),s_i);?(d_i,s_i)? D ^ (12)
where, ?_self is the self-training of X on the SWORTS selected document and reference summary pairs from D ^. LIMES(d_i) represents the content selection from d_i using LIMES or its variants. In the present disclosure, for experiment purpose, the BART and the PEGASUS models are trained separately on the CNN-DM and the XSUM datasets and self-train these models on SWORTS selected datasets. Furthermore, the self-trained model’s performance is evaluated on the Factual Inconsistency Benchmark (FIB) (known in the art). The FIB benchmark contains article-summary pairs from the CNN-DM and the XSUM datasets where the reference summaries are manually corrected to remove the factual inconsistency. Further the model’s performance is evaluated against the factually corrected reference summaries.
In the next setup, an experiment with cross training an existing summarization model on a novel/new/unseen dataset has been conducted. A model Y was crossed trained on a randomly sampled set of r document and summary pairs from D ^. In this experiment, the model Y is originally trained on dataset(s) other than D ^. Formally,
Y ^ = ?_cross(Y,r,LIMES(d_i),s_i); ?(d_i,s_i)? D ^ (13)
where, ?_cross is the cross-training of Y on the SWORTS selected document and reference summary pairs for r samples from D ^. LIMES(d_i) represents the content selection from d_i using LIMES or its variants. Given that the model Y is not originally trained on the dataset D ^, Y can be cross-trained on D ^ without the content selection such that:
Y ^ = ?_cross(Y,r,d_i),s_i); ?(d_i,s_i)? D ^ (14)
In the present disclosure, the experiment conducted cross-trains the BART and the PEGASUS models (trained originally on the XSUM datasets) on the CNN-DM dataset and vice-versa. Further, the experiment with four different sample sizes (r) i.e., 0.1%, 1%, 10%, and 100% of the training dataset are conducted.
In an embodiment of the present disclosure, a final experiment is conducted wherein the zero-shot adaptation capability of the summarization models is evaluated. In the aforementioned previous experiments, the summaries are generated with the self-trained and-cross trained models at the inference stage by passing the entire document with no content selection. This allows the system and method to compare the zero-shot summarization capability of these models against the pre-trained models. Here, the pre-trained BART and the PEGASUS models on the XSUM and the CNN-DM datasets are compared against the self-trained versions of these models on the same datasets. Further the self-trained models were experimented on the datasets from four different domains:
NARRASUM (known in the art): It is a narrative summarization dataset to summarize the plot description of movies and TV episodes abstractively. The data is collected from various movie websites and encyclopedias such as Wikipedia® and Internet Movie Database® (IMDb). In the present disclosure, the zero-shot summaries are generated for 6121 narrative-summary pairs from the test set and report the findings.
PENS (known in the art): It is a personalized news headline generation dataset. The dataset is curated from the Microsoft® News website cand contains English news articles and summaries, other textual content, and user impression logs. In the present disclosure, the zero-shot summaries are generated for 5000 news articles from the dataset and compare them against the reference summary.
Reddit® (known in the art): This dataset is collected from the Reddit® online discussion forum. In contrast to the usual summarization datasets with formal documents as the source, this dataset contains informal discussions from a subreddit forum along with TL; DR (too long; didn't read) as the summary. In the present disclosure, experiment with 5000 samples from Reddit dataset were conducted and reported the findings.
WikiSum® (known in the art): The dataset is collected from the article-summary pairs appearing on the WikiHow® website. The WikiSum® documents are written in simple English, and the summaries (written by the document author) provide non-obvious tips that mimic the advice a knowledgeable, empathetic friend might give. In the present disclosure, experiment with 2000 article-summary pairs was conducted from the test set of the dataset.
The present disclosure analyzed the results from the three experiments discussed in above sections. The following metrics were used for evaluating different setups:
1. ROUGE (known in the art): The model-generated summaries are evaluated against the reference summary using the ROUGE-1 (known in the art), ROUGE-2 (known in the art), and ROUGE-L (known in the art) scores.
2. ?ROUGE (known in the art): The performance of two models is compared based on the percentage change in the ROUGE score (known in the art). For two systems with ROUGE-1 scores (known in the art) as R and R ^, ?R1 is defined as:
?R1 = 100 * (R-R ^)/R (15)
Similarly, the ?R2 and ?RL scores are defined for ROUGE-2 (known in the art) and ROUGE-L (known in the art) metrics, respectively.
3. Aggregate Performance Change (APC (known in the art)): While comparing two systems, the ?ROUGE metric gives three different scores i.e., ?R1, ?R2, and ?RL. An aggregate change in the performance with model 1 M ^1 as compared to the BASE model M1 is reported as:
APC(M1,M ^1) = (?R1+ ?R2+?R3)/3 (16)
4. ?Time: The two models (any) have been compared based on the time required to train a model on a dataset. For two models with training time as T and T ^ , ?Time (or ?T) is defined as:
?Time = 100 * (T-T ^)/T (17)
where, ?R1, ?R2, and ?RL scores are computed between the summaries generated by the models M1 and M ^1.
In this section, the self-training results of the BART and the PEGASUS models on the SWORTS selected data from the CNN-DM and the XSUM datasets is presented. In Table 1, the results of the self-training experiment are reported. Further, the ROUGE and the Average Performance Change (APC) scores for each model are reported. The APC score for the LIMES model (and its variant) against the BASE model (the BART and the PEGASUS models on the CNN-DM and the XSUM datasets) are reported. Further the performance of the self-trained models on the Factual Inconsistency Benchmark (FIB) benchmark is reported. Some key observations are:
Focused training with SWORTS improves model performance. As shown in Table 1, all LIMES variations show comparable performance to the BASE model, especially in the case of the CNN-DM dataset. The CNN-DM dataset has comparatively longer and more high-quality reference summaries than the XSUM dataset (known in the art) which further helps the self-training to generate high-quality summaries (examples of model-generated summaries). Shorter and more abstractive summarization needs attention. For the XSUM dataset, all experiments show comparable but negative APC which is believed due to the inherent complexity of the XSUM dataset, and which is defined as extreme summarization. In other words, self-training may be of limited use in the case of more abstractive and shorter summary generation which is in line with the observations made with the existing extractive summarization works on this dataset (known in the art). Further experiments with more training, additional data or external data will be useful in identifying contributing factors for better performance. Noisy data needs to be cleaned. The present disclosure considers the FIB Benchmark dataset consisting of verified and corrected data samples. Table 2 reports self-training results on this dataset. Due to the limited size of the Factual Inconsistency Benchmark (FIB) Benchmark dataset, the zero-shot inference results are used for comparison. As can be seen in Table 2, all LIMES variations give better results than the BASE model in case of the CNN-DM dataset, whereas the BASE model performs better in the case of the XSUM dataset similar to observations in Table 1. The present disclosure hypothesizes that this is because the articles, as well as corresponding summaries in the CNN-DM dataset, are comparatively larger and less noisy than the ones in the XSUM dataset.
Model CNN-DM XSUM
BART PEGASUS BART PEGASUS
R1 R2 RL APC R1 R2 RL APC R1 R2 RL APC R1 R2 RL APC
BASE 44.16 21.28 40.9 - 44.17 21.47 41.11 - 45.14 22.27 37.25 - 47.21 24.56 39.25 -
LIMES 44.53 21.12 41.56 +0.56 44.68 21.67 41.63 +1.11 44.20 20.87 35.28 -4.28 46.72 23.92 38.67 -1.70
Lead-K 44.09 21.11 41.16 -0.10 44.01 21.07 41.03 -0.80 44.86 21.54 36.29 -2.15 47.26 24.32 39.09 -0.42
Oracle 44.39 21.26 41.46 +0.59 44.32 21.39 41.33 +0.16 44.58 21.29 35.98 -3.01 46.83 24.11 38.81 -1.25
D-oracle 44.46 21.34 41.52 +0.82 44.14 21.25 41.10 -0.37 44.51 21.23 35.95 -3.18 46.91 24.15 38.86 -1.09
Table 1. Self-training results.
Table 1 depicts the self-training results. The row with the maximum APC among all the LIMES variants is highlighted in the Table 1. The ROUGE metric scores are marked in bold for a better model than the corresponding BASE model (the BART and the PEGASUS models on the CNN-DM and the XSUM datasets). The BASE model represents the model trained on the corresponding dataset with no content selection.
Model CNN-DM XSUM
BART PEGASUS BART PEGASUS
R1 R2 RL APC R1 R2 RL APC R1 R2 RL APC R1 R2 RL APC
BASE 32.15 12.93 29.12 - 32.01 13.21 27.07 - 44.01 19.95 36.90 - 44.99 20.93 38.08 -
LIMES 33.48 13.03 30.54 +3.26 35.01 14.40 31.86 +12.02 42.34 18.48 35.17 -5.28 44.56 20.84 37.80 -0.70
Lead-K 34.23 14.03 30.68 +6.77 33.27 13.35 30.54 +5.93 42.63 18.70 35.19 -4.67 43.93 20.03 37.41 -2.80
Oracle 33.81 13.91 30.73 +6.09 34.22 13.62 31.12 +8.32 43.34 19.07 36.28 -2.53 44.26 20.19 37.22 -2.42
D-oracle 35.69 15.58 32.20 +14.02 34.28 14.52 31.38 +10.97 43.07 18.98 36.07 -3.08 44.61 20.37 20.37 -1.73
Table 2. Zero-shot inference results on the FIB benchmark with the self-trained models.
Table 2 depicts the Zero-shot inference results on the FIB benchmark with the self-trained models. The row with the maximum APC among all the LIMES variants is highlighted. The ROUGE metric scores are marked in bold for a better model than the corresponding BASE model. The BASE model represents the model trained on the corresponding dataset with no content selection.
In the present disclosure, the results for the sampled cross-training experiments are presented with the BART and the PEGASUS models in Tables 3, 4, and 5. Some key observations are:
No need to worry about the training time. In Table 3, the cross-training results are presented on the original source dataset without any LIMES variations for content selection. Further results are reported on four different sample sizes of the training dataset. Table 4 and Table 5 give a more detailed understanding of how the training time and performance can be optimized. It is observed that only 10% of the dataset may be sufficient to get comparable performance on the ROUGE score (known in the art) and almost 90% saving on the training time. In such a setting the content selection with LIMES help to reduce the training time against training on the source dataset with no content selection. In many real-life applications, time spent on getting a system ready is important and hence this is acceptable. Pre-trained summarization models are easily customizable in similar domains Cross-training involves fine-tuning a model pre-trained on a dataset from the same or similar domain. Results in Tables 4 and 5 show as small as 10% of the dataset is sufficient to get comparable performance on ROUGE score (known in the art), APC, and with additional time saved with content selection. It is believed that the models do learn intrinsic properties about the domain and the task which makes it easier to adapt them for different datasets in similar and related domains.
Model
r CNN-DM XSUM
BART PEGASUS BART PEGASUS
R1 R2 RL R1 R2 RL R1 R2 RL R1 R2 RL
BASE - 44.16 21.28 40.9 44.17 21.47 41.11 45.14 22.27 37.25 47.21 24.56 39.25
ZERO-SHOT - 24.99 7.42 21.55 20.84 6.93 18.13 20.73 3.51 16.64 21.15 3.89 14.24
SOUCRE 0.1% 38.49 16.47 35.32 30.95 12.52 27.19 29.11 9.87 21.87 24.97 6.66 17.76
1% 41.51 18.86 38.53 39.4 17.64 35.96 31.26 11.66 23.53 36.7 14.8 27.72
10% 42.82 20.05 39.84 41 18.95 37.89 32.93 13.14 25.02 39.93 17.44 30.83
100% 44.08 21.14 41.12 42.71 20.19 39.62 34.74 14.93 26.51 41.58 19.16 32.89
Table 3. Sampled cross-training results with no content selection.
Table 3 depicts sampled cross-training results with no content selection. In the Table 3, r denotes the sample size from the original source training dataset with no content selection.
r Model BART PEGASUS
?R1 ?R2 ?RL APC ?RT ?R1 ?R2 ?RL APC ?RT
0.1% SOURCE -12.68 -22.09 -14.10 -16.29 -99.91 -27.53 -37.98 -31.37 -32.39 -99.92
LIMES -12.11 -21.71 -13.95 -15.92 -99.92 -27.69 -38.28 -31.60 -32.42 -99.93
LEAD-K -13.81 -22.65 -15.53 -17.33 -99.91 -28.02 -38.43 -31.85 -32.76 -99.92
ORACLE -10.93 -18.73 -12.28 -13.98 -99.91 26.94 -36.20 -30.38 -31.17 -99.92
D-ORACLE -11.43 -19.34 -12.67 -14.48 -99.91 -27.81 -36.55 -31.07 -31.81 -99.92
1% SOURCE -5.83 -10.78 -6.29 -7.63 -99.12 -7.74 -12.63 -9.23 -9.86 -99.09
LIMES -5.17 -10.12 -5.73 -7.01 -99.18 -7.84 -13.47 -9.54 -10.28 -99.20
LEAD-K -6.57 -11.44 -6.97 -8.32 -99.18 -7.67 -12.63 -9.06 -9.78 -99.20
ORACLE -5.19 -9.46 -5.59 -6.74 -99.18 -7.25 -11.98 -8.55 -9.26 -99.16
D-ORACLE -5.33 -9.50 -5.73 -6.85 -99.17 -6.88 -11.58 -8.25 -8.90 -99.14
10% SOURCE -2.85 -5.15 -3.11 -3.70 -90.42 -4 -6.14 -4.36 -4.83 -90.51
LIMES -3.10 -5.62 -3.45 -4.05 -91.21 -3.46 -5.79 -3.96 -4.40 -91.59
LEAD-K -3.74 -6.48 -3.89 -4.70 -91 -3.48 -5.54 -3.83 -4.28 -91.51
ORACLE -3.19 -5.58 -3.40 -4.05 -90.85 -3.39 -5.10 -3.73 -4.07 -90.91
D-ORACLE -2.90 -5.25 -3.11 -3.75 -90.88 -3.44 -5.29 -3.81 -4.18 -90.88
100% SOURCE - - - - - - - - - -
LIMES -0.79 -1.75 -0.89 -1.14 -8.74 0 +0.44 -0.10 +0.11 -8.82
LEAD-K -0.58 -0.80 -0.51 -0.63 -7.19 -0.35 -0.24 -0.47 -0.12 -6.11
ORACLE 0 -0.04 +0.02 -0.01 -3.16 +0.09 +0.59 +0.02 +0.23 -3.02
D-ORACLE +0.02 -0.14 +0.04 -0.02 -4.09 +0.09 +0.59 +0.02 +0.23 -3.53
Table 4. Sampled cross-training results for the CNN-DM dataset.
Table 4 depicts sampled cross-training results for the CNN-DM dataset. For a model, the metric scores are computed against the model trained on the 100% training dataset with no content selection. The row with the maximum APC score is highlighted among all five models. For a given r and the BASE model, the individual best scores is highlighted in bold for the metrics.
r Model BART PEGASUS
?R1 ?R2 ?RL APC ?RT ?R1 ?R2 ?RL APC ?RT
0.1% SOURCE -16.20 -33.89 -17.50 -22.53 -99.90 -39.94 -65.24 -46 -50.39 -99.92
LIMES -16.63 -34.56 -18.52 -23.23 -99.92 -39.70 -64.71 -45.69 -50.03 -99.93
LEAD-K -16 -32.15 -18.48 -22.21 -99.92 -37.68 -61.84 -43.72 -47.74 -99.93
ORACLE -17.21 -35.09 -18.59 -23.63 -99.92 -39 -63.83 -45.02 -49.28 -99.93
D-ORACLE -16.75 -34.62 -18.18 -23.18 -99.92 -36.60 -60.22 -42.41 -46.41 -99.93
1% SOURCE -10.01 -21.90 -11.24 -14.38 -99.13 -11.73 -22.75 -15.71 -16.73 -99.07
LIMES -10.24 -21.90 -11.58 -14.57 -99.27 -12.55 -23.69 -16.47 -17.57 -99.17
LEAD-K -10.67 -22.77 -11.69 -15.04 -99.24 -12.09 -23.17 -15.93 -17.06 -99.14
ORACLE -11.08 -23.30 -12.22 -15.53 -99.22 -12.69 -24.06 -16.60 -17.78 -99.09
D-ORACLE -10.67 -22.97 -11.99 -15.21 -99.25 -12.24 -23.53 -16.20 -17.20 -99.16
10% SOURCE -5.21 -11.98 -5.62 -7.60 -90.10 -3.96 -8.97 -6.26 -6.39 -90.60
LIMES -5.81 -12.86 -6.26 -8.31 -91.90 -4.56 -10.02 -6.99 -7.19 -91.87
LEAD-K -5.38 -11.92 -6.11 -7.80 -91.21 -3.99 -9.34 -6.47 -6.6 -91.39
ORACLE -5.87 -13.52 -6.63 -8.67 -90.61 -4.59 -10.17 -6.68 -7.14 -90.92
D-ORACLE -6.04 -13.46 -6.71 -8.73 -90.97 -4.61 -10.59 -7.05 -7.41 -91.58
100% SOURCE - - - - - - - - - -
LIMES -0.94 -2.07 -1.01 -1.34 -16.22 -1.13 -1.82 -1.21 -1.38 -12.43
LEAD-K -0.28 -0.46 -0.75 -0.49 -8.43 -0.60 -0.73 -0.51 -0.61 -6.64
ORACLE -0.28 -2.27 -1.20 -1.45 -6.55 -1.34 -2.19 -1.55 -1.69 -4.63
D-ORACLE -1.09 -2.27 -1.35 -1.57 -8.44 -1.01 -1.35 -1.12 -1.16 -6.45
Table 5. Sampled cross-training results for the XSUM dataset.
Table 5 depicts sampled cross-training results for the XSUM dataset. For a model, the metric scores are computed against the model trained on the 100% training dataset with no content selection. The row with the maximum APC score is highlighted among all five models. For a given r and the BASE model, the individual best scores is highlighted in bold for the metrics.
In order to evaluate the domain generalization capabilities, the present disclosure considers self-trained LIMES variations of the BART (known in the art) and the PEGASUS (known in the art) models. The results are presented in Table 6. It is observed that a self-trained model with content selection shows improved zero-shot performance on novel datasets from unseen domains. D-Oracle consistently performs better than any other LIMES variation and the base model. Extrapolating the present disclosure’s observation in cross-training and self-training experiments, it is believed/noted that a small subset of the dataset as training data and a single training epoch may be sufficient to get strong results on domain adaptation with additional time saved in training such models with content selection.
Model NARRASUM PENS Reddit WikiSum
R1 R2 RL APC R1 R2 RL APC R1 R2 RL APC R1 R2 RL APC
BART CNN-DM
BASE 28.18 5.72 24.59 - 17.96 6.65 16.21 - 17.23 3.78 14.50 - 32.03 8.74 28.78 -
LIMES 28.40 5.80 25.14 +1.47 18.08 6.75 16.43 +1.17 17.68 3.80 14.91 +1.98 33.97 9.46 30.52 +6.78
LEAD-K 28.85 5.81 25.34 +2.33 17.28 6.39 15.67 -3.67 17.86 4.00 15.06 4.44 33.52 9.30 30.13 +5.24
ORACLE 28.27 5.76 24.84 +0.67 17.75 6.62 16.17 -0.62 18.38 4.01 15.48 +6.50 33.37 9.39 30.03 +5.32
D-ORACLE 28.80 5.87 25.30 +2.56 17.87 6.69 16.21 +0.03 17.92 3.87 15.12 +3.55 33.55 9.50 30.26 +6.19
PEGASUS CNN-DM
BASE 25.16 4.97 19.97 - 19.37 7.43 16.12 - 17.35 3.69 13.15 - 27.68 7.36 22.73 -
LIMES 25.50 5.25 22.34 +6.28 20.01 7.63 18.01 +5.90 18.02 3.84 14.98 +7.28 28.99 7.84 25.94 +8.45
LEAD-K 26.04 5.15 22.77 +7.04 19.41 7.28 17.42 +2.08 17.40 3.68 14.49 +3.40 29.23 7.76 26.14 +8.67
ORACLE 25.89 5.20 22.71 +7.08 19.75 7.52 17.82 +4.57 17.84 3.74 14.85 +5.70 28.62 7.76 25.65 +7.22
D-ORACLE 26.30 5.31 23.03 +8.89 19.66 7.53 17.63 +4.07 17.42 3.60 14.51 +2.76 29.38 8.02 26.33 +10.31
BART XSUM
BASE 16.25 2.83 13.53 - 24.62 8.57 20.81 - 17.87 3.65 13.91 - 15.59 4.57 13.77 -
LIMES 15.99 2.30 13.29 -7.36 18.67 5.01 15.35 -30.64 14.74 2.17 11.43 -25.29 18.25 4.22 15.51 +7.34
LEAD-K 17.63 2.80 14.55 +4.99 20.29 5.97 16.96 -22.14 16.85 3.13 13.14 -8.49 16.20 4.63 14.33 +3.09
ORACLE 17.50 2.88 14.45 +5.41 21.98 6.83 18.23 -14.47 18.21 3.69 13.99 +1.19 16.48 4.80 14.50 +5.34
D-ORACLE 17.40 2.83 14.34 +4.35 21.05 6.32 17.49 -18.90 18.19 3.51 14.00 -3.75 17.50 5.08 15.39 +11.72
PEGASUS XSUM
BASE 14.24 2.50 11.93 - 20.36 7.28 17.38 - 14.36 2.43 11.71 - 16.80 4.85 14.81 -
LIMES 13.77 2.27 11.55 -5.22 19.07 6.42 16.22 -8.27 14.17 2.21 11.58 -3.82 14.29 3.78 12.62 -17.26
LEAD-K 14.30 2.46 12.02 -0.14 18.97 6.46 16.14 -8.40 14.24 2.14 11.56 -4.68 13.29 3.78 11.78 -21.13
ORACLE 14.94 2.62 12.49 +4.80 19.03 6.48 16.18 -8.14 14.57 2.40 11.96 +0.78 13.38 3.78 11.89 -20.71
D-ORACLE 15.14 2.72 12.67 +7.10 20.51 7.20 17.46 +0.03 14.90 2.48 12.19 +3.30 13.83 4.03 12.28 -17.22
Table 6. Zero-Shot adaptation results.
Table 6 depicts zero-Shot adaptation results with self-trained models. The APC against the BASE model is computed. The row with the maximum APC score among all the LIMES variants is highlighted. For a given dataset, the individual best scores for the metrics are marked in bold.
In the previous section, several merits of using the content selection over the naive training of the summarization model with an unmodified dataset are highlighted. It is observed that the competitive performance of the trained models in several configurations against the base models. Further it is observed that the time saved in training a model with the content selection which deems useful in a real-world setting. Additionally, the challenges in the content selection model that guides the LIMES variants are also recognized. One such major limitation is the additional time required for content selection with the LIMES pipeline. To mitigate this, it is posit that the threshold K which is identified with LIMES on the entire training dataset could be identified on a very small subset of the dataset which can be further used with Lead – K (known in the art) and Oracle variants (known in the art). It would also be interesting to combine the LIMES content selection pipeline with other extractive summarization techniques including MATCHSUM (known in the art) and BERT-Ext (Bidirectional Encoder Representations (known in the art)).
Australia leg-spinner Stuart MacGill has announced he will quit international cricket at the end of the ongoing second Test against West Indies. MacGill will retire after 10 years of Test cricket, in which he has taken 207 wickets. The 37- year-old made his Test debut against South Africa 10 years ago and has since gone on to take 207 wickets at an average of 28.28 over 43 Test matches. “Unfortunately now my time is up” MacGill said. “I am incredibly lucky that as well as providing me with amazing opportunities off the field, my job allows me to test myself in one of Australia’s most highly scrutinised sporting environments. Bowling with some of crickets all time greats such as Glenn McGrath, Shane Warne, Jason Gillespie and Brett Lee has made my job a lot easier. I want to be sure that exciting young bowlers like Mitchell Johnson enjoy the same privilege,” he added. MacGill took the only wicket to fall on a rain-interrupted third day of the Test in Antigua. He had Ramnaresh Sarwan brilliantly caught at slip by Michael Clarke for a well-constructed 65, but otherwise drew blank on a frustrating day for the tourists. The ever dependable Shivnarine Chander paul (55 not out) and Dwayne Bravo (29 not out) took the West Indies to the close on 255 for four wickets. They were replying to Australia’s 479 for seven declared and with only two days remaining a draw looks the likely outcome in MacGill’s farewell appearance. Australia won the first Test in Jamaica by 95 runs.
Table. 7. Example from the CNN-DM dataset.
Table. 7 shows an example from the CNN-DM dataset. The sentences identified as non-summary-worthy by the proposed SWORTS selection approach are highlighted. The underlined phrases cover the facts presented in the first highlighted sentence. The second highlighted sentence captures the sentiment associated with the event (i.e., retirement) and presents very less factual information.
Most climbers who try don’t succeed in summiting the 29,035-foot-high Mount Everest, the world’s tallest peak. But they do leave their trash. Thousands of pounds of it. That’s why an experienced climbing group from the Indian army plans to trek up the 8,850-meter mountain to pick up at least 4,000 kilograms (more than 8,000 pounds) of waste from the high-altitude camps, according to India Today. The mountain is part of the Himalaya Mountain range on the border between Nepal and the Tibet region. The 34-member team plans to depart for Kathmandu on Saturday and start the ascent in mid-May. The upcoming trip marks the 50th anniversary of the first Indian team to scale Mount Everest. “Sadly, Mount Everest is now ... called the world’s highest junkyard” Maj. Ranveer Singh Jamval, the team leader, told India Today. “We will target the mountaineering waste from Camp 1 (19,695 feet) to the summit,” said Jamval, who has scaled Mount Everest twice. “There are old cylinders, tents, tins, packets, equipment and other mountaineering waste. Apart from our own haversacks weighing 10 kg each, we intend to bring in another 10 kg each on the trip.” More than 200 climbers have died attempting to climb the peak, part of a UNESCO World Heritage Site. The Indian expedition isn’t the first attempt to clean up the trash left by generations of hikers. Among the cleanup efforts is the Eco Everest Expedition, an annual trip launched in 2008 that is all about climbing ”in an eco-sensitive manner,” bringing old refuse, in addition to that generated during the trip, down for disposal, according to the Asian Trekking website. Last year, Nepalese tourism authorities started to require hikers to carry out an extra 18 pounds of garbage, in addition to their own trash and human waste, according to the New York Times.
BASE: The 34-member team plans to depart for Kathmandu on Saturday and start the ascent in mid-May. The upcoming trip marks the 50th anniversary of the first Indian team to scale Mount Everest. The mountain is part of the Himalaya mountain range on the border between Nepal and the Tibet region.
LIMES: An Indian army team plans to trek up Mount Everest to pick up tons of waste from high-altitude camps. The trip marks the 50th anniversary of the first Indian team to scale Mount Everest.“Sadly, Mount Everest is now ... called the world’s highest junkyard,” the team leader says.
Lead-K: An experienced climbing group from the Indian army plans to trek up Mount Everest. The 34-member team plans to depart for Kathmandu on Saturday and start the ascent in mid-May. More than 200 climbers have died attempting to climb the 8,850-meter mountain.
Oracle: A 34-member team from the Indian army plans to trek up Mount Everest to pick up at least 4,000 kilograms of waste. The trip marks the 50th anniversary of the first Indian team to scale the mountain. ”Sadly, Mount Everest is now ... called the world’s highest junkyard,” team leader says.
D-Oracle: A 34-member team from the Indian army plans to trek up Mount Everest to pick up 4,000 kilograms of waste. The trip marks the 50th anniversary of the first Indian team to scale the mountain. More than 200 climbers have died attempting to climb the 29,035-foot-high mountain.
Table. 8 - Example from the CNN-DM dataset.
Table 8 shows an example from the CNN-DM dataset. The BASE model is BART CNN-DM. The LIMES and its variant models are the self-trained versions of BART CNN-DM. The text fragment is highlighted in the source document from where the BASE model has copied the content to generate the summary. The phrase in the generated summaries that add the reasoning to the plan to trek up the mountain are highlighted.
Sportscotland says it has yet to decide where the cuts will fall amid concerns that elite athletes could suffer. The cuts are being blamed on reduced government spending and a drop in National Lottery ticket sales - a major contributor to sports backing. Governing bodies are bracing themselves for potential job losses. About 60% of funding is made up from Scottish Government money, with lottery sales making up the rest. In the year ending 2015, total revenue was £65. 1m, which is divided up and awarded to 52 sports. By the end of next year, that figure will have fallen to £51. 8m - a cut to the Scottish sporting budget of 20% in just three years. sportscotland chairman Mel Young said: “It’s heartbreaking to me because I know the effort the sport governing bodies and the community that’s around it; the volunteers and the mums and dads.”. And, to have to say that we’re having to cut some money back is, I believe, not the right way to go. “Former badminton player Susan Egelstaff, a double bronze medallist at the Commonwealth Games [in 2002 and 2006] and an Olympian, fears the cuts will have “a huge impact”. “What that means in the future is that Scotland will be constantly playing catch-up,” she said. “It’s almost impossible to catch up if you fall too far behind the leading nations.”. The Scottish Government said sport and physical activity play a “key role” in a healthy Scotland and that “significant” investment would continue in those areas. “Having successfully delivered the Commonwealth Games [in 2014] we are now focusing on protecting or raising investment in areas intended to decrease health inequality and improve life chances, and the small reduction in the sport budget allows us to support those priorities,” Minister for Sport Aileen Campbell said. “There is on-going support for active lifestyles through capital investment in cycling and walking, and over the last 10 years we have invested £168m in sport infrastructure - from grassroots to the elite performance which has increased the facilities and opportunities to get people active and achieve on the world stage at both the Olympics and Commonwealth Games.”
BASE: Scotland’s sport budget is set to be cut by £10m over the next three years, it has been announced.
LIMES: Sport funding in Scotland is to be cut by 20% over the next three years, according to the body that represents the country’s sporting bodies. Lead-K: Sport Scotland’s budget is to be cut by 20% from next year, it has been announced.
Oracle: Scotland’s sport governing bodies are bracing themselves for a “heartbreaking” cut to the funding of almost £50m over the next three years.
D-Oracle: Scottish sport governing bodies are bracing themselves for a “heartbreaking” cut to their budgets next year.
Table.9 - Example from the XSUM dataset. The BASE model is BART XSUM. The LIMES and its variant models are the self-trained versions of BART XSUM. The text fragment in the summaries which contains factually incorrect information are highlighted.
Table 9 shows an example from the XSUM dataset. The BASE model is BART XSUM. The LIMES and its variant models are the self-trained versions of BART XSUM. The text fragment in the summaries which contains factually incorrect information are highlighted.
The written description describes the subject matter herein to enable any person skilled in the art to make and use the embodiments. The scope of the subject matter embodiments is defined by the claims and may include other modifications that occur to those skilled in the art. Such other modifications are intended to be within the scope of the claims if they have similar elements that do not differ from the literal language of the claims or if they include equivalent elements with insubstantial differences from the literal language of the claims.
The present disclosure provides system 100 that implements a topic-aware and reference-free approach for identifying summary-worthy segments (content selection) from a document. An experiment with multiple variations of content selection for self-training, cross training, and domain adaptation settings was conducted. The present disclosure leverages three metrics (namely informativeness, relevance, and redundancy) to identify SWORTS from a document. Further, it is found that content selection universally improves the model performance. Specifically, it is observed that dynamic content selection (D-Oracle (known in the art)) performs the best across all variations. From self-training and cross-training experiments, it is observed that it is possible to achieve strong and comparable results with lesser data and in lesser training time. Zero-shot domain adaptation observations show that content selection helps in improving model performance beyond the training dataset (and domain). With extensive experimentation, the importance of content selection for better summarization is reiterated.
It is to be understood that the scope of the protection is extended to such a program and in addition to a computer-readable means having a message therein; such computer-readable storage means contain program-code means for implementation of one or more steps of the method, when the program runs on a server or mobile device or any suitable programmable device. The hardware device can be any kind of device which can be programmed including e.g., any kind of computer like a server or a personal computer, or the like, or any combination thereof. The device may also include means which could be e.g., hardware means like e.g., an application-specific integrated circuit (ASIC), a field-programmable gate array (FPGA), or a combination of hardware and software means, e.g., an ASIC and an FPGA, or at least one microprocessor and at least one memory with software processing components located therein. Thus, the means can include both hardware means and software means. The method embodiments described herein could be implemented in hardware and software. The device may also include software means. Alternatively, the embodiments may be implemented on different hardware devices, e.g., using a plurality of CPUs.
The embodiments herein can comprise hardware and software elements. The embodiments that are implemented in software include but are not limited to, firmware, resident software, microcode, etc. The functions performed by various components described herein may be implemented in other components or combinations of other components. For the purposes of this description, a computer-usable or computer readable medium can be any apparatus that can comprise, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device.
The illustrated steps are set out to explain the exemplary embodiments shown, and it should be anticipated that ongoing technological development will change the manner in which particular functions are performed. These examples are presented herein for purposes of illustration, and not limitation. Further, the boundaries of the functional building blocks have been arbitrarily defined herein for the convenience of the description. Alternative boundaries can be defined so long as the specified functions and relationships thereof are appropriately performed. Alternatives (including equivalents, extensions, variations, deviations, etc., of those described herein) will be apparent to persons skilled in the relevant art(s) based on the teachings contained herein. Such alternatives fall within the scope of the disclosed embodiments. Also, the words “comprising,” “having,” “containing,” and “including,” and other similar forms are intended to be equivalent in meaning and be open ended in that an item or items following any one of these words is not meant to be an exhaustive listing of such item or items, or meant to be limited to only the listed item or items. It must also be noted that as used herein and in the appended claims, the singular forms “a,” “an,” and “the” include plural references unless the context clearly dictates otherwise.
Furthermore, one or more computer-readable storage media may be utilized in implementing embodiments consistent with the present disclosure. A computer-readable storage medium refers to any type of physical memory on which information or data readable by a processor may be stored. Thus, a computer-readable storage medium may store instructions for execution by one or more processors, including instructions for causing the processor(s) to perform steps or stages consistent with the embodiments described herein. The term “computer-readable medium” should be understood to include tangible items and exclude carrier waves and transient signals, i.e., be non-transitory. Examples include random access memory (RAM), read-only memory (ROM), volatile memory, nonvolatile memory, hard drives, CD ROMs, DVDs, flash drives, disks, and any other known physical storage media.
It is intended that the disclosure and examples be considered as exemplary only, with a true scope of disclosed embodiments being indicated by the following claims.
, Claims:We Claim:
A processor implemented method (300), comprising:
receiving, via one or more hardware processors, a source document D containing one or more document units and a summary document S, wherein the summary document S is used as a reference document for training a content summarization model (302);
identifying, via the one or more hardware processors, a plurality of summary- worthy segments (SWORTS) from the received source document D by computing a plurality of linguistic and topic-informed metric scores for each document unit of the one or more document units comprised in one or more topic-segments of the source document D by at least one of: (304):
defining an informativeness of the one or more document units comprised in the one or more topic-segments of the source document D based on at least one of :
one or more u-grams in a first order in the one or more topic-segments contributes at least a percentage with respect to the informativeness of the one or more document units; and
one or more u-grams in a second order in the one or more topic-segments signifying an importance pertaining to a key-information, and wherein the informativeness measures a degree to which the one or more document units provides a non-trivial information;
defining a relevance of the one or more document units comprised in the one or more topic-segments of the source document D based on at least one of:
computing the relevance of the one or more document units comprised in the source document D against one or more preceding document units comprised in the one or more topic segments of the source document D;
computing the relevance of one or more leading document units comprised in the source document D against the one or more leading document units of the remaining one or more topic segments of the source document D; and
computing the relevance of the one or more leading document units comprised in the source document D against a last document unit of a previous topic segment comprised in the one or more topic segments of the source document D to capture a continuum of a discussion across one or more topics.
defining a redundancy of the one or more document units comprised in the one or more topic-segments of the source document D based on one or more words comprised in the one or more document units which occur frequently in the one or more topic-segments;
leveraging, via the one or more hardware processors, a plurality of document units ranking systems on the identified plurality of summary worthy segments to dynamically select the one or more document units from the source document D, wherein each of the plurality of document units ranking systems considers one or more combinations of the plurality of linguistic and topic-informed metric scores (306);
measuring, via the one or more hardware processors, a fluency of the dynamically selected one or more document units to identify one or more best document units by computing a perplexity of the dynamically selected one or more document units using a language model (308);
training, via the one or more hardware processors, the content summarization model using the identified one or more best documents and the summary document S, wherein the trained content summarization model is used to summarize the source document D(310); and
validating, via the one or more hardware processors, the trained content summarization model on a plurality of datasets to evaluate an accuracy of the trained content summarization model (312).
The processor implemented method as claimed in claim 1, wherein the plurality of datasets to evaluate the accuracy of the trained content summarization model includes a CNN-DailyMail dataset and a XSUM dataset.
The processor implemented method as claimed in claim 1, wherein the language model for computing the perplexity of the dynamically selected one or more document units includes an auto-regressive language model.
A system (100), comprising:
a memory (102) storing instructions;
one or more communication interfaces (106); and
one or more hardware processors (104) coupled to the memory (102) via the one or more communication interfaces (106), wherein the one or more hardware processors (104) are configured by the instructions to:
receive a source document D containing one or more document units and a summary document S, wherein the summary document S is used as a reference document for training a content summarization model;
identify a plurality of summary-worthy segments (SWORTS) from the received source document D by computing a plurality of linguistic and topic-informed metric scores for each document unit of the one or more document units comprised in one or more topic-segments of the source document D by at least one of:
define an informativeness of the one or more document units comprised in the one or more topic-segments of the source document D based on at least one of :
one or more u-grams in a first order in the one or more topic-segments contributes at least a percentage with respect to the informativeness of the one or more document units; and
one or more u-grams in a second order in the one or more topic-segments signifying an importance pertaining to a key-information, and wherein the informativeness measures a degree to which the one or more document units provides a non-trivial information;
define a relevance of the one or more document units comprised in the one or more topic-segments of the source document D based on at least one of :
compute the relevance of the one or more document units comprised in the source document D against one or more preceding document units comprised in the one or more topic segments of the source document D;
compute the relevance of one or more leading document units comprised in the source document D against the one or more leading document units of the remaining one or more topic segments of the source document D; and
compute the relevance of the one or more leading document units comprised in the source document D against a last document unit of a previous topic segment comprised in the one or more topic segments of the source document D to capture a continuum of a discussion across one or more topics.
define a redundancy of the one or more document units comprised in the one or more topic-segments of the source document D based on one or more words comprised in the one or more document units which occur frequently in the one or more topic-segments;
leverage a plurality of document unit ranking systems on the identified plurality of summary worthy segments to dynamically select the one or more document units from the source document D, wherein each of the plurality of document unit ranking systems considers one or more combinations of the plurality of linguistic and topic-informed metric scores;
measure a fluency of the dynamically selected one or more document units to identify one or more best document units by computing a perplexity of the dynamically selected one or more document units using a language model;
train the content summarization model using the identified one or more best documents and the summary document S, wherein the trained content summarization model is used to summarize the source document D; and
validate the trained content summarization model on a plurality of datasets to evaluate an accuracy of the trained content summarization model.
The system as claimed in claim 4, wherein the plurality of datasets to evaluate the accuracy of the trained content summarization model includes a CNN-DailyMail dataset and a XSUM dataset.
The system as claimed in claim 4, wherein the language model for computing the perplexity of the dynamically selected the one or more document units includes an auto-regressive language model.
Dated this 18th Day of September 2023
Tata Consultancy Services Limited
By their Agent & Attorney
(Adheesh Nargolkar)
of Khaitan & Co
Reg No IN-PA-1086
| # | Name | Date |
|---|---|---|
| 1 | 202321062596-STATEMENT OF UNDERTAKING (FORM 3) [18-09-2023(online)].pdf | 2023-09-18 |
| 2 | 202321062596-REQUEST FOR EXAMINATION (FORM-18) [18-09-2023(online)].pdf | 2023-09-18 |
| 3 | 202321062596-FORM 18 [18-09-2023(online)].pdf | 2023-09-18 |
| 4 | 202321062596-FORM 1 [18-09-2023(online)].pdf | 2023-09-18 |
| 5 | 202321062596-FIGURE OF ABSTRACT [18-09-2023(online)].pdf | 2023-09-18 |
| 6 | 202321062596-DRAWINGS [18-09-2023(online)].pdf | 2023-09-18 |
| 7 | 202321062596-DECLARATION OF INVENTORSHIP (FORM 5) [18-09-2023(online)].pdf | 2023-09-18 |
| 8 | 202321062596-COMPLETE SPECIFICATION [18-09-2023(online)].pdf | 2023-09-18 |
| 9 | 202321062596-FORM-26 [14-12-2023(online)].pdf | 2023-12-14 |
| 10 | 202321062596-Proof of Right [19-12-2023(online)].pdf | 2023-12-19 |
| 11 | Abstract.jpg | 2024-01-11 |
| 12 | 202321062596-FORM-26 [11-11-2025(online)].pdf | 2025-11-11 |