Methods And Systems For Annotating Semi Structured Documents Using

< Back

Methods And Systems For Annotating Semi Structured Documents Using Knowledge Graph Schema

Abstract: The disclosure herein relates to methods and systems for annotating semi-structured documents using a knowledge graph schema. Conventional techniques in the art for the annotation of semi-structured documents are limited. The methods and systems of the present disclosure solve the technical problems for the annotation by proposing a two-stage approach. The first stage uses a probabilistic graphical model (PGM) with structural parameters to capture the complexity of document structure templates (DSTs) and variations within the DST, by recovering the latent (true) semantic structure of the semi-structured documents. In the second stage, a probabilistic logic model (PLM) is used to jointly annotate the nodes and the paths in the recovered document structures with entity types and relation types respectively, present in the KG schema. Further, the methods and systems of the present disclosure, discovers new entity types and relation types by analyzing the patterns in the unannotated document parts.

Get Free WhatsApp Updates!
Notices, Deadlines & Correspondence

Patent Information

Application #

Filing Date

12 March 2021

Publication Number

37/2022

Publication Type

INA

Invention Field

COMPUTER SCIENCE

Status

Parent Application

Patent Number

Legal Status

Grant Date

2025-03-26

Renewal Date

Applicants

Tata Consultancy Services Limited

Nirmal Building, 9th Floor, Nariman Point, Mumbai - 400021, Maharashtra, India

Inventors

1. KUNDU, Arpita

Tata Consultancy Services Limited, Block -1B, Eco Space, Plot No. IIF/12 (Old No. AA-II/BLK 3. I.T) Street 59 M. WIDE (R.O.W.) Road, New Town, Rajarhat, P.S. Rajarhat, Dist - N. 24 Parganas, Kolkata - 700160, West Bengal, India

2. BHATTACHARYA, Indrajit

3. GHOSH, Subhasish

4. PRAMANICK, Aniket

5. AGARWAL, Puneet

Tata Consultancy Services Limited, Plot No. A-44 & A-45, Ground, 1st to 5th Floor & 10th floor, Block - C & D, Sector - 62, Noida - 201309, Uttar Pradesh, India

6. PATIDAR, Mayur

Tata Consultancy Services Limited, Plot No. A-44 & A-45, Ground, 1st to 5th Floor & 10th floor, Block - C & D, Sector - 62, Noida - 201309, Uttar Pradesh, India

Specification

Claims:
1. A processor-implemented method (200) for annotating one or more semi-structured documents using a knowledge graph (KG) schema, the method (200) comprising the steps of:
receiving, via one or more hardware processors, one or more semi-structured documents to be annotated, wherein each semi-structured document of the one or more semi-structured documents to be annotated comprises one or more textual contents, each textual content of the one or more textual contents comprises a sequence of text tokens (202);
obtaining, via the one or more hardware processors, an observed structure graph for each semi-structured document of the one or more semi-structured documents to be annotated, wherein the observed structure graph for each semi-structured document comprises a plurality of observing nodes and a plurality of observing edges, each observing node of the plurality of observing nodes represents the textual content present in the corresponding semi-structured document, and each observing edge of the plurality of observing edges represents a relation between two observing nodes, and each observing node of the plurality of observing nodes act as one or more of (i) a parent node, and (ii) a child node, based on the one or more observing edges connected with the corresponding observing node (204);
generating, via the one or more hardware processors, a latent structure graph for each semi-structured document of the one or more semi-structured documents to be annotated, based on the corresponding observed structure graph, using a trained probabilistic graphical model (PGM), wherein the latent structure graph for each semi-structured document comprises a plurality of latent nodes and a plurality of latent edges, wherein two or more latent edges that are connected consecutively forms a latent path to get one or more latent paths, each latent edge of the plurality of latent edges is associated with the relation between two latent nodes, each latent node of the plurality of latent nodes is one of (i) a header node type, and (ii) a value node type, defined based on the associated node type value, and each latent node of the plurality of latent nodes act as one or more of (i) the parent node, and (ii) the child node, based on the one or more latent edges connected with the corresponding latent node (206); and
annotating, via the one or more hardware processors, jointly, each latent node of the plurality of latent nodes of header node type, and each latent path of the one or more latent paths having two or more latent nodes of header node type, present in the latent structure graph of each semi-structured document of the one or more semi-structured documents to be annotated, with one entity type of a plurality of entity types and one relation type of a plurality of relation types, respectively, present in the KG schema, using a trained probabilistic logic model (PLM) (208).

2. The method as claimed in claim 1, wherein the trained PGM is generated by:
receiving a plurality of semi-structured documents for training, wherein each semi-structured document of the plurality of semi-structured documents for training, is associated with a document structure template (DST) of one or more document structure templates (DSTs), and comprises the one or more training textual contents arranged according to the associated document structure template, each training textual content of the one or more training textual contents comprises the sequence of text tokens (206a1);
obtaining a training observed structure graph for each semi-structured document of the plurality of semi-structured documents for training, wherein the training observed structure graph for each semi-structured document for training comprises a plurality of training observing nodes and a plurality of training observing edges, each training observing node of the plurality of training observing nodes represents the training textual content present in the corresponding semi-structured document for training, and each training observing edge of the plurality of training observing edge represents the relation between two training observing nodes, and each training observing node of the plurality of training observing nodes act as one or more of (i) the parent node, and (ii) the child node, based on the one or more training observing edges connected with the associated training observing node (206a2);
receiving a training latent structure graph for each semi-structured document of the plurality of semi-structured documents for training, wherein the training latent structure graph for each semi-structured document for training comprises the plurality of training latent nodes and a plurality of training latent edges, wherein two or more training latent edges that are connected consecutively forms a training latent path to get one or more training latent paths, each training latent edge of the plurality of training latent edges is associated with the relation between two training latent nodes, each training latent node of the plurality of training latent nodes is one of (i) the header node type, and (ii) the value node type, defined based on the associated node type value, and each training latent node of the plurality of training latent nodes act as one or more of (i) the parent node, and (ii) the child node, based on the one or more latent edges connected between the corresponding latent node (206a3);
generating the trained PGM by training a PGM with the plurality of semi-structured documents for training, using (i) the training observed structure graphs, and (ii) the training latent structure graphs (206a4), wherein training the PGM with the plurality of semi-structured documents for training comprises:
(a) assigning: (i) an initial DST value for each semi-structured document of the plurality of semi-structured documents for training, (ii) an initial node type value for each training latent node present in the training latent structure graph of each semi-structured document of the plurality of semi-structured documents for training, wherein the initial node type value is one of (i) an initial header node type value and (ii) an initial value node type value (206a4a);
(b) computing an initial parameter value for each model parameter of a set of model parameters, for the plurality of semi-structured documents for training, wherein the set of model parameters comprises: (i) a probability of the initial node type value for each training latent node being a child node of another initial node type value in the corresponding training latent structure graph, (ii) a probability of the initial node type value for each training latent node appearing under a particular initial DST value, (iii) a probability of the initial node type value for each training latent node having a specific training textual content, and (iv) a probability of the initial node type value for each training latent node being left out in the corresponding training observed structure graph (206a4b);
(c) inferring a successive DST value, for each semi-structured document of the plurality of semi-structured documents for training, based on (i) the corresponding training latent structure graph, (ii) the initial node type value for each training latent node present in corresponding training latent structure graph, (iii) the initial parameter value for each model parameter of the set of model parameters, and (iv) a value of score function, wherein the score function is defined using the set of model parameters (206a4c);
(d) inferring a successive node type value for each training latent node present in the training latent structure graph of each semi-structured document of the plurality of semi-structured documents for training, based on (i) the corresponding training latent structure graph, (ii) the corresponding successive DST value, (iii) the initial parameter value for each model parameter of the set of model parameters, and (iv) the value of score function (206a4d);
(e) computing a successive parameter value for each model parameter of the set of model parameters, for the plurality of semi-structured documents for training (206a4e);
(f) repeating the steps (c) through (e), by taking the successive DST value as the initial DST value, the successive node type value as the initial node type value, for each semi-structured document of the plurality of semi-structured documents for training, and the successive parameter value for each model parameter as the initial parameter value, until no change in (i) the successive DST value and (ii) the successive node type value for each training latent node present in the training latent structure graph of each semi-structured document of the plurality of semi-structured documents for training (206a4f); and
(g) extracting (i) the DST value for each semi-structured document of the plurality of semi-structured documents for training, and (ii) the node type value for each training latent node present in the training latent structure graph of each semi-structured document of the plurality of semi-structured documents for training (206a4g).

3. The method as claimed in claim 1, wherein generating the latent structure graph for each semi-structured document of the one or more semi-structured documents to be annotated, based on the corresponding observed structure graph, using the trained PGM, further comprising:
(A) initializing the corresponding observed structure graph as the initial latent structure graph (206b1);
(B) assigning (i) an initial node type value for each latent node present in the initial latent structure graph, and (ii) an initial document structure template (DST) value, based on the initial node type value for each latent node present in the initial latent structure graph, wherein the initial node type value is one of (i) an initial header node type value, and (ii) an initial value node type value (206b2);
(C) finding one or more missing header node type values for the corresponding semi-structured document, that are missing in the initial latent structure graph, but present in training latent structure graph of at least one semi-structured document having the same DST value, out of a plurality of semi-structured documents for training (206b3);
(D) inferring a successive DST value for the corresponding semi-structured document, based on (i) the initial latent structure graph, (ii) the initial node type value for each latent node present in the initial latent structure graph (206b4);
(E) inferring a successive node type value for each latent node present in the initial latent structure graph, based on (i) the initial latent structure graph, and (ii) the successive DST value for the corresponding semi-structured document (206b5);
(F) inferring one or more successive parent nodes for each latent node present in the initial latent structure graph, based on (i) the corresponding successive node type value for each latent node, and (ii) the successive DST value for the corresponding semi-structured document (206b6);
(G) inserting a new latent node for each missing header node type value, in the initial latent structure graph, based on an associated value of score function determined by the trained PGM (206b7);
(H) repeating the steps (C) through (G), by taking the successive DST value as the initial DST value, and the successive node type value as the initial node type value, for each latent node present in the initial latent structure graph, until no change in (i) the successive DST value, (ii) the successive node type value for each latent node present in the initial latent structure graph and (iii) the one or more successive parent nodes for each latent node present in the initial latent structure graph (206b8); and
(I) considering the initial latent structure graph as the latent structure graph for the corresponding semi-structured document (206b9).

4. The method as claimed in claim 1, wherein the trained PLM is generated by:
receiving the KG schema comprising the plurality of entity types and the plurality of relation types, wherein each relation type represents a type of relation between two entity types (208a1);
receiving a plurality of entity type annotations and a plurality of relation type annotations, associated with a plurality of semi-structured documents for training, wherein each entity type annotation of the plurality of entity type annotations represents an annotation of a training latent node of header node type, present in a training latent structure graph of the associated semi-structured document of the plurality of semi-structured documents for training, with the corresponding entity type present in the KG schema, and each relation type annotation represents the annotation of a training latent path having training latent nodes of header node type, present in the training latent structure graph of the associated semi-structured document of the plurality of semi-structured documents for training, with the corresponding relation type present in the KG schema (208a2);
generating the trained PLM by training a PLM with (i) the training latent structure graph for each semi-structured document of the plurality of semi-structured documents for training, (ii) the plurality of entity types and the plurality of relation types present in the KG schema, and (iii) the plurality of entity type annotations and the plurality of relation type annotations associated with the plurality of semi-structured documents for training (208a3), by:
providing each training latent structure graph for each semi-structured document of the plurality of semi-structured documents for training in terms of training grounded facts comprising (i) the grounded facts related to each training latent edge present between two training latent nodes, and (ii) the grounded facts related to node type value for each training latent node, wherein the node type value for each training latent node is one of: (i) a header node type value and (ii) a value node type value (208a3a);
providing training uncertain fact parameters comprising: (i) a probability of header node type value for each training latent node present in the training latent structure graph, being one of the entity type present in the KG schema, and (ii) a probability of each relation type being present between two entity types present in the KG schema (208a3b);
providing training query fact parameters comprising: (i) a probability of each entity type annotation being associated with the training latent node of header node type, and (ii) a probability of each relation type annotation being associated with the training latent path having latent nodes of header node type (208a3c);
providing training constraints in terms of rules comprising: (i) if the training latent path present in the training latent structure graph for each semi-structured document is annotated with one of the relation type present in the KG schema, then, end training latent nodes of the associated training latent path must be annotated with the respective entity types associated with the corresponding relation type, and (ii) if the training latent path present in the training latent structure graph for each semi-structured document is annotated with one of the relation type present in the KG schema, then, internal training latent nodes present in the associated training latent path must not be annotated with any entity type present in the KG schema (208a3d); and
computing training uncertain fact parameter value for each training uncertain fact parameter, using an expectation maximization algorithm, based on the plurality of entity type annotations and the plurality of relation type annotations associated with the plurality of semi-structured documents for training, as evidences (208a3e).

5. The method as claimed in claim 1, wherein jointly annotating each latent node of the plurality of latent nodes of header node type, and each latent path of the plurality of latent paths having latent nodes of header node type, present in the latent structure graph of each semi-structured document of the one or more semi-structured documents to be annotated, with one entity type of the plurality of entity types and one relation type of the plurality of relation types, respectively, present in the KG schema, using the trained PLM, further comprising:
providing the corresponding latent structure graph in terms of grounded facts comprising (i) the grounded facts related to each latent edge present between the two latent nodes, and (ii) the grounded facts related to node type value for each latent node, wherein the node type value for each latent node is one of: (i) the header node type value and (ii) the value node type value (208b1); and
inferring the probability of each relation type annotation being associated with each latent path having latent nodes of header node type, present in the latent structure graph, to: (i) annotate the associated latent path with the associated relation type annotation having the maximum probability; and (ii) annotate end latent nodes of header node type present in the associated latent path, with the entity type annotation corresponding to the associated relation type annotation (208b2).

6. The method as claimed in claim 5, further comprising:
(A) identifying one or more latent nodes having same header node type value out of the plurality of latent nodes, but are not annotated with one of the entity type present in the KG schema (208c1);
(B) discovering a new entity type for the identified one or more latent nodes having same header node type value, if a number of the identified one or more latent nodes having the same header node type value, is greater than a first threshold value (208c2);
(C) annotating the identified one or more latent nodes having the same header node type value, with the discovered new entity type, and adding the discovered new entity type to the KG schema (208c3);
(D) identifying one or more latent paths having same header node type value sequence, but are not annotated with one of the relation type present in the KG schema, and the internal latent nodes present in latent path of the identified one or more latent paths are not annotated with any entity type present in the KG schema (208c4);
(E) discovering a new relation type for the associated header node type value sequence, if a number of the identified one or more latent paths having the same header node type value sequence, is greater than a second threshold value (208c5);
(F) annotating the identified one or more latent paths having the same header node type value sequence, with the discovered new relation type, and adding the discovered new relation type to the KG schema (208c6); and
(G) repeating the steps (A) through (F), until (i) no new entity type and (ii) no new relation type, is discovered (208c7).

7. A system (100) for annotating semi-structured documents using a knowledge graph (KG) schema, the system (100) comprising:
a memory (102) storing instructions;
one or more Input/Output (I/O) interfaces (106); and
one or more hardware processors (104) coupled to the memory (102) via the one or more I/O interfaces (106), wherein the one or more hardware processors (104) are configured by the instructions to:
receive one or more semi-structured documents to be annotated, wherein each semi-structured document of the one or more semi-structured documents to be annotated comprises one or more textual contents, each textual content of the one or more textual contents comprises a sequence of text tokens;
obtain an observed structure graph for each semi-structured document of the one or more semi-structured documents to be annotated, wherein the observed structure graph for each semi-structured document comprises a plurality of observing nodes and a plurality of observing edges, each observing node of the plurality of observing nodes represents the textual content present in the corresponding semi-structured document, and each observing edge of the plurality of observing edges represents a relation between two observing nodes, and each observing node of the plurality of observing nodes act as one or more of (i) a parent node, and (ii) a child node, based on the one or more observing edges connected with the corresponding observing node;
generate a latent structure graph for each semi-structured document of the one or more semi-structured documents to be annotated, based on the corresponding observed structure graph, using a trained probabilistic graphical model (PGM), wherein the latent structure graph for each semi-structured document comprises a plurality of latent nodes and a plurality of latent edges, wherein two or more latent edges that are connected consecutively forms a latent path to get one or more latent paths, each latent edge of the plurality of latent edges is associated with the relation between two latent nodes, each latent node of the plurality of latent nodes is one of (i) a header node type, and (ii) a value node type, defined based on the associated node type value, and each latent node of the plurality of latent nodes act as one or more of (i) the parent node, and (ii) the child node, based on the one or more latent edges connected with the corresponding latent node; and
annotate jointly, each latent node of the plurality of latent nodes of header node type, and each latent path of the one or more latent paths having two or more latent nodes of header node type, present in the latent structure graph of each semi-structured document of the one or more semi-structured documents to be annotated, with one entity type of a plurality of entity types and one relation type of a plurality of relation types, respectively, present in the KG schema, using a trained probabilistic logic model (PLM).

8. The system as claimed in claim 7, wherein the one or more hardware processors (104) are further configured to generate the trained PGM, by:
receiving a plurality of semi-structured documents for training, wherein each semi-structured document of the plurality of semi-structured documents for training, is associated with a document structure template (DST) of one or more document structure templates (DSTs), and comprises the one or more training textual contents arranged according to the associated document structure template, each training textual content of the one or more training textual contents comprises the sequence of text tokens;
obtaining a training observed structure graph for each semi-structured document of the plurality of semi-structured documents for training, wherein the training observed structure graph for each semi-structured document for training comprises a plurality of training observing nodes and a plurality of training observing edges, each training observing node of the plurality of training observing nodes represents the training textual content present in the corresponding semi-structured document for training, and each training observing edge of the plurality of training observing edge represents the relation between two training observing nodes, and each training observing node of the plurality of training observing nodes act as one or more of (i) the parent node, and (ii) the child node, based on the one or more training observing edges connected with the associated training observing node;
receiving a training latent structure graph for each semi-structured document of the plurality of semi-structured documents for training, wherein the training latent structure graph for each semi-structured document for training comprises the plurality of training latent nodes and a plurality of training latent edges, wherein two or more training latent edges that are connected consecutively forms a training latent path to get one or more training latent paths, each training latent edge of the plurality of training latent edges is associated with the relation between two training latent nodes, each training latent node of the plurality of training latent nodes is one of (i) the header node type, and (ii) the value node type, defined based on the associated node type value, and each training latent node of the plurality of training latent nodes act as one or more of (i) the parent node, and (ii) the child node, based on the one or more latent edges connected between the corresponding latent node;
generating the trained PGM by training a PGM with the plurality of semi-structured documents for training, using (i) the training observed structure graphs, and (ii) the training latent structure graphs, wherein training the PGM with the plurality of semi-structured documents for training comprises:
(a) assigning: (i) an initial DST value for each semi-structured document of the plurality of semi-structured documents for training, (ii) an initial node type value for each training latent node present in the training latent structure graph of each semi-structured document of the plurality of semi-structured documents for training, wherein the initial node type value is one of (i) an initial header node type value and (ii) an initial value node type value;
(b) computing an initial parameter value for each model parameter of a set of model parameters, for the plurality of semi-structured documents for training, wherein the set of model parameters comprises: (i) a probability of the initial node type value for each training latent node being a child node of another initial node type value in the corresponding training latent structure graph, (ii) a probability of the initial node type value for each training latent node appearing under a particular initial DST value, (iii) a probability of the initial node type value for each training latent node having a specific training textual content, and (iv) a probability of the initial node type value for each training latent node being left out in the corresponding training observed structure graph;
(c) inferring a successive DST value, for each semi-structured document of the plurality of semi-structured documents for training, based on (i) the corresponding training latent structure graph, (ii) the initial node type value for each training latent node present in corresponding training latent structure graph, (iii) the initial parameter value for each model parameter of the set of model parameters, and (iv) a value of score function, wherein the score function is defined using the set of model parameters;
(d) inferring a successive node type value for each training latent node present in the training latent structure graph of each semi-structured document of the plurality of semi-structured documents for training, based on (i) the corresponding training latent structure graph, (ii) the corresponding successive DST value, (iii) the initial parameter value for each model parameter of the set of model parameters, and (iv) the value of score function;
(e) computing a successive parameter value for each model parameter of the set of model parameters, for the plurality of semi-structured documents for training;
(f) repeating the steps (c) through (e), by taking the successive DST value as the initial DST value, the successive node type value as the initial node type value, for each semi-structured document of the plurality of semi-structured documents for training, and the successive parameter value for each model parameter as the initial parameter value, until no change in (i) the successive DST value and (ii) the successive node type value for each training latent node present in the training latent structure graph of each semi-structured document of the plurality of semi-structured documents for training; and
(g) extracting (i) the DST value for each semi-structured document of the plurality of semi-structured documents for training, and (ii) the node type value for each training latent node present in the training latent structure graph of each semi-structured document of the plurality of semi-structured documents for training.

9. The system as claimed in claim 7, wherein the one or more hardware processors (104) are further configured to generate the latent structure graph for each semi-structured document of the one or more semi-structured documents to be annotated, based on the corresponding observed structure graph, using the trained PGM, by:
(A) initializing the corresponding observed structure graph as the initial latent structure graph;
(B) assigning (i) an initial node type value for each latent node present in the initial latent structure graph, and (ii) an initial document structure template (DST) value, based on the initial node type value for each latent node present in the initial latent structure graph, wherein the initial node type value is one of (i) an initial header node type value, and (ii) an initial value node type value;
(C) finding one or more missing header node type values for the corresponding semi-structured document, that are missing in the initial latent structure graph, but present in training latent structure graph of at least one semi-structured document having the same DST value, out of a plurality of semi-structured documents for training;
(D) inferring a successive DST value for the corresponding semi-structured document, based on (i) the initial latent structure graph, (ii) the initial node type value for each latent node present in the initial latent structure graph;
(E) inferring a successive node type value for each latent node present in the initial latent structure graph, based on (i) the initial latent structure graph, and (ii) the successive DST value for the corresponding semi-structured document;
(F) inferring one or more successive parent nodes for each latent node present in the initial latent structure graph, based on (i) the corresponding successive node type value for each latent node, and (ii) the successive DST value for the corresponding semi-structured document;
(G) inserting a new latent node for each missing header node type value, in the initial latent structure graph, based on an associated value of score function determined by the trained PGM;
(H) repeating the steps (C) through (G), by taking the successive DST value as the initial DST value, and the successive node type value as the initial node type value, for each latent node present in the initial latent structure graph, until no change in (i) the successive DST value, (ii) the successive node type value for each latent node present in the initial latent structure graph and (iii) the one or more successive parent nodes for each latent node present in the initial latent structure graph; and
(I) considering the initial latent structure graph as the latent structure graph for the corresponding semi-structured document.

10. The system as claimed in claim 7, wherein the one or more hardware processors (104) are further configured to generate the trained PLM, by:
receiving the KG schema comprising the plurality of entity types and the plurality of relation types, wherein each relation type represents a type of relation between two entity types;
receiving a plurality of entity type annotations and a plurality of relation type annotations, associated with a plurality of semi-structured documents for training, wherein each entity type annotation of the plurality of entity type annotations represents an annotation of a training latent node of header node type, present in a training latent structure graph of the associated semi-structured document of the plurality of semi-structured documents for training, with the corresponding entity type present in the KG schema, and each relation type annotation represents the annotation of a training latent path having training latent nodes of header node type, present in the training latent structure graph of the associated semi-structured document of the plurality of semi-structured documents for training, with the corresponding relation type present in the KG schema;
generating the trained PLM by training a PLM with (i) the training latent structure graph for each semi-structured document of the plurality of semi-structured documents for training, (ii) the plurality of entity types and the plurality of relation types present in the KG schema, and (iii) the plurality of entity type annotations and the plurality of relation type annotations associated with the plurality of semi-structured documents for training, by:
providing each training latent structure graph for each semi-structured document of the plurality of semi-structured documents for training in terms of training grounded facts comprising (i) the grounded facts related to each training latent edge present between two training latent nodes, and (ii) the grounded facts related to node type value for each training latent node, wherein the node type value for each training latent node is one of: (i) a header node type value and (ii) a value node type value;
providing training uncertain fact parameters comprising: (i) a probability of header node type value for each training latent node present in the training latent structure graph, being one of the entity type present in the KG schema, and (ii) a probability of each relation type being present between two entity types present in the KG schema;
providing training query fact parameters comprising: (i) a probability of each entity type annotation being associated with the training latent node of header node type, and (ii) a probability of each relation type annotation being associated with the training latent path having latent nodes of header node type;
providing training constraints in terms of rules comprising: (i) if the training latent path present in the training latent structure graph for each semi-structured document is annotated with one of the relation type present in the KG schema, then, end training latent nodes of the associated training latent path must be annotated with the respective entity types associated with the corresponding relation type, and (ii) if the training latent path present in the training latent structure graph for each semi-structured document is annotated with one of the relation type present in the KG schema, then, internal training latent nodes present in the associated training latent path must not be annotated with any entity type present in the KG schema; and
computing training uncertain fact parameter value for each training uncertain fact parameter, using an expectation maximization algorithm, based on the plurality of entity type annotations and the plurality of relation type annotations associated with the plurality of semi-structured documents for training, as evidences.

11. The system as claimed in claim 7, wherein the one or more hardware processors (104) are further configured to jointly annotate each latent node of the plurality of latent nodes of header node type, and each latent path of the plurality of latent paths having latent nodes of header node type, present in the latent structure graph of each semi-structured document of the one or more semi-structured documents to be annotated, with one entity type of the plurality of entity types and one relation type of the plurality of relation types, respectively, present in the KG schema, using the trained PLM, by:
providing the corresponding latent structure graph in terms of grounded facts comprising (i) the grounded facts related to each latent edge present between the two latent nodes, and (ii) the grounded facts related to node type value for each latent node, wherein the node type value for each latent node is one of: (i) the header node type value and (ii) the value node type value; and
inferring the probability of each relation type annotation being associated with each latent path having latent nodes of header node type, present in the latent structure graph, to: (i) annotate the associated latent path with the associated relation type annotation having the maximum probability; and (ii) annotate end latent nodes of header node type present in the associated latent path, with the entity type annotation corresponding to the associated relation type annotation.

12. The system as claimed in claim 11, the one or more hardware processors (104) are further configured to:
(A) identifying one or more latent nodes having same header node type value out of the plurality of latent nodes, but are not annotated with one of the entity type present in the KG schema;
(B) discovering a new entity type for the identified one or more latent nodes having same header node type value, if a number of the identified one or more latent nodes having the same header node type value, is greater than a first threshold value;
(C) annotating the identified one or more latent nodes having the same header node type value, with the discovered new entity type, and adding the discovered new entity type to the KG schema;
(D) identifying one or more latent paths having same header node type value sequence, but are not annotated with one of the relation type present in the KG schema, and the internal latent nodes present in latent path of the identified one or more latent paths are not annotated with any entity type present in the KG schema;
(E) discovering a new relation type for the associated header node type value sequence, if a number of the identified one or more latent paths having the same header node type value sequence, is greater than a second threshold value;
(F) annotating the identified one or more latent paths having the same header node type value sequence, with the discovered new relation type, and adding the discovered new relation type to the KG schema; and
(G) repeating the steps (A) through (F), until (i) no new entity type and (ii) no new relation type, is discovered.
, Description:FORM 2

THE PATENTS ACT, 1970
(39 of 1970)
&
THE PATENT RULES, 2003

COMPLETE SPECIFICATION
(See Section 10 and Rule 13)

Title of invention:
METHODS AND SYSTEMS FOR ANNOTATING SEMI-STRUCTURED DOCUMENTS USING KNOWLEDGE GRAPH SCHEMA

Applicant:
Tata Consultancy Services Limited
A company Incorporated in India under the Companies Act, 1956
Having address:
Nirmal Building, 9th Floor,
Nariman Point, Mumbai 400021,
Maharashtra, India

The following specification particularly describes the invention and the manner in which it is to be performed.
TECHNICAL FIELD
The disclosure herein generally relates to the field of annotation, and, more particularly, to methods and systems for annotating semi-structured documents using a knowledge graph schema.

BACKGROUND
Semi-structured documents such as web-pages, electronic reports, slide decks, and so on, are important information sources and have seen much interest in recent times. The important information may be present in the form of textual contents. The structure of the documents in general, may contain significant evidence about semantic relations between the textual blocks present in the documents. However, the structure of the documents is more complex in a real world and interesting than those of the other plain text documents.
One view of such documents is that of complex tables, with the cells that span multiple rows and columns. But most of the annotation literature has only considered simple tables. The second view is directed acyclic graph (DAG) structure or a tree like structural view of the semi-structured documents. The conventional techniques in the art are limited in dealing the document structure of the semi-structured documents and ignore lot of their complexity. Firstly, the observed visual structure may not be a true semantic structure. The textual contents are not always related to the closest neighbors. The observed visual structure also often leaves out header textual contents. Secondly, the documents in any corpus usually correspond to a small number of document structure templates (DSTs). But the DSTs for individual documents are often implicit. Moreover, the structure of the individual documents within a DST contain minor variations within the theme (template).
Further, the annotation process of such semi-structured documents may utilize a knowledge graph (KG) schema that contains entity types and binary relation types between the entity types. The structure of KG schemas in real may be quite complex and they present in the form of directed multi-graphs in general. So, because of the complex document structures of the semi-structured documents and the complexity in the KG schemas, the nature of the annotation also become complex. For example, not all the nodes in the document structures correspond to entity types. As a result, individual edges in the document structures cannot be annotated with the relation types present in the KG schema. Instead, the relation types correspond to document structure paths. This may lead to structural constraints on the annotation, which may call for joint annotation of the nodes and paths in the document structure with the entity types and the relation types present in the KG schema.
Furthermore, the available KG schemas for a domain may be precise, but far from the complete. A typical document corpus mentions many entity types and relation types that may not specified in the KG schema. Such new entity and relation types need to be discovered from the corpus and incorporate into the KG schema.
Conventional techniques in the art for the annotation of semi-structured documents are limited specially to deal with the complexities involved. Further, the conventional techniques for the table annotation literature only considers simple table structures. The table structure is completely observed and there may be no notion of document structure templates. Some conventional techniques in the art for the annotation of the semi-structured documents are limited to simple star-shaped schemas, for which independent annotation of individual document edges is sufficient.

SUMMARY
Embodiments of the present disclosure present technological improvements as solutions to one or more of the above-mentioned technical problems recognized by the inventors in conventional systems.
In an aspect, there is provided a processor-implemented method for annotating one or more semi-structured documents using a knowledge graph (KG) schema, the method comprising the steps of: receiving one or more semi-structured documents to be annotated, wherein each semi-structured document of the one or more semi-structured documents to be annotated comprises one or more textual contents, each textual content of the one or more textual contents comprises a sequence of text tokens; obtaining an observed structure graph for each semi-structured document of the one or more semi-structured documents to be annotated, wherein the observed structure graph for each semi-structured document comprises a plurality of observing nodes and a plurality of observing edges, each observing node of the plurality of observing nodes represents the textual content present in the corresponding semi-structured document, and each observing edge of the plurality of observing edges represents a relation between two observing nodes, and each observing node of the plurality of observing nodes act as one or more of (i) a parent node, and (ii) a child node, based on the one or more observing edges connected with the corresponding observing node; generating a latent structure graph for each semi-structured document of the one or more semi-structured documents to be annotated, based on the corresponding observed structure graph, using a trained probabilistic graphical model (PGM), wherein the latent structure graph for each semi-structured document comprises a plurality of latent nodes and a plurality of latent edges, wherein two or more latent edges that are connected consecutively forms a latent path to get one or more latent paths, each latent edge of the plurality of latent edges is associated with the relation between two latent nodes, each latent node of the plurality of latent nodes is one of (i) a header node type, and (ii) a value node type, defined based on the associated node type value, and each latent node of the plurality of latent nodes act as one or more of (i) the parent node, and (ii) the child node, based on the one or more latent edges connected with the corresponding latent node; and annotating jointly, each latent node of the plurality of latent nodes of header node type, and each latent path of the one or more latent paths having two or more latent nodes of header node type, present in the latent structure graph of each semi-structured document of the one or more semi-structured documents to be annotated, with one entity type of a plurality of entity types and one relation type of a plurality of relation types, respectively, present in the KG schema, using a trained probabilistic logic model (PLM).
In another aspect, there is provided a system for annotating semi-structured documents using a knowledge graph (KG) schema, the system comprising: a memory storing instructions; one or more Input/Output (I/O) interfaces; and one or more hardware processors coupled to the memory via the one or more I/O interfaces, wherein the one or more hardware processors are configured by the instructions to: receive one or more semi-structured documents to be annotated, wherein each semi-structured document of the one or more semi-structured documents to be annotated comprises one or more textual contents, each textual content of the one or more textual contents comprises a sequence of text tokens; obtain an observed structure graph for each semi-structured document of the one or more semi-structured documents to be annotated, wherein the observed structure graph for each semi-structured document comprises a plurality of observing nodes and a plurality of observing edges, each observing node of the plurality of observing nodes represents the textual content present in the corresponding semi-structured document, and each observing edge of the plurality of observing edges represents a relation between two observing nodes, and each observing node of the plurality of observing nodes act as one or more of (i) a parent node, and (ii) a child node, based on the one or more observing edges connected with the corresponding observing node; generate a latent structure graph for each semi-structured document of the one or more semi-structured documents to be annotated, based on the corresponding observed structure graph, using a trained probabilistic graphical model (PGM), wherein the latent structure graph for each semi-structured document comprises a plurality of latent nodes and a plurality of latent edges, wherein two or more latent edges that are connected consecutively forms a latent path to get one or more latent paths, each latent edge of the plurality of latent edges is associated with the relation between two latent nodes, each latent node of the plurality of latent nodes is one of (i) a header node type, and (ii) a value node type, defined based on the associated node type value, and each latent node of the plurality of latent nodes act as one or more of (i) the parent node, and (ii) the child node, based on the one or more latent edges connected with the corresponding latent node; and annotate jointly, each latent node of the plurality of latent nodes of header node type, and each latent path of the one or more latent paths having two or more latent nodes of header node type, present in the latent structure graph of each semi-structured document of the one or more semi-structured documents to be annotated, with one entity type of a plurality of entity types and one relation type of a plurality of relation types, respectively, present in the KG schema, using a trained probabilistic logic model (PLM).
In yet another aspect, there is provided a computer program product comprising a non-transitory computer readable medium having a computer readable program embodied therein, wherein the computer readable program, when executed on a computing device, causes the computing device to: receive one or more semi-structured documents to be annotated, wherein each semi-structured document of the one or more semi-structured documents to be annotated comprises one or more textual contents, each textual content of the one or more textual contents comprises a sequence of text tokens; obtain an observed structure graph for each semi-structured document of the one or more semi-structured documents to be annotated, wherein the observed structure graph for each semi-structured document comprises a plurality of observing nodes and a plurality of observing edges, each observing node of the plurality of observing nodes represents the textual content present in the corresponding semi-structured document, and each observing edge of the plurality of observing edges represents a relation between two observing nodes, and each observing node of the plurality of observing nodes act as one or more of (i) a parent node, and (ii) a child node, based on the one or more observing edges connected with the corresponding observing node; generate a latent structure graph for each semi-structured document of the one or more semi-structured documents to be annotated, based on the corresponding observed structure graph, using a trained probabilistic graphical model (PGM), wherein the latent structure graph for each semi-structured document comprises a plurality of latent nodes and a plurality of latent edges, wherein two or more latent edges that are connected consecutively forms a latent path to get one or more latent paths, each latent edge of the plurality of latent edges is associated with the relation between two latent nodes, each latent node of the plurality of latent nodes is one of (i) a header node type, and (ii) a value node type, defined based on the associated node type value, and each latent node of the plurality of latent nodes act as one or more of (i) the parent node, and (ii) the child node, based on the one or more latent edges connected with the corresponding latent node; and annotate jointly, each latent node of the plurality of latent nodes of header node type, and each latent path of the one or more latent paths having two or more latent nodes of header node type, present in the latent structure graph of each semi-structured document of the one or more semi-structured documents to be annotated, with one entity type of a plurality of entity types and one relation type of a plurality of relation types, respectively, present in the KG schema, using a trained probabilistic logic model (PLM).
In an embodiment, the trained probabilistic graphical model (PGM) is generated by: receiving a plurality of semi-structured documents for training, wherein each semi-structured document of the plurality of semi-structured documents for training, is associated with a document structure template (DST) of one or more document structure templates (DSTs), and comprises the one or more training textual contents arranged according to the associated document structure template, each training textual content of the one or more training textual contents comprises the sequence of text tokens; obtaining a training observed structure graph for each semi-structured document of the plurality of semi-structured documents for training, wherein the training observed structure graph for each semi-structured document for training comprises a plurality of training observing nodes and a plurality of training observing edges, each training observing node of the plurality of training observing nodes represents the training textual content present in the corresponding semi-structured document for training, and each training observing edge of the plurality of training observing edge represents the relation between two training observing nodes, and each training observing node of the plurality of training observing nodes act as one or more of (i) the parent node, and (ii) the child node, based on the one or more training observing edges connected with the associated training observing node; receiving a training latent structure graph for each semi-structured document of the plurality of semi-structured documents for training, wherein the training latent structure graph for each semi-structured document for training comprises the plurality of training latent nodes and a plurality of training latent edges, wherein two or more training latent edges that are connected consecutively forms a training latent path to get one or more training latent paths, each training latent edge of the plurality of training latent edges is associated with the relation between two training latent nodes, each training latent node of the plurality of training latent nodes is one of (i) the header node type, and (ii) the value node type, defined based on the associated node type value, and each training latent node of the plurality of training latent nodes act as one or more of (i) the parent node, and (ii) the child node, based on the one or more latent edges connected between the corresponding latent node; generating the trained PGM by training a PGM with the plurality of semi-structured documents for training, using (i) the training observed structure graphs, and (ii) the training latent structure graphs, wherein training the PGM with the plurality of semi-structured documents for training comprises: (a) assigning: (i) an initial DST value for each semi-structured document of the plurality of semi-structured documents for training, (ii) an initial node type value for each training latent node present in the training latent structure graph of each semi-structured document of the plurality of semi-structured documents for training, wherein the initial node type value is one of (i) an initial header node type value and (ii) an initial value node type value; (b) computing an initial parameter value for each model parameter of a set of model parameters, for the plurality of semi-structured documents for training, wherein the set of model parameters comprises: (i) a probability of the initial node type value for each training latent node being a child node of another initial node type value in the corresponding training latent structure graph, (ii) a probability of the initial node type value for each training latent node appearing under a particular initial DST value, (iii) a probability of the initial node type value for each training latent node having a specific training textual content, and (iv) a probability of the initial node type value for each training latent node being left out in the corresponding training observed structure graph; (c) inferring a successive DST value, for each semi-structured document of the plurality of semi-structured documents for training, based on (i) the corresponding training latent structure graph, (ii) the initial node type value for each training latent node present in corresponding training latent structure graph, (iii) the initial parameter value for each model parameter of the set of model parameters, and (iv) a value of score function, wherein the score function is defined using the set of model parameters; (d) inferring a successive node type value for each training latent node present in the training latent structure graph of each semi-structured document of the plurality of semi-structured documents for training, based on (i) the corresponding training latent structure graph, (ii) the corresponding successive DST value, (iii) the initial parameter value for each model parameter of the set of model parameters, and (iv) the value of score function; (e) computing a successive parameter value for each model parameter of the set of model parameters, for the plurality of semi-structured documents for training; (f) repeating the steps (c) through (e), by taking the successive DST value as the initial DST value, the successive node type value as the initial node type value, for each semi-structured document of the plurality of semi-structured documents for training, and the successive parameter value for each model parameter as the initial parameter value, until no change in (i) the successive DST value and (ii) the successive node type value for each training latent node present in the training latent structure graph of each semi-structured document of the plurality of semi-structured documents for training; and (g) extracting (i) the DST value for each semi-structured document of the plurality of semi-structured documents for training, and (ii) the node type value for each training latent node present in the training latent structure graph of each semi-structured document of the plurality of semi-structured documents for training.
In an embodiment, the latent structure graph for each semi-structured document of the one or more semi-structured documents to be annotated, is generated, based on the corresponding observed structure graph, using the trained probabilistic graphical model (PGM), by: (A) initializing the corresponding observed structure graph as the initial latent structure graph; (B) assigning (i) an initial node type value for each latent node present in the initial latent structure graph, and (ii) an initial document structure template (DST) value, based on the initial node type value for each latent node present in the initial latent structure graph, wherein the initial node type value is one of (i) an initial header node type value, and (ii) an initial value node type value; (C) finding one or more missing header node type values for the corresponding semi-structured document, that are missing in the initial latent structure graph, but present in training latent structure graph of at least one semi-structured document having the same DST value, out of a plurality of semi-structured documents for training; (D) inferring a successive DST value for the corresponding semi-structured document, based on (i) the initial latent structure graph, (ii) the initial node type value for each latent node present in the initial latent structure graph; (E) inferring a successive node type value for each latent node present in the initial latent structure graph, based on (i) the initial latent structure graph, and (ii) the successive DST value for the corresponding semi-structured document; (F) inferring one or more successive parent nodes for each latent node present in the initial latent structure graph, based on (i) the corresponding successive node type value for each latent node, and (ii) the successive DST value for the corresponding semi-structured document; (G) inserting a new latent node for each missing header node type value, in the initial latent structure graph, based on an associated value of score function determined by the trained PGM; (H) repeating the steps (C) through (G), by taking the successive DST value as the initial DST value, and the successive node type value as the initial node type value, for each latent node present in the initial latent structure graph, until no change in (i) the successive DST value, (ii) the successive node type value for each latent node present in the initial latent structure graph and (iii) the one or more successive parent nodes for each latent node present in the initial latent structure graph; and (I) considering the initial latent structure graph as the latent structure graph for the corresponding semi-structured document.
In an embodiment, the trained probabilistic logic model (PLM) is generated by: receiving the KG schema comprising the plurality of entity types and the plurality of relation types, wherein each relation type represents a type of relation between two entity types; receiving a plurality of entity type annotations and a plurality of relation type annotations, associated with a plurality of semi-structured documents for training, wherein each entity type annotation of the plurality of entity type annotations represents an annotation of a training latent node of header node type, present in a training latent structure graph of the associated semi-structured document of the plurality of semi-structured documents for training, with the corresponding entity type present in the KG schema, and each relation type annotation represents the annotation of a training latent path having training latent nodes of header node type, present in the training latent structure graph of the associated semi-structured document of the plurality of semi-structured documents for training, with the corresponding relation type present in the KG schema; generating the trained PLM by training a PLM with (i) the training latent structure graph for each semi-structured document of the plurality of semi-structured documents for training, (ii) the plurality of entity types and the plurality of relation types present in the KG schema, and (iii) the plurality of entity type annotations and the plurality of relation type annotations associated with the plurality of semi-structured documents for training, by: providing each training latent structure graph for each semi-structured document of the plurality of semi-structured documents for training in terms of training grounded facts comprising (i) the grounded facts related to each training latent edge present between two training latent nodes, and (ii) the grounded facts related to node type value for each training latent node, wherein the node type value for each training latent node is one of: (i) a header node type value and (ii) a value node type value; providing training uncertain fact parameters comprising: (i) a probability of header node type value for each training latent node present in the training latent structure graph, being one of the entity type present in the KG schema, and (ii) a probability of each relation type being present between two entity types present in the KG schema; providing training query fact parameters comprising: (i) a probability of each entity type annotation being associated with the training latent node of header node type, and (ii) a probability of each relation type annotation being associated with the training latent path having latent nodes of header node type; providing training constraints in terms of rules comprising: (i) if the training latent path present in the training latent structure graph for each semi-structured document is annotated with one of the relation type present in the KG schema, then, end training latent nodes of the associated training latent path must be annotated with the respective entity types associated with the corresponding relation type, and (ii) if the training latent path present in the training latent structure graph for each semi-structured document is annotated with one of the relation type present in the KG schema, then, internal training latent nodes present in the associated training latent path must not be annotated with any entity type present in the KG schema; and computing training uncertain fact parameter value for each training uncertain fact parameter, using an expectation maximization algorithm, based on the plurality of entity type annotations and the plurality of relation type annotations associated with the plurality of semi-structured documents for training, as evidences.
In an embodiment, each latent node of the plurality of latent nodes of header node type, and each latent path of the plurality of latent paths having latent nodes of header node type, present in the latent structure graph of each semi-structured document of the one or more semi-structured documents to be annotated, with one entity type of the plurality of entity types and one relation type of the plurality of relation types, respectively, present in the KG schema, using the trained probabilistic logic model (PLM) model, are annotated jointly, by: providing the corresponding latent structure graph in terms of grounded facts comprising (i) the grounded facts related to each latent edge present between the two latent nodes, and (ii) the grounded facts related to node type value for each latent node, wherein the node type value for each latent node is one of: (i) the header node type value and (ii) the value node type value; and inferring the probability of each relation type annotation being associated with each latent path having latent nodes of header node type, present in the latent structure graph, to: (i) annotate the associated latent path with the associated relation type annotation having the maximum probability; and (ii) annotate end latent nodes of header node type present in the associated latent path, with the entity type annotation corresponding to the associated relation type annotation.
In an embodiment, the annotation of each latent node of the plurality of latent nodes of header node type, and each latent path of the plurality of latent paths having latent nodes of header node type, present in the latent structure graph of each semi-structured document of the one or more semi-structured documents to be annotated, with one entity type of the plurality of entity types and one relation type of the plurality of relation types, respectively, present in the KG schema, using the trained probabilistic logic model (PLM) model, further comprises: (A) identifying one or more latent nodes having same header node type value out of the plurality of latent nodes, but are not annotated with one of the entity type present in the KG schema; (B) discovering a new entity type for the identified one or more latent nodes having same header node type value, if a number of the identified one or more latent nodes having the same header node type value, is greater than a first threshold value; (C) annotating the identified one or more latent nodes having the same header node type value, with the discovered new entity type, and adding the discovered new entity type to the KG schema; (D) identifying one or more latent paths having same header node type value sequence, but are not annotated with one of the relation type present in the KG schema, and the internal latent nodes present in latent path of the identified one or more latent paths are not annotated with any entity type present in the KG schema; (E) discovering a new relation type for the associated header node type value sequence, if a number of the identified one or more latent paths having the same header node type value sequence, is greater than a second threshold value; (F) annotating the identified one or more latent paths having the same header node type value sequence, with the discovered new relation type, and adding the discovered new relation type to the KG schema; and (G) repeating the steps (A) through (F), until (i) no new entity type and (ii) no new relation type, is discovered.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the embodiments of the present disclosure, as claimed.

BRIEF DESCRIPTION OF THE DRAWINGS
The accompanying drawings, which are incorporated in and constitute a part of this disclosure, illustrate exemplary embodiments and, together with the description, serve to explain the disclosed principles:
FIG. 1 is an exemplary block diagram of a system for annotating semi-structured documents using a knowledge graph schema, in accordance with some embodiments of the present disclosure.
FIG. 2A through FIG. 2B illustrate exemplary flow diagrams of a processor-implemented method for annotating semi-structured documents using the knowledge graph schema, in accordance with some embodiments of the present disclosure.
FIG. 2C through FIG. 2F illustrate exemplary flow diagrams for generating a trained probabilistic graphical model (PGM), in accordance with some embodiments of the present disclosure.
FIG. 2G through FIG. 2H illustrate exemplary flow diagrams for generating a latent structure graph for each semi-structured document to be annotated using the trained PGM, in accordance with some embodiments of the present disclosure.
FIG. 2I through FIG. 2K illustrate exemplary flow diagrams for generating a trained probabilistic logic model (PLM), in accordance with some embodiments of the present disclosure.
FIG. 2L illustrates an exemplary flow diagram for jointly annotating each latent node of a plurality of latent nodes of header node type, and each latent path of a plurality of latent paths having latent nodes of header node type, present in the latent structure graph of each semi-structured document to be annotated, with one entity type of a plurality of entity types and one relation type of a plurality of relation types, respectively, present in the KG schema, using the trained PLM, in accordance with some embodiments of the present disclosure.
FIG. 2M illustrates an exemplary flow diagram for discovering one or more new entity types and one or more new relation types, for annotation, in accordance with some embodiments of the present disclosure.
FIG. 3A shows two exemplary semi-structured documents to be annotated, in accordance with some embodiments of the present disclosure.
FIG. 3B shows observed structure graphs for the two exemplary semi-structured documents to be annotated shown in FIG. 3A, in accordance with some embodiments of the present disclosure.
FIG. 3C shows latent structure graphs for the two exemplary semi-structured documents to be annotated shown in FIG. 3A, in accordance with some embodiments of the present disclosure.
FIG. 3D shows initial latent structure graphs for the two exemplary semi-structured documents to be annotated shown in FIG. 3A, in accordance with some embodiments of the present disclosure.
FIG. 4 shows annotating of one or more latent nodes and one or more latent paths present in the latent structure graph of the two exemplary semi-structured documents to be annotated, shown in FIG. 3A, with one or more entity types and one or more relation types present in the KG schema, in accordance with some embodiments of the present disclosure.

DETAILED DESCRIPTION OF EMBODIMENTS
Exemplary embodiments are described with reference to the accompanying drawings. In the figures, the left-most digit(s) of a reference number identifies the figure in which the reference number first appears. Wherever convenient, the same reference numbers are used throughout the drawings to refer to the same or like parts. While examples and features of disclosed principles are described herein, modifications, adaptations, and other implementations are possible without departing from the scope of the disclosed embodiments.
The present disclosure herein provides methods and systems for annotating semi-structured documents using the knowledge graph schema solve the technical problems for the annotation by proposing a two-stage approach. The first stage uses a probabilistic graphical model (PGM) with structural parameters to capture the complexity of document structure templates (DSTs) and variations within the DST, by recovering the latent (true) semantic structure of the semi-structured documents. In the second stage, a probabilistic logic model (PGM) is used to jointly annotate the nodes and the paths in the recovered document structures with entity types and relation types respectively, present in the KG schema. This allows learning of annotation patterns while satisfying logical annotation constraints over the document structures. Further, the methods and systems of the present disclosure, discovers new entity types and relation types by analyzing the patterns in the unannotated document parts.
Referring now to the drawings, and more particularly to FIG. 1 through FIG. 4, where similar reference characters denote corresponding features consistently throughout the figures, there are shown preferred embodiments and these embodiments are described in the context of the following exemplary systems and/or methods.
FIG. 1 is an exemplary block diagram of a system 100 for annotating semi-structured documents using a knowledge graph schema, in accordance with some embodiments of the present disclosure. In an embodiment, the system 100 includes or is otherwise in communication with one or more hardware processors 104, communication interface device(s) or input/output (I/O) interface(s) 106, and one or more data storage devices or memory 102 operatively coupled to the one or more hardware processors 104. The one or more hardware processors 104, the memory 102, and the I/O interface(s) 106 may be coupled to a system bus 108 or a similar mechanism.
The I/O interface(s) 106 may include a variety of software and hardware interfaces, for example, a web interface, a graphical user interface, and the like. The I/O interface(s) 106 may include a variety of software and hardware interfaces, for example, interfaces for peripheral device(s), such as a keyboard, a mouse, an external memory, a plurality of sensor devices, a printer and the like. Further, the I/O interface(s) 106 may enable the system 100 to communicate with other devices, such as web servers and external databases.
The I/O interface(s) 106 can facilitate multiple communications within a wide variety of networks and protocol types, including wired networks, for example, local area network (LAN), cable, etc., and wireless networks, such as Wireless LAN (WLAN), cellular, or satellite. For the purpose, the I/O interface(s) 106 may include one or more ports for connecting a number of computing systems with one another or to another server computer. Further, the I/O interface(s) 106 may include one or more ports for connecting a number of devices to one another or to another server.
The one or more hardware processors 104 may be implemented as one or more microprocessors, microcomputers, microcontrollers, digital signal processors, central processing units, state machines, logic circuitries, and/or any devices that manipulate signals based on operational instructions. Among other capabilities, the one or more hardware processors 104 are configured to fetch and execute computer-readable instructions stored in the memory 102. In the context of the present disclosure, the expressions ‘processors’ and ‘hardware processors’ may be used interchangeably. In an embodiment, the system 100 can be implemented in a variety of computing systems, such as laptop computers, portable computers, notebooks, hand-held devices, workstations, mainframe computers, servers, a network cloud and the like.
The memory 102 may include any computer-readable medium known in the art including, for example, volatile memory, such as static random access memory (SRAM) and dynamic random access memory (DRAM), and/or non-volatile memory, such as read only memory (ROM), erasable programmable ROM, flash memories, hard disks, optical disks, and magnetic tapes. In an embodiment, the memory 102 includes a plurality of modules 102a and a repository 102b for storing data processed, received, and generated by one or more of the plurality of modules 102a. The plurality of modules 102a may include routines, programs, objects, components, data structures, and so on, which perform particular tasks or implement particular abstract data types.
The plurality of modules 102a may include programs or computer-readable instructions or coded instructions that supplement applications or functions performed by the system 100. The plurality of modules 102a may also be used as, signal processor(s), state machine(s), logic circuitries, and/or any other device or component that manipulates signals based on operational instructions. Further, the plurality of modules 102a can be used by hardware, by computer-readable instructions executed by the one or more hardware processors 104, or by a combination thereof. In an embodiment, the plurality of modules 102a can include various sub-modules (not shown in FIG. 1). Further, the memory 102 may include information pertaining to input(s)/output(s) of each step performed by the processor(s) 104 of the system 100 and methods of the present disclosure.
The repository 102b may include a database or a data engine. Further, the repository 102b amongst other things, may serve as a database or includes a plurality of databases for storing the data that is processed, received, or generated as a result of the execution of the plurality of modules 102a. Although the repository 102a is shown internal to the system 100, it will be noted that, in alternate embodiments, the repository 102b can also be implemented external to the system 100, where the repository 102b may be stored within an external database (not shown in FIG. 1) communicatively coupled to the system 100. The data contained within such external database may be periodically updated. For example, new data may be added into the external database and/or existing data may be modified and/or non-useful data may be deleted from the external database. In one example, the data may be stored in an external system, such as a Lightweight Directory Access Protocol (LDAP) directory and a Relational Database Management System (RDBMS). In another embodiment, the data stored in the repository 102b may be distributed between the system 100 and the external database.
Referring to FIG. 2A through FIG. 2B, components and functionalities of the system 100 are described in accordance with an example embodiment of the present disclosure. For example, FIG. 2A through FIG. 2B illustrate exemplary flow diagrams of a processor-implemented method 200 for annotating semi-structured documents using the knowledge graph schema, in accordance with some embodiments of the present disclosure. Although steps of the method 200 including process steps, method steps, techniques or the like may be described in a sequential order, such processes, methods and techniques may be configured to work in alternate orders. In other words, any sequence or order of steps that may be described does not necessarily indicate a requirement that the steps be performed in that order. The steps of processes described herein may be performed in any practical order. Further, some steps may be performed simultaneously, or some steps may be performed alone or independently.
At step 202 of the method 200, the one or more hardware processors 104 of the system 100 are configured to receive one or more semi-structured documents to be annotated. Each semi-structured document of the one or more semi-structured documents includes one or more textual contents. The one or more textual contents present in each semi-structured document may be represented in a particular structure to define the document structure of the associated semi-structured document. In an embodiment, the document structure of one semi-structured document may be similar or same to the document structure of another semi-structured document, out of the one or more semi-structured documents to be annotated. In another embodiment, the document structure of one semi-structured document may not be similar or not same to the document structure of another semi-structured document, out of the one or more semi-structured documents to be annotated. In other words, some semi-structured documents out of the one or more semi-structured documents to be annotated, may contain the similar or same document structure, while some semi-structured documents out of the one or more semi-structured documents to be annotated contain a different document structure.
Each textual content of the one or more textual contents present in each semi-structured document of the one or more semi-structured documents includes a sequence of text tokens. However, the document structure of each semi-structured document of the one or more semi-structured documents to be annotated, is not known and to be derived based on the corresponding one or more textual contents.
In an embodiment, the one or more semi-structured documents to be annotated may be web-pages, electronic reports, business reports, office reports, or in a combination thereof. In an embodiment, the one or more semi-structured documents to be annotated may belongs to one organization, or multiple organizations or a combination thereof. In an embodiment, the organization may include but are limited to any business entity, educational institutes, medical organizations, and so on.
FIG. 3A shows two exemplary semi-structured documents to be annotated, in accordance with some embodiments of the present disclosure. Both the exemplary semi-structured documents are associated with banking industry and in each semi-structured document, the text present in each rectangular block represents the textual contents. For example, ‘ABC Bank Designs Digital Banking’, ‘Client’, ‘Industries’, ‘Challenge’, and so on from first semi-structured document and ‘SUCCESS STORY’, ‘XYZ Automates Transaction’, ‘Challenges’, and so on from second semi-structured document. Both the exemplary semi-structured documents show the success story of the respective banks namely ‘ABC Bank’ and ‘XYZ’ Bank.
At step 204 of the method 200, the one or more hardware processors 104 of the system 100 are configured to obtain an observed structure graph for each semi-structured document of the one or more semi-structured documents to be annotated, received at step 202 of the method 200. The observed structure graph for each semi-structured document represents the visual observed structure of the associated semi-structured document. The observed structure shows the arrangement and connections of the textual blocks present in the associated semi-structured document, from the first impression.
In an embodiment, the observed structure graph for each semi-structured document may be obtained by assuming the semi-structured document as a tabular structure where the textual blocks present in the semi-structured document are placed under rows and columns. Then, by considering each textual block as the observing node and the observing edge between each pair of the observing nodes, based on the closeness distance of the textual blocks.
In another embodiment, the observed structure graph for each semi-structured document may be obtained by first extracting X and Y coordinates of textual blocks present in the semi-structured document. Then, by considering each textual block as the observing node and the observing edge between each pair of the observing nodes, based on the closeness property of X and Y coordinates of the corresponding textual blocks. The observed structure graph for each semi-structured document may be in the form of directed acyclic graph (DAG) structure. The observed structure graph includes a plurality of observing nodes and a plurality of observing edges connecting two or more observing nodes of the plurality of observing nodes. The plurality of observing nodes are the nodes present in the observed structure graph and similarly, the plurality of observing edges are the edges present in the observed structure graph.
Each observing node of the plurality observing nodes represents one textual content present in the corresponding semi-structured document. Each observing edge represents a relation between two textual contents (observing nodes). Each observing node may act as one or more of; (i) a parent node, and (ii) a child node, based on the one or more observing edges connected with the corresponding observing node.
For example, FIG. 3B shows observed structure graphs for the two exemplary semi-structured documents to be annotated shown in FIG. 3A, in accordance with some embodiments of the present disclosure. The textual contents present in each semi-structured document are represented in the form of observing nodes and the relation between each pair of textual contents is represented by observing edges. The textual contents ‘ABC bank’, ‘Few customers are taking…’, ‘As people continue to expect more…’, and so on from the first exemplary semi-structured document, and the textual contents ’SUCCESS STORY’, ‘XYZ bank wanted to…..’, and so on from the second exemplary semi-structured document, are represented in the form of observing nodes and the observing edges are formed based on the relation between the two observing nodes. For example, ‘ABC bank’ is of type ‘industries’ so the observing edge is formed between the two observing nodes ‘ABC bank’ and ‘industries’ in the first exemplary semi-structured document. Similarly, the textual content ‘XYZ automates..’ is belongs to ‘SUCCESS STORY’, hence the observing edge is formed between the two observing nodes ‘XYZ automates..’ and ‘SUCCESS STORY’, in the second exemplary semi-structured document.
Each observing node present in the observed structure graph act either as a parent node or a child node or both depending on the one or more observing edges connected with the associated observing node. If there is observing edge going out from first observing node and coming into the second observing node, then the second observing node act as the child node of the first observing node and the first observing node act as the parent node of second observing node. But if there is also the observing edge going out from the second observing node and coming to a third observing node, then the second observing node act as the child node for the first observing node and the parent node of the third observing node. All the observing nodes that are leaf nodes which do not have any observing edges going out from them act as only the child nodes. In the DAG structure, there is at least one observing node for which there are no observing edges coming into it, i.e. the root node. The root node and the other observing nodes for which there are no observing edges coming into them act as only the parent node. Rest of the observing nodes for which there are observing edges going from them and coming into them, act as both the parent node and the child node. For example, in the first exemplary semi-structured document of FIG. 3B, there are no observing edges coming into the observing node with textual content ‘ABC Bank..’ and it is the root node, hence it act as only the parent node. The observing nodes with textual content ‘Banking’, ‘As people continue….’are the leaf nodes and hence they act as only the child nodes. Similarly, the observing node with textual content ‘ABC Bank’ has both the parent node ‘Client’ and the child node ‘Challenge’ and hence it acts as both the parent node and the child node.
In an embodiment, the observed structure graph of one semi-structured document may be similar to the observed structure graph of the second semi-structured document. In another embodiment, the observed structure graph of one semi-structured document may be same as that of the observed structure graph of the second semi-structured document. Further in another embodiment, the observed structure graph of one semi-structured document may be different to the observed structure graph of the second semi-structured document. In other words, the observed structure of one semi-structured document may be similar, or same, or different to the observed structure of the second semi-structured document. The semi-structured documents having the same or similar observed structure may belongs to a particular DST. Similarly, the semi-structured documents having the different observed structure belongs to different DSTs. For example, in FIG. 3B, both the exemplary semi-structured documents reveal the success story of two banks, and hence both the exemplary semi-structured documents may belong to same DST.
However, the observed structure obtained in the in form of observed structure graph of the semi-structured document may not reveal the true semantic relationships between the textual contents. For example, from both exemplary semi-structured documents, the textual contents ‘Challenges’ and ‘Challenge’ appear from the associated observed structure graphs as related to two client names. But actually, both are directly related to high level success stories, which are root nodes of the semi-structured documents. Further, due to these difference, the actual DST for each semi-structured document may not be derived accurately.
At step 206 of the method 200, the one or more hardware processors 104 of the system 100 are configured to generate a latent structure graph for each semi-structured document of the one or more semi-structured documents to be annotated received at step 202 of the method 200. The latent structure graph for each semi-structured document represents the actual or true structure of the associated semi-structured document and generated based on the corresponding observed structure graph.
The generated latent structure graph for each semi-structured document may be in the form of the directed acyclic graph (DAG) structure and include a plurality of latent nodes and a plurality of latent edges connecting two or more latent nodes of the plurality of latent nodes. The plurality of latent nodes are the nodes present in the latent structure graph, and similarly, the plurality of latent edges are the edges present in the latent structure graph. Each latent node may represent the textual content present in the semi-structured document. Each latent edge represents the relation between two latent nodes (two textual contents).
Each latent node present in the latent structure graph act either as the parent node or the child node or both depending on one or more latent edges connected with the associated latent node. If there is a latent edge going out from first latent node and coming into the second latent node, then the second latent node act as the child node of the first latent node and the first latent node act as the parent node of second latent node. But if there is also the latent edge going out from the second latent node and coming to a third latent node, then the second latent node act as the child node for the first latent node and the parent node of the third latent node. All the latent nodes that are leaf nodes which do not have any latent edges going out from them act as only the child nodes. In the DAG structure, there is at least one latent node for which there are no latent edges coming into it i.e. the root node. The root node and the other latent nodes for which there are no latent edges coming into them act as only the parent node. Rest of the latent nodes for which there are latent edges going from them and coming into them, act as both the parent node and the child node.
Each latent node present in the latent structure graph may act as either a header node or a value node, based on the associated node type. In other words, each observing node present in the corresponding observed structure graph may act as either the header node or the value node, in the corresponding latent structure graph. Further, two or more latent edges that are connected consecutively between the two or more latent nodes forms a latent path, and the latent structure graph for each semi-structured document may include one or more such latent paths. FIG. 3C shows latent structure graphs for the two exemplary semi-structured documents to be annotated shown in FIG. 3A, in accordance with some embodiments of the present disclosure. H1, H2, H3, and H4 present in the first exemplary latent structure graph are the header nodes and similarly, V1, V2, V3, and V4 are the value nodes. For example, in the latent structure graph of the first exemplary semi-structured document, the textual content ‘ ABC Bank…’ is represented as the value node ‘V1’,the textual content ‘Client’ is represented with as the header node ‘H2’, and so on. Most importantly, in the latent structure graph of the second exemplary semi-structured document, the textual content ‘SUCCESS STORY’ is represented with the header node ‘H1’, but the textual content ‘SUCCESS STORY’ is missing in the first semi-structured document. As both the exemplary semi-structured documents are of same type success stories, the header node ‘H1’ must be inserted in the latent structure graph of the first exemplary semi-structured document.
Further, the latent nodes which have either similar (or same) textual content and have similar node types in neighbor (same node type in the parent or child of the corresponding node) shares same node type. For example, the latent node with textual content ‘ABC Bank..’ in first latent structure graph of the first exemplary semi-structured document and the latent node with textual content ‘XYZ automates..’ in second latent structure graph of second exemplary semi-structured document do not have same textual content but they have similar neighbor node types (they are success story names and their parent node should contain the textual content success story which is missing for first example), and hence both have same node type and represented by ‘V1’. Similarly, the nodes with textual content ‘banking’ have same textual content and also have similar neighbor type (their parent is about industries and the parent is missing for second example), hence both are represented with the value node ‘V3’. Similarly, the nodes with textual content ‘Few customers are taking…’, ‘As people continue to expect more…’ from the first exemplary semi-structured document, and the node with textual content ‘XYZ bank wanted to…..’ from the second exemplary semi-structured document, are child of same type node ‘challenges/Challenge’. Hence the nodes are represented with the value node type ‘V4’.
In an embodiment, the latent structure graph for each semi-structured document is generated using a trained PGM, based on the corresponding observed structure graph. FIG. 2C through FIG. 2F illustrate exemplary flow diagrams for generating a trained probabilistic graphical model (PGM), in accordance with some embodiments of the present disclosure.
At step 206a1 of the method 200, the one or more hardware processors 104 of the system 100 are configured to receive a plurality of semi-structured documents for training. Each semi-structured document of the plurality of semi-structured documents for training, is associated with the DST of one or more DSTs. Each semi-structured document includes the one or more training textual contents arranged according to the associated document structure template and each training textual content of the one or more training textual contents includes the sequence of text tokens. The one or more training textual contents are the textual contents that are present in each semi-structured document for training.
At step 206a2 of the method 200, the one or more hardware processors 104 of the system 100 are configured to obtain a training observed structure graph for each semi-structured document of the plurality of semi-structured documents for training. The training observed structure graph is the observed structure graph for each semi-structured document for training, The training observed structure graph for each semi-structured document for training, represents the visual observed structure as mentioned at step 204 of the method 200, and in the form of directed acyclic graph (DAG) structure. The training observed structure graph for each semi-structured document for training includes a plurality of training observing nodes and a plurality of training observing edges. The plurality of training observing nodes are the nodes present in the training observed structure graph and similarly, the plurality of training observing edges are the edges present in the training observed structure graph.
Each training observing node of the plurality of training observing nodes represents the training textual content present in the corresponding semi-structured document for training, and each training observing edge represents the relation between two training observing nodes. Each training observing node act as one or more of (i) the parent node, and (ii) the child node, based on the one or more training observing edges connected with the associated training observing node, as mentioned at the step 204 of the method 200.
At step 206a3 of the method 200, the one or more hardware processors 104 of the system 100 are configured to receive a training latent structure graph for each semi-structured document of the plurality of semi-structured documents for training. The training latent structure graph for each semi-structured document for training is received in the form of the directed acyclic graph (DAG) structure. The training latent structure graph is the latent structure graph for each semi-structured document for training. The training latent structure graph includes a plurality of training latent nodes and a plurality of training latent edges. The plurality of training latent nodes are the nodes present in the associated training latent structure graph. Similarly, the plurality of training latent edges are the edges present in the associated training latent structure graph. Each training latent edge is associated with the relation between two training latent nodes. Each training latent node act as one or more of (i) the parent node, and (ii) the child node, based on the one or more training latent edges connected between the associated training latent node.
Each training latent node may represent the training textual content present in the semi-structured document for training. Each training latent edge represents the relation between two training latent nodes (two training textual contents present in the semi-structured document for training).
Further, each training latent node present in the training latent structure graph of the semi-structured document for training, may act as either the header node type or the value node type, based on the corresponding node type value. In other words, each training observing node present in the corresponding observed structure graph may act as either the header node or the value node, based on the corresponding node type, in the corresponding training latent structure graph. Further, two or more training latent edges that are connected consecutively between the two or more training latent nodes forms a training latent path, and the training latent structure graph for each semi-structured document for training may include one or more such training latent paths.
At step 206a4 of the method 200, the one or more hardware processors 104 of the system 100 are configured to generate the trained PGM by training a PGM with the plurality of semi-structured documents for training received at step 206a1 of the method 200 , using (i) the corresponding training observed structure graphs obtained at step 206a2 of the method 200 , and (ii) the corresponding training latent structure graphs received at step 206a3 of the method 200. In an embodiment, the PGM may be a generative PGM. The training of the PGM is further described below to generate the trained PGM.
Let, the training observed structure graph, the DST, and the training latent structure graph of ith semi-structured document for training are represented as Si, Ti, S*i respectively where the training textual content, the parent node and node type value of jth training latent node of ith semi-structured document are w*ij, p*ij and tij respectively. Also, the set of training textual contents, the parent nodes and node type values for all training latent nodes in ith semi-structured document are represented by w*i, p*I, and ti, hence S*i = (w*i,p*i,ti). Moreover, the correspondence between the training observed structure graph Si and the training latent structure graph S*i of ith semi-structure document is represented by ci = {cij}j where cij=-1 implies jth training latent node in ith document left out in the corresponding training observed structure graph Si.
At step 206a4a of the method 200, the one or more hardware processors 104 of the system 100 are configured to assign: (i) an initial DST value for each semi-structured document of the plurality of semi-structured documents for training, (ii) an initial node type value for each training latent node present in the training latent structure graph of each semi-structured document of the plurality of semi-structured documents for training.
The initial DST value represents the initial DST for the associated semi-structured document of the plurality of semi-structured documents for training. The semi-structured documents having the same DST value, includes same document structure template. Similarly, the semi-structured documents having the different DST value, includes different document structure templates.
The initial DST value is one of a positive integer value selected from a sequence 1,2,3…, K, where K is a finite positive number. At the time of initialization, at step 206a4a of the method 200, each semi-structured document of the plurality of semi-structured documents for training, is assigned in the sequential order, with the unique DST value starting from 1, then 2, and so on till ‘K’, assuming that the DST for each semi-structured document of the plurality of semi-structured documents for training, is unknown. More specifically, the initial DST value for the semi-structured document received at the first time in the first iteration, is assigned with ‘1’.
Each training latent node present in the training latent structure graph of each semi-structured document for training is assigned with the initial node type value. The initial node type value is one of (i) an initial header node type value, and (ii) an initial value node type value. In an embodiment, each training latent node present in the training latent structure graph is classified as one of: (i) the header node type, and (ii) the value node type, based on (i) the length of the corresponding training textual content and the associated training latent node is the leaf node or not. The length of the corresponding training textual content is defined based on the number of characters (including numbers and special characters) present in the corresponding training textual content.
If the training latent node is a leaf node then the corresponding training latent node is classified as the value node type. If the length of the training textual content present in the training latent node is less than or equal than a predefined length threshold and the training latent node is not the leaf node, then the corresponding training latent node is classified as the header node type. Else, if the length of the training textual content present in the training latent node is greater than the predefined length threshold, then the corresponding training latent node is classified as the value node type. In an embodiment, the predefined length threshold may be less than 5.
During the initialization, each training latent node of header node type, present in the training latent structure graph of the semi-structured document for training received at the first time in the first iteration, is assigned with the initial header node type value selected in from a sequence H1, H2, H3…, Hc, where H denotes ‘header node type’ and ‘c’ represents the total number of header node type values for the plurality of semi-structured documents for training. More specifically, the first training latent node of header node type is assigned with ‘H1’, the second training latent node of header node type is assigned with ‘H2’, and so on, in the sequential order.
Similarly, each training latent node of value node type, present in the training latent structure graph of the semi-structured document for training received at the first time in the first iteration, is assigned with the initial value node type value selected from a sequence V1, V2, V3…, Vd, where V denotes ‘value node type’ and ‘d’ represents the total number of value node type values for the plurality of semi-structured documents for training. More specifically, the first training latent node of value node type is assigned with ‘V1’, the second training latent node of header node type is assigned with ‘V2’, and so on, in the sequential order. Further, if the parent node of the training latent node which of header node type, having the initial header type value ‘Hk’, the initial value node type value for the corresponding training latent node is assigned with ‘Vk’.
In the next step, an initial parameter value for each model parameter of a set of model parameters as mentioned at step 206a4b, and a value of score function as mentioned at step 206a4c, are computed, for the semi-structured document received at the first time in the first iteration. Then, for the semi-structured document received at the second time onwards, in the same iteration, the initial node type value for each training latent node is assigned, based on (i) the associated node type, (ii) the model parameters, and (iii) the value of the score function. More specifically, the initial header node type value for each training latent node of header node type, is assigned from the possible initial header node type values (H1, H2…Hk, assigned for the semi-structured document received at the first time in the first iteration), that are already seen, or assigned with new initial header node type value ‘Hk+1’ Similarly, the initial value node type value for each training latent node of value node type, is assigned from the possible initial value node type values (V1, V2…Vk, assigned for the semi-structured document received at the first time in the first iteration), that are already seen, or assigned with new initial value node type value ‘Vk+1’.
Further, the initial DST value for the semi-structured document received at the second time onwards, in the same iteration, is assigned, based on (i) the initial node type values of the training latent nodes present in the corresponding training latent structure graph, (ii) the model parameters, and (iii) the value of the score function. More specifically, the initial DST value for the semi-structured document received at the second time is assigned with possible values of (‘1’ (already seen)), or a new DST value ‘2’. The initial DST value for the semi-structured document received at the third time is assigned with possible values of ( ‘1’ or ‘2’ ( if already seen)), or a new DST value ‘3’, and so on. At the end of the step 206a4a of the method 200, the initial parameter value for each model parameter of a set of model parameters are computed.
At step 206a4b of the method 200, the one or more hardware processors 104 of the system 100 are configured to compute the initial parameter value for each model parameter of the set of model parameters, for the plurality of semi-structured documents for training. The set of model parameters includes: (i) a probability of the initial node type value for each training latent node being a child node of another initial node type value in the corresponding training latent structure graph, which is denoted by µ, (ii) a probability of the initial node type value for each training latent node appearing under a particular initial DST value, which is denoted by ß, (iii) a probability of the initial node type value for each training latent node having a specific training textual content, which is denoted by P (w*ij|tij ), and (iv) a probability of the initial node type value for each training latent node being left out in the corresponding training observed structure graph, which is denoted by f . After the computation, the initial parameter values of the set of model parameters, for the plurality of semi-structured documents for training, may be stored in a matrix or vector format. In an embodiment, when node locations are provided, the probability of location (x and y coordinates) difference between node type and parent node type are also considered as the model parameters.
At step 206a4c of the method 200, the one or more hardware processors 104 of the system 100 are configured to infer a successive DST value, for each semi-structured document of the plurality of semi-structured documents for training. The successive DST value for each semi-structured document is inferred based on (i) the corresponding training latent structure graph, (ii) the initial node type value for each training latent node present in corresponding training latent structure graph, (iii) the initial parameter value for each model parameter of the set of model parameters, and (iv) the value of score function, wherein the score function is defined using the set of model parameters. The successive DST value for each semi-structured document may be the DST value that is already seen in the previous iteration, or a new DST value.
The score function or the joint distribution defined for the generative PGM is, P(Si, Ti, S*i, ci)=P(Ti)*P(ci|S*i)*P(S*i|Ti)*P(Si|S*i, ci, Ti) where P(Ti) follows a categorical distribution with parameter a, P(ci|ti) = ?i P(cij|tij), where P (cij ? -1|tij = t) is defined as ft, P(Si|S*i, ci, Tij) defines the probability of getting training observed structure graph given the training latent structure graph. In an embodiment, the term P(Ti) follows a uniform categorical distribution. When the training latent node has a corresponding training observing node (i.e. cij = -1), then the training textual content of the associated training latent node is copied to the training observing node. P(S*i|Ti) is further defined as: P(S*i|Ti) = ?j P(p*ij, tij |ti,=j,Ti) P(w*ij|tij). The first term is the probability of the parent nodes of each training latent node and its node type conditioned on the DST and the node types of the previous training latent nodes in the topological order. The probability P(pijj’= 1, tij = k|tij’= k’, Ti = t) of training latent node j with type k having node j’ with type k’ as parent for DST t is defined as µkk’ ßtk’. The pattern for parent and child types is captured by µ, and that for DST and node types by ß. The second term in the probability of the training textual content of the training latent node given the associated node type. For the probability P (wij|tij ) of training text contents for a node type, sequence embeddings of the node types are used and fine-tuned using BERT. Each node type t also assumed to have an (unknown) embedding ht , so that P (wij|tij ) = cosine(h(wij ), ht ), where h(wij ) is the embedding for wij and cosine(h(wij ), ht ) is cosine similarity.
To account for variable number Nt of DSTs, and node types Nh for header nodes and Nv for value nodes, Dirichlet Process priors Nt ~ DP(at ), Nh ~DP(ah), Nv~DP(av) are used so that the numbers grow slowly with increasing observations. When the node locations are provided, terms are added to the joint distribution for the probability of the distances between a node and its parent nodes. This is modeled as a Gaussian centered at the location of the node.
To infer the DST value, for ith document, P(Ti|Si, S*i, ci) = P(Ti, Si, S*i, ci)/ P(Si, S*i, ci) ~ P(Ti, Si, S*i, ci) is calculated for each possible value of Ti and the successive value of Ti is chosen which gives maximum value of P(Ti|Si, S*i, ci). If the inferred successive DST value is different to that of the initial DST value, then the inferred successive DST value replaces the initial DST value for the corresponding semi-structured document for training. In other words, if the inferred successive DST value is different to that of the initial DST value, then the DST of the corresponding semi-structured document for training is different to that of the assumed initial DST (the initial DST value), based on the value of score function.
At step 206a4d of the method 200, the one or more hardware processors 104 of the system 100 are configured to infer a successive node type value for each training latent node present in the training latent structure graph of each semi-structured document of the plurality of semi-structured documents for training. The successive node type value for each training latent node is inferred based on (i) the corresponding training latent structure graph, (ii) the corresponding successive DST value, (iii) the initial parameter value for each model parameter of the set of model parameters, and (iv) the value of score function. The ti represents the set for node types of all training latent nodes of ith document and as S*i=(w*I ,p*i, ti), then for ith document jth node, P(tij|Si, w*i, p*i, ci, tik:k?j) = P(Si,, w*i,p*i, ti, ci) / P(Si,, w*i, p*i, ci) ~ P(Si,, w*i, p*i, ti, ci) is calculated for each possible value of tij where the other nodes (other than jth node) have the current assignment of node type value and the successive value of tij is chosen which gives maximum value of P(tij|Si, w*i, p*i, ci, tik:k?j).
If the inferred successive node type value is different to that of the initial node type value, then the inferred successive node type value replaces the initial node type value for the corresponding training latent node present in the training latent structure graph. In other words, if the inferred successive node type value is different to that of the initial node type value, then the node type of the corresponding training latent node is different to that of the assumed initial node type (the initial node type value), based on the value of score function. Here the node type for one training latent node may change from the header node type to the value node type, or vice versa. The successive node type value for each training latent node may be chosen from the initial node type values that already seen in the previous iteration, or a new initial node type value.
At step 206a4e of the method 200, the one or more hardware processors 104 of the system 100 are configured to compute a successive parameter value for each model parameter of the set of model parameters, for the plurality of semi-structured documents for training. The successive parameter value for each model parameter may change either due to the change in the DST of the semi-structured document (due to the change in the DST value at step 206a4c) or due to the change in the node type (due to the change in the node type value at step 206a4d) that results the change in the corresponding training latent structure graph. After the computation, the successive parameter values of the set of model parameters, for the plurality of semi-structured documents for training, may be stored in the matrix or vector format.
At step 206a4f of the method 200, the one or more hardware processors 104 of the system 100 are configured to repeat the steps 206a4c through 206a4e, by taking the successive DST value as the initial DST value, the successive node type value as the initial node type value, for each semi-structured document of the plurality of semi-structured documents for training, and the successive parameter value for each model parameter as the initial parameter value, until no change in (i) the successive DST value and (ii) the successive node type value for each training latent node present in the training latent structure graph of each semi-structured document of the plurality of semi-structured documents for training.
At step 206a4g of the method 200, the one or more hardware processors 104 of the system 100 are configured to extract (i) the DST value for each semi-structured document of the plurality of semi-structured documents for training, and the node type value for each training latent node present in the training latent structure graph for each semi-structured document of the plurality of semi-structured documents for training, after the training. The DST value for each semi-structured document of the plurality of semi-structured documents for training is extracted from the corresponding successive DST value obtained after the step 206a4f of the method 200. Similarly, the node type value for each training latent node present in the training latent structure graph for each semi-structured document of the plurality of semi-structured documents for training, after the training, is extracted from the corresponding successive node type value obtained after the step 206a4f of the method 200.
The semi-structured documents out of the plurality of semi-structured documents for training, having the same DST value include the same DST. In an embodiment, the semi-structured documents for training having the same DST may have same or similar training latent structure graph (latent structure). In another embodiment, the semi-structured documents for training having the same DST may be small variations in their training latent structure graphs.
After the step 206a4g of the method 200, the trained PGM model is generated. Further, the trained PGM model learns the number of DSTs associated with the plurality of semi-structured documents for training, from the successive DST value associated with any semi-structured document out of the plurality of semi-structured documents for training, which is maximum. For example, if the maximum successive DST value associated with any semi-structured document is ‘15’, then the number of DSTs associated with the plurality of semi-structured documents for training, is ‘15’. Similarly, the trained PGM model learns the number of node types present in each semi-structured document and for the plurality of semi-structured documents for training. The initial node type value or the successive node type value defines the node type of the corresponding training latent node. If the node type value is same for two training latent nodes, then the node type of the corresponding training latent node is also same. The number of node types includes the number of header node types and the number of value node types present in each semi-structured document and for the plurality of semi-structured documents for training.
The trained PGM model is used to generate the latent structure of the semi-structured document. Further it is used to identify the most likely DST of the semi-structured document, node types for each training latent node and associated correspondences, conditioned on the training observed structure graph, and may also to accommodate the possibility of new node types and new DSTs.
Further, the trained PGM model is validated using a validation set comprising a set of semi-structured documents for validation, associated validation observed structure graphs and associated validation latent structure graphs to fine tune the model hyper-parameters, before used for generating the latent structure graph of the semi-structure document to be annotated.
The trained PGM model obtained at step 206a4 is used to generate the latent structure graph for each semi-structured document of the one or more semi-structured documents to be annotated, based on the corresponding observed structure graph received at step 204 of the method 200. FIG. 2G through FIG. 2H illustrate exemplary flow diagrams for generating the latent structure graph for each semi-structured document to be annotated using the trained PGM, in accordance with some embodiments of the present disclosure. The generation of the latent structure graph for each semi-structured document is further described in the below steps. At step 206b1 of the method 200, the one or more hardware processors 104 of the system 100 are configured to initialize the corresponding observed structure graph received at step 204 of the method 200, as the initial latent structure graph.
At step 206b2 of the method 200, the one or more hardware processors 104 of the system 100 are configured to assign (i) an initial node type value for each latent node present in the initial latent structure graph, and (ii) an initial DST value, based on the initial node type value for each latent node present in the initial latent structure graph, wherein the initial node type value is one of (i) an initial header node type value, and (ii) an initial value node type value. In an embodiment, the initial DST value for the corresponding semi-structured document is chosen from the number of DST values observed during the training. In an embodiment, the initial node type value for each latent node present in the initial latent structure graph is assigned based on the node type values that are observed during the training.
FIG. 3D shows the initial latent structure graphs for the two exemplary semi-structured documents to be annotated shown in FIG. 3A, in accordance with some embodiments of the present disclosure. In the first initial latent structure graph, initial header node value ‘H2’ is assigned to the textual content ‘Client’ and the initial value node value ‘V1’ is assigned to the textual content ‘ABC Bank’ as they may be associated with the respective DST. Similarly, in the second initial latent structure graph, initial header node value ‘H1’ is assigned to the textual content ‘SUCCESS STORY’ and the initial value node value ‘V1’ is assigned to the textual content ‘XYZ automates’ as they may be associated with the respective DST.
At step 206b3 of the method 200, the one or more hardware processors 104 of the system 100 are configured to find one or more missing header node type values for the corresponding semi-structured document, that are missing in the initial latent structure graph, but present in training latent structure graph of at least one semi-structured document having the same DST value, out of the plurality of semi-structured documents for training, received at step 206a1 of the method 200..
As observed at step 206b2, the header node ‘H1’ is missing in the first initial latent structure graph of FIG. 3D. Hence the header node ‘H1’ is considered as the missing latent node, if the same is present in the training latent structure graph of at least one semi-structured document having the same DST value, out of the plurality of semi-structured documents for training. Such one or more missing header node type values for the corresponding semi-structured document are found in this step.
At step 206b4 of the method 200, the one or more hardware processors 104 of the system 100 are configured to infer the successive DST value for the corresponding semi-structured document, based on (i) the initial latent structure graph, (ii) the initial node type value for each latent node present in the initial latent structure graph. At step 206b5 of the method 200, the one or more hardware processors 104 of the system 100 are configured to infer the successive node type value for each latent node present in the initial latent structure graph, based on (i) the initial latent structure graph, and (ii) the successive DST value for the corresponding semi-structured document.
Further, at step 206b6 of the method 200, the one or more hardware processors 104 of the system 100 are configured to infer one or more successive parent nodes for each latent node present in the initial latent structure graph, based on (i) the corresponding successive node type value for each latent node, and (ii) the successive DST value for the corresponding semi-structured document.
At step 206b7 of the method 200, the one or more hardware processors 104 of the system 100 are configured to insert a new latent node for each missing header node type value found at step 206b3 of the method 200, in the initial latent structure graph, based on the associated value of the score function determined by the trained PGM. In the first step, the new latent node is inserted and then in the second step, the one or more successive parent nodes for each latent node are inferred as specified, at step 206b6 of the method 200. Then, the value of the score function with respect to the latent structure graph is computed to check whether the value of the score function increases to that of previous value in order to insert the new latent node. For example, for the missing node ’H1’, the new latent node is inserted, if the associated value of score function determined by the trained PGM increases to that of its previous value of the score function. At step 206b8 of the method 200, the one or more hardware processors 104 of the system 100 are configured to repeat the steps 206b3 through 206b7, by taking the successive DST value as the initial DST value, and the successive node type value as the initial node type value, for each latent node present in the initial latent structure graph, until no change in (i) the successive DST value, (ii) the successive node type value for each latent node present in the initial latent structure graph and (iii) the one or more successive parent nodes for each latent node present in the initial latent structure graph. At step 206b9 of the method 200, the one or more hardware processors 104 of the system 100 are configured to consider the initial latent structure graph generated after the step 206b8 of the method 200, as the latent structure graph for the corresponding semi-structured document. Further the DST for the corresponding semi-structured document is obtained based on the successive DST value which is obtained after the step 206b8 of the method 200.
At step 208 of the method 200, the one or more hardware processors 104 of the system 100 are configured to annotate jointly, each latent node of the plurality of latent nodes of header node type, and each latent path of the one or more latent paths having two or more latent nodes of header node type, present in the latent structure graph generated by the trained PGM, of each semi-structured document of the one or more semi-structured documents to be annotated, with one entity type of a plurality of entity types and one relation type of a plurality of relation types, respectively, present in the KG schema. Each latent node of the plurality of latent nodes of header node type present in the latent structure graph of each semi-structured document, are annotated with one entity type of the plurality of entity types present in the KG schema. Similarly, each latent path of the one or more latent paths having two or more latent nodes of header node type present in the latent structure graph of each semi-structured document, are annotated with one relation type of the plurality of relation types, present in the KG schema.
In an embodiment, the annotation of each latent node of the plurality of latent nodes of header node type, and each latent path of the one or more latent paths having two or more latent nodes of header node type, present in the latent structure graph of each semi-structured document, is carried out using a trained PLM, based on the plurality of entity types and the plurality of relation types present in the KG schema. FIG. 2I through FIG. 2K illustrate exemplary flow diagrams for generating the trained probabilistic logic model (PLM), in accordance with some embodiments of the present disclosure.
At step 208a1 of the method 200, the one or more hardware processors 104 of the system 100 are configured to receive the KG schema having the plurality of entity types and the plurality of relation types Each relation type of the plurality of relation types represents a type of relation between two entity types of the plurality of entity types . In an embodiment, the received KG schema may be for a particular domain such as banking industry, educational institute, and so on. In an embodiment, the KG schema may be received in the form of directed multi-graph. There may be multiple directed relation paths between the same pair of entities.
Let G be the KG schema that specifies the plurality of entity types and the plurality of relation typefor the given domain. Let E be the set of the plurality of entity types. Let R be the set of the plurality of binary relation types between the two entity types in E.
At step 208a2 of the method 200, the one or more hardware processors 104 of the system 100 are configured to receive plurality of entity type annotations and a plurality of relation type annotations associated with the plurality of semi-structured documents for training received at step 206a1 of the method 200. Each entity type annotation of the plurality of entity type annotations represents an annotation of the training latent node of header node type, present in the training latent structure graph of the associated semi-structured document of the plurality of semi-structured documents for training, with the corresponding entity type present in the KG schema.. Similarly, each relation type annotation represents the annotation of the training latent path having training latent nodes of header node type, present in the training latent structure graph of the associated semi-structured document of the plurality of semi-structured documents for training, with the corresponding relation type present in the KG schema
At step 208a3 of the method 200, the one or more hardware processors 104 of the system 100 are configured to generate the trained PLM by training a PLM. The PLM is trained with (i) the latent structure graph for each semi-structured document of the plurality of semi-structured documents for training received at step 206a1 of the method 200, (ii) the plurality of entity types and the plurality of relation types present in the KG schema received at step 208a1 of the method 200, and (iii) the plurality of entity type annotations and the plurality of relation type annotations associated with the plurality of semi-structured documents for training,, received at step 208a2 of the method 200. The training of the PLM is further described below to generate the trained PLM.
At step 208a3a of the method 200, the one or more hardware processors 104 of the system 100 are configured to provide to the PLM, each latent structure graph for each semi-structured document of the plurality of semi-structured documents for training in terms of training grounded facts. The training grounded facts includes (i) the grounded facts related to each training latent edge present between the two training latent nodes, that is represented by edge(nij, nij’) = 1 if there is a training latent edge from nij to nij’ or from nij’ to nij in the training latent structure graph of ith semi-structure document for training, and (ii) the grounded facts related to node type value for each training latent node, that is represented by type(tij, t) =1 if the jth training latent node of the training latent structure graph of ith semi-structure document for training, has node type value t. The node type value for each latent node is one of: (i) the header node type value, and (ii) the value node type value.
At step 208a3b of the method 200, the one or more hardware processors 104 of the system 100 are configured to provide training uncertain fact parameters to the PLM. The training uncertain fact parameters include: (i) the probability of each header node type value present in the training latent structure graph, being the entity type present in the KG schema, which is represented by ?et:: eLabel(e, t) i.e. the probability of mapping header node type value t to entity type e, and (ii) the probability of each relation type being present between two entity types present in the KG schema, which is represented by ?re1e2 :: rLabel(r, e1, e2) i.e. the probability of r relation type being present between two entity types e1 and e2.
At step 208a3c of the method 200, the one or more hardware processors 104 of the system 100 are configured to provide training query fact parameters to the PLM. The training query fact parameters include: (i) the probability of each entity type annotation being associated with the training latent node of header node type, which is represented by eAnnot(e, nij) i.e. the probability of annotating of jth training latent node of the training latent structure graph of ith semi-structure document for training with entity type e, and (ii) the probability of each relation type annotation being associated with the training latent path having training latent nodes of header node type, which is represented by rAnnot(r, pil) i.e. the probability of annotating ith training latent path of the training latent structure graph of ith semi-structure document for training with relation type r.
Further, at step 208a3d of the method 200, the one or more hardware processors 104 of the system 100 are configured to provide training constraints in terms of rules to PLM. The rules include: (i) if the training latent path present in the training latent structure graph for each semi-structured document is annotated with one of the relation type present in the KG schema, then, end training latent nodes of the associated training latent path must be annotated with the respective entity types associated with the corresponding relation type, and (ii) if the training latent path present in the training latent structure graph for each semi-structured document is annotated with one of the relation type present in the KG schema, then, internal training latent nodes present in the associated training latent path must not be annotated with any entity type present in the KG schema.
The end training latent nodes of the associated training latent path include the pair of the training latent nodes being present at exterior end of the training latent path. For example, A, B, C, D, E are the training latent nodes and A->B->C->D->E is the training latent path then the training latent nodes A and E are considered to the end training latent nodes. Similarly, the internal training latent nodes present in the associated training latent path, include the training latent nodes that are present in between the end training latent nodes. In the example of the training latent path A->B->C->D->E, the training latent nodes B, C, D are considered to the internal training latent nodes.
If the training latent paths are represented as lists in PLM, and for notational convenience functions head(l), tail(l) and last(l) are assumed for lists, then the rules in PLM are:
eAnnot(e, nij) := type(t, nij), eLabel(e, t).
rAnnot(r, p) := path(p), eAnnot(e1, head(p)), eAnnot(e2, last(p)), rLabel(r, e1, e2 ).
path(p) := len(p, 2), edge(head(p), last(p))
path(p) := edge(head(p), head(tail(p))), path(tail(p)), eAnnot(head(tail(p)), -1).
At step 208a3e of the method 200, the one or more hardware processors 104 of the system 100 are configured to compute training uncertain fact parameter value for each of the training uncertain fact parameters using an expectation maximization algorithm, based on the plurality of entity type annotations and the plurality of relation type annotations associated with the plurality of semi-structured documents for training, as evidences, to obtain the trained PLM.
More specifically, the probability of each header node type value present in the training latent structure graph, being the entity type present in the KG schema, is computed with respect to each entity type annotation of the plurality of entity type annotations. Similarly, the probability of each relation type being present between two entity types present in the KG schema is computed with respect to each relation type annotation of the plurality of relation type annotations. The trained PLM includes the trained uncertain fact parameter value for each of the training uncertain fact parameters, after the training.
The trained PLM generated at step 208a3 is used to jointly annotate each latent node of the plurality of latent nodes of header node type, and each latent path of the one or more latent paths having two or more latent nodes of header node type, present in the latent structure graph generated by the trained PGM, of each semi-structured document of the one or more semi-structured documents to be annotated, with one entity type of the plurality of entity types and one relation type of the plurality of relation types, respectively, present in the KG schema received at step 208a1 of the method 200. Note here that only the each header node of the one or more header nodes are annotated with one entity type of the plurality of entity types and each latent path having the latent nodes of header node type are annotated with one relation type of the plurality of relation types.
FIG. 2L illustrates an exemplary flow diagram for jointly annotating each latent node of the plurality of latent nodes of header node type, and each latent path of the plurality of latent paths having latent nodes of header node type, present in the latent structure graph of each semi-structured document to be annotated, with one entity type of the plurality of entity types and one relation type of the plurality of relation types, respectively, present in the KG schema, using the trained PLM, in accordance with some embodiments of the present disclosure.
At step 208b1 of the method 200, the one or more hardware processors 104 of the system 100 are configured to provide the corresponding latent structure graph of the semi-structured documents to be annotated in terms of grounded facts. The grounded facts include (i) the grounded facts related to each latent edge present between the two latent nodes, that is represented by edge(nij, nij’) = 1 if there is a latent edge from nij to nij’ or from nij’ to nij in latent structure graph of ith semi-structure document to be annotated, and (ii) the grounded facts related to node type value for each latent node, that is represented by type(tij, t) =1 if the jth latent node of the latent structure graph of ith semi-structure document to be annotated has node type value t. The node type value for each latent node is one of: (i) the header node type value and (ii) the value node type value.
At step 208b2 of the method 200, the one or more hardware processors 104 of the system 100 are configured to infer the probability of each relation type annotation i.e. to calculate rAnnot(r, pil) for each latent path having latent nodes of header node type, present in latent structure graphs and for each relation type present in KG schema, to: (i) annotate the associated latent path with the associated relation type annotation having the maximum probability; and (ii) annotate end latent nodes of header node type present in the associated latent path, with the entity type annotation corresponding to the associated relation type annotation. The end latent nodes of the associated latent path include the pair of the latent nodes being present at exterior end of the latent path, as mentioned at step 208a3d of the method 200. FIG. 4 shows annotating of the one or more latent nodes and the one or more latent paths present in the latent structure graph of the two exemplary semi-structured documents to be annotated, shown in FIG. 3A, with the one or more entity types and the one or more relation types present in the KG schema, in accordance with some embodiments of the present disclosure. From FIG. 4, the top portion shows the exemplary KG schema, where the nodes in oval, such as ‘Customer’, ‘Industry’, and so on represents the entity types and the relation between the entity types such as ‘shares’, ‘has’, and so on represents the relation types. The bottom portion shows the header sub-graph (the latent nodes are of header node type and the corresponding latent edges) of the latent structure graph of the of the two exemplary semi-structured documents to be annotated. Note here that the header sub-graph of the latent structure graph of each of the two exemplary semi-structured documents to be annotated, is same because their DST is same. Further, the latent node with node type ‘H2’, ‘H3’ and ‘H4’ are annotated with entity types ‘Customer’, ‘Industry’, and ‘Problems’ respectively. Moreover, one latent edge from latent node with node type ‘H2’ to latent node with node type ‘H3’ is annotated with relation type ‘Serves’ and the latent path having nodes with node types ‘H2’, ‘H1’ and ‘H4’ is mapped to ‘Faces’ relation type.
In general, the KG schema received at step 208a1 of the method 200, may be incomplete and may not include all the entity types and all the relation types. If the KG schema is incomplete and not having all the entity types and all the relation types, the annotation of some latent nodes and some latent paths is not possible as the respective entity types and the respective relation types are not available. Hence the KG schema has to be updated such that it includes all the entity types and all the relation types, that are sufficient enough for the annotation.
FIG. 2M illustrates an exemplary flow diagram for discovering one or more new entity types and one or more new relation types, for annotation, in accordance with some embodiments of the present disclosure. At step 208c1 of the method 200, one or more hardware processors 104 of the system 100 are configured to identify one or more latent nodes having same header node type value out of the plurality of latent nodes, but are not annotated with one of the entity type present in the KG schema. The step 208c1 is used to identify one or more missing latent nodes that cannot be annotated with any entity type present in the KG schema received at step 208a1 of the method 200.
At step 208c2 of the method 200, one or more hardware processors 104 of the system 100 are configured to discover a new entity type for the identified one or more latent nodes having same header node type value, if a number of the identified one or more latent nodes having the same header node type value, identified at step 208c1 of the method 200, is greater than a first threshold value. In an embodiment, the first threshold value may be 0.2 times total number of latent nodes present in latent structure graphs of semi-structure documents to be annotated.
At step 208c3 of the method 200, one or more hardware processors 104 of the system 100 are configured to annotate the identified one or more latent nodes having the same header node type value, identified at step 208c1 of the method 200 with the discovered new entity type obtained at step 208c2. The discovered new entity type obtained at step 208c2 is added to the KG schema to obtain the updated KG schema.
Similarly, at step 208c4 of the method 200, one or more hardware processors 104 of the system 100 are configured to one or more latent paths having same header node type value sequence, but are not annotated with one of the relation type present in the KG schema, and the internal latent nodes present in latent path of the identified one or more latent paths are not annotated with any entity type present in the KG schema. The header node type value sequence represents the sequence having same header node type values. For example, if H1, H2, H3, H4 are the header node type values, then the exemplary header node type value sequence may be H1->H2->H3->H4. The step 208c4 is used to identify one or more missing latent paths that cannot be annotated with any relation type present in the KG schema received at step 208a1 of the method 200.
At step 208c5 of the method 200, one or more hardware processors 104 of the system 100 are configured to discover a new relation type for the associated header node type value sequence obtained at step 208c4 of the method 200, if a number of the identified one or more latent paths having the same header node type value sequence, is greater than a second threshold value. In an embodiment, the second threshold value may be 0.2 times total number of latent edges present in latent structure graphs of semi-structure documents to be annotated. In an embodiment, the second threshold value may be same or different to that of the first threshold value mentioned at step 208c2.
At step 208c6 of the method 200, one or more hardware processors 104 of the system 100 are configured to annotate the identified one or more latent paths having the same header node type value sequence, identified at step 208c4 of the method 200, with the discovered new relation type, obtained at step 208c5. The discovered new relation type obtained at step 208c5 is added to the KG schema to obtain the updated KG schema.
At step 208c6 of the method 200, one or more hardware processors 104 of the system 100 are configured to repeating the steps 208c1 through 208c6, until (i) no new entity type and (ii) no new relation type, is discovered. From FIG. 4, the discovered new entity type ‘Industry’ is missing in the KG schema received at step 208a1 of the method 200. Also the new discovered relation type ‘Solves’ is missing in the KG schema received at step 208a1 of the method 200. After completion of the step 208c6 of the method 200, the completed KG schema may have all the entity types and all the relation types that are sufficient enough for the annotation, for the given semi-structured documents. Further, the annotation of all the latent nodes and all the latent paths present in the latent structure graph of the given semi-structured document to be annotated, are annotated using the respective entity types and the respective relation types, out of all the entity types and all the relation types present in the completed KG schema.
In accordance with the present disclosure, the methods and systems effectively annotate the semi-structured documents based on the KG schema, using the combination of PGM and PLM models. The trained PGM recovers the latent structure in the form of latent structure graph effectively along with the missing latent nodes with respect to its observed structure. The trained PLM effectively jointly annotates the latent nodes (header nodes) and the latent paths present in the recovered latent structure using the entity types and the relation types, respectively, present in the KG schema. Further, the present disclosure discovers new entity types and the new relation types that are missing in the given KG schema. Hence the complete KG schema is generated with all possible entity types and the relation types for the domain. As a result, all the latent nodes and the latent paths present in the latent structure graph are annotated without missing any annotations. The annotation of the semi-structured documents has potential for significantly improving performance in downstream applications such as informational retrieval, question answering, and so on.
Though the present disclosure is for annotating the semi-structured documents such as web-pages, electronic reports, slide decks, and so on, the scope of the present disclosure is not limited to web tables and any other semi-structured documents.

Example scenario:
To validate the performance of the present disclosure, the performance of the proposed invention (herein after referred as ‘PGM-PLM’) for the annotation of the semi-structured documents, is compared against the conventional state-of-the-art baselines.
Dataset: Below three real datasets of different types are used for the evaluating the performance of the ‘PGM-PLM’:
Web Complex: This dataset is created from the website of a large multinational company, having all discussed complexities of semi-structured documents. This dataset includes 640 webpages following 6 DSTs. The html of the webpages is converted to a DAG structure containing the locations of the text units. The DAG structure of each webpage contains 6-18 nodes (latent nodes). Each webpage is annotated with a gold-standard structure. The latent nodes are annotated using 40 node-types (20 header nodes, 20 value nodes). The KG schema associated with the large multinational company includes 11 entity types and 13 relation types. The latent nodes and the latent paths in the gold-standard structure are annotated with entity types and relation types respectively. Each webpage has 3-6 entity type annotations and 2-6 relation type annotations. 15% of the pages are used for training, 15% for validation and 70% for the actual testing.
SWDE: This a benchmark webpage dataset for entity annotation discussed in the paper Lockard, C., Dong, X.L., Einolghozati, A., Shiralkar, P.: Ceres: Distantly supervised relation, extraction from the semi-structured web. Proc. VLDB Endow. The SWDE dataset includes gold annotations of 4–5 predicates of a single entity type for 4 categories, with 10 websites in each category and 200–2000 pages for each website. The KG schema essentially has a simple star shape, with a central entity types and 4-5 secondary entity types. The train-test setting specified in the paper, Gulhane, P., Madaan, A., Mehta, R., Ramamirtham, J., Rastogi, R., Satpal, S., Sengamedu, S., Tengli, A., Tiwari, C.: Web-scale information extraction with vertex, is considered. The document object model (DOM) tree structure of each webpage is converted to DAG structure with only the text units as nodes. This dataset does not differentiate between observed and latent structure graphs and does not identify DSTs.
Tab Complex: This dataset contains real complex tables from the pharmaceuticals industry. This dataset is used to demonstrate that proposed invention is not restricted to webpages but can handle document structure with only spatial locations of text units and no associated html. This dataset has 76 tables of 9 DSTs. The number of tables for each type ranges from 4 to 14. The tables are complex with 2-level hierarchy in value cells within some columns. The total number of cells per table ranges from 13 to 8292, corresponding to 36 different cell types. The latent structure graph is the same as the observed structure graph. The KG schema has 18 entity types and 26 relation types. Each table contains gold-standard entity and relation type annotations. A table has 2-5 entity annotations and 1-6 relation annotations. 15% of the data is used for the training, 15% for the validation, and remaining for the testing.
Baselines: Below baselines of the art are used for the evaluating the performance of the ‘PGM-PLM’. The model hyper-parameters of the ‘PGM-PLM’ are the Dirichlet Process priors a_t, a_h, a_v, and the entity and relation discovery thresholds ?_e and ?_r respectively are tuned using the validation sets comprising
CERES: The CERES is trained using distant supervision, considers html features for annotation, does not recover latent structure, and only handles a star schema. The performance of the CERES for SWDE dataset is reported in the paper Lockard, C., Dong, X.L., Einolghozati, A., Shiralkar, P.: Ceres: Distantly supervised relation extraction from the semi-structured web. Proc. VLDB Endow. For Web Complex and Tab Complex datasets, below implementation is considered for adapting the CERES. A best possible star approximation of the schemas, and the gold-standard document structure as input, are provided. The entity annotation classifier only uses text-based features since html features are not available for Web Complex and Tab Complex datasets. The CERES uses the same training data as the other algorithms.
XTPath: Since for SWDE dataset, CERES uses distant supervision and not the actual training data, The performance of XTPath mentioned in the paper Cohen, J.P., W. Ding, W., Bagherjeiran, A.: Semi-supervised web wrapper repair via recursive tree matching. CoRR (2015), is reported, which uses the same training data as the other algorithms used here.
Limaye: This is a traditional baseline for annotation of simple tables (Limaye, G., Sarawagi, S., Chakrabarti, S.: Annotating and searching web tables using entities, types and relationships. Proc. VLDB Endow.) that uses joint inference using a probabilistic graphical model. This approach does not recognize DSTs or latent structures. Since this body of work does not consider complex tables, the datasets are converted to simple tables as faithfully as possible. A complex table is defined as tree where each leaf node is a simple table, and each internal node has the structure of a simple table but each cell points to another internal or leaf node. Notably, sibling tables (internal or leaf) do not necessarily have the same dimensions. Such tables are converted to simple tables bottom-up. An internal table is simplified only when all its children are simple tables. The cells of a table are ordered and merged its children sequentially and pairwise. The merge operation takes all combination of the rows of the two tables, and columns from both.
Annotation Experiments: Table 1 shows the results for the entity (E) and the relation (R) annotations for the three datasets. The performance is reported using macro-averaged F1, which is the average of the F1 scores for the different classes (entity types or relation types). For Web Complex and Tab Complex, the results are from the defined experiments. For SWDE dataset, the performance of CERES and XTPath are taken from the CERES paper [Lockard, C., Dong, X.L., Einolghozati, A., Shiralkar, P.: Ceres: Distantly supervised relation extraction from the semi-structured web. Proc. VLDB Endow.]. This does not report relation annotation performance separately. Limaye is defined only for tables and cannot annotate html structures as in SWDE dataset.
Model Web Complex dataset Tab Complex dataset SWDE dataset
Movie NBA Player University Book
E R E R E R E R E R E R
Limaye 0.78 0.73 0.95 0.93 - - -
-
- - - -
CERES 0.65 0.43 0.71 0.60 0.99 - 0.98 - 0.94 - 0.76 -
XTPath - - - - 0.94 - 0.98 - 0.98 - 0.97 -
PGM-PLM 0.98 0.96 0.99 0.99 0.99 0.99 1.00 1.00 0.99 0.99 0.98 0.97
Table 1
The results for Web Complex dataset are observed more closely, which has the most complex structure and the KG schema, the PGM-PLM significantly outperforms both the Limaye and the CERES. This reflects the usefulness of structure recovery. The Limaye performs better than CERES since it is able to handle arbitrary schemas, while CERES takes in a star-approximation. Unlike Web Complex dataset, the Tab Complex dataset does not have any latent structure different from the observed structure. As a result, performance of the PGM-PLM improves over the Limaye and the CERES, with the CERES still lagging the Limaye, again because of the star-approximation for the KG schema. However, the PGM-PLM is still better than both of these. The smaller gap between PGP-PLM and Limaye also reflects lower structural complexity in Tab Complex dataset compared to the Web Complex dataset. For the SWDE dataset, the CERES performs very well due to the simple nature of the KG schema and lack of complexity in the document structure. However, the PGM-PLM is still able to outperform the CERES. Note that for the SWDE dataset, the CERES uses distant supervision. Hence, the performance of XTPath is also included, which is the best supervised model in the CERES paper and uses the same training data as PGM-PLM. We observe that the PGM-PLM outperforms the XTPath as well.
Document Structure Recovery: Next, the performance of the baselines for the document structure recovery is observed. Specifically, the goodness of parent nodes identification, node type identification (at the secondary level), and DST identification are evaluated. Note that the parent nodes identification is a binary classification task, so that we evaluate using F1. The node type identification and the DST identification are clustering tasks, with no supervision provided during the training. Accordingly, a mutual information (NMI) is normalized over pair-wise clustering decisions for the evaluation. The other baselines do not address this task. Therefore, ablations of the PGM-PLM are compared. This task does not involve the PLM stage. So the ablations of PGM are used. Ablation -T does not use the textual content, -L does not use the locations of the text units, and -S does not use structural patterns between parent and child types as captured by the parameter ß. The performance is reported only for the Web Complex dataset since this has gold standards for all of these aspects of the structure. Table 2 shows the structure recovery performance for ablations of the PGM-PLM.
Models Parent Type DST
-T 0.61 0.48 0.2
-L 0.87 0.84 0.94
-S 0.65 0.83 0.97
PGM-PLM 0.98 0.95 0.99
Table 2
From table 2, the importance of each of these aspects for structure recovery is seen. The PGM is able to recover the structure almost perfectly. Not surprisingly, the text content has the biggest individual impact, most of all on the DST. The structural pattern has the next biggest impact on parent identification and node types, but not for the DST. This implies that for the DST, once the node types, their content and location are known, the structural pattern does not provide much additional information. Of the three aspects, the location seems to have the smallest individual impact, but still results in a significant difference in performance.
New entity and new relation Discovery: Further, the performance for new entity and new relation discovery is reported. In this experiment, some entities and all their relations from the training schema and the training annotations are hidden. How well occurrences of these entities and relations are recognized in the test documents, are evaluated. This is again a clustering task since these entity and relation labels are not seen during the training. Accordingly, the performance again evaluated using NMI. Note that the other baselines cannot address this task. So only the PGM-PLM is evaluated. Since, the SWDE dataset has a simple star schema to begin with, and hence only for the Web Complex dataset and the Tab Complex dataset are evaluated.
% Hidden Web Complex dataset Tab Complex dataset
E R E R
30 0.97 0.95 0.98 0.96
60 0.96 0.94 0.89 0.87
100 0.95 0.94 0.87 0.82
Table 3
Table 3 shows the performance for varying percentages of hidden entities during the training. The performance is extremely robust for the Web Complex dataset. The entities and the relations are detected very accurately even when no initial schema is provided based on the document structure properties. Even for the Tab Complex, NMI is higher than 0.80 for both the entities and the relations.
The written description describes the subject matter herein to enable any person skilled in the art to make and use the embodiments. The scope of the subject matter embodiments is defined by the claims and may include other modifications that occur to those skilled in the art. Such other modifications are intended to be within the scope of the claims if they have similar elements that do not differ from the literal language of the claims or if they include equivalent elements with insubstantial differences from the literal language of the claims.
The experimental results also prove that, the methods and systems of the present disclosure effectively annotate the semi-structured documents based on the KG schema, using the combination of PGM and PLM models. Further, the present disclosure also discovers new entity types and the new relation types that are missing in the given KG schema. Hence the complete KG schema is generated with all possible entity types and the relation types for the domain. As a result, all the latent nodes and the latent paths present in the latent structure graph are annotated without missing any annotations.
It is to be understood that the scope of the protection is extended to such a program and in addition to a computer-readable means having a message therein; such computer-readable storage means contain program-code means for implementation of one or more steps of the method, when the program runs on a server or mobile device or any suitable programmable device. The hardware device can be any kind of device which can be programmed including e.g. any kind of computer like a server or a personal computer, or the like, or any combination thereof. The device may also include means which could be e.g. hardware means like e.g. an application-specific integrated circuit (ASIC), a field-programmable gate array (FPGA), or a combination of hardware and software means, e.g. an ASIC and an FPGA, or at least one microprocessor and at least one memory with software modules located therein. Thus, the means can include both hardware means and software means. The method embodiments described herein could be implemented in hardware and software. The device may also include software means. Alternatively, the embodiments may be implemented on different hardware devices, e.g. using a plurality of CPUs.
The embodiments herein can comprise hardware and software elements. The embodiments that are implemented in software include but are not limited to, firmware, resident software, microcode, etc. The functions performed by various modules described herein may be implemented in other modules or combinations of other modules. For the purposes of this description, a computer-usable or computer readable medium can be any apparatus that can comprise, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device.
The illustrated steps are set out to explain the exemplary embodiments shown, and it should be anticipated that ongoing technological development will change the manner in which particular functions are performed. These examples are presented herein for purposes of illustration, and not limitation. Further, the boundaries of the functional building blocks have been arbitrarily defined herein for the convenience of the description. Alternative boundaries can be defined so long as the specified functions and relationships thereof are appropriately performed. Alternatives (including equivalents, extensions, variations, deviations, etc., of those described herein) will be apparent to persons skilled in the relevant art(s) based on the teachings contained herein. Such alternatives fall within the scope and spirit of the disclosed embodiments. Also, the words “comprising,” “having,” “containing,” and “including,” and other similar forms are intended to be equivalent in meaning and be open ended in that an item or items following any one of these words is not meant to be an exhaustive listing of such item or items, or meant to be limited to only the listed item or items. It must also be noted that as used herein and in the appended claims (when included in the specification), the singular forms “a,” “an,” and “the” include plural references unless the context clearly dictates otherwise.
Furthermore, one or more computer-readable storage media may be utilized in implementing embodiments consistent with the present disclosure. A computer-readable storage medium refers to any type of physical memory on which information or data readable by a processor may be stored. Thus, a computer-readable storage medium may store instructions for execution by one or more processors, including instructions for causing the processor(s) to perform steps or stages consistent with the embodiments described herein. The term “computer-readable medium” should be understood to include tangible items and exclude carrier waves and transient signals, i.e., be non-transitory. Examples include random access memory (RAM), read-only memory (ROM), volatile memory, nonvolatile memory, hard drives, CD ROMs, DVDs, flash drives, disks, and any other known physical storage media.
It is intended that the disclosure and examples be considered as exemplary only, with a true scope and spirit of disclosed embodiments being indicated by the following claims.

Documents

Application Documents

#	Name	Date
1	202121010546-STATEMENT OF UNDERTAKING (FORM 3) [12-03-2021(online)].pdf	2021-03-12
2	202121010546-REQUEST FOR EXAMINATION (FORM-18) [12-03-2021(online)].pdf	2021-03-12
3	202121010546-FORM 18 [12-03-2021(online)].pdf	2021-03-12
4	202121010546-FORM 1 [12-03-2021(online)].pdf	2021-03-12
5	202121010546-FIGURE OF ABSTRACT [12-03-2021(online)].jpg	2021-03-12
6	202121010546-DRAWINGS [12-03-2021(online)].pdf	2021-03-12
7	202121010546-DECLARATION OF INVENTORSHIP (FORM 5) [12-03-2021(online)].pdf	2021-03-12
8	202121010546-COMPLETE SPECIFICATION [12-03-2021(online)].pdf	2021-03-12
9	202121010546-Proof of Right [24-06-2021(online)].pdf	2021-06-24
10	202121010546-FORM-26 [14-10-2021(online)].pdf	2021-10-14
11	Abstract1.jpg	2022-02-17
12	202121010546-FER.pdf	2023-01-10
13	202121010546-OTHERS [15-05-2023(online)].pdf	2023-05-15
14	202121010546-FER_SER_REPLY [15-05-2023(online)].pdf	2023-05-15
15	202121010546-COMPLETE SPECIFICATION [15-05-2023(online)].pdf	2023-05-15
16	202121010546-CLAIMS [15-05-2023(online)].pdf	2023-05-15
17	202121010546-US(14)-HearingNotice-(HearingDate-13-03-2025).pdf	2025-02-24
18	202121010546-Correspondence to notify the Controller [07-03-2025(online)].pdf	2025-03-07
19	202121010546-FORM-26 [08-03-2025(online)].pdf	2025-03-08
20	202121010546-FORM-26 [08-03-2025(online)]-1.pdf	2025-03-08
21	202121010546-Written submissions and relevant documents [26-03-2025(online)].pdf	2025-03-26
22	202121010546-PatentCertificate26-03-2025.pdf	2025-03-26
23	202121010546-IntimationOfGrant26-03-2025.pdf	2025-03-26

Search Strategy

1	SearchStrategyE_02-01-2023.pdf
2	SearchHistoryamended(13)AE_25-06-2024.pdf