Methods And Systems For Annotating Tables With Entities And Entity

< Back

Methods And Systems For Annotating Tables With Entities And Entity Types

Abstract: The disclosure herein relates to methods and systems for annotating tables present in a webpage or in a document, with entities and entity types. Conventional approaches for the table annotation only consider either capture syntactic structure of the tables, or embeddings of the table elements without accounting for a complete structure and without end-to-end training. The present disclosure builds an efficient end-to-end model for annotating the tables, by integrating complete structure of the tables, the knowledge elements, and available annotations, using a graph convolution network (GCN). The embedding of the table elements, the knowledge elements are trained jointly using the available annotations including the entity annotations and the entity type annotations. The trained embeddings enable the end-to-end model to annotate remaining table elements that are unannotated, with the available or newly discovered entities and entity types. [To be published with FIG. 4]

Get Free WhatsApp Updates!
Notices, Deadlines & Correspondence

Patent Information

Application #

Filing Date

29 October 2020

Publication Number

24/2022

Publication Type

INA

Invention Field

COMPUTER SCIENCE

Status

kcopatents@khaitanco.com

Parent Application

Patent Number

Legal Status

Grant Date

2025-06-30

Renewal Date

Applicants

Tata Consultancy Services Limited

Nirmal Building, 9th Floor, Nariman Point Mumbai Maharashtra India 400021

Inventors

1. PRAMANICK, Aniket

Tata Consultancy Services Limited Block -1B, Eco Space, Plot No. IIF/12 (Old No. AA-II/BLK 3. I.T) Street 59 M. WIDE (R.O.W.) Road, New Town, Rajarhat, P.S. Rajarhat, Dist - N. 24 Parganas, Kolkata West Bengal India 700160

2. BHATTACHARYA, Indrajit

Specification

FORM 2
THE PATENTS ACT, 1970
(39 of 1970)
&
THE PATENT RULES, 2003
COMPLETE SPECIFICATION (See Section 10 and Rule 13)
Title of invention:
METHODS AND SYSTEMS FOR ANNOTATING TABLES WITH ENTITIES
AND ENTITY TYPES
Applicant
Tata Consultancy Services Limited A company Incorporated in India under the Companies Act, 1956
Having address:
Nirmal Building, 9th floor,
Nariman point, Mumbai 400021,
Maharashtra, India
Preamble to the description: The following specification particularly describes the invention and the manner
in which it is to be performed.

TECHNICAL FIELD
[001] The disclosure herein generally relates to the field of table annotation, and, more particularly, to methods and systems for annotating tables with entities and entity types using a graph convolutional network.
BACKGROUND
[002] Tables especially web-tables are used in webpages and electronic documents to present data in a structural format. A plurality of entities and entity types may be available in background knowledge in the form of knowledge graph, such as Yago, DBPedia, Freebase etc. Annotation of table elements such as columns, cells, and rows, using the available background knowledge in the form of entities and entity types helps in better understanding and semantic interpretation of the tables. Since the tables use their structure to present facts in significantly clearer form compared to unstructured text, table annotation has potential for significantly improving performance in downstream applications such as informational retrieval, question answering, and so on.
[003] Conventional approaches for the table annotation are limited to either capture syntactic structure of the tables using probabilistic graphical models or use embeddings of the table elements without accounting for a complete structure. Further, capturing the syntactic structure of the tables using the probabilistic graphical models require specification of hand-crafted semantic features which is challenging and time consuming task. Also, the approaches that use embeddings of the table elements largely discards the syntactic structure of the tables.
SUMMARY
[004] Embodiments of the present disclosure present technological improvements as solutions to one or more of the above-mentioned technical problems recognized by the inventors in conventional systems.
[005] In an aspect, there is provided a processor-implemented method for

annotating one or more tables with entities and entity types, the method comprising the steps of: receiving the one or more tables, one or more entities and one or more entity types associated with the one or more tables, background knowledge elements comprising one or more observed entities and one or more observed entity types present in the one or more tables, and one or more available annotations for each table in the one or more tables, wherein each table of the one or more tables comprises one or more columns and one or more rows with individual cells, each column of each table corresponds to an entity type of the one or more entities and each individual cell of each table corresponds to an entity of the one or more entities, each entity of the one or more entities is associated with the entity type of the one or more entity types, and the one or more available annotations for each table comprises one or more available entity annotations and one or more available entity type annotations; constructing a graph from the one or more tables, the background knowledge elements and the one or more available annotations for each table in the one or more tables, to obtain a constructed graph, wherein the constructed graph comprises one or more table nodes for each table in the one or more tables, one or more knowledge nodes, and one or more edges; building an end-to-end model, from the constructed graph using a graph convolution network (GCN), wherein building the end-to-end model comprises: initiating one or more model parameters of the GCN, randomly; generating (i) an embedding for each table node of the one or more table nodes for each table to obtain table embeddings for the one or more table nodes, and (ii) an embedding for each knowledge node of the one or more knowledge nodes to obtain knowledge embeddings for the one or more knowledge nodes, using the GCN, wherein the table embeddings comprises embedding of each row, each column, each cell, and a self-table, and the knowledge embeddings comprises embedding for each observed entity and each observed entity type; obtaining (i) an entity type embedding for each column node, by projecting the associated column embedding to an entity type space using an entity type projection matrix, and (ii) an entity embedding for each cell node, by projecting the

associated cell embedding to an entity space using an entity projection matrix; obtaining (i) a probability distribution over the one or more observed entity types by using a soft-max layer on associated entity type embeddings, and (ii) the probability distribution over the one or more observed entities by using the soft-max layer on associated entity embeddings ; learning (i) entity type parameters for each entity type embedding from the corresponding entity type projection matrix and the corresponding soft-max layer, and (ii) entity parameters for each entity embedding, from the corresponding entity projection matrix and the corresponding soft-max layer; and building the end-to-end model by training the GCN with (i) the one or more model parameters, (ii) the entity type parameters for each entity type embedding, and (iii) the entity parameters for each entity embedding, and based on the one or more available entity annotations and the one or more available entity type annotations; and annotating each table of the one or more tables, using the built end-to-end model.
[006] In another aspect, there is provided a system for annotating one or more tables with entities and entity types, the system comprising: a memory storing instructions; one or more Input/Output (I/O) interfaces; and one or more hardware processors coupled to the memory via the one or more I/O interfaces, wherein the one or more hardware processors are configured by the instructions to: receive the one or more tables, one or more entities and one or more entity types associated with the one or more tables, background knowledge elements comprising one or more observed entities and one or more observed entity types present in the one or more tables, and one or more available annotations for each table in the one or more tables, wherein each table of the one or more tables comprises one or more columns and one or more rows with individual cells, each column of each table corresponds to an entity type of the one or more entities and each individual cell of each table corresponds to an entity of the one or more entities, each entity of the one or more entities is associated with the entity type of the one or more entity types, and the one or more available annotations for each table comprises one or more available entity annotations and one or more

available entity type annotations; construct a graph from the one or more tables, the background knowledge elements and the one or more available annotations for each table in the one or more tables, to obtain a constructed graph, wherein the constructed graph comprises one or more table nodes for each table in the one or more tables, one or more knowledge nodes, and one or more edges; build an end-to-end model, from the constructed graph using a graph convolution network (GCN), wherein the end-to-end model is built by: initiating one or more model parameters of the GCN, randomly: generating (i) an embedding for each table node of the one or more table nodes for each table to obtain table embeddings for the one or more table nodes, and (ii) an embedding for each knowledge node of the one or more knowledge nodes to obtain knowledge embeddings for the one or more knowledge nodes, using the GCN, wherein the table embeddings comprises embedding of each row, each column, each cell, and a self-table, and the knowledge embeddings comprises embedding for each observed entity and each observed entity type; obtaining (i) an entity type embedding for each column node, by projecting the associated column embedding to an entity type space using an entity type projection matrix, and (ii) an entity embedding for each cell node, by projecting the associated cell embedding to an entity space using an entity projection matrix; obtaining (i) a probability distribution over the one or more observed entity types by using a soft-max layer on associated entity type embeddings, and (ii) the probability distribution over the one or more observed entities by using the soft-max layer on associated entity embeddings; learning (i) entity type parameters for each entity type embedding from the corresponding entity type projection matrix and the corresponding soft-max layer, and (ii) entity parameters for each entity embedding, from the corresponding entity projection matrix and the corresponding soft-max layer; and building the end-to-end model by training the GCN with (i) the one or more model parameters, (ii) the entity type parameters for each entity type embedding, and (iii) the entity parameters for each entity embedding, and based on the one or more available entity annotations and the one or more available entity type annotations; and annotate

each table of the one or more tables using the built end-to-end model.
[007] In yet another aspect, there is provided a computer program product comprising a non-transitory computer readable medium having a computer readable program embodied therein, wherein the computer readable program, when executed on a computing device, causes the computing device to: receive the one or more tables, one or more entities and one or more entity types associated with the one or more tables, background knowledge elements comprising one or more observed entities and one or more observed entity types present in the one or more tables, and one or more available annotations for each table in the one or more tables, wherein each table of the one or more tables comprises one or more columns and one or more rows with individual cells, each column of each table corresponds to an entity type of the one or more entities and each individual cell of each table corresponds to an entity of the one or more entities, each entity of the one or more entities is associated with the entity type of the one or more entity types, and the one or more available annotations for each table comprises one or more available entity annotations and one or more available entity type annotations; construct a graph from the one or more tables, the background knowledge elements and the one or more available annotations for each table in the one or more tables, to obtain a constructed graph, wherein the constructed graph comprises one or more table nodes for each table in the one or more tables, one or more knowledge nodes, and one or more edges; build an end-to-end model, from the constructed graph using a graph convolution network (GCN), wherein the end-to-end model is built by: initiating one or more model parameters of the GCN, randomly: generating (i) an embedding for each table node of the one or more table nodes for each table to obtain table embeddings for the one or more table nodes, and (ii) an embedding for each knowledge node of the one or more knowledge nodes to obtain knowledge embeddings for the one or more knowledge nodes, using the GCN, wherein the table embeddings comprises embedding of each row, each column, each cell, and a self-table, and the knowledge embeddings comprises embedding for each observed entity

and each observed entity type; obtaining (i) an entity type embedding for each column node, by projecting the associated column embedding to an entity type space using an entity type projection matrix, and (ii) an entity embedding for each cell node, by projecting the associated cell embedding to an entity space using an entity projection matrix; obtaining (i) a probability distribution over the one or more observed entity types by using a soft-max layer on associated entity type embeddings, and (ii) the probability distribution over the one or more observed entities by using the soft-max layer on associated entity embeddings; learning (i) entity type parameters for each entity type embedding from the corresponding entity type projection matrix and the corresponding soft-max layer, and (ii) entity parameters for each entity embedding, from the corresponding entity projection matrix and the corresponding soft-max layer; and building the end-to-end model by training the GCN with (i) the one or more model parameters, (ii) the entity type parameters for each entity type embedding, and (iii) the entity parameters for each entity embedding, and based on the one or more available entity annotations and the one or more available entity type annotations; and annotate each table of the one or more tables using the built end-to-end model.
[008] In an embodiment, annotating each table of the one or more tables using the built end-to-end model, comprises: determining whether one or more unannotated columns corresponds to the one or more observed entity types and one or more unannotated individual cells correspond to and one or more observed entities, to obtain (i) one or more novel unannotated columns and one or more novel unannotated individual cells, and (ii) one or more non-novel unannotated columns and one or more non-novel unannotated individual cells; annotating the one or more non-novel unannotated columns with the one or more observed entity types and the one or more non-novel unannotated individual cells with the one or more observed entities, using the built end-to-end model through a forward propagation; discovering one or more new entity types for the one or more novel unannotated columns by clustering the one or more entity type embeddings of the columns and assigning a new entity type to each

cluster, and one or more new entities for the one or more novel unannotated individual cells by clustering the one or more entity embeddings of the cells and assigning a new entity to each cluster, using a clustering technique; and annotating the one or more novel unannotated columns with the one or more new entity types, and the one or more novel unannotated individual cells with the one or more new entities.
[009] In an embodiment, the one or more observed entities are some of entities of the one or more entities, that are present in at least one table of the one or more tables, and the one or more observed entity types are some of entity types of the one or more entity types, that are present in at least one table of the one or more tables.
[010] In an embodiment, the one or more available entity annotations corresponds to annotation of each individual cell present in the corresponding table with an observed entity of one or more observed entities and the one or more available entity type annotations corresponds to annotation of each column present in the corresponding table with an observed entity type of one or more observed entity types.
[011] In an embodiment, the one or more table nodes for each table comprises a column node for each column, a row node for each row, a cell node for each cell, and a self-table node for the corresponding table.
[012] In an embodiment, the one or more knowledge nodes comprises an observed entity node for each observed entity of the one or more observed entities, and an observed entity type node for each observed entity type of the one or more observed entity types.
[013] In an embodiment, the one or more edges comprises one or more table edges for each table, one or more knowledge edges, and one or more annotation edges.
[014] In an embodiment, the one or more table edges for each table comprises (i) a cell-column edge between each cell node and a corresponding column node, (ii) a cell-row node between each cell node and a corresponding row node, (iii) a column-table edge between each column node and a corresponding self-table node, and (iv) a row table edge between each row node and the corresponding self-table node.

[015] In an embodiment, the one or more knowledge edges include an observed entity-observed entity type edge between each observed entity node and a corresponding observed entity type node.
[016] In an embodiment, the one or more annotation edges comprises (i) an entity annotation edge for each available entity annotation between a cell node and a observed entity node, and (ii) an entity type annotation edge for each available entity type annotation between a column node and a observed entity type node.
[017] It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the embodiments of the present disclosure, as claimed.
BRIEF DESCRIPTION OF THE DRAWINGS
[018] The accompanying drawings, which are incorporated in and constitute a part of this disclosure, illustrate exemplary embodiments and, together with the description, serve to explain the disclosed principles:
[019] FIG. 1 is an exemplary block diagram of a system for annotating tables with entities and entity types, in accordance with some embodiments of the present disclosure.
[020] FIG. 2A through FIG. 2B illustrate exemplary flow diagrams of a processor-implemented method for annotating tables with entities and entity types, in accordance with some embodiments of the present disclosure.
[021] FIG. 3 shows a schematic diagram of an exemplary constructed graph for an exemplary table, in accordance with some embodiments of the present disclosure.
[022] FIG. 4 shows an architecture of a graph convolution network for annotating tables with entities and entity types, in accordance with some embodiments of the present disclosure.

DETAILED DESCRIPTION OF EMBODIMENTS [023] Exemplary embodiments are described with reference to the accompanying drawings. In the figures, the left-most digit(s) of a reference number identifies the figure in which the reference number first appears. Wherever convenient, the same reference numbers are used throughout the drawings to refer to the same or like parts. While examples and features of disclosed principles are described herein, modifications, adaptations, and other implementations are possible without departing from the scope of the disclosed embodiments.
[024] Annotations of tables especially web-tables that are present in webpages, documents, and so on, using knowledge graphs of known entities and entity types, is useful for many applications such as informational retrieval, question answering, and so on. However, the web-tables may not adhere to any standard format, schema, or convention. Additionally, the knowledge graphs may be typically incomplete as the entities and the entity types associated with the web-tables may not always exist in the knowledge graphs. Therefore, the annotation of the web-tables with the incomplete knowledge graphs is not effective and quite challenging.
[025] Conventional approaches for annotating the web-tables generally falls into two categories. First category recognizes importance of capturing syntactic structure of the web-tables for the annotation and may use graphical models for capturing the syntactic structure and performing joint inference for entity and entity type annotations. However, capturing the syntactic structure requires specification of hand-crafted semantic features based on domain knowledge, which is tedious and time consuming task. Second category transforms table elements such as columns, cells, and rows, into embeddings (word sequences), to circumvent the feature construction problem but without accounting for the syntactic structure of the web-tables. Further, the conventional approaches for annotating the web-tables are not trained end-to-end with the available annotations.

[026] The present disclosure herein provides methods and systems for annotating tables with entities and entity types solve the technical problems by building an efficient end-to-end model for annotating the web-tables. The present disclosure integrates complete structure of the web-tables, the knowledge elements, and available annotations to build the end-to-end model for annotating the web-tables, using a graph convolution network (GCN). The embedding of the table elements, the knowledge elements are trained jointly using the available annotations including the entity annotations and the entity type annotations. The trained embeddings enable the end-to-end model to annotate remaining table elements that are unannotated, with the available or newly discovered entities and entity types.
[027] In the present disclosure, the expressions ‘tables’ and ‘one or more tables’ are interchangeably used based on the context, however such expressions refer to the tables present in the webpage or in the document, which are to be annotated. Further, the expression ‘table elements’ refer to the elements present in the table such as rows, columns and cells (or ‘individual cells’).
[028] Referring now to the drawings, and more particularly to FIG. 1 through FIG. 4, where similar reference characters denote corresponding features consistently throughout the figures, there are shown preferred embodiments and these embodiments are described in the context of the following exemplary systems and/or methods.
[029] FIG. 1 is an exemplary block diagram of a system for annotating tables with entities and entity types, in accordance with some embodiments of the present disclosure. In an embodiment, the system 100 includes or is otherwise in communication with one or more hardware processors 104, communication interface device(s) or input/output (I/O) interface(s) 106, and one or more data storage devices or memory 102 operatively coupled to the one or more hardware processors 104. The one or more hardware processors 104, the memory 102, and the I/O interface(s) 106 may be coupled to a system bus 108 or a similar mechanism.

[030] The I/O interface(s) 106 may include a variety of software and hardware interfaces, for example, a web interface, a graphical user interface, and the like. The I/O interface(s) 106 may include a variety of software and hardware interfaces, for example, interfaces for peripheral device(s), such as a keyboard, a mouse, an external memory, a plurality of sensor devices, a printer and the like. Further, the I/O interface(s) 106 may enable the system 100 to communicate with other devices, such as web servers and external databases.
[031] The I/O interface(s) 106 can facilitate multiple communications within a wide variety of networks and protocol types, including wired networks, for example, local area network (LAN), cable, etc., and wireless networks, such as Wireless LAN (WLAN), cellular, or satellite. For the purpose, the I/O interface(s) 106 may include one or more ports for connecting a number of computing systems with one another or to another server computer. Further, the I/O interface(s) 106 may include one or more ports for connecting a number of devices to one another or to another server.
[032] The one or more hardware processors 104 may be implemented as one or more microprocessors, microcomputers, microcontrollers, digital signal processors, central processing units, state machines, logic circuitries, and/or any devices that manipulate signals based on operational instructions. Among other capabilities, the one or more hardware processors 104 are configured to fetch and execute computer-readable instructions stored in the memory 102. In the context of the present disclosure, the expressions ‘processors’ and ‘hardware processors’ may be used interchangeably. In an embodiment, the system 100 can be implemented in a variety of computing systems, such as laptop computers, portable computers, notebooks, hand-held devices, workstations, mainframe computers, servers, a network cloud and the like.
[033] The memory 102 may include any computer-readable medium known in the art including, for example, volatile memory, such as static random access memory (SRAM) and dynamic random access memory (DRAM), and/or non-volatile memory, such as read only memory (ROM), erasable programmable ROM, flash

memories, hard disks, optical disks, and magnetic tapes. In an embodiment, the memory 102 includes a plurality of modules 102a and a repository 102b for storing data processed, received, and generated by one or more of the plurality of modules 102a. The plurality of modules 102a may include routines, programs, objects, components, data structures, and so on, which perform particular tasks or implement particular abstract data types.
[034] The plurality of modules 102a may include programs or computer-readable instructions or coded instructions that supplement applications or functions performed by the system 100. The plurality of modules 102a may also be used as, signal processor(s), state machine(s), logic circuitries, and/or any other device or component that manipulates signals based on operational instructions. Further, the plurality of modules 102a can be used by hardware, by computer-readable instructions executed by the one or more hardware processors 104, or by a combination thereof. In an embodiment, the plurality of modules 102a can include various sub-modules (not shown in FIG.1). Further, the memory 102 may include information pertaining to input(s)/output(s) of each step performed by the processor(s) 104 of the system 100 and methods of the present disclosure.
[035] The repository 102b may include a database or a data engine. Further, the repository 102b amongst other things, may serve as a database or includes a plurality of databases for storing the data that is processed, received, or generated as a result of the execution of the plurality of modules 102a. Although the repository 102a is shown internal to the system 100, it will be noted that, in alternate embodiments, the repository 102b can also be implemented external to the system 100, where the repository 102b may be stored within an external database (not shown in FIG. 1) communicatively coupled to the system 100. The data contained within such external database may be periodically updated. For example, new data may be added into the external database and/or existing data may be modified and/or non-useful data may be deleted from the external database. In one example, the data may be stored in an

external system, such as a Lightweight Directory Access Protocol (LDAP) directory and a Relational Database Management System (RDBMS). In another embodiment, the data stored in the repository 102b may be distributed between the system 100 and the external database.
[036] Referring to FIG. 2A through FIG. 2B, components and functionalities of the system 100 are described in accordance with an example embodiment of the present disclosure. For example, FIG. 2A through FIG. 2B illustrate exemplary flow diagrams of a processor-implemented method 200 for annotating tables with entities and entity types, in accordance with some embodiments of the present disclosure. Although steps of the method 200 including process steps, method steps, techniques or the like may be described in a sequential order, such processes, methods and techniques may be configured to work in alternate orders. In other words, any sequence or order of steps that may be described does not necessarily indicate a requirement that the steps be performed in that order. The steps of processes described herein may be performed in any practical order. Further, some steps may be performed simultaneously, or some steps may be performed alone or independently.
[037] At step 202 of the method 200, the one or more hardware processors 104 of the system 100 are configured to receive one or more tables, one or more entities and one or more entity types associated with the one or more tables, background knowledge elements including one or more observed entities and one or more observed entity types present in the one or more tables, and one or more available annotations for each table in the one or more tables. Each table in the one or more tables include one or more columns and or more rows with individual cells. Wherein, each column of each table corresponds to an entity type of the one or more entity types and each individual cell of each table corresponds to an entity of the one or more entities of that entity type. Each entity of the one or more entities is associated with the entity type of the one or more entity types. The one or more tables may present in the webpage or in the document.

[038] The background knowledge elements including one or more observed entities and one or more observed entity types present in the one or more tables, may present in the knowledge graph of the one or more tables. The one or more observed entities are some of entities of the one or more entities that are present in at least one table of the one or more tables. Similarly, the one or more observed entity types are some of entity types of the one or more entity types that are present in at least one table of the one or more tables.
[039] The one or more available annotations for each table include one or more available entity annotations and one or more available entity type annotations. Wherein, the one or more available entity annotations corresponds to annotation of each individual cell present in the corresponding table with the observed entity of one or more observed entities of that table. Similarly, the one or more available entity type annotations corresponds to annotation of each column present in the corresponding table with the observed entity type of one or more observed entity types of that table.
[040] For example, consider a table set S including one or more tables. A kth table Sk ∈ 5, may include mk rows and nk columns of individual cells. The individual cell in ith row and jft column of the kth table may be denoted as xkij . In the present disclosure, only the textual tables are considered for the annotation. Hence, each xg takes a string as value. Also, the ith row may be denoted as Rf and the jft column may be denoted as Cjk. The one or more tables present in the table set S include background knowledge of the one or more entities and the one or more entity types. Let T denote the set of entity types, and E the set of entities. Each entity E is associated with the entity type T(E). For each entity E, there is also an entity description or lemma L(E).
[041] The one or more tables present in the table set S may contain information about entities E and entity types T. Specifically, each column of each table corresponds to a single entity type and hence a plurality of entity type annotations may be available. Similarly, each cell corresponds to a specific entity of that entity type and a plurality of entity annotations may be available. Let Ae be the set of all entity

annotations of the cells, and At that of all entity type annotations of the columns. However only a subset Aeo ⊂ Ae of entity annotations may be available. Similarly, only a subset Aot ⊂ At of entity type annotations may be available. Let To denote the one or more observed entity types seen in Aot, and Eo denote one or more observed entities seen in Aeo. As an additional challenge, in the incomplete knowledge setting, To is a strict subset of T (To ⊂ T) indicating that all the entity types may not seen in the available entity type annotations. Similarly, all the entities may not seen in the available entity annotations (Eo ⊂ E). So there is a need for a mechanism to annotate the unannotated cells and unannotated columns of the tables using the available annotations as training data.
[042] At step 204 of the method 200, the one or more hardware processors 104 of the system 100 are configured to construct a graph from the one or more tables, the background knowledge elements and the one or more available annotations for each table in the one or more tables, to obtain a constructed graph. The constructed graph includes one or more table nodes for each table in the one or more tables, one or more knowledge nodes, and one or more edges. In an embodiment, the constructed graph may be a directed graph.
[043] FIG. 3 shows a schematic diagram of an exemplary constructed graph for an exemplary table, in accordance with some embodiments of the present disclosure. The one or more table nodes for each table are represented in corner round rectangular boxes and one or more knowledge nodes are presented in ovals to form a complete node set V. The one or more knowledge nodes (in oval) shows exemplary entities and entity types. For example, entity �123 has associated entity type �12 : ‘Person’ and lemma ‘Tamara K’.
[044] The one or more table nodes for each table include a column node for each column Cjk, a row node for each row Rij, a cell node for each cell xijk, and a self-table node for the corresponding table Sk itself. The one or more knowledge nodes include an observed entity node for each observed entity of the one or more observed

entities Eo, and an observed entity type node for each observed entity type of the one or more observed entity types To.
[045] The one or more edges R are bi-directional and reflect a connection between all the nodes present in the constructed graph. The one or more edges R include one or more table edges Rt for each table, one or more knowledge edges Rk, and one or more annotation edges Ra. The one or more table edges Rt for each table comprises (i) a cell-column edge between each cell node xikj and a corresponding column node
Cjk, (ii) a cell-row node between each cell node xikj and a corresponding row node Rik, (iii) a column-table edge between each column node Cjkand a corresponding self-table node Sk, and (iv) a row table edge between each row node Rik and the corresponding self-table node Sk. The one or more table edges Rt are represented in fine solid lines in FIG. 3.
[046] The one or more knowledge edges Rk include an observed entity-observed entity type edge between each observed entity node Eo and a corresponding observed entity type node To. The one or more knowledge edges Rk are represented in thick solid lines in FIG. 3. The one or more annotation edges Ra include (i) an entity annotation edge for each available entity annotation in Aoe, which is between a cell node xikj and a observed entity node Eo, and (ii) an entity type annotation edge for each available entity type annotation in Ato, which is between a column node Cjk and a observed entity type node in Ate. The one or more annotation edges Ra are represented in dashed lines in FIG. 3. Let T (Cjk) denote the entity type annotation associated with column Cjk, and E(xijk) the entity annotation associated with the cell xijk. For example,
the cell x11 is annotated with entity E123, and column C1 is annotated with type �12: Person.
[047] At step 206 of the method 200, the one or more hardware processors 104 of the system 100 are configured to build an end-to-end model, from the

constructed graph obtained at step 204 of the method 200, using a graph convolution network (GCN).
[048] At step 206a of the method 200, the one or more hardware processors 104 of the system 100 are configured to initiating one or more model parameters of the GCN, randomly. The one or more model parameters are associated with each layer present in the GCN. FIG.4 shows an architecture of a graph convolution network for annotating tables with entities and entity types, in accordance with some embodiments of the present disclosure.
[049] At step 206b of the method 200, the one or more hardware processors 104 of the system 100 are configured to generate (i) an embedding for each table node of the one or more table nodes for each table to obtain table embeddings for the one or more table nodes present in the constructed graph obtained at step 204 of the method 200, using the GCN. Also, the one or more hardware processors 104 of the system 100 are configured to generate the embedding for each knowledge node of the one or more knowledge nodes to obtain knowledge embeddings for the one or more knowledge nodes present in the constructed graph obtained at step 204 of the method 200, using the GCN. The table embeddings include embedding of each row node, each column node, each cell node, and the self-table node itself. The knowledge embeddings include embedding for each observed entity node and each observed entity type node.
[050] For example, consider a constructed graph G = (V, R), where V indicate a set of n nodes (the table nodes and the knowledge nodes) and R indicate a set of directed edges. The edge between each nodes u, v G V with a label (annotation) Luv is denoted as (it, v, Luv) G R. The set of directed edges include an inverse edge for each edge and a self-loop for each node. An input feature matrix X G Rmxn contain an input feature representation xu G Rm of each node VUG V in its columns. An output of each embedding of the node v at kth layer of the GCN is given by:

where WLuv(k) and bL( k)uv are label (annotation) specific model parameters at kth layer and

hu(1) = xu and the function f ( ) is a non-liner activation and ReLU is used for the activation in the GCN.
[051] Self-loop is added for each node associated with textual input, specifically, the cells and the entities with lemmas. For the input representation of such nodes, a mean of pre-trained word embeddings such as GloVe is used for each of the associated constituent tokens.
[052] The final GCN layer K generates the embedding hv(k) for each node v. The embedding for each node generated at the final layer of the GCN contain information about both entities and entity types. A classification layer is used in the GCN which classifies each embedding of the node generated at the final layer into the entity or the entity type.
[053] For the entity type classification, the one or more hardware processors 104 of the system 100 are configured at step 206c of the method 200 to first obtain the entity type embedding ht(c) for each column node c, by projecting the associated
column embedding hc(k) to an entity type space using an entity type projection matrix
Pt using ht(c) = Pt hc(K). Then, the one or more hardware processors 104 of the system 100 are configured at step 206d of the method 200 to obtain a probability distribution gt(c) over the one or more observed entity types To by using a soft-max layer on associated entity type embedding ht(c) and using a sigmoid weight matrix θt based on the equation: gt(c) = (ht(c); θt). Next, the one or more hardware processors 104 of the system 100 are configured at step 206e of the method 200 to learn entity type parameters for each entity type embedding. The entity type parameters for each entity type embedding include the associated entity type projection matrix Pt and the associated sigmoid weight matrix θt.
[054] For the entity classification, the one or more hardware processors 104 of the system 100 are configured at step 206c of the method 200 to first obtain the

entity embedding he(x) for each cell node x, by projecting the associated cell
embedding hx(K) to an entity space using an entity projection matrix Pe using he(x) =
Pe he(K) . Then, the one or more hardware processors 104 of the system 100 are configured at step 206d of the method 200 to obtain a probability distribution ge(x) over the one or more observed entities Eo by using the soft-max layer on associated entity embedding he(x) and using a sigmoid weight matrix θe based on the equation: ge(x) = σ(he(x); θe). Next, the one or more hardware processors 104 of the system 100 are configured at step 206e of the method 200 to learn entity parameters for each entity embedding. The entity parameters for each entity embedding include the associated entity projection matrix Pe and the associated sigmoid weight matrix θe.
[055] At step 206f of the method 200, the one or more hardware processors 104 of the system 100 are configured to build the end-to-end model by jointly training the GCN with (i) the one or more model parameters, (ii) the entity type parameters for each entity type embedding, and (iii) the entity parameters for each entity embedding, and based on the one or more available entity annotations for the cells and the one or more available entity type annotations for the columns. Specifically, an entity type classification loss between (i) the probability distribution gt(c) and (ii) the observed entity annotation T(c). Similarly, an entity classification loss between (i) the probability distribution ge(x) and the observed entity annotation E(x). In an embodiment, a weighted combination of the entity prediction loss and the entity type prediction loss is considered while training the GCN. In an embodiment, a cross-entropy is used as the classification loss function and Adam for optimization while training the GCN.
[056] At step 208 of the method 200, the one or more hardware processors 104 of the system 100 are configured to annotate each table of the one or more tables, using the built end-to-end model obtained at step 206 of the method 200. Firstly, the one or more hardware processors 104 of the system 100 are configured to determine whether one or more unannotated columns corresponds to the one or more observed

entity types, to obtain one or more novel unannotated columns and one or more novel
unannotated individual cells. To decide whether the unannotated column c corresponds
to one of the one or more observed entity types To, the associated entity type
embedding is used. The column c corresponds to a new entity type it its entity type
space embedding ht(c) = Pt h(c) is far way from the entity type space embedding
ht(T)= Pt (T) for all T ∈ To. More specifically
δ (maxT ∈To cos(Pt h(c), Pt h(T)) ≤ ∈t is used, where ∈t is a novel entity type threshold and δ ( ) is a Kronecker delta function.
[057] Similarly, whether one or more unannotated individual cells correspond to and one or more observed entities, is determined to obtain one or more non-novel unannotated columns and one or more non-novel unannotated individual cells. To decide whether the unannotated cell x corresponds to one of the one or more observed entities Eo, the associated entity embedding is used. The cell x corresponds to a new entity it its entity space embedding he(x) = Pe h(x) is far way from the entity space embedding he(E) = Pe h(E) for all E ∈ Eo. More specifically δ (maxE∈E o cos(Pe h(x ), Pe h(E)) ≤ ∈e is used, where ∈e is a novel entity threshold and δ ( ) is the Kronecker delta function.
[058] Then, each non-novel unannotated column of the one or more non-novel unannotated columns is annotated with the associated observed entity type of the one or more observed entity types, using the built end-to-end model through a forward propagation. Similarly, each non-novel unannotated individual cells of the one or more non-novel unannotated individual cells is annotated with the observed entity of the one or more observed entities, using the built end-to-end model through the forward propagation in the trained network on the embeddings of the corresponding nodes. The entity type probability distribution gt(c) of the column c is obtained from gt(c) = Pt h(c). Similarly, the entity probability distribution ge(x) of the column x is obtained from ge(x) = Pe h(x).

[059] One or more new entity types are discovered for the one or more novel unannotated columns by clustering the one or more entity type embeddings of the columns and assigning a new entity type to each cluster, using a clustering technique. Specifically, for each new entity type of the one or more entity types, the entity type embedding ht(c) of all new columns c and forming the cluster. In an embodiment, a Silhouette clustering is used as a representative non-parametric clustering algorithm.
[060] Similarly, one or more new entities are discovered for the one or more novel unannotated individual cells by clustering the one or more entity embeddings of the cells and assigning a new entity to each cluster, using the clustering technique. Specifically, for each new entity of the one or more entities, the entity embedding he(x) of all new cells x and forming the cluster. In an embodiment, the Silhouette clustering is used as the representative non-parametric clustering algorithm.
[061] Then, each novel unannotated column of the one or more novel unannotated column is annotated with the associated new entity types of the discovered one or more new entity types, using the built end-to-end model through the forward propagation. Similarly, each novel unannotated individual cell of the one or more novel unannotated individual cells is annotated with the associated new entity of the discovered one or more new entities, using the built end-to-end model through the forward propagation.
[062] In accordance with the present disclosure, the methods and systems, the end-to end model is built using the GCN for annotating the one or more tables present in the webpage or in the electronic document effectively. The present disclosure captures the syntactic structure of the tables, the knowledge elements as well as the available annotations, and learns using both the entity loss and the entity type losses jointly and in the end-to-end fashion. Hence the build end-to-end model unifies benefits of probabilistic graphical model based approaches and the embedding based approaches for the table annotation. Using the embeddings of the table and knowledge elements thus learnt, the end-to-end model is efficient over the conventional

approaches for annotating the web tables. Also, the end-to-end model discovers the new entities and the new entity types for annotating the novel unannotated columns the novel unannotated individual cells.
[063] The method 200 of the present disclosure explained the steps for annotating the unannotated individual cells and the unannotated columns as the available annotation are only provided for the cells and the columns during the training of the GCN for building the end-to-end model. However, embeddings are available for the rows and tables as well after training, which may be used for different tasks. Since the rows and the tables do not involve with entity type spaces or entity spaces, the associated embeddings are used directly for the tasks. For example, three different tasks: (i) table clustering task, (ii) row to table assignment task, and (iii) an anomalous row detection task. The table clustering is the task of grouping together semantically related tables. The table embeddings of h(S) of all tables are clustered. In an embodiment, the silhouette clustering algorithm is used for consistency. The row to table assignment is the task of assigning a row to its most appropriate table. For this, a row R with embedding h(R) is assigned to the table Sk with the ‘closest’ embedding h(Sk). more specifically, to S* = argmaxsk cos(h(R),h(Sk)). If R is the row from the table provided during the training, then the associated embedding is readily available, however if it is a new row, then its embedding is created by convolving over the input embeddings of its constituent cells using the model parameters Wl for cell-row edge. Finally, the anomalous row detection is the task of determining for each row Rjk of the table Sk if it is an anomalous row in that table. The row Rjk is anomalous if its embedding h(Rjk) is ‘far away’ from the embedding h(Sk) of its table. Example scenario:
[064] To validate the performance of the present disclosure, the performance of the end-to-end model (herein after referred as ‘TableGCN’) is compared against the conventional state-of-the-art baselines.

[065] Data: Five benchmark web table datasets: (i) Wiki Manual (Girija Limaye, Sunita Sarawagi, and Soumen Chakrabarti. 2010. Annotating and searching web tables using entities, types and relationships. Proc.VLDB Endow., 3(1-2):1338– 1347) is a small dataset of simple non-infobox tables from Wikipedia articles, (ii) Web Manual (Girija Limaye, Sunita Sarawagi, and Soumen Chakrabarti. 2010. Annotating and searching web tables using entities, types and relationships. Proc.VLDB Endow., 3(1-2):1338–1347.) contains tables fetched by web-crawling using Wiki Manual tables as queries, these tables are manually annotated with entities and entity types from YAGO, (iii) Limaye (Jiaoyan Chen, Ernesto Jimenez-Ruiz, Ian Horrocks, and Charles Sutton. 2019. Colnet: Embedding the semantics of web tables for column type prediction. In AAAI; Vasilis Efthymiou, Oktie Hassanzadeh, Mariano Rodriguez-Muro, and Vassilis Christophides. 2017, Matching web tables with knowledge base entities: From entity lookups to entity embeddings. In International Semantic Web Conference (ISWC)) corrects many incorrect annotations in Wiki Manual using with the entities and the entity types from DBPedia 2015, (iv) T2Dv2 (Jiaoyan Chen, Ernesto Jimenez-Ruiz, Ian Horrocks, and Charles Sutton. 2019. Colnet: Embedding the semantics of web tables for column type prediction. In AAAI) contains tables from common web crawl, and (v) Wikipedia is a publicly available subset of the data used by Vasilis Efthymiou, Oktie Hassanzadeh, Mariano Rodriguez-Muro, and Vassilis Christophides. 2017, Matching web tables with knowledge base entities: From entity lookups to entity embeddings. In International Semantic Web Conference (ISWC). The datasets T2Dv2 and Wikipedia are annotated using the DBPedia. But, T2Dv2 and Wikipedia contain only entity type annotations and the entity annotations are not available. For all the five datasets, 30% of the unique entities and entity types are set aside first as the unseen knowledge elements and added their annotations during the training. Out of the remaining annotations, again 30% is removed during the training.
[066] Conventional state-of-the-art baselines: Five existing table annotation approaches are compared with the present disclosure. First one is PGM (Girija Limaye,

Sunita Sarawagi, and Soumen Chakrabarti. 2010. Annotating and searching web tables using entities, types and relationships. Proc.VLDB Endow., 3(1-2):1338–1347) which uses a probabilistic graphical model and hand-crafted features to calculate the potential functions of the model. Since the entity type set is flat instead of being a hierarchical graph, in the present disclosure, a logistic regression is used instead of structural support vector machine (SVM) to estimate the model parameters.
[067] Second one is MeiMei (Kunihiro Takeoka, Masafumi Oyamada, Shinji Nakadai, and Takeshi Okadome. 2019. Meimei: An efficient probabilistic approach for semantically annotating tables. In The Thirty-Third AAAI Conference on Artificial Intelligence, AAAI 2019, pages 281–288) which uses a Markov Random field but embeds the entities and entity types for fast computation of clique potentials. For both PGM and MeiMei, relation annotations for column-pairs and associated potential functions are ignored since the relation detection is not relevant in the present disclosure.
[068] Third one is Table2Vec (Li Deng, Shuo Zhang, and Krisztian Balog. 2019. Table2vec: Neuralword and entity embeddings for table population and retrieval. In SIGIR.) which learns various word2vec-based embeddings for tables for entity annotation but does not address entity type annotation. However, like other embedding-based table annotation models, Table2Vec focuses on relational tables and associates the single entity with the row for the associated core column. Table2Vec is adapted for web tables containing the entity for each cell. Table2VecE version is used which models the row as a sequences of all entities that appear within cells of that row. After training the word2vec model, instead of considering the embedding for the entire row as in Table2VecE, cell-specific embeddings from only the tokens for that cell are created.
[069] Fourth one is ColNet (Jiaoyan Chen, Ernesto Jimenez-Ruiz, Ian Horrocks, and Charles Sutton. 2019. Colnet: Embedding the semantics of web tables for column type prediction. In AAAI) which models each column as a sequence of

words from the cells under the column and uses a CNN (Convolutional Neural Network) to predict the column type.
[070] Fifth one is an adaptation of TaBERT (Pengcheng Yin, Graham Neubig, Wen-tau Yih, and Sebastian Riedel. 2020. TaBERT: Pretraining for joint understanding of textual and tabular data. In Annual Meeting of the Association for Computational Linguistics (ACL)) which trains a joint language model for retrieval from the tables given utterance. The same approach is adapted to independently linearize each row and each column of the table as a sequence of cells under that row and column, and get column and row embeddings using the mean of the corresponding cell embeddings. For cells, pre-trained BERT embeddings are considered.
[071] Hyper-parameters: One-hot encodings are used as inputs for the cells and the entities with lemmas. ReLU is used as the nonlinear GCN activation, where 2 GCN layers with 512 and 256 dimensions are used. As a result, the 7 GCN weight matrices are Vx512 (V is a vocabulary size) in the first GCN layer and 512x256 in the second GCN layer. The entity and entity type space embeddings contains 256 dimensions, so that the entity and entity type projection matrices are both 256x256. During the training, a learning rate 0:001, dropout 0.5, 1000 epochs are used. The weights for combining the entity type and entity losses is 1:2 for both datasets, optimized manually. The experiments are performed on a dual-core intel core i5 Processor with 8GB RAM. Average training time per epoch ranges from 1.3 secs for Wiki Manual to 35.4 secs for T2Dv2.
[072] Detection results: For the entity and entity type detection, all the five baselines predict entity annotations for all cells and entity type annotations for all columns in the test set. However, evaluation is only for those cells and columns whose true entity and entity type annotations are contained in the observed entities Eo and the observed types To respectively. A micro-averaged F1 is used during the evaluation, which takes the weighted average of the F1 scores over the entities and the entity types.

Table 1 shows entity and entity type detection performance of the all the five baselines on the five datasets.

Model Wiki Web T2Dv2 Limaye Wikipedia
s Manual Manual

Enti Enti Enti Enti Enti Enti Enti Enti Enti Enti
ty type ty ty type ty ty type ty ty type ty ty type ty
PGM 55% 70% 57% 75% - - 60% 75% - -
MeiMe i 40% - 62% - - - 58% - - -
Tab2V - 20% - 50% - - - 50% - -
ec
ColNet 20% - 49% - 59% - 47% - 59% -
TaBE 20% - 49% - 59% - 47% - 59% -
RT
TabGC 33% 24% 84% 83% 82% - 84% 79% 85% -
N
Table 1
[073] From the performance results, the TabGCN significantly outperforms over the three baselines on both detection tasks. The improvements are particularly large for entity type detection. The only exception is for Wiki Manual. The graphical model based approaches with the handcrafted potential functions outperform the representation learning approaches, possibly on account of the smallness of the dataset. Among the embedding based approaches, the TabGCN performs the best.
[074] Novelty classification results: The novelty classification is an unsupervised task, where a model makes a binary decision for each unannotated column (novel entity type classification) and for each unannotated cell (novel entity classification). Since the decision depends on the thresholds for entity type ( ∈t) and

entity ( ∈e), the F1 score is plotted on the y-axis against the corresponding threshold on the x-axis. Out of the five baselines, only the PGM can address this novelty classification but outputs NONE annotations for the entities and the entity types. The TabGCN reaches F1 score around 0.8 for the Web Manual.
[075] New entity type and entity discovery results: The final annotation tasks are new entity type discovery, where all unannotated columns that do not correspond to observed types in To need to clustered into distinct new entity types. Similarly, the new entity discovery where all unannotated cells that do not correspond to observed entities in Eo need to clustered into distinct new entities. A normalized mutual
information (NMI) is used, which is between the assigned cluster
annotations (Y) and the true class annotations (C) is used, where I(; ) denotes mutual information and H(;) denotes entropy. Table 2 shows a performance of the TabGCN for discovering the one or more new entities and the one or more new entity types. From table 2, the TabGCN performs consistently above 80% and significantly outperforms over the ColNet and the TaBERT, for the new entity and the new entity type discovery across the datasets. The PGM, MeiMei, and Tab2Vec are not considered during the experiment as they are not capable of discovering the new entities and new entity types.

Model Wiki Web T2Dv2 Limaye Wikipedia
s Manual Manual

Enti Enti Enti Enti Enti Enti Enti Enti Enti Enti
ty type ty ty type ty ty type ty ty type ty ty type ty
ColNet 76% - 64% - 77% - 63% - 60% -
TaBE 76% - 62% - 76% - 61% - 60%
RT
TabGC 88% 86% 84% 84% 90% - 84% 83% 82% -
N

Table 2 [076] Table and row assignment performance results: Finally, the table and row related inference tasks are evaluated. Table 3 shows row to table assignment task performance using Hits@1 and MRR for TabGCN, Tab2Vec and TaBERT on all five datasets. Hits@1 measures the fraction of rows with the correct table at rank 1. Mean Reciprocal rank (MRR) is the mean of the reciprocal rank of the correct table over all rows, and its perfect value is 1.0. From the results, again the TabGCN shows very good performance across the five datasets.

Model Wiki Web T2Dv2 Limaye Wikipedia
s Manual Manual

Hits MR Hits MR Hits MR Hits MR Hits MR
@1 R @1 R @1 R @1 R @1 R
Tab2V 0.11 0.16 0.13 0.21 0.32 0.39 0.09 0.18 0.29 0.35
ec
TaBE 0.13 0.18 0.51 0.58 0.32 0.40 0.46 0.57 0.38 0.46
RT
TabG 0.18 0.21 0.71 0.77 0.67 0.70 0.71 0.77 0.71 0.74
CN
Table 3
[077] The present disclosure explores the usefulness of learning the embeddings of table elements and knowledge elements jointly using both entity and entity type losses in the end-to-end fashion in a variety of tasks on the two datasets. Based on the experimental results, for the entity and entity type annotation tasks, the TabGCN has significantly outperformed over the mentioned baselines. Also, the present disclosure has shown good performance for the novelty classification and discovering the new entities and entity types.
[078] The written description describes the subject matter herein to enable any person skilled in the art to make and use the embodiments. The scope of the subject

matter embodiments is defined by the claims and may include other modifications that occur to those skilled in the art. Such other modifications are intended to be within the scope of the claims if they have similar elements that do not differ from the literal language of the claims or if they include equivalent elements with insubstantial differences from the literal language of the claims.
[079] It is to be understood that the scope of the protection is extended to such a program and in addition to a computer-readable means having a message therein; such computer-readable storage means contain program-code means for implementation of one or more steps of the method, when the program runs on a server or mobile device or any suitable programmable device. The hardware device can be any kind of device which can be programmed including e.g. any kind of computer like a server or a personal computer, or the like, or any combination thereof. The device may also include means which could be e.g. hardware means like e.g. an application-specific integrated circuit (ASIC), a field-programmable gate array (FPGA), or a combination of hardware and software means, e.g. an ASIC and an FPGA, or at least one microprocessor and at least one memory with software modules located therein. Thus, the means can include both hardware means and software means. The method embodiments described herein could be implemented in hardware and software. The device may also include software means. Alternatively, the embodiments may be implemented on different hardware devices, e.g. using a plurality of CPUs.
[080] The embodiments herein can comprise hardware and software elements. The embodiments that are implemented in software include but are not limited to, firmware, resident software, microcode, etc. The functions performed by various modules described herein may be implemented in other modules or combinations of other modules. For the purposes of this description, a computer-usable or computer readable medium can be any apparatus that can comprise, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device.

[081] The illustrated steps are set out to explain the exemplary embodiments shown, and it should be anticipated that ongoing technological development will change the manner in which particular functions are performed. These examples are presented herein for purposes of illustration, and not limitation. Further, the boundaries of the functional building blocks have been arbitrarily defined herein for the convenience of the description. Alternative boundaries can be defined so long as the specified functions and relationships thereof are appropriately performed. Alternatives (including equivalents, extensions, variations, deviations, etc., of those described herein) will be apparent to persons skilled in the relevant art(s) based on the teachings contained herein. Such alternatives fall within the scope and spirit of the disclosed embodiments. Also, the words “comprising,” “having,” “containing,” and “including,” and other similar forms are intended to be equivalent in meaning and be open ended in that an item or items following any one of these words is not meant to be an exhaustive listing of such item or items, or meant to be limited to only the listed item or items. It must also be noted that as used herein and in the appended claims (when included in the specification), the singular forms “a,” “an,” and “the” include plural references unless the context clearly dictates otherwise.
[082] Furthermore, one or more computer-readable storage media may be utilized in implementing embodiments consistent with the present disclosure. A computer-readable storage medium refers to any type of physical memory on which information or data readable by a processor may be stored. Thus, a computer-readable storage medium may store instructions for execution by one or more processors, including instructions for causing the processor(s) to perform steps or stages consistent with the embodiments described herein. The term “computer-readable medium” should be understood to include tangible items and exclude carrier waves and transient signals, i.e., be non-transitory. Examples include random access memory (RAM), read-only memory (ROM), volatile memory, nonvolatile memory, hard drives, CD ROMs, DVDs, flash drives, disks, and any other known physical storage media.

[083] It is intended that the disclosure and examples be considered as exemplary only, with a true scope and spirit of disclosed embodiments being indicated by the following claims.

WE CLAIM:
1. A processor-implemented method (200) for annotating one or more tables with
entities and entity types, the method (200) comprising the steps of:
receiving, via one or more hardware processors, the one or more tables, one or more entities and one or more entity types associated with the one or more tables, background knowledge elements comprising one or more observed entities and one or more observed entity types present in the one or more tables, and one or more available annotations for each table in the one or more tables, wherein each table of the one or more tables comprises one or more columns and one or more rows with individual cells, each column of each table corresponds to an entity type of the one or more entities and each individual cell of each table corresponds to an entity of the one or more entities, each entity of the one or more entities is associated with the entity type of the one or more entity types, and the one or more available annotations for each table comprises one or more available entity annotations and one or more available entity type annotations (202);
constructing, via the one or more hardware processors, a graph from the one or more tables, the background knowledge elements and the one or more available annotations for each table in the one or more tables, to obtain a constructed graph, wherein the constructed graph comprises one or more table nodes for each table in the one or more tables, one or more knowledge nodes, and one or more edges (204);
building, via the one or more hardware processors, an end-to-end model, from the constructed graph using a graph convolution network (GCN) (206), wherein building the end-to-end model comprises:
initiating one or more model parameters of the GCN, randomly
(206a);

generating (i) an embedding for each table node of the one or more table nodes for each table to obtain table embeddings for the one or more table nodes, and (ii) an embedding for each knowledge node of the one or more knowledge nodes to obtain knowledge embeddings for the one or more knowledge nodes, using the GCN, wherein the table embeddings comprises embedding of each row, each column, each cell, and a self-table, and the knowledge embeddings comprises embedding for each observed entity and each observed entity type (206b);
obtaining (i) an entity type embedding for each column node, by projecting the associated column embedding to an entity type space using an entity type projection matrix, and (ii) an entity embedding for each cell node, by projecting the associated cell embedding to an entity space using an entity projection matrix (206c);
obtaining (i) a probability distribution over the one or more observed entity types by using a soft-max layer on associated entity type embeddings, and (ii) the probability distribution over the one or more observed entities by using the soft-max layer on associated entity embeddings (206d);
learning (i) entity type parameters for each entity type embedding from the corresponding entity type projection matrix and the corresponding soft-max layer, and (ii) entity parameters for each entity embedding, from the corresponding entity projection matrix and the corresponding soft-max layer (206e); and
building the end-to-end model by training the GCN with (i) the one or more model parameters, (ii) the entity type parameters for each entity type embedding, and (iii) the entity parameters for each entity embedding, and based on the one or more available entity annotations and the one or more available entity type annotations (206f); and

annotating, via the one or more hardware processors, each table of the one or more tables, using the built end-to-end model (208).
2. The method as claimed in claim 1, wherein annotating each table of the one or
more tables using the built end-to-end model, further comprises:
determining whether one or more unannotated columns corresponds to the one or more observed entity types and one or more unannotated individual cells correspond to and one or more observed entities, to obtain (i) one or more novel unannotated columns and one or more novel unannotated individual cells, and (ii) one or more non-novel unannotated columns and one or more non-novel unannotated individual cells;
annotating the one or more non-novel unannotated columns with the one or more observed entity types and the one or more non-novel unannotated individual cells with the one or more observed entities, using the built end-to-end model through a forward propagation;
discovering one or more new entity types for the one or more novel unannotated columns by clustering the one or more entity type embeddings of the columns and assigning a new entity type to each cluster, and one or more new entities for the one or more novel unannotated individual cells by clustering the one or more entity embeddings of the cells and assigning a new entity to each cluster, using a clustering technique; and
annotating the one or more novel unannotated columns with the one or more new entity types, and the one or more novel unannotated individual cells with the one or more new entities.
3. The method as claimed in claim 1, wherein the one or more observed entities
are some of entities of the one or more entities, that are present in at least one table of
the one or more tables, and the one or more observed entity types are some of entity

types of the one or more entity types, that are present in at least one table of the one or more tables.
4. The method as claimed in claim 1, wherein the one or more available entity annotations corresponds to annotation of each individual cell present in the corresponding table with an observed entity of one or more observed entities and the one or more available entity type annotations corresponds to annotation of each column present in the corresponding table with an observed entity type of one or more observed entity types.
5. The method as claimed in claim 1, wherein the one or more table nodes for each table comprises a column node for each column, a row node for each row, a cell node for each cell, and a self-table node for the corresponding table.
6. The method as claimed in claim 1, wherein the one or more knowledge nodes comprises an observed entity node for each observed entity of the one or more observed entities, and an observed entity type node for each observed entity type of the one or more observed entity types.
7. The method as claimed in claim 1, wherein the one or more edges comprises one or more table edges for each table, one or more knowledge edges, and one or more annotation edges.
8. The method as claimed in claim 7, wherein the one or more table edges for each table comprises (i) a cell-column edge between each cell node and a corresponding column node, (ii) a cell-row node between each cell node and a corresponding row node, (iii) a column-table edge between each column node and a corresponding self-

table node, and (iv) a row table edge between each row node and the corresponding self-table node.
9. The method as claimed in claim 7, wherein the one or more knowledge edges include an observed entity-observed entity type edge between each observed entity node and a corresponding observed entity type node.
10. The method as claimed in claim 7, wherein the one or more annotation edges comprises (i) an entity annotation edge for each available entity annotation between a cell node and a observed entity node, and (ii) an entity type annotation edge for each available entity type annotation between a column node and a observed entity type node.
11. A system (100) for annotating one or more tables with entities and entity types, the system (100) comprising:
a memory (102) storing instructions;
one or more Input/Output (I/O) interfaces (106); and
one or more hardware processors (104) coupled to the memory (102) via the
one or more I/O interfaces (106), wherein the one or more hardware processors
(104) are configured by the instructions to:
receive the one or more tables, one or more entities and one or more entity types associated with the one or more tables, background knowledge elements comprising one or more observed entities and one or more observed entity types present in the one or more tables, and one or more available annotations for each table in the one or more tables, wherein each table of the one or more tables comprises one or more columns and one or more rows with individual cells, each column of each table corresponds to an entity type of the one or more entities and each individual cell of each table corresponds to an

entity of the one or more entities, each entity of the one or more entities is associated with the entity type of the one or more entity types, and the one or more available annotations for each table comprises one or more available entity annotations and one or more available entity type annotations;
construct a graph from the one or more tables, the background knowledge elements and the one or more available annotations for each table in the one or more tables, to obtain a constructed graph, wherein the constructed graph comprises one or more table nodes for each table in the one or more tables, one or more knowledge nodes, and one or more edges;
build an end-to-end model, from the constructed graph using a graph convolution network (GCN), wherein the end-to-end model is built by:
initiating one or more model parameters of the GCN, randomly: generating (i) an embedding for each table node of the one or more table nodes for each table to obtain table embeddings for the one or more table nodes, and (ii) an embedding for each knowledge node of the one or more knowledge nodes to obtain knowledge embeddings for the one or more knowledge nodes, using the GCN, wherein the table embeddings comprises embedding of each row, each column, each cell, and a self-table, and the knowledge embeddings comprises embedding for each observed entity and each observed entity type;
obtaining (i) an entity type embedding for each column node, by projecting the associated column embedding to an entity type space using an entity type projection matrix, and (ii) an entity embedding for each cell node, by projecting the associated cell embedding to an entity space using an entity projection matrix;
obtaining (i) a probability distribution over the one or more observed entity types by using a soft-max layer on associated entity type embeddings, and (ii) the probability distribution over the one or more

observed entities by using the soft-max layer on associated entity embeddings;
learning (i) entity type parameters for each entity type embedding from the corresponding entity type projection matrix and the corresponding soft-max layer, and (ii) entity parameters for each entity embedding, from the corresponding entity projection matrix and the corresponding soft-max layer; and
building the end-to-end model by training the GCN with (i) the one or more model parameters, (ii) the entity type parameters for each entity type embedding, and (iii) the entity parameters for each entity embedding, and based on the one or more available entity annotations and the one or more available entity type annotations; and annotate each table of the one or more tables using the built end-to-end model.
12. The system as claimed in claim 11, wherein the one or more hardware
processors (104) are further configured to annotate each table of the one or more tables using the built end-to-end model, by:
determining whether one or more unannotated columns corresponds to the one or more observed entity types and one or more unannotated individual cells correspond to and one or more observed entities, to obtain (i) one or more novel unannotated columns and one or more novel unannotated individual cells, and (ii) one or more non-novel unannotated columns and one or more non-novel unannotated individual cells;
annotating the one or more non-novel unannotated columns with the one or more observed entity types and the one or more non-novel unannotated individual cells with the one or more observed entities, using the built end-to-end model through a forward propagation;
discovering one or more new entity types for the one or more novel unannotated columns by clustering the one or more entity type embeddings of the columns and

assigning a new entity type to each cluster, and one or more new entities for the one or more novel unannotated individual cells by clustering the one or more entity embeddings of the cells and assigning a new entity to each cluster, using a clustering technique; and
annotating the one or more novel unannotated columns with the one or more new entity types, and the one or more novel unannotated individual cells with the one or more new entities.
13. The system as claimed in claim 11, wherein the one or more observed entities are some of entities of the one or more entities, that are present in at least one table of the one or more tables, and the one or more observed entity types are some of entity types of the one or more entity types, that are present in at least one table of the one or more tables.
14. The system as claimed in claim 11, wherein the one or more available entity annotations corresponds to annotation of each individual cell present in the corresponding table with an observed entity of one or more observed entities and the one or more available entity type annotations corresponds to annotation of each column present in the corresponding table with an observed entity type of one or more observed entity types.
15. The system as claimed in claim 11, wherein the one or more table nodes for each table comprises a column node for each column, a row node for each row, a cell node for each cell, and a self-table node for the corresponding table.
16. The system as claimed in claim 11, wherein the one or more knowledge nodes comprises an observed entity node for each observed entity of the one or more observed entities, and an observed entity type node for each observed entity type of the one or more observed entity types.

17. The system as claimed in claim 11, wherein the one or more edges comprises one or more table edges for each table, one or more knowledge edges, and one or more annotation edges.
18. The system as claimed in claim 17, wherein the one or more table edges for each table comprises (i) a cell-column edge between each cell node and a corresponding column node, (ii) a cell-row node between each cell node and a corresponding row node, (iii) a column-table edge between each column node and a corresponding self-table node, and (iv) a row table edge between each row node and the corresponding self-table node.
19. The system as claimed in claim 17, wherein the one or more knowledge edges include an observed entity-observed entity type edge between each observed entity node and a corresponding observed entity type node.
20. The system as claimed in claim 17, wherein the one or more annotation edges comprises (i) an entity annotation edge for each available entity annotation between a cell node and a observed entity node, and (ii) an entity type annotation edge for each available entity type annotation between a column node and a observed entity type node.

Documents

Application Documents

#	Name	Date
1	202021047314-STATEMENT OF UNDERTAKING (FORM 3) [29-10-2020(online)].pdf	2020-10-29
2	202021047314-REQUEST FOR EXAMINATION (FORM-18) [29-10-2020(online)].pdf	2020-10-29
3	202021047314-PROOF OF RIGHT [29-10-2020(online)].pdf	2020-10-29
4	202021047314-FORM 18 [29-10-2020(online)].pdf	2020-10-29
5	202021047314-FORM 1 [29-10-2020(online)].pdf	2020-10-29
6	202021047314-FIGURE OF ABSTRACT [29-10-2020(online)].jpg	2020-10-29
7	202021047314-DRAWINGS [29-10-2020(online)].pdf	2020-10-29
8	202021047314-DECLARATION OF INVENTORSHIP (FORM 5) [29-10-2020(online)].pdf	2020-10-29
9	202021047314-COMPLETE SPECIFICATION [29-10-2020(online)].pdf	2020-10-29
10	Abstract1.jpg	2021-10-19
11	202021047314-FORM-26 [21-10-2021(online)].pdf	2021-10-21
12	202021047314-FER.pdf	2022-06-23
13	202021047314-OTHERS [12-08-2022(online)].pdf	2022-08-12
14	202021047314-FER_SER_REPLY [12-08-2022(online)].pdf	2022-08-12
15	202021047314-DRAWING [12-08-2022(online)].pdf	2022-08-12
16	202021047314-COMPLETE SPECIFICATION [12-08-2022(online)].pdf	2022-08-12
17	202021047314-CLAIMS [12-08-2022(online)].pdf	2022-08-12
18	202021047314-US(14)-HearingNotice-(HearingDate-03-06-2025).pdf	2025-05-08
19	202021047314-Correspondence to notify the Controller [27-05-2025(online)].pdf	2025-05-27
20	202021047314-FORM-26 [30-05-2025(online)].pdf	2025-05-30
21	202021047314-Written submissions and relevant documents [12-06-2025(online)].pdf	2025-06-12
22	202021047314-PatentCertificate30-06-2025.pdf	2025-06-30
23	202021047314-IntimationOfGrant30-06-2025.pdf	2025-06-30

Search Strategy

1	SEARCH_STRATEGY_1505AE_15-05-2023.pdf
2	SearchHistoryE_22-06-2022.pdf