Abstract: This disclosure relates generally to molecule design. Conventional systems suffers from absence of message-passing mechanism to leverage local-node and periphery graph neighbors information to effectively capture the prominent sub-structure patterns in molecular graph while learning the discriminative graph-level embeddings, lack of latent information incorporated in the message-passing mechanism to update node and graph-level embeddings, and so on. The disclosed system overcomes these limitations by providing a molecular graph encoder to determine molecular graph-level embeddings, a clique tree encoder to compute the tree-level embeddings, and a clique tree decoder to retrieve tree representation of molecular graphs. The Virtual Hyperedge Replacement Graph Grammars are extracted corresponding to the virtual junction graph to aid in molecular graph decoding. The graph decoder retrieves the original molecular graphs. The disclosed system improves molecule validity and aids in property optimization task on both the limited and unlimited scenarios.
Claims:
1. A processor-implemented method (200) for molecule design using a Virtual Graph Grammars driven Variational Generative model, the method comprising:
employing (202), via one or more hardware processors, a Virtual Graph Grammars driven Variational Generative model for molecule design, the Virtual Graph Grammars driven Variational Generative model comprising a Molecular Graph Encoder, Molecular Clique Tree Encoder, Molecular Clique Tree Decoder, Virtual Hyperedge Replacement Graph Grammars and a Molecular Graph Decoder;
accessing (204), via the one or more hardware processors, a database comprising a plurality of variable-sized molecular graphs associated with a plurality of molecules and a plurality of labels indicative of chemical properties of the plurality of the variable-sized molecular graphs, wherein each molecular graph of the plurality of variable-sized molecular graphs comprises a plurality of nodes and a plurality of edges connecting a plurality of neighboring nodes from amongst the plurality of nodes;
augmenting (206), via the one or more hardware processors, each molecular graph from amongst the plurality of variable-sized molecular graphs with a virtual node described by an auxiliary information to obtain an augmented molecular graph, wherein the virtual node of the molecular graph is bidirectionally connected to the plurality of nodes of the molecular graph through a virtual edge-type;
learning (208), via the one or more hardware processors, expressive representations of the augmented molecular graph using the Molecular graph encoder, wherein the expressive representations comprise a node-level embedding and a graph-level embedding of the augmented molecular graph;
performing (210), via the one or more hardware processors, tree decomposition of the augmented molecular graph to obtain a clique tree representation of the augmented molecular graphs;
computing (212), via the one or more hardware processors, a tree-level embedding of the clique tree representation by message-passing schemes on valid subgraphs of the molecular graph, through a molecular Clique tree encoder, by operating on the topology of the clique tree augmented with a virtual node;
reconstructing (214), via the one or more hardware processors, the clique tree representation of the molecular graph from the tree level embedding of the clique tree representation by using the Molecular Clique tree decoder;
extracting (216), via the one or more hardware processors, a plurality of virtual hyperedge replacement graph grammar (VHRG) latent vectors of a plurality of production rules associated with clique tree virtual graph to aid in molecular graph decoding using a sequence-to-sequence learning (Seq2Seq) model, wherein the plurality of production rules describes graphical rewriting rules to match by hyperedges and replace hyperedges with hypergraphs fragments in the clique tree virtual graph; and
retrieving (218), via the one or more hardware processors, the molecular graph, by the molecular graph-decoder, wherein the molecular graph-decoder utilizes the reconstructed clique tree, the graph-level embeddings of the molecular graph, and the plurality of VHRG latent vectors of the clique tree virtual graph for retrieval.
2. The processor implemented method of claim 1, wherein learning the node-level embedding and the graph-level embedding of each augmented molecular graph from amongst the plurality of augmented molecular graphs comprises:
in a plurality of message passing iterations, iteratively subjecting each augmented molecular graph through the learning representation to determine a latent information incorporated in node-level embeddings of the augmented molecular graph, wherein each iteration of the plurality of iterations comprises:
dispatching, by a set of local-graph neighbors of a source node in the molecular graph, a long-range dependencies embedded neural-message vector and a local-information aware neural message vector to the source node, wherein the long-range dependencies embedded neural-message vector is modeled by a gating mechanism to capture prominent substructures in the molecular graph, and wherein the local-information aware neural message vector is modeled by a feed-forward neural network to encapsulate local-graph structure information;
computing a weighted neural message vector to be perceived by the source node by refining the long-range dependencies embedded neural-message vector and the local-information aware neural message vector by using a feature vector of the source node and a static edge vector characterizing the edge between the source node and the sink node;
aggregating local-neighbors information comprising the weighted neural message vector from the plurality of message-passing iterations to compute a refined neural message vector to be dispatched from the plurality of source nodes to the sink node;
updating node level embedding of the sink node based on the refined neural message vector as perceived by the sink node, wherein the node level embedding of the sink node embeds structure and local-feature information about its local graph neighbors;
determining a fixed-size graph-level embedding by performing node-ordering invariant sum-pooling operation on the node-level embeddings of the sink node; and
sharing learning parameters of iterative graph message-passing mechanism across the plurality of nodes of varying local neighborhood size in the molecular graph and across the iteration steps.
3. The processor implemented method of claim 1, wherein computing the tree-level embedding of the clique tree representation of the molecular graph using the tree encoder comprises:
performing, a plurality of tree based message-passing iterations to obtain a plurality of tree node-level embeddings of the plurality of clique trees of the molecular graph to refine each of the plurality of clique tree level embeddings, wherein each iteration of the plurality of tree based message-passing iterations comprises:
bidirectionally connecting a virtual node to the plurality of nodes of the molecular clique tree through arbitrary local edge type to aid in subgraph representation learning;
communicating neural messages across the local-tree neighbors in the molecular clique tree, wherein the local edges connecting the local-tree neighbors in the local clique tree are characterized with empty static edge-feature vector;
selecting a random leaf node as a root source node of the corresponding clique tree of the annotated molecular graph; and
communicating the structure-aware neural messages from one of the root source node and the intermediate source nodes to all the sink nodes in a top-down mode and a bottom-up mode, wherein in the top-down mode, neural messages are dispatched from the root source node to all the leaf sink nodes according to the computational graph associated with the root node, and wherein in the bottom-up mechanism, neural messages are dispatched and refined iteratively from the leaf source nodes towards the root sink node of the clique tree.
4. The processor implemented method of claim 1, wherein reconstructing the clique tree representation from the tree level embeddings of the clique tree to obtain reconstructed clique tree comprises:
determining a label of the root node from a set of valid vocabulary obtained from tree-decomposition of the molecular graphs; and
iterating, by operating the tree decoder in a pre-order traversal technique, over a set of nodes of the clique tree commencing from the root node and bringing about adding a child node to a current node from amongst the set of nodes in a depth-first order and extending up to a set of terminal nodes belonging to the set of nodes, wherein each iteration comprises:
adding the child node to the current node in the clique tree at a current iteration-step;
determining a label of the child node and;
decoding the clique tree incrementally based on a knowledge assembled from local-tree neighbors embeddings aware neural messages, wherein each child node in the reconstructed clique tree acquires structure-aware neural messages from the local-clique tree neighbors at every iteration for topological predictions, and wherein decoding the clique tree comprises reconstructing the clique tree.
5. The processor implemented method of claim 1, wherein extracting the plurality of VHRG latent vectors of the production rules, comprises:
transforming an undirected clique tree into a rooted clique tree with a random leaf node designated as a root node, wherein the nodes of the rooted clique tree are subgraphs of the molecular graph, and wherein the rooted clique tree is associated with a natural ordering of the child nodes of an immediate parent node from a top-to-bottom and from a left-to-right approach;
adding virtual edges characterized by empty static edge feature attributes by traversing level-by-level, connecting one or more sibling nodes of each node visited in a current iteration in the rooted clique tree, to obtain a clique tree virtual graph;
decomposing the molecular clique tree virtual graph to obtain the virtual clique tree, wherein the virtual clique tree is a constrained representation of the molecular clique tree virtual graph;
extracting the VHRG from the virtual clique tree to generate molecular clique tree virtual graph production rules; and
reconstructing the isomorphic clone of the molecular clique tree virtual graph to learn VHRG latent vectors.
6. The processor implemented method of claim 1, wherein reconstructing the isomorphic clone of the molecular graph associated with the reconstructed clique tree using the molecular graph decoder comprises:
providing molecular graph-level embedding of the molecular graph, the plurality of VHRG latent vectors and the reconstructed clique tree as input to the molecular graph decoder;
selecting a molecular graph having equivalent clique tree as the reconstructed clique tree from amongst a set of realizable molecular graph based on a scoring function, wherein the scoring function facilitates in evaluating the realizable molecular graph; and
merging the local-tree nodes in the clique tree and transform it into the valid molecular graph.
7. A system (602) for molecule design using a Virtual Graph Grammars driven Variational Generative model, the system comprising:
a memory (615) storing instructions;
one or more communication interfaces (603); and
one or more hardware processors (602) coupled to the memory (615) via the one or more communication interfaces (603), wherein the one or more hardware processors (602) are configured by the instructions to:
employ a Virtual Graph Grammars driven Variational Generative model for molecule design, the Virtual Graph Grammars driven Variational Generative model comprising a Molecular Graph Encoder, Molecular Clique Tree Encoder, Molecular Clique Tree Decoder, Virtual Hyperedge Replacement Graph Grammars and a Molecular Graph Decoder;
access a database comprising a plurality of variable-sized molecular graphs associated with a plurality of molecules and a plurality of labels indicative of chemical properties of the plurality of the variable-sized molecular graphs, wherein each molecular graph of the plurality of variable-sized molecular graphs comprises a plurality of nodes and a plurality of edges connecting a plurality of neighboring nodes from amongst the plurality of nodes;
augment each molecular graph from amongst the plurality of variable-sized molecular graphs with a virtual node described by auxiliary information to obtain augmented molecular graph, wherein the virtual node of the molecular graph is bidirectionally connected to the plurality of nodes of the molecular graph through a virtual edge-type;
learn expressive representations of the augmented molecular graphs using the Molecular graph encoder, wherein the expressive representations comprises a node-level embedding and a graph-level embedding of the augmented molecular graphs;
perform tree decomposition of the augmented molecular graph to obtain a clique tree representation of the augmented molecular graphs;
compute a tree-level embedding of the clique tree representation by message-passing schemes on valid subgraphs of the molecular graph, through a molecular Clique tree encoder, by operating on the topology of the clique tree augmented with a virtual node;
reconstruct the clique tree representation of the molecular graphs from the tree level embeddings of the clique tree representation using the Molecular Clique tree decoder;
extract a plurality of virtual hyperedge replacement graph grammar (VHRG) latent vectors of a plurality of production rules associated with clique tree virtual graph to aid in molecular graph decoding using a sequence-to-sequence learning (Seq2Seq) model, wherein the plurality of production rules describes graphical rewriting rules to match by hyperedges and replace hyperedges with hypergraphs fragments in the clique tree virtual graph; and
retrieve the molecular graph, by the molecular graph-decoder, wherein the molecular graph-decoder utilizes the reconstructed clique tree, the graph-level embeddings of the molecular graph, and the plurality of VHRG latent vectors of the clique tree virtual graph for retrieval.
8. The system of claim 7, wherein to learn the node-level embedding and the graph-level embedding of each augmented molecular graph from amongst the plurality of augmented molecular graphs, the one or more hardware processors are configured by the instructions comprises:
in a plurality of message passing iterations, iteratively subject each augmented molecular graph through the learning representation to determine latent information incorporated node-level embeddings of the augmented molecular graph, wherein to execute each iteration of the plurality of iterations, the one or more hardware processors are configured by the instructions to:
dispatch, by a set of local-graph neighbors of a source node in the molecular graph, a long-range dependencies embedded neural-message vector and a local-information aware neural message vector to the source node, wherein the long-range dependencies embedded neural-message vector is modeled by a gating mechanism to capture prominent substructures in the molecular graph, and wherein the local-information aware neural message vector is modeled by a feed-forward neural network to encapsulate local-graph structure information;
compute a weighted neural message vector to be perceived by the source node by refining the long-range dependencies embedded neural-message vector and the local-information aware neural message vector by using a feature vector of the source node and a static edge vector characterizing the edge between the source node and the sink node;
aggregate local-neighbors information comprising the weighted neural message vector from the plurality of message-passing iterations to compute a refined neural message vector to be dispatched from the plurality of source nodes to the sink node;
update node level embedding of the sink node based on the refined neural message vector as perceived by the sink node, wherein the node level embedding of the sink node embeds structure and local-feature information about its local graph neighbors;
determine a fixed-size graph-level embedding by performing node-ordering invariant sum-pooling operation on the node-level embeddings of the sink node; and
share learning parameters of iterative graph message-passing mechanism across the plurality of nodes of varying local neighborhood size in the molecular graph and across the iteration steps.
9. The system of claim 7, wherein to compute the tree-level embedding of the clique tree representation of the molecular graph using the tree encoder, the one or more hardware processors are configured by the instructions to:
performing, a plurality of tree based message-passing iterations to obtain a plurality of tree node-level embeddings of the plurality of clique trees of the molecular graph to refine each of the plurality of clique tree level embeddings, wherein to perform each iteration of the plurality of tree based message-passing iterations, the one or more hardware processors are configured by the instructions to:
bidirectionally connect a virtual node to the plurality of nodes of the molecular clique tree through arbitrary local edge type to aid in subgraph representation learning;
communicate neural messages across the local-tree neighbors in the molecular clique tree, wherein the local edges connecting the local-tree neighbors in the local clique tree are characterized with empty static edge-feature vector;
select a random leaf node as a root source node of the corresponding clique tree of the annotated molecular graph; and
communicate the structure-aware neural messages from one of the root source node and the intermediate source nodes to all the sink nodes in a top-down mode and a bottom-up mode, wherein in the top-down mode, neural messages are dispatched from the root source node to all the leaf sink nodes according to the computational graph associated with the root node, and wherein in the bottom-up mechanism, neural messages are dispatched and refined iteratively from the leaf source nodes towards the root sink node of the clique tree.
10. The system of claim 7, wherein to reconstruct the clique tree representation from the tree level embeddings of the clique tree to obtain reconstructed clique tree, the one or more hardware processors are configured by the instructions to:
determine a label of the root node from a set of valid vocabulary obtained from tree-decomposition of the molecular graphs; and
iterate in a plurality of iterations, by operating the tree decoder in a pre-order traversal technique, over a set of nodes of the clique tree commencing from the root node and bringing about adding a child node to a current node from amongst the set of nodes in a depth-first order and extending up to a set of terminal nodes belonging to the set of nodes, wherein to perform each iteration from amongst the a plurality of iterations, the one or more hardware processors are configured by the instructions to:
add the child node to the current node in the clique tree at a current iteration-step;
determine a label of the child node and;
decode the clique tree incrementally based on a knowledge assembled from local-tree neighbors embeddings aware neural messages, wherein each child node in the reconstructed clique tree acquires structure-aware neural messages from the local-clique tree neighbors at every iteration for topological predictions, and wherein decoding the clique tree comprises reconstructing the clique tree.
11. The system of claim 7, wherein to extract the plurality of VHRG latent vectors of the plurality of production rules, the one or more hardware processors are configured by the instructions to:
transform an undirected clique tree into a rooted clique tree with a random leaf node designated as a root node, wherein the nodes of the rooted clique tree are subgraphs of the molecular graph, and wherein the rooted clique tree is associated with a natural ordering of the child nodes of an immediate parent node from a top-to-bottom and from a left-to-right approach;
add virtual edges characterized by empty static edge feature attributes by traversing level-by-level, connecting one or more sibling nodes of each node visited in a current iteration in the rooted clique tree, to obtain a clique tree virtual graph;
decompose the molecular clique tree virtual graph to obtain the virtual clique tree, wherein the virtual clique tree is a constrained representation of the molecular clique tree virtual graph;
extract the VHRG from the virtual clique tree to generate molecular clique tree virtual graph production rules; and
reconstruct the isomorphic clone of the molecular clique tree virtual graph to learn VHRG latent vectors.
12. The system of claim 7, wherein to reconstruct the isomorphic clone of the molecular graph associated with the reconstructed clique tree using the molecular graph decoder, the one or more hardware processors are configured by the instructions to:
provide molecular graph-level embedding of the molecular graph, the plurality of VHRG latent vectors and the reconstructed clique tree as input to the molecular graph decoder;
select a molecular graph having equivalent clique tree as the reconstructed clique tree from amongst a set of realizable molecular graph based on a scoring function, wherein the scoring function facilitates in evaluating the realizable molecular graph; and
merge the local-tree nodes (cliques or clusters, subgraphs in the molecular graph) in the clique tree and transform it into the valid molecular graph.
, Description:FORM 2
THE PATENTS ACT, 1970
(39 of 1970)
&
THE PATENT RULES, 2003
COMPLETE SPECIFICATION
(See Section 10 and Rule 13)
Title of invention:
SYSTEM AND METHOD FOR MOLECULE DESIGN USING A VIRTUAL GRAPH GRAMMARS DRIVEN VARIATIONAL GENERATIVE MODEL
Applicant:
Tata Consultancy Services Limited
A company Incorporated in India under the Companies Act, 1956
Having address:
Nirmal Building, 9th Floor,
Nariman Point, Mumbai 400021,
Maharashtra, India
The following specification particularly describes the invention and the manner in which it is to be performed.
TECHNICAL FIELD
The disclosure herein generally relates to the field of molecule design, and, more particularly, to system and method for molecule design using a virtual graph grammars driven variational generative model.
BACKGROUND
The key desideratum of artificial intelligence in molecule design is to assist in computational molecule discovery by moving humans-out-of-the-loop to pave the way for automation of discovering relevant novel task-specific molecules with desired properties.
The existing models for molecule design/search/generation are classified based on neural-network architecture such as Variational Autoencoder (VAE), Generative Adversarial Networks (GAN), Normalizing flows, Reinforcement Learning, etc. However, the known neural-network architectures such as Junction Tree VAE (JTVAE) suffers from certain limitations. For example, such JTVAE suffers from absence of message-passing mechanism to leverage both the local-node and periphery graph neighbors information to effectively capture the prominent sub-structure patterns in the molecular graph in learning the discriminative graph-level embeddings. Another limitation of the conventional JTVAE for molecule design includes lack of latent information incorporated in the message-passing mechanism to update node and graph-level embeddings. Moreover, due to its non-deterministic molecular graph decoding, the VAE leads to lower molecular graph reconstruction accuracy from its latent vectors and sub-optimum results in property optimization.
SUMMARY
Embodiments of the present disclosure present technological improvements as solutions to one or more of the above-mentioned technical problems recognized by the inventors in conventional systems. For example, in one embodiment, a method for molecule design using a Virtual Graph Grammars driven variational generative model is provided. The method includes employing, via one or more hardware processors, a Virtual Graph Grammars driven Variational Generative model for molecule design, the Graph Grammars driven Variational Generative model comprising a Molecular Graph Encoder, Molecular Clique Tree Encoder, Molecular Clique Tree Decoder, Virtual Hyperedge Replacement Graph Grammars and a Molecular Graph Decoder. Further, the method includes accessing, via the one or more hardware processors, a database comprising a plurality of variable-sized molecular graphs associated with a plurality of molecules and a plurality of labels indicative of chemical properties of the plurality of the variable-sized molecular graphs, wherein each molecular graph of the plurality of variable-sized molecular graphs comprises a plurality of nodes and a plurality of edges connecting a plurality of neighboring nodes from amongst the plurality of nodes. Furthermore, the method includes augmenting, via the one or more hardware processors, each molecular graph from amongst the plurality of variable-sized molecular graphs with a virtual node described by auxiliary information to obtain augmented molecular graph, wherein the virtual node of the molecular graph is bidirectionally connected to the plurality of nodes of the molecular graph through a virtual edge-type. Also, the method includes learning, via the one or more hardware processors, expressive representations of the augmented molecular graphs using the Molecular graph encoder, wherein the expressive representations comprises a node-level embedding and a graph-level embedding of the augmented molecular graphs. Moreover, the method includes performing, via the one or more hardware processors, tree decomposition of the molecular graphs to obtain a clique tree representation of the molecular graphs. Also, the method includes computing, via the one or more hardware processors, a discriminative tree-level embedding of the clique tree representation by message-passing schemes on valid subgraphs of the molecular graph, through a molecular Clique tree encoder, by operating on the topology of the clique tree augmented with a virtual node. Further, the method includes reconstructing, via the one or more hardware processors, the clique tree representation of the molecular graphs from tree level embeddings of the clique tree using the Molecular Clique tree decoder. Further, the method includes extracting, via the one or more hardware processors, a plurality of virtual hyperedge replacement graph grammar (VHRG) latent vectors of the production rules associated with clique tree virtual graph to aid in molecular graph decoding using sequence-to-sequence learning (Seq2Seq) model, wherein the production rules describes graphical rewriting rules to match by hyperedges and replace hyperedges with hypergraphs fragments in the clique tree virtual graph. The method also includes retrieving, via the one or more hardware processors, the molecular graph, by the molecular graph-decoder, wherein the molecular graph-decoder utilizes the reconstructed clique tree, the graph-level embeddings of the molecular graph, and the plurality of VHRG latent vectors of the clique tree virtual graph for retrieval.
In another aspect, a system for molecule design using a Virtual Graph Grammars driven variational generative is provided. The system includes a memory storing instructions; one or more communication interfaces; and one or more hardware processors coupled to the memory via the one or more communication interfaces, wherein the one or more hardware processors are configured by the instructions to employ a Virtual Graph Grammars driven Variational Generative model for molecule design, the Graph Grammars driven Variational Generative model comprising a Molecular Graph Encoder, Molecular Clique Tree Encoder, Molecular Clique Tree Decoder, Virtual Hyperedge Replacement Graph Grammars and a Molecular Graph Decoder. Further, the one or more hardware processors are configured by the instructions to access a database comprising a plurality of variable-sized molecular graphs associated with a plurality of molecules and a plurality of labels indicative of chemical properties of the plurality of the variable-sized molecular graphs, wherein each molecular graph of the plurality of variable-sized molecular graphs comprises a plurality of nodes and a plurality of edges connecting a plurality of neighboring nodes from amongst the plurality of nodes. Furthermore, the one or more hardware processors are configured by the instructions to augment each molecular graph from amongst the plurality of variable-sized molecular graphs with a virtual node described by auxiliary information to obtain augmented molecular graph, wherein the virtual node of the molecular graph is bidirectionally connected to the plurality of nodes of the molecular graph through a virtual edge-type. Also, the one or more hardware processors are configured by the instructions to learn expressive representations of the augmented molecular graphs using the Molecular graph encoder, wherein the expressive representations comprises a node-level embedding and a graph-level embedding of the augmented molecular graphs. Moreover, the one or more hardware processors are configured by the instructions to perform tree decomposition of the molecular graphs to obtain a clique tree representation of the molecular graphs. Also the one or more hardware processors are configured by the instructions to compute a discriminative tree-level embedding of the clique tree representation by message-passing schemes on valid subgraphs of the molecular graph, through a molecular Clique tree encoder, by operating on the topology of the clique tree augmented with a virtual node. Further, the one or more hardware processors are configured by the instructions to include reconstruct the clique tree representation of the molecular graphs from tree level embeddings of the clique tree using the Molecular Clique tree decoder. Further, the one or more hardware processors are configured by the instructions to extract a plurality of virtual hyperedge replacement graph grammar (VHRG) latent vectors of the production rules associated with clique tree virtual graph to aid in molecular graph decoding using sequence-to-sequence learning (Seq2Seq) model, wherein the production rules describes graphical rewriting rules to match by hyperedges and replace hyperedges with hypergraphs fragments in the clique tree virtual graph. The one or more hardware processors are configured by the instructions to retrieve, via the one or more hardware processors, the molecular graph, by the molecular graph-decoder, wherein the molecular graph-decoder utilizes the reconstructed clique tree, the graph-level embeddings of the molecular graph, and the plurality of VHRG latent vectors of the clique tree virtual graph for retrieval.
In yet another aspect, a non-transitory computer readable medium for a executing a method for molecule design using a Virtual Graph Grammars driven variational generative is provided. The non-transitory computer readable medium includes a plurality of instructions, which when executed, cause the molecular property prediction via the following method. The method steps includes employing, via one or more hardware processors, a Virtual Graph Grammars driven Variational Generative model for molecule design, the Graph Grammars driven Variational Generative model comprising a Molecular Graph Encoder, Molecular Clique Tree Encoder, Molecular Clique Tree Decoder, Virtual Hyperedge Replacement Graph Grammars and a Molecular Graph Decoder. Further, the method includes accessing, via the one or more hardware processors, a database comprising a plurality of variable-sized molecular graphs associated with a plurality of molecules and a plurality of labels indicative of chemical properties of the plurality of the variable-sized molecular graphs, wherein each molecular graph of the plurality of variable-sized molecular graphs comprises a plurality of nodes and a plurality of edges connecting a plurality of neighboring nodes from amongst the plurality of nodes. Furthermore, the method includes augmenting, via the one or more hardware processors, each molecular graph from amongst the plurality of variable-sized molecular graphs with a virtual node described by auxiliary information to obtain augmented molecular graph, wherein the virtual node of the molecular graph is bidirectionally connected to the plurality of nodes of the molecular graph through a virtual edge-type. Also, the method includes learning, via the one or more hardware processors, expressive representations of the augmented molecular graphs using the Molecular graph encoder, wherein the expressive representations comprises a node-level embedding and a graph-level embedding of the augmented molecular graphs. Moreover, the method includes performing, via the one or more hardware processors, tree decomposition of the molecular graphs to obtain a clique tree representation of the molecular graphs. Also, the method includes computing, via the one or more hardware processors, a discriminative tree-level embedding of the clique tree representation by message-passing schemes on valid subgraphs of the molecular graph, through a molecular Clique tree encoder, by operating on the topology of the clique tree augmented with a virtual node. Further, the method includes reconstructing, via the one or more hardware processors, the clique tree representation of the molecular graphs from tree level embeddings of the clique tree using the Molecular Clique tree decoder. Further, the method includes extracting, via the one or more hardware processors, a plurality of virtual hyperedge replacement graph grammar (VHRG) latent vectors of the production rules associated with clique tree virtual graph to aid in molecular graph decoding using sequence-to-sequence learning (Seq2Seq) model, wherein the production rules describes graphical rewriting rules to match by hyperedges and replace hyperedges with hypergraphs fragments in the clique tree virtual graph. The method also includes retrieving, via the one or more hardware processors, the molecular graph, by the molecular graph-decoder, wherein the molecular graph-decoder utilizes the reconstructed clique tree, the graph-level embeddings of the molecular graph, and the plurality of VHRG latent vectors of the clique tree virtual graph for retrieval.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the invention, as claimed.
BRIEF DESCRIPTION OF THE DRAWINGS
The accompanying drawings, which are incorporated in and constitute a part of this disclosure, illustrate exemplary embodiments and, together with the description, serve to explain the disclosed principles:
FIG. 1 illustrates a network implementation of a system for molecule design using a virtual graph grammars driven variational generative model according to some embodiments of the present disclosure.
FIGS. 2A and 2B is a flow diagram illustrating a method for molecule design using a virtual graph grammars driven variational generative model in accordance with some embodiments of the present disclosure.
FIG. 3A illustrates an example representation of determination of molecular graph-level embedding by a molecular graph encoder according to some embodiments of the present disclosure.
FIG. 3B illustrates an example representation of a method for reconstruction of clique tree representation of the molecular graph, in accordance with an example embodiment of present disclosure.
FIG. 3C illustrates an example representation of latent vectors of the hyperedge replacement graph grammars according to some embodiments of the present disclosure.
FIG. 3D illustrates an example representation of a method for retrieving the molecular graph, by the molecular graph-decoder according to some embodiments of the present disclosure.
FIGS. 4A-4C illustrate an example representation of a method for extracting VHRG latent vectors of the production rules associated with clique tree virtual graph, according to some embodiments of the present disclosure.
FIGS. 5A-5D illustrate an example representation of merging subgraphs or nodes in the clique tree, according to some embodiments of the present disclosure.
FIG. 6 is a block diagram of an exemplary computer system for implementing embodiments consistent with the present disclosure.
DETAILED DESCRIPTION OF EMBODIMENTS
The existing models for molecule design/search/generation are classified based on neural-network architecture such as Variational Autoencoder (VAE), Generative Adversarial Networks (GAN), Normalizing flows, Reinforcement Learning, etc. However, the known neural-network architectures such as JTVAE suffers from certain limitations. For example, absence of a message-passing mechanism to leverage both the local-node and periphery graph neighbors information to effectively capture the prominent sub-structure patterns in the molecular graph in learning the discriminative graph-level embeddings. Another limitation of the conventional JTVAE for molecule design includes lack of latent information incorporated in the message-passing mechanism to update node and graph-level embeddings. It is attributed due to the absence of virtual node to aid in during performing the inference on the molecular graphs. Moreover, due to its non-deterministic molecular graph decoding, the VAE leads to lower molecular graph reconstruction accuracy from its latent vectors and sub-optimum results in property optimization.
Various embodiments described herein provides method and system for molecule design that overcome the aforemention and other limitations of the VAE. For example, in one embodiment, a system for molecule design is proposed that utilizes virtual graph grammars driven variational generative model.
The disclosed system puts forth a message-passing mechanism to leverage both the local-node and periphery graph neighbors information to effectively capture the prominent sub-structure patterns in the molecular graph, clique tree in learning the discriminative graph-level embeddings. The disclosed system incorporates the latent information in the message-passing mechanism to update node and graph-level embeddings by utilizing the virtual node. The molecular graph decoder leverages the VHRG latent vectors to improve the molecular graph reconstruction accuracy and avoid sub-optimum results in property optimization.
Organic molecules are treated as undirected molecular graphs. Assuming, that the set of feasible, accessible, and relevant molecules represented as molecular graphs are sampled from valid chemical space denoted by M^G. The finite-dataset of molecules with predetermined ground-truth labels is given by {G_1,…,G_M}?M^G. The subscript, M denotes the finite number of molecules in the dataset. Each undirected molecular graph in the dataset is described by, G_i=(V_i,E_i). V_i and E_i denote the finite vertex and edge set respectively. |V_i |, |E_i | denote the finite number of nodes and edges in the homogenous molecular graph, G_i. Each node, v_i?V_i (G_i) in the molecular graph, G_i is associated with a feature vector, F_i?R^C. It describes the node label, valence of the node, and other additional properties. C denotes the characteristic dimension of the node feature vector. Each undirected edge, e_ji?E_i (G_i) in the molecular graph, G_i is characterized by the edge information vector, E_ji?R^Z. E_ji describes the edge type and spatial information of the local-graph neighbors connected through the edge, e_ji. Z denotes the characteristic dimension of the static edge feature vector.
Assuming, v_i & v_j?V_i (G_i) are the local-graph neighbors in the molecular graph, G_i. They are connected by an arbitrary edge, e_ji?E_i (G_i)?e_ij?E_i (G_i), ?v_j?N(i). N(i) denotes the local-graph neighbors of a node, v_i. The molecular graph, G_i connectivity is described by the adjacency matrix, G_A^((i)). G_A^((i))?R^(|V_i |×|V_i |). G_A^((i)) [j,i]=1 if e_ji?E_i (G_i) or else G_A^((i)) [j,i]=0. Each molecular graph, G_i is characterized by a set of node feature matrix, F?R^(|V_i |×C), edge feature information matrix, E?R^(|E_i |×Z) and the adjacency matrix, G_A^((i)). The performance of the Graph Grammars driven Variational Generative model is evaluated and benchmarked on the task of molecule reconstruction and validity, and property optimization. For the property optimization, both the global and local molecule optimization approaches are considered. The Bayesian optimization is performed to determine the best molecule with specific desired properties in the molecule latent space. The local-optimization technique is leverage to perform gradient ascent over latent space to improve the specific properties of a given set of molecules in a constrained scenario. The molecular reconstruction task objective is to retrieve the input molecular graphs from the corresponding latent vectors sampled from the molecule latent space, M^G. Assuming, E_VAE and D_VAE denotes the pair of encoder and decoder in the model. E_VAE:G_i?M^G?R^D and D_VAE:R^D?G_i^'?M^G. The mathematical formulation for molecule reconstruction is described by,
G_i^'=D_VAE (E_VAE (G_i)) (1)
If, G_i^' and G_i are isomorphic, the reconstruction has succeeded. The mathematical formulation for molecule validity is given by, D_VAE (z_G) and latent vectors, z_G?R^D are sampled from the prior distribution. D_VAE (z_G) succeeds if it is chemically valid (RDKit to parse SMILES string). The molecular global property optimization task is to determine G^??M^G such that, G^*=arg max_(G_i?M^G ) f(G_i). G^* is acquired by,
G^*=D_VAE (argmax-(z_G?R^D ) f(D_VAE (z_G))) (2)
f:G?R is a machine learning function approximation model that predicts the target chemical property of the organic molecule. The local molecular optimization aims to improve the property of a given molecule by constraining the degree of deviation from the original molecule, as:
m^?=D_VAE (argmax-(z_G:sim(m_G,D_VAE (z_G))=t) f(D_VAE (z_G))) (3)
Here, sim(m,m^' ) computes the similarity between molecules m_G and m_G^'. t is a similarity threshold. Tanimoto similarity is used with Morgan fingerprint (radius=2).
Exemplary embodiments are described with reference to the accompanying drawings. In the figures, the left-most digit(s) of a reference number identifies the figure in which the reference number first appears. Wherever convenient, the same reference numbers are used throughout the drawings to refer to the same or like parts. While examples and features of disclosed principles are described herein, modifications, adaptations, and other implementations are possible without departing from the scope of the disclosed embodiments. It is intended that the following detailed description be considered as exemplary only, with the true scope being indicated by the following claims.
Referring now to the drawings, and more particularly to FIG. 1 through 6, where similar reference characters denote corresponding features consistently throughout the figures, there are shown preferred embodiments and these embodiments are described in the context of the following exemplary system and/or method.
FIG. 1 illustrates a network implementation 100 of a system 102 for molecule design using a virtual graph grammars driven variational generative model according to some embodiments of the present disclosure. The disclosed system assists in computational molecule discovery by moving humans-out-of-the-loop to pave the way for automation of discovering relevant novel task-specific molecules with desired properties. The system 102 facilitates molecule design/ molecule search/ molecule generation for discovering molecules with high potency. The system 102 includes a molecular graph encoder, a molecular clique tree encoder, a molecular clique tree decoder, a virtual hyperedge replacement graph grammars and a molecular graph decoder.
Although the present disclosure is explained considering that the system 102 is implemented on a server, it may be understood that the system 102 may also be implemented in a variety of computing systems 104, such as a laptop computer, a desktop computer, a notebook, a workstation, a cloud-based computing environment and the like. It will be understood that the system 102 may be accessed through one or more devices 106-1, 106-2... 106-N, collectively referred to as devices 106 hereinafter, or applications residing on the devices 106. Examples of the devices 106 may include, but are not limited to, a portable computer, a personal digital assistant, a handheld device, a smartphone, a tablet computer, a workstation and the like. The devices 106 are communicatively coupled to the system 102 through a network 108.
In an embodiment, the network 108 may be a wireless or a wired network, or a combination thereof. In an example, the network 108 can be implemented as a computer network, as one of the different types of networks, such as virtual private network (VPN), intranet, local area network (LAN), wide area network (WAN), the internet, and such. The network 106 may either be a dedicated network or a shared network, which represents an association of the different types of networks that use a variety of protocols, for example, Hypertext Transfer Protocol (HTTP), Transmission Control Protocol/Internet Protocol (TCP/IP), and Wireless Application Protocol (WAP), to communicate with each other. Further, the network 108 may include a variety of network devices, including routers, bridges, servers, computing devices, storage devices. The network devices within the network 108 may interact with the system 102 through communication links.
As discussed above, the system 102 may be implemented in a computing device 104, such as a hand-held device, a laptop or other portable computer, a tablet computer, a mobile phone, a PDA, a smartphone, and a desktop computer. The system 102 may also be implemented in a workstation, a mainframe computer, a server, and a network server. In an embodiment, the system 102 may be coupled to a database, for example, a database 112. The database 112 may store data processed, received, and generated by the system 102. Additionally, the database 112 includes a plurality of molecular graphs associated with a plurality of molecules and a plurality of labels indicative of chemical properties of the plurality of the molecular graphs. In an alternate embodiment, the system 102 may include the database 112.
The network implementation 100 supports various connectivity options such as BLUETOOTH®, USB, ZigBee and other cellular services. The network environment enables connection of devices 106 such as Smartphone with the server 104, and accordingly with the database 112 using any communication link including Internet, WAN, MAN, and so on. In an exemplary embodiment, the system 102 is implemented to operate as a stand-alone device. In another embodiment, the system 102 may be implemented to work as a loosely coupled device to a smart computing environment.
FIGS. 2A and 2B illustrate a flow chart of a method 200 for method for molecule design using a virtual graph grammars driven variational generative model, in accordance with an example embodiment of present disclosure. Operations of the flowchart, and combinations of operation in the flowchart, may be implemented by various means, such as hardware, firmware, processor, circuitry and/or other device associated with execution of software including one or more computer program instructions. For example, one or more of the procedures described in various embodiments may be embodied by computer program instructions. In an example embodiment, the computer program instructions, which embody the procedures, described in various embodiments may be stored by at least one memory device of a system and executed by at least one processor in the system. Any such computer program instructions may be loaded onto a computer or other programmable system (for example, hardware) to produce a machine, such that the resulting computer or other programmable system embody means for implementing the operations specified in the flowchart. It will be noted herein that the operations of the method 200 are described with help of system 102. However, the operations of the method 200 can be described and/or practiced by using any other system.
At 202, the method 200 includes employing a Virtual Graph Grammars driven Variational Generative model for molecule design. The Virtual Graph Grammars driven Variational Generative model includes a Molecular Graph Encoder, a Molecular Clique Tree Encoder, a Molecular Clique Tree Decoder, a Virtual Hyperedge Replacement Graph Grammars and a Molecular Graph Decoder. The molecular graph encoder determines the molecular graph-level embeddings of a plurality of molecules stored in a database. The tree encoder computes the tree-level embeddings. The tree decoder retrieves the tree representation of the molecular graphs. The Virtual Hyperedge Replacement Graph Grammars corresponding to the virtual junction graph are extracted to aid in molecular graph decoding. The graph decoder retrieves the original molecular graphs. An algorithm (Algorithm 1) depicts the working of the Generative model submodules, as described below:
Algorithm 1: Pseudocode for Generative Model.
for i=1 to M do
Molecular Graph Encoder.
Molecular Clique Tree Encoder.
Molecular Clique Tree Decoder.
Virtual Hyperedge Replacement Graph Grammers
Molecular Graph Decoder.
end for
The function of each of the Molecular Graph Encoder, the Molecular Clique Tree Encoder, the Molecular Clique Tree Decoder, the Virtual Hyperedge Replacement Graph Grammars and the Molecular Graph Decoder is described further with reference to the FIGS. 2A-2B, 3A-3D and description below.
At 204, the database comprising a plurality of variable-sized molecular graphs associated with the plurality of molecules and a plurality of labels indicative of chemical properties of the plurality of the variable-sized molecular graphs is accessed. Each molecular graph of the plurality of variable-sized molecular graphs includes a plurality of nodes and a plurality of edges connecting a plurality of neighboring nodes from amongst the plurality of nodes.
As illustrated in FIG. 3A, the molecular graph encoder determines the molecular graph-level embeddings (for example, a molecular graph-level embedding 304) of the plurality of molecular graphs (for example, a molecular graph 302) corresponding to the plurality of molecules stored in the database. The objective of the molecular graph encoder is to compute the low-dimensional discriminative graph-level embedding, h_(G_i ) of the molecular graph, G_i. In an embodiment, a node-ordering invariant graph message-passing neural network is utilized to model the complex non-linear molecular graphs and to determine the low-dimensional representation of G_i whilst maximally preserving the topology and properties of the graph. It is a non-spectral, graph-convolution based algorithmic approach to determine expressive node-level, h_i(?v_i?V_i (G_i)) and graph-level embedding, h_(G_i ) of the molecular graph, G_i.
At 206, each molecular graph from amongst the plurality of variable-sized molecular graphs is augmented with a virtual node described by an auxiliary information to obtain augmented molecular graph. The auxiliary information includes empty feature attributes and a label to represent a virtual node. The virtual node of the augmented molecular graph is bidirectionally connected to the plurality of nodes G_i of the augmented molecular graph through a virtual edge-type.
At 208, the method 200 includes learning expressive representations of the augmented molecular graphs using the Molecular graph encoder. The expressive representations include a node-level embedding and a graph-level embedding of the augmented molecular graphs. In an embodiment, learning expressive representations of the augmented molecular graphs includes iteratively subjecting each augmented molecular graph through the learning representation to determine latent information incorporated node-level embeddings of the augmented molecular graph in a plurality of message passing iterations. Each iteration of the plurality of message passing iterations includes dispatching a long-range dependencies embedded neural-message vector and a local-information aware neural message vector by a set of local-graph neighbors of a source node in the molecular graph to the source node, as described further.
For example, the augmented molecular graphs (from step 206) are put through the learning representation to determine the latent information incorporated, node-level embeddings. Each node, v_i?V_i (G_i) in the molecular graph sends and receives messages through the edge-connecting them across their respective local-graph neighbors in the molecular graph, G_i. At first, the source node, v_j receives neural messages from its local-graph neighbors, v_w?N(j)/v_i. Later, the source node, v_j?N(i) sends a neural message to the sink node, v_i. The local-graph neighbors of the source node, v_w?N(j)/v_i dispatches the long-range dependencies embedded neural-message vector, F_wj^((t)) and the local-information aware neural message vector, p_wj^((t)) to the source node, v_j. F_wj^((t)) is modeled by the gating mechanism to capture the prominent substructures in the molecular graph. p_wj^((t)) is modeled by the feed-forward neural network to encapsulate the rich local-graph structure information.
The neural message vectors, F_wj^((t)) and p_wj^((t)) perceived by the source node, v_j are refined by the source node, v_j feature vector, F_j and the static edge vector, E_ji to compute the weighted neural-message vector, ?_ji^((t)).
Algorithm 2: Iterative Graph Message Passing Mechanism
Initialize messages: ?_ji^((0))=0
for i=1 to T do
? F?_wj^((t))=GRU([F_w,W_2 E_wj ],{?_wj^((t-1)) }_(N(j)\i) )
p_wj^((t))=?_(T^' ) ([F_w,W_2 E_wj,?_(w?N(j)\i)¦? ?_wj^((t-1))])
?_ji^((t))=t(W_1 F_j+W_2 E_ji+(1-a)F_wj^((t))+(a)p_wj^((t)))
end for
h_i=t((1-ß)U_1 F_i+ß?_(j?N(i))¦? U_2 ?_ji^((T)) )
Return h_(G_i )=?_i¦? h_i/|V_i (G_i)|.
Algorithm 3: Gating Mechanism on Graph Network
s^((t-1))=?_(w?N(j)\i)¦? ?_wj^((t-1))
z_wj^((t-1))=s(W^z [F_w,W_2 E_wj ]+U^z s^((t-1))+b^z )
r_wj^((t-1))=s(W^r [F_w,W_2 E_wj ]+U^r ?_wj^((t-1))+b^r )
F ~_wj^((t-1))=tanh(W[F_w,E_wj ]+U?_(w?N(j)\i)¦? r_wj^((t-1))??_wj^((t-1)))
? F?_wj^((t))=(1-z_wj^((t-1)) )?s^((t-1))+z_wj^((t-1))?F ~_wj^((t-1))
The message vectors, ?_ji^((t)) and ?_ij^((t)) (?_ji^((t))??_ij^((t))) denote the local-graph neighborhood aware neural messages sent from node, v_j to v_i and from v_i to v_j at t-th message-passing iteration step. Each message passing iterative step enlarges the receptive field by one hop. Iterating for a plurality of iteration steps (T), computational steps expands the receptive field to set of local-graph neighbors. The plurlity of iteration steps (T) of message-passing iterations across the nodes of the molecular graph, G_i by aggregating local-neighbors information (T-hop local-neighbors information) from the plurality of message-passing iterations to compute the refined neural message vector, ?_ji^((T)) to be dispatched from the plurality of source nodes to the sink node. ?_ji^((T)) is perceived by the sink node, v_i to update its node-level embedding, h_i. The sink node representation, h_i embedds structure and local-feature information about its T-hop (or local graph) neighborhood. A fixed-size graph-level embedding, h_(G_i ) is determined by performing node-ordering invariant sum-pooling operation on the node-level embeddings, ?_i¦? h_i/|V_i (G_i)|. The mean, µ_(G_i ) and log variance, logs_(G_i ) of the variational posterior approximation of the unobserved latent vector, z_(G_i ) are utilized to perform inference on the molecular graphs. µ_(G_i ) and logs_(G_i ) are determined by h_(G_i ) with two distinct affine layers. ? is sampled from a Gaussian distribution, N(µ_(z_(G_i ) ),s_(z_(G_i ) )).
z_(G_i )=µ_(G_i )+s_(G_i )??,where?~N(µ_(z_(G_i ) ),s_(z_(G_i ) )) (4)
Herein, a weight-sharing technique, W^z=W^r=W and U^z=U^r=U for the gating-mechanism may be employed. The parameters of iterative graph message-passing and gating mechanism are shared across nodes of varying local neighborhood size in the molecular graph, G_i and across the iteration steps. t denotes the rectified linear unit. W_1, W_2 are the trainable parameters of the network. ?_(T^' ) is a feed-forward neural-network with learnable parameters of the network depicted by T^'. a & ß are the hyper-parameters of the graph encoder. [ ] and ? denotes the concatenation operation and the Hadamard product respectively.
At 210, the method 200 includes performing tree decomposition of the molecular graphs to obtain a clique tree representation of the molecular graphs by the molecular clique tree encoder and retrieve the molecular clique tree from its corresponding clique tree representation (as illustrated in FIG. 3B). Herein, the tree-decomposition of the molecular graph, G_i is performed to obtain the clique tree representation (for example, a clique tree 306) of the molecular graph, T_(G_i )(for example, the molecular graph 302). A clique (or junction) tree representation of the molecular graph, T_(G_i ) is a hierarchical coarse representation of the molecular graph substructures and the topology of subgraphs, and is obtained by performing the tree decomposition of the molecular graph. The purpose of the clique tree encoder is to compute the discriminative tree-level embedding, h^(T_(G_i ) ) of the clique tree, T_(G_i ). h^(T_(G_i ) ) learns the expressive and interpretable subgraph features of the molecular graph, G_i.
At 212, the method 200 includes computing, a discriminative tree-level embedding 306 of the clique tree representation by message-passing schemes on valid subgraphs of the molecular graph, through the molecular clique tree encoder, by operating on the topology of the clique tree augmented with a virtual node. In an embodiment, a tree message-passing network is utilized to operate on the topology of T_(G_i ) and compute the low-dimensional, nonlinear, and informative tree-level embedding, h^(T_(G_i ) ) of the join tree, T_(G_i ). Each node of the cycle-free clique tree, C_i is an induced subgraph of molecular graph, G_i. C_i is described by its node embedding, h_i^(T_(G_i ) ). h_i^(T_(G_i ) ) is refined by the iterative tree message-passing scheme to be more discriminative. Neural messages are communicated across the local-tree neighbors in the junction tree, T_(G_i ). In contrast to the graph message passing mechanism, the edges connecting the local-tree neighbors (cliques) in the join tree are characterized with empty static edge-feature vector. Neural messages, m_ij^((t)) and m_ji^((t)) are dispatched from the node, C_i to the node, C_j through the edge (C_i,C_j) and the vice-versa in the junction tree at iteration-step, t. The subscript ij, ji indicates a message from vertex C_i to vertex C_j and vice-versa. A virtual node connected bidirectionally to all the nodes of the molecular junction tree through an arbitrary virtual edge type is utilized to aid in subgraph representation learning. m_ji^((t)) , is determined by the sum-pooling mechanism of the weighted neural-messages, GRU(h_j^(T_(G_i ) ),{m_kj^((t-1)) }_(k?N(j)\i) ) and d_T ^ ([h_j^(T_(G_i ) ),?_(k?N(j)\i)¦? m_kj^((t-1))]). The later weighted neural message summarizes the local-tree neighbors embeddings and the former weighted neural message is computed through by leveraging gating-mechanism in encapsulating long-range, periphery tree nodes hidden representations dependencies. A random leaf node is selected as the root node of the corresponding clique tree of the annotated molecular graph, G_i. The structure-aware neural messages are communicated in two distinct mechanisms. In a top-down mechanism, messages are dispatched from the root source node to all the leaf sink nodes according to the computational graph associated with the root node. In the bottom-up mechanism, messages are dispatched and refined iteratively from the intermediate source nodes towards the root sink node of the clique tree.
Algorithm 4: Iterative Tree-Message Passing Mechanism
Initialize messages: m_ji^((0))=0
for i=1 to T do
m_ji^((t))=(1-?)GRU(h_j^(T_(G_i ) ),{m_kj^((t-1)) }_(k?N(j)\i) )+ (?)d_T ^ ([h_j^(T_(G_i ) ),?_(k?N(j)\i)¦? m_kj^((t-1))]).
h_i^(T_(G_i ) )=t((1-?)W^o h_i^(T_(G_i ) )+(?)?_(k?N(i))¦? U^o m_ki^((t-1)) )
end for
Return h^(T_(G_i ) )=root node,h_i^(T_i^G ).
Algorithm 5: Gating Mechanism on Tree-Structure
s^((t-1))=?_(k?N(j)\i)¦? m_kj^((t-1))
z_kj^((t-1))=s(W^z h_j^T+U^z s^((t-1))+b^z )
? r?_kj^((t-1))=s(W^r h_j^T+U^r m_kj^((t-1))+b^r )
m ~_kj^((t-1) )=tanh(Wh_j^T+U?_(k?N(j)\i)¦? r_ki^((t-1) )?? m?_kj^((t-1) ) )
m_kj^((t-1))=(1-z_kj^((t-1)) )?s^((t-1))+z_kj^((t-1))?m ~_kj^((t-1))
After performing, T, tree based message-passing iterations, every tree node-level embedding incorporates information about its local T-hop neighborhood in the clique tree. The tree node-level embeddings, h_i^(T_(G_i ) ) are transformed through the bottom-up approach and the tree-decoder utilizes both the mechanisms to retrieve the T_(G_i ) from the tree latent vector, h^(T_(G_i ) ). The tree-level embedding, h^(T_(G_i ) ) is the root-node embedding, h_root^(T_(G_i ) ). This is in contrast to the sum-pooling operation on the node-level embeddings to determine the graph-level embedding in the case of graph message-passing scheme. , ?,? are the hyper-parameters of the learning algorithm. W^o, U^o are the trainable weight parameters of the algorithm. d_T ^ is implemented by a feed-forward neural network and the learnable parameters of the network are denoted by, T ^. A weight-sharing mechanism, W^z=W^r=W and U^z=U^r=U is employed in the gating mechanism and the configurable parameters are distinct from the graph encoder, GRU. The trainable parameters of the xare shared across the nodes of the clique tree and the set of variable-sized junction trees associated with the training labeled molecular graph datasets.
At 214, the method 200 includes reconstructing the clique tree representation 304 of the molecular graphs from the tree level embeddings 306 of the clique tree using the molecular clique tree decoder to obtain a reconstructed (or decoded) clique tree 308 (FIG. 3B) . The objective of the molecular clique tree-decoder is to retrieve the junction-tree, T_(G_i ) from its hidden latent vector representation, h^(T_(G_i ) ). The clique tree, T ^_(G_i ) is reconstructed in a top-down approach. The label of the root node is determined. The node label is determined from the set of valid vocabulary (cliques or clusters) obtained from tree-decomposition of the molecular graphs. The tree decoder operates in a pre-order traversal technique by iterating over a set of nodes of the clique tree commencing from the root node and bringing about adding child nodes to the visited (or current) node (from amongst the set of nodes) in a depth-first order and extending up to set of numerous terminal nodes beonging to the set of nodes. The objective of the moelcular clique tree decoder is two-fold. (1) for a given visited node, it computes topological predictions and (2) predicts the label of the visited node to reconstruct the clique tree. The tree decoder is modeled by a neural-network architecture to determine whether the visited node has a child node to generate. A single child node is added to the immediate parent node (or the current node) in the clique tree at a current iteration-step. The label of the child node is determined and then both of its subtrees are traversed. An exploration is performed along every branch before backtracking if the corresponding visited node has no children to generate. The junction tree is decoded/reconstructed incrementally based on the knowledge assembled from local-tree neighbors embeddings aware neural messages, m ^_ji. Each visited child node, C ^_i in the current decoded clique tree acquires structure-aware neural messages from the local-clique tree neighbors, C ^_j?N(i) at every computational/iteration step for topological predictions. Let E ^={(C ^_i^((1)),C ^_j^((1))),?,(C ^_i^((l)),C ^_j^((l)))} be the edges traversed in the depth-first approach for traversal over T ^_(G_i )=(V ^^T,E ^^T). Here, l=2|E ^|. The structure-aware neural messages are dispatched as well as received between two local-tree nodes, C ^_i & C ^_j through the same edge, (C ^_i, C ^_j) connecting them. Thus every edge is traversed twice. The tree decoder function visits node C ^_i^((s)) at computational step, s. Let E ^_s denote the s edges traversed in E ^ from the root node. The structure-aware neural message m ^_((C ^_i^((s)),C ^_j^((s)))) is determined as follows,
p_1^s=GRU(h_(C ^_i^((s) ))^T,{m ^_((C ^_k,C ^_i^((s) ) ) ) }_((C ^_k,C ^_i^((s) ) )?E ^_s,C ^_k?C ^_j^((s) ) ) ) (5)
p_2^s=?_T' ([h_(C ^_i^((s)))^T,?_(C ^_k?N(i^s)\C ^_j^((s)))¦? m ^_((C ^_k,C ^_i^((s))))]) (6)
m ^_((C ^_i^((s)),C ^_j^((s))))=(1-?)p_1^s+(?)p_2^s (7)
For the topological prediction, the algorithm predicts the probability just in case if child nodes to be added to the immediate parent node, C ^_i^((s)) of the decoded clique tree. It is computed by a feed-forward neural network function which takes into account the tree-level embedding, h^(T_(G_i ) )(obtained from encoding the junction tree, T_(G_i )), the node embedding, h_(C ^_i^((s)))^T and the inward neural messages m ^_((C ^_k,C ^_i^((s)))). The output is further transformed by a non-linear sigmoid activation function to give rise to binary-valued predictions.
p_s=s(u^d·t(W_1^d h_(C ^_i^((s)))^T+W_2^d h^(T_(G_i ) )+W_3^d ?_((C ^_k,C ^_i^((s)) )?E ^_s)¦? m ^_((C ^_k,C ^_i^((s))))))
u^d is a trainable weight parameter of the algorithm. For label prediction of the child node, C ^_j added to the clique tree through an arbitrary edge of an empty attribute to its immediate parent node, C ^_i. The label is determined as follows,
?_(C ^_j )=softmax(U^l t(W_1^l T_(G_i )+W_2^l m ^_((C ^_i^((s)),C ^_j^((s)))) )) (8)
C ^_j is a probability distribution over valid vocabulary components, X obtained from tree-decomposition of molecular graphs. Let us imagine, if C ^_j is a root node, its immediate parent, C ^_i is an imaginary node. In this scenario, m ^_((C ^_i^((s)),C ^_j^((s))))=0. The junction tree decoder is trained to learn optimum learnable weights, W_1^d,W_2^d & U^l with a purpose to maximize the likelihood p(T ^_(G_i ) |T_(G_i )). Let p ^_s?{0,1} and ? ^_(C_j ) be the ground truth topological and label. The loss function of the tree decoder is to minimize the following cross-entropy loss,
? L?_c (T)=??_s¦? L?^d (p_s,p ^_s)+??_f¦? L?^l (?_(C ^_j ),? ^_(C ^_j )) (9)
A teacher forcing technique is leveraged for training recurrent neural networks to also train the auto-regressive tree decoder neural network architecture. The topological and label predictions are replaced at a given step from the tree decoder model with their associated ground truth values to be fed as input for the tree decoder at the next step, this is in contrast to Professor Forcing methodology which is an adversarial method for learning generative models. The molecular clique tree-decoder retrieves the clique tree by operating recursively as recommended by the topological predictions in the absence of specific training guidance during inference in comparison with the well-guided training phase to make certain that the reconstructed junction tree can be transformed into a valid molecular graph. Here, the set X_(C ^_i ) represnets node labels (clique tags) that are chemically consistent with the parent node, C ^_i and its current adjacent local-tree neighbors. Considering this with the scenario if node C ^_i generates a child node C ^_j, its label is sampled from X_i with a renormalized distribution ?_(C ^_i ) over X_(C ^_i ) by informing the feed-forward neural network layers that certain labels are invalid and thus should be skipped (masking out) when making predictions.
At 216, the method 200 includes extracting a plurality of virtual hyperedge replacement graph grammar (VHRG) latent vectors of the production rules associated with clique tree virtual graph to aid in molecular graph decoding using sequence-to-sequence learning (Seq2Seq). The method of extracting VHRG is illustrated in FIGS. 4A-4C. Herein, the production rules describes graphical rewriting rules that can match by hyperedges and replace hyperedges with hypergraphs fragments in the clique tree virtual graph. Hence, the tree decomposition of molecular graphs, G_i is performed to obtain the junction tree representation of the molecular graphs, T_(G_i ), as seen in FIG. 4A. An undirected junction tree, T_(G_i ) is transformed into a rooted junction tree, T_(G_i)^R with a random leaf node designated as a root node. The nodes of the rooted junction tree, T_(G_i)^R are subgraphs of the molecular graph, G_i. T_(G_i)^R is associated with a natural ordering of the child nodes of the immediate parent node from top-to-bottom and from the left-to-right approach, as seen in FIG. 4B. The virtual edges characterized by empty static edge feature attributes by traversing level-by-level are added, connecting the sibling nodes of each visited (parent) node in the rooted junction tree, T_(G_i)^R to obtain a junction tree virtual graph, G_i^V, as seen in FIG. 4C. The molecular junction tree virtual graph, G_i^V is decomposed to obtain the virtual junction tree, T_(G_i)^V. The virtual junction tree, T_(G_i)^V is a constrained representation of the molecular junction tree virtual graph, G_i^V. VHRG is extracted from the virtual junction tree, T_(G_i)^V to generate molecular junction tree virtual graph production rules. Herein, the put forth VHRG offers a complementary approach of hyperedge replacement grammars for graph generation.
Let ? be an inner node that has at least a single child node or sibling non-leaf nodes on the same hierarchical level in the junction tree, T_(G_i)^V. ?^' be its immediate parent node, a supernode of a given node, ?. ?_1,…,?_k be the child local subnodes of a parent node, ?. The inner node, ? of the T_(G_i)^V is associated with a VHRG production rule, A?R. A on LHS is a labeled nonterminal hyperedge, with arity, |A|=|V_(?^' )nV_? |. Note: ?^',? are subgraphs in G_i^V. The hypergraph, R on RHS is realized by cloning an isomorphic copy of the nodes in V_? and the edges in E_?. The nodes in V_(?^' )nV_? are marked as external nodes in R. The remaining nodes in R are nominated as internal nodes. Here, k-non terminal hyperedges are added in the hypergraph, R since the inner node has k child nodes. The nodes in V_?nV_(?_j ), j?{1,…,k} are connect with the corresponding non-terminal hyperedge. A VHRG production rule which has an absence of nonterminal hyperedge in its RHS, it has a terminal assembling rule.
The VHRG is put-forth as an elementary approach to junction tree virtual graph generation and are presented as a (hyper) graph-grammatical counterpart to context-free string grammars. A VHRG, extracted from the junction tree virtual graph, describes the graphical rewriting rules that can match by hyperedges and replace hyperedges with hypergraphs fragments. A Variational Autoencoder (VAE) driven by an attention mechanism is utized that learns to align and translate jointly for Sequence-to-Sequence learning of production rules corresponding to the molecular graphs. The extracted production rules (output of the decoder) are leveraged to reconstruct or generate the isomorphic clone of the molecular junction tree virtual graph. The attention mechanism utilized is well-known additive attention, in contrast to the multiplicative attention mechanism which brings about a linear combination of encoder states and the decoder states that maps an input to an output of production rule sequence of the variable-length input sequence. The neural-network architecture leveraged is a Seq2Seq-Bidirectional-Bahdanau Attention-Scheduled Sampling-based algorithm. The production rules which are well-described by the internal state vectors of the algorithm, T_(G_i)^* are learned representations for VHRG of the junction tree virtual graphs. The linear transformed learned representations of VHRG associated with each molecular graph, W^* T_(G_i)^* is concatenated with the molecular graph-level representation, h_(G_i ). The concatenated vector, (h_(G_i ) + W^* T_(G_i)^*) is fed as an input to the graph decoder. In general, a molecular junction tree is associated with diverse feasible molecular graphs (one-to-many mapping). The purpose of fusing the learned transformed representations of VHRG into the molecular graph representation, h_(G_i ) is to aid in the molecular graph decoding as an injective(one-to-one mapping) mechanism to augment molecular graph reconstruction accuracy and to perform inference on it. W^* is the learnable parameter of the algorithm.
At 218, the method 200 includes retrieving the molecular graph, by the molecular graph-decoder. The molecular graph-decoder utilizes the reconstructed clique tree 316, the graph-level embeddings 304 of the molecular graph, and the plurality of VHRG latent vectors 312 of the clique tree virtual graph for retrieval, as illustrated in FIG. 3D. The objective of the auto-regressive molecular graph decoder is to reconstruct an isomorphic clone of the input molecular graph, G_i that the retrieved clique (or junction) tree T ^_(G_i )=(V ^^T,E ^^T) is associated with. The input to the molecular graph decoder are the molecular graph-level embedding, h_(G_i ), the plurality of VHRG latent vectors and the reconstructed clique (or junction) tree, T ^^(G_i ) and the output is the reconstructed molecular graph, G ^_i. The purpose of the molecular graph decoder is to merge the local-tree nodes (cliques or clusters, subgraphs in G_i(for example, 502, 504)), C ^_i, C ^_j in the junction tree and transform it into the valid molecular graph, G ^_i(for exaample, 316), as illustrated in FIGS. 5A-5D. The above step is nondeterministic, where uncertainty prevails over many realizable molecular graphs, G_k?M^G,G_k?G_i having equivalent clique tree, T_(G_i ). The stochasticity exists in the way local-tree nodes, C ^_i and C ^_j in the junction tree, T ^_(G_i ) are connected to each other as subgraphs in G ^_i. The set of molecular graphs associated with the junction tree, T_(G_i ) is described by (T_(G_i )). Decoding molecular graph G ^_i from T ^_(G_i )=(V ^^T,E ^^T) is a molecular graph structure prognostication,
G ^_i=arg max-(G_i^'?(T_(G_i ))) s_f^a (G_i^') (10)
s_f^a is a scoring function to evaluate the feasible molecular graphs. There exists a distribution in the scoring function output based only on the way node, C ^_i is connected with its local-tree node neighbor C ^_j?N(i) in the junction tree, T ^_(G_i ) are merged to form the subgraph, G ^_i. The chemical subgraph, G ^_i is assembled by conglomerating in local-tree neighbor at every step, abiding the order in which the junction tree, T ^_(G_i ) was decoded as described in the subsection, C. The merging of subgraphs begins by selection of the root node in the junction tree, T ^_(G_i ) and its associated local-tree neighbors in the junction tree based on their scores attained. This cycle is recurred to merge the local-tree neighbors and their corresponding local tree neighbors (independent of the stochasticity set by the root node selection in T ^_(G_i )) to obtain G ^_i.
Every resulting merged molecular graph is scored by s_f^a, as described below. Let G ^_i be the subgraph derived by a certain procedure of coalescing node(cluster or clique), C ^_i in the junction tree, T ^_(G_i ) with its local tree neighbor, C ^_j, C ^_j?N(i). We score the resulting subgraph, G ^_i by first deriving a feature vector representation, _(G ^_i ) and later evaluate the subgraph score, s_f^a (G ^_i)=_(G ^_i )·(h_(G_i ) + W^* T_(G_i)^*). Now lets say, m,n represent atoms in the merged subgraph G ^_i and let a_n=i if n?C ^_i and a_n=j if n?C ^_j\C ^_i. The indices a_n, a_m are associated with the n and m atoms type respectively and are utilized to mark the position of the atoms in the clique tree, T ^_(G_i ). The indices are utilized to make use of the structure-aware neural messages m ^_(C ^_i,C ^_j ) by encapsulating the subtree under C ^_i through the edge (C ^_i,C ^_j) acquired by performing the tree encoding algorithm. Message-passing schemes are leveraged to compute the node-level embeddings in the subgraph, G ^_i. Additionally, m ^_(C ^_i,C ^_j ) is leveraged for learning discriminative node-level embeddings. Neural messages are communicated across the nodes of the subgraph, G ^_i through the edges connecting the local-node neighbors. The neural messages, which encapsulate the fine-grained details of the graph structure, local-graph neighbors feature information, edge attributes in the subgraph G ^_i are acquired and are maximally retained into _(G ^_i ).
µ_mn^((t))=(1-?)t(F_m+W_1^gd E_mn+W_2^gd ?_(w?N(m)\n)¦? µ ~_wm^((t-1)))+(?)t?_T ^ ([F_m,W_1^gd E_mn,W_2^gd ?_(w?N(m)\n)¦? µ ~_wm^((t-1))])
µ ~_wm^((t-1)) = {¦( GRU([F_w,W_1^gd E_wm],W_2^gd ?_(w?N(m)\n)¦?_wm^((t-1))),&@ a_m=a_n&@ GRU([F_w,W_1^gd E_wm],W_2^gd ?_(w?N(m)\n)¦?_wm^((t-1)))& ,@ +m ^_(a_m,a_n ),a_m?a_n&)¦
(1-?) and ? regulates the information flow from the gating mechanism, the feed-forward network evaluated neural message vectors respectively to determine the single-aggregated weighted neural message-vector, µ_mn^((t)) to be perceived by the sink node, n. The single-aggregated weighted neural message-vector, µ_mn^((t)) is percieved by the sink node, n to refine its node-level embedding. It is obtained by sum-pooling of gating mechanism transformed neural message vector and feed-forward neural network evaluated neural message vector from its local-graph neighbors. The gating mechanism refined neural message vector augments in encapsulating the long-range dependencies, and it is described by, (1-?)(F_m+W_1^gd E_mn+W_2^gd ?_(w?N(m)\n)¦? µ ~_wm^((t-1))). The feed-forward neural network updated neural message vector describes the fine-details of local feature & graph structure and it is described by, ??_T ^ ([F_m,W_1^gd E_mn, W_2^gd ?_(w?N(m)\n)¦? µ ~_wm^((t-1))]). The graph decoder model incorporates the tree message, m ^_(a_m,a_n ) obtained by running the tree encoder over the predicted tree T ^_(G_i ) for determining the neural message-vector, µ ~_wm^((t-1)). It is applied in scenarios if the sink node under consideration receives the neural message from a local node belonging to a different clique in the junction tree. The tree messages are augmented into the subgraph message-passing network in the graph decoder to avoid local isomorphism. µ ~_wm^((t-1)) is the neural message-vector sent from the local-tree neighbor, N(m)\n to the source node, m. The node, m can only deliver a message to its one-hop neighbor, node n if & only if for which it has acquired all the needed incoming neural messages from its local-tree neighbors, N(m)\n. The neural-message, µ ~_wm^((t-1)) dispatched from the local-tree neighbors, N(m)\n of node, m is computed through the gating mechanism, GRU([F_w,W_1^gd E_wm],W_2^gd ?_(w?N(m)\n)¦?_wm^((t-1))).
Algorithm 5: Graph Decoder
Initialize neural messages: ? µ?_mn^((0))=0
for i=1 to T do
Compute neural messages, µ_mn^((t))
end for
h_(G ^_i)^n=t(_1^a F_n+??_(w?N(n))¦?_2^a?_wn^((T)))
Return h ^_(G_i ) ^ =?_n¦?_(G ^_i)^n/|V((G_i ) ^)|.
The subgraph, G ^_i is scored by utilizing the scoring function, s_f^a for ranking the feasible subgraphs within the set as given by, (T_(G_i )). As is seen above,, first a graph message-passing network is applied over the subgraph, G ^_i to compute the node-level abstract representations, _(G ^_i)^n,n?V(G ^_i). A graph-level representation of G ^_i is obtained by sum-pooling mechanism,h ^_(G_i ) ^ =?_n¦?_(G ^_i)^n/|V((G_i ) ^)|. The potential subgraph, G ^_i is scored by computing dot product between the (h_(G_i ) + W^* T_(G_i)^*) and the encoded subgraph vector( _(G ^_i )), s_f^a (G ^_i)=_(G ^_i )·(h_(G_i ) + W^* T_(G_i)^*). The graph decoder learnable parameters are trained to maximize the log-likelihood of prognostication of the correct subgraphs G ^_i of the original molecular graph G_i at each tree node. _g (G)=?_i¦? [s_f^a (G_i)-log?_(G_i^'?(T_(G_i )))¦? exp(s_f^a (G_i^'))]. G_i^' is the set of feasible potential subgraphs at tree vertex i. During training, the teacher forcing technique is leveraged to train the graph decoder by supplying the observed (ground truth) sequence as input in the next step.
FIG. 6 is a block diagram of an exemplary computer system 601 for implementing embodiments consistent with the present disclosure. The computer system 601 may be implemented in alone or in combination of components of the system 202 (FIG. 2). Variations of computer system 601 may be used for implementing the devices included in this disclosure. Computer system 601 may comprise a central processing unit (“CPU” or “hardware processor”) 602. The hardware processor 602 may comprise at least one data processor for executing program components for executing user- or system-generated requests. The processor may include specialized processing units such as integrated system (bus) controllers, memory management control units, floating point units, graphics processing units, digital signal processing units, etc. The processor may include a microprocessor, such as AMD AthlonTM, DuronTM or OpteronTM, ARM’s application, embedded or secure processors, IBM PowerPCTM, Intel’s Core, ItaniumTM, XeonTM, CeleronTM or other line of processors, etc. The processor 602 may be implemented using mainframe, distributed processor, multi-core, parallel, grid, or other architectures. Some embodiments may utilize embedded technologies like application specific integrated circuits (ASICs), digital signal processors (DSPs), Field Programmable Gate Arrays (FPGAs), etc. The processor 602 may be a multi-core multi-threaded processor.
The processor 602 may be disposed in communication with one or more input/output (I/O) devices via I/O interface 603. The I/O interface 603 may employ communication protocols/methods such as, without limitation, audio, analog, digital, monoaural, RCA, stereo, IEEE-1394, serial bus, universal serial bus (USB), infrared, PS/2, BNC, coaxial, component, composite, digital visual interface (DVI), high-definition multimedia interface (HDMI), RF antennas, S-Video, VGA, IEEE 802.11 a/b/g/n/x, Bluetooth, cellular (e.g., code-division multiple access (CDMA), high-speed packet access (HSPA+), global system for mobile communications (GSM), long-term evolution (LTE), WiMax, or the like), etc.
Using the I/O interface 603, the computer system 601 may communicate with one or more I/O devices. For example, the input device 604 may be an antenna, keyboard, mouse, joystick, (infrared) remote control, camera, card reader, fax machine, dongle, biometric reader, microphone, touch screen, touchpad, trackball, sensor (e.g., accelerometer, light sensor, GPS, gyroscope, proximity sensor, or the like), stylus, scanner, storage device, transceiver, video device/source, visors, etc.
Output device 605 may be a printer, fax machine, video display (e.g., cathode ray tube (CRT), liquid crystal display (LCD), light-emitting diode (LED), plasma, or the like), audio speaker, etc. In some embodiments, a transceiver 606 may be disposed in connection with the processor 602. The transceiver may facilitate various types of wireless transmission or reception. For example, the transceiver may include an antenna operatively connected to a transceiver chip (e.g., Texas Instruments WiLink WL1283, Broadcom BCM4750IUB8, Infineon Technologies X-Gold 618-PMB9800, or the like), providing IEEE 802.11a/b/g/n, Bluetooth, FM, global positioning system (GPS), 2G/3G HSDPA/HSUPA communications, etc.
In some embodiments, the processor 602 may be disposed in communication with a communication network 608 via a network interface 607. The network interface 607 may communicate with the communication network 508. The network interface may employ connection protocols including, without limitation, direct connect, Ethernet (e.g., twisted pair 10/100/1000 Base T), transmission control protocol/internet protocol (TCP/IP), token ring, IEEE 802.11a/b/g/n/x, etc. The communication network 608 may include, without limitation, a direct interconnection, local area network (LAN), wide area network (WAN), wireless network (e.g., using Wireless Application Protocol), the Internet, etc. Using the network interface 607 and the communication network 608, the computer system 601 may communicate with devices 609 and 610. These devices may include, without limitation, personal computer(s), server(s), fax machines, printers, scanners, various mobile devices such as cellular telephones, smartphones (e.g., Apple iPhone, Blackberry, Android-based phones, etc.), tablet computers, eBook readers (Amazon Kindle, Nook, etc.), laptop computers, notebooks, gaming consoles (Microsoft Xbox, Nintendo DS, Sony PlayStation, etc.), or the like. In some embodiments, the computer system 601 may itself embody one or more of these devices.
In some embodiments, the processor 602 may be disposed in communication with one or more memory devices (e.g., RAM 613, ROM 614, etc.) via a storage interface 612. The storage interface may connect to memory devices including, without limitation, memory drives, removable disc drives, etc., employing connection protocols such as serial advanced technology attachment (SATA), integrated drive electronics (IDE), IEEE-1394, universal serial bus (USB), fiber channel, small computer systems interface (SCSI), etc. The memory drives may further include a drum, magnetic disc drive, magneto-optical drive, optical drive, redundant array of independent discs (RAID), solid-state memory devices, solid-state drives, etc. Variations of memory devices may be used for implementing, for example, any databases utilized in this disclosure.
The memory devices may store a collection of programs or database components, including, without limitation, an operating system 616, user interface application 617, user/application data 618 (e.g., any data variables or data records discussed in this disclosure), etc. The operating system 616 may facilitate resource management and operation of the computer system 601. Examples of operating systems include, without limitation, Apple Macintosh OS X, Unix, Unix-like system distributions (e.g., Berkeley Software Distribution (BSD), FreeBSD, NetBSD, OpenBSD, etc.), Linux distributions (e.g., Red Hat, Ubuntu, Kubuntu, etc.), IBM OS/2, Microsoft Windows (XP, Vista/7/8, etc.), Apple iOS, Google Android, Blackberry OS, or the like. User interface 617 may facilitate display, execution, interaction, manipulation, or operation of program components through textual or graphical facilities. For example, user interfaces may provide computer interaction interface elements on a display system operatively connected to the computer system 601, such as cursors, icons, check boxes, menus, scrollers, windows, widgets, etc. Graphical user interfaces (GUIs) may be employed, including, without limitation, Apple Macintosh operating systems’ Aqua, IBM OS/2, Microsoft Windows (e.g., Aero, Metro, etc.), Unix X-Windows, web interface libraries (e.g., ActiveX, Java, Javascript, AJAX, HTML, Adobe Flash, etc.), or the like.
In some embodiments, computer system 601 may store user/application data 618, such as the data, variables, records, etc. as described in this disclosure. Such databases may be implemented as fault-tolerant, relational, scalable, secure databases such as Oracle or Sybase. Alternatively, such databases may be implemented using standardized data structures, such as an array, hash, linked list, structured text file (e.g., XML), table, or as hand-oriented databases (e.g., using HandStore, Poet, Zope, etc.). Such databases may be consolidated or distributed, sometimes among various computer systems discussed above. It is to be understood that the structure and operation of any computer or database component may be combined, consolidated, or distributed in any working combination.
Additionally, in some embodiments, (the server, messaging and instructions transmitted or received may emanate from hardware, including operating system, and program code (i.e., application code) residing in a cloud implementation. Further, it should be noted that one or more of the systems and methods provided herein may be suitable for cloud-based implementation. For example, in some embodiments, some or all of the data used in the disclosed methods may be sourced from or stored on any cloud computing platform.
Example:
For the purpose of validation, experiments were conducted using ZINC 250k data set. The first objective of the experiment was to reconstruct the molecular graphs from its encoded latent vectors and then later sample molecular graphs from the learnt latent space. Reconstruction accuracy and prior validity results on the ZINC 250k molecule dataset were reported in the table 1.
Method % Reconstruction Valid Prior
CVAE 44.60% 0.70%
GVAE 53.70% 7.20%
SD-VAE 76.20% 43.50%
JT-VAE 76.70% 100%
MHG-VAE 94.80% 100%
Ours 93.60% 100%
Table. 1
In case of property optimization task, we report the results for both global and local optimization. In global optimization, the objective here is to find the best molecule in the entire latent space and we perform Bayesian optimization. In local optimization, the constrained optimization, here we perform gradient ascent over latent space. The target chemical properties to optimize are the Penalized logP score and QED score. The Penalized logP score evaluates the octanol-water partition coefficient and it takes into consideration both the ring size and synthetic accessibility and QED score quantifies the drug-likeness of an organic molecule. Here, the results of the following evaluations performed are reported (1) Bayesian property optimization with limited oracle property evaluations, (refer to Table 2), (2) Bayesian property optimization with limited oracle property evaluations (refer to Table 3) and (3) the constrained property optimizations (Refer to Table 4).
Table 2: Bayesian property optimization with limited oracle property evaluations
Table 3: Results on constrained property optimizations
Method Penalized logP
1st 2nd 3rd 50th Top 50 Avg Validity
JT-VAE 1.69 1.68 1.6 -9.93 -1.33 100%
GCPN 2.77 2.73 2.34 0.91 1.36 100%
MHG-VAE 5.24 5.06 4.91 4.25 4.53 100%
MSO 2.96 2.91 2.75 2.49 2.54 100%
MNCE-RL 9.88 9.82 9.75 7.28 8.31 100%
Disclosed system 7.03 6.84 6.63 6.05 6.37 100%
For constrained optimization:
m^?=D_VAE (argmax-(z_G:sim(m_G,D_VAE (z_G))=t) f(D_VAE (z_G)))
Here, t is a similarity threshold.
The metric reported is success rate (how often a modification succeeds). The average improvement in property. The molecular similarity sim (m;m^') between the original and modfied molecules.
Table 4: Results on constrained property optimizations
The written description describes the subject matter herein to enable any person skilled in the art to make and use the embodiments. The scope of the subject matter embodiments is defined by the claims and may include other modifications that occur to those skilled in the art. Such other modifications are intended to be within the scope of the claims if they have similar elements that do not differ from the literal language of the claims or if they include equivalent elements with insubstantial differences from the literal language of the claims.
Various embodiment disclosed herein provides method and system for molecule design using a virtual graph grammars driven variational generative model. The disclosed method automates discovery of relevant novel task-specific molecules with desired properties by incorporating virtual graph grammars driven variational generative model. Said model facilitates in learning expressive representations of molecular graphs augmented with virtual nodes (using a molecular graph encoder). Further, the disclosed model provides a tree representation learning framework for message-passing schemes on valid subgraphs of the molecular graph by operating on the topology of the junction tree (also referred to as clique) augmented with virtual node, and offer a complementary representation of molecular graphs. Furthermore, a tree-decoder framework associated with the model retrieves the junction tree representation of the molecular graphs from its latent vectors. The model leverages the sequence-to-sequence learning (Seq2Seq) approach to generate the latent vectors of the novel virtual hyperedge replacement graph grammar (VHRG) derived production rules associated with virtual junction tree graph to aid in molecular graph decoding. Finally, the model includes an auto-regressive graph-decoder framework to retrieve the original molecular graph from the reconstructed junction tree, the graph-level embeddings, and latent vectors of the production rules of the virtual junction graph respectively. The augmented framework improves the molecule validity and aids in property optimization task on both the limited and unlimited oracle scenarios. The framework is simple, yet effective and can be leveraged for designing an auto-regressive algorithmic approach based generation of new molecules with application-driven goals and for property optimization to expedite advances in generating novel structured molecules.
It is to be understood that the scope of the protection is extended to such a program and in addition to a computer-readable means having a message therein; such computer-readable storage means contain program-code means for implementation of one or more steps of the method, when the program runs on a server or mobile device or any suitable programmable device. The hardware device can be any kind of device which can be programmed including e.g., any kind of computer like a server or a personal computer, or the like, or any combination thereof. The device may also include means which could be e.g., hardware means like e.g., an application-specific integrated circuit (ASIC), a field-programmable gate array (FPGA), or a combination of hardware and software means, e.g., an ASIC and an FPGA, or at least one microprocessor and at least one memory with software processing components located therein. Thus, the means can include both hardware means and software means. The method embodiments described herein could be implemented in hardware and software. The device may also include software means. Alternatively, the embodiments may be implemented on different hardware devices, e.g., using a plurality of CPUs.
The embodiments herein can comprise hardware and software elements. The embodiments that are implemented in software include but are not limited to, firmware, resident software, microcode, etc. The functions performed by various components described herein may be implemented in other components or combinations of other components. For the purposes of this description, a computer-usable or computer readable medium can be any apparatus that can comprise, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device.
The illustrated steps are set out to explain the exemplary embodiments shown, and it should be anticipated that ongoing technological development will change the manner in which particular functions are performed. These examples are presented herein for purposes of illustration, and not limitation. Further, the boundaries of the functional building blocks have been arbitrarily defined herein for the convenience of the description. Alternative boundaries can be defined so long as the specified functions and relationships thereof are appropriately performed. Alternatives (including equivalents, extensions, variations, deviations, etc., of those described herein) will be apparent to persons skilled in the relevant art(s) based on the teachings contained herein. Such alternatives fall within the scope of the disclosed embodiments. Also, the words “comprising,” “having,” “containing,” and “including,” and other similar forms are intended to be equivalent in meaning and be open ended in that an item or items following any one of these words is not meant to be an exhaustive listing of such item or items, or meant to be limited to only the listed item or items. It must also be noted that as used herein and in the appended claims, the singular forms “a,” “an,” and “the” include plural references unless the context clearly dictates otherwise.
Furthermore, one or more computer-readable storage media may be utilized in implementing embodiments consistent with the present disclosure. A computer-readable storage medium refers to any type of physical memory on which information or data readable by a processor may be stored. Thus, a computer-readable storage medium may store instructions for execution by one or more processors, including instructions for causing the processor(s) to perform steps or stages consistent with the embodiments described herein. The term “computer-readable medium” should be understood to include tangible items and exclude carrier waves and transient signals, i.e., be non-transitory. Examples include random access memory (RAM), read-only memory (ROM), volatile memory, nonvolatile memory, hard drives, CD ROMs, DVDs, flash drives, disks, and any other known physical storage media.
It is intended that the disclosure and examples be considered as exemplary only, with a true scope of disclosed embodiments being indicated by the following claims.
| # | Name | Date |
|---|---|---|
| 1 | 202221012634-STATEMENT OF UNDERTAKING (FORM 3) [08-03-2022(online)].pdf | 2022-03-08 |
| 2 | 202221012634-REQUEST FOR EXAMINATION (FORM-18) [08-03-2022(online)].pdf | 2022-03-08 |
| 3 | 202221012634-FORM 18 [08-03-2022(online)].pdf | 2022-03-08 |
| 4 | 202221012634-FORM 1 [08-03-2022(online)].pdf | 2022-03-08 |
| 5 | 202221012634-FIGURE OF ABSTRACT [08-03-2022(online)].jpg | 2022-03-08 |
| 6 | 202221012634-DRAWINGS [08-03-2022(online)].pdf | 2022-03-08 |
| 7 | 202221012634-DECLARATION OF INVENTORSHIP (FORM 5) [08-03-2022(online)].pdf | 2022-03-08 |
| 8 | 202221012634-COMPLETE SPECIFICATION [08-03-2022(online)].pdf | 2022-03-08 |
| 9 | 202221012634-FORM-26 [11-04-2022(online)].pdf | 2022-04-11 |
| 10 | Abstract1.jpg | 2022-07-09 |
| 11 | 202221012634-Proof of Right [26-08-2022(online)].pdf | 2022-08-26 |
| 12 | 202221012634-FER.pdf | 2025-03-17 |
| 13 | 202221012634-FORM 3 [09-04-2025(online)].pdf | 2025-04-09 |
| 14 | 202221012634-FER_SER_REPLY [05-08-2025(online)].pdf | 2025-08-05 |
| 15 | 202221012634-COMPLETE SPECIFICATION [05-08-2025(online)].pdf | 2025-08-05 |
| 16 | 202221012634-CLAIMS [05-08-2025(online)].pdf | 2025-08-05 |
| 1 | SearchHistoryE_30-08-2024.pdf |