Abstract: Identifying potential adverse drug reactions of drug candidates in the early stage of the drug development pipeline can improve drug safety, reduce risks for the patients and save money for the pharmaceutical companies. A system and a method for the prediction of side effects of a compound, especially the drug compounds have been provided. The disclosure provides a comprehensive, unbiased and cost-effective method of prediction of side effects for compounds whose gene expression could be measured. The disclosure further provides an in-silico side effect prediction during drug development process. The system is a concrete system for prediction of side effects using an ensemble of matrix of models trained to predict adverse event at different levels of vocabulary hierarchy.
Claims:
1. A processor implemented method (400) for prediction of side effects of a compound, the method comprising:
generating, via one or more hardware processors, a first dataset of gene expression profiles for a plurality of compounds at a plurality of time points from a gene expression dataset, wherein the first dataset comprises a name of the plurality of compounds, corresponding simplified molecular-input line-entry (SMILE) annotations of the plurality of compounds and gene expression information of the plurality of compounds (402);
generating, via the one or more hardware processors, a second dataset of a chemical information data using one of the SMILE annotations or curated from a chemical database (404);
generating, via the one or more hardware processors, a third dataset of a plurality of side effects of the plurality of compounds using a side effects database, wherein the third dataset comprises the name of the plurality of compounds, the SMILE annotations and the corresponding side effects of the plurality of compounds (406);
generating, via the one or more hardware processors, a master dataset in a tabular form by linking the first dataset, the second dataset and the third dataset using the SMILE annotations, wherein the master dataset comprises mapping of the side effects of the plurality of compounds with the gene expressions information, wherein the side effects are represented either as present or absent corresponding to each compound (408);
splitting, via the one or more hardware processors, the master dataset vertically, a first predefined number of times (m) based on a plurality of levels of hierarchy of a set of terms for side effects, splitting results in the generation of a first predefined number of input sheets (410);
splitting, via the one or more hardware processors, each of the first predefined number of input sheets horizontally in a second predefined number of times (n) based on a set of factors depending on requirements (412);
training, via the one or more hardware processors, a plurality of models based on the horizontal split (HS(n)) for each vertical split (VS(m)) (414);
providing, via the one or more hardware processors, the compound for the prediction of side effects (416);
calculating, via the one or more hardware processors, a probability score corresponding to each trained model of the plurality of models for each side effect present in the side effect database (418);
calculating, via the one or more hardware processors, a final score for each side effect based on each of the calculated probability score, wherein the final score is calculated using a weightage given to the first predefined number of vertical splits at the plurality of levels of hierarchy (420); and
selecting, via the one or more hardware processors, top side effects from of the side effects present in the side effects database as the side effects of the compound based on the final score, wherein the number of top side effects is decided by a user (422).
2. The method of claim 1 wherein, splitting the sheets at each level based on a plurality of factors, wherein the plurality of factors is decided based on the requirements.
3. The method of claim 1, wherein, the side effects database configured to provide the name of the compounds, their respective chemical ID, the SMILE structure and the corresponding side effects.
4. The method of claim 1, wherein master dataset after mapping contains the drug name, the chemical properties, the gene expression and the side effects in a binary format.
5. The method of claim 1 wherein the predefined number of times for vertical split is decided based on ontology levels and the predefined number of times for horizontal split is decided by the user based on the requirements.
6. A system (100) for prediction of side effects of a compound, the system comprises:
an input/output interface (104);
one or more hardware processors (108); and
a memory (110) in communication with the one or more hardware processors, wherein the one or more first hardware processors are configured to execute programmed instructions stored in the one or more first memories, to:
generate a first dataset of gene expression profiles for a plurality of compounds at a plurality of time points from a gene expression dataset, wherein the first dataset comprises a name of the plurality of compounds, corresponding simplified molecular-input line-entry (SMILE) annotations of the plurality of compounds and gene expression information of the plurality of compounds;
generate a second dataset of a chemical information data using one of the SMILE annotations or curated from a chemical database;
generate a third dataset of a plurality of side effects of the plurality of compounds using a side effects database, wherein the third dataset comprises the name of the plurality of compounds, the SMILE annotations and the corresponding side effects of the plurality of compounds;
generate a master dataset in a tabular form by linking the first dataset, the second dataset and the third dataset using the SMILE annotations, wherein the master dataset comprises mapping of the side effects of the plurality of compounds with the gene expressions information, wherein the side effects are represented either as present or absent corresponding to each compound;
split the master dataset vertically, a first predefined number of times (m) based on a plurality of levels of hierarchy of a set of terms for side effects, splitting results in the generation of the first predefined number of input sheets;
split each of the first predefined number of input sheets horizontally in a second predefined number of times (n) based on a set of factors depending on requirements;
train a plurality of models based on the horizontal split (HS(n)) for each vertical split (VS(m));
provide the compound for the prediction of side effects;
calculate a probability score corresponding to each trained model of the plurality of models for each side effect present in the side effect database;
calculate a final score for each side effect based on each of the calculated probability score, wherein the final score is calculated using a weightage given to the first predefined number of vertical splits at the plurality of levels of hierarchy; and
select top side effects out of the side effects present in the side effects database as the side effects of the compound based on the final score, wherein the number of top side effects is decided by a user.
7. The method of claim 6, wherein the splitting the sheets at each level based on a plurality of factors, wherein the plurality of factors is decided based on the requirements.
8. The method of claim 6, wherein the side effects database configured to provide the name of the compounds, the respective chemical ID, the SMILE structure and the corresponding side effects.
9. The method of claim 6, wherein the predefined number of times for vertical split is decided based on ontology levels and the predefined number of times for horizontal split is decided by a user based on the requirements.
, Description:FORM 2
THE PATENTS ACT, 1970
(39 of 1970)
&
THE PATENT RULES, 2003
COMPLETE SPECIFICATION
(See Section 10 and Rule 13)
Title of invention:
METHOD AND SYSTEM FOR PREDICTION OF SIDE EFFECTS OF A COMPOUND
Applicant:
Tata Consultancy Services Limited
A company Incorporated in India under the Companies Act, 1956
Having address:
Nirmal Building, 9th Floor,
Nariman Point, Mumbai 400021,
Maharashtra, India
The following specification particularly describes the invention and the manner in which it is to be performed.
TECHNICAL FIELD
The disclosure herein generally relates to the field of analyzing drug safety and, more particularly, to a method and system for prediction of side effects of a compound such as drug compound.
BACKGROUND
As per the definition by the World Health Organization (WHO), an adverse drug reaction (ADR) is defined as an unintended and harmful reaction suspected to be caused by a drug taken under normal conditions. It has been recognized that ADRs represent a significant public health problem all over the world. Identifying potential ADRs of drug candidates in the early stage of the drug development pipeline can improve drug safety, reduce risks for the patients and save money for the pharmaceutical companies. Most of the serious side e?ects are identi?ed during preclinical and clinical trials, some of them are reported during the post-approval monitoring. The uncertainty about the potential side e?ects of new drugs is a concern for not only pharmaceutical companies but also patients.
The information available in the early stages of drug development is mainly the chemical structure of the drug. Many existing studies on ADR prediction have been devoted to analyzing the chemical properties of drug molecules. Though the mechanisms of ADRs are complicated and may not be well understood, machine learning techniques or computational techniques are promising solutions to understand and analyze such complicated problems.
Computational methods hold great promise for mitigating the health and financial risks of drug development by predicting possible side effects before entering the clinical trials. Several learning-based methods have been proposed for predicting the side effects of drugs.
Recently, deep learning models have been employed to predict side effects. One of the methods uses biological, chemical and semantic information on drugs in addition to clinical notes and case reports. Another method uses various chemical fingerprints extracted using deep architectures to compare the side effect prediction performance. While these methods have proven useful for predicting adverse drug reactions, the features they use are solely based on external knowledge about the drugs (i.e., drug-protein interactions, etc.) and are not cell or condition (i.e., dosage) specific.
SUMMARY
Embodiments of the present disclosure present technological improvements as solutions to one or more of the above-mentioned technical problems recognized by the inventors in conventional systems. For example, in one embodiment, a system for prediction of side effects of a compound is provided. The system comprises an input/output interface, one or more hardware processors and a memory in communication with the one or more hardware processors. The one or more first hardware processors are configured to execute programmed instructions stored in the one or more first memories, to: generate a first dataset of gene expression profiles for a plurality of compounds at a plurality of time points from a gene expression dataset, wherein the first dataset comprises a name of the plurality of compounds, corresponding simplified molecular-input line-entry (SMILE) annotations of the plurality of compounds and gene expression information of the plurality of compounds; generate a second dataset of a chemical information data using one of the SMILE annotations or curated from a chemical database; generate a third dataset of a plurality of side effects of the plurality of compounds using a side effects database, wherein the third dataset comprises the name of the plurality of compounds, the SMILE annotations and the corresponding side effects of the plurality of compounds; generate a master dataset in a tabular form by linking the first dataset, the second dataset and the third dataset using the SMILE annotations, wherein the master dataset comprises mapping of the side effects of the plurality of compounds with the gene expressions information, wherein the side effects are represented either as present or absent corresponding to each compound; split the master dataset vertically, a first predefined number of times (m) based on a plurality of levels of hierarchy of a set of terms for side effects, splitting results in the generation of the first predefined number of input sheets; split each of the first predefined number of input sheets horizontally in a second predefined number of times (n) based on a set of factors depending on requirements; train a plurality of models based on the horizontal split (HS(n)) for each vertical split (VS(m)); provide the compound for the prediction of side effects; calculate a probability score corresponding to each trained model of the plurality of models for each side effect present in the side effect database; calculate a final score for each side effect based on each of the calculated probability score, wherein the final score is calculated using a weightage given to the first predefined number of vertical splits at the plurality of levels of hierarchy; and select top side effects out of the side effects present in the side effects database as the side effects of the compound based on the final score, wherein the number of top side effects is decided by a user.
In another aspect, a method for prediction of side effects of a compound is provided. Initially, a first dataset of gene expression profiles is generated for a plurality of compounds at a plurality of time points from a gene expression dataset, wherein the first dataset comprises a name of the plurality of compounds, corresponding simplified molecular-input line-entry (SMILE) annotations of the plurality of compounds and gene expression information of the plurality of compounds. In the next step, a second dataset of a chemical information data is generated using one of the SMILE annotations or curated from a chemical database. Further, a third dataset of a plurality of side effects of the plurality of compounds is generated using a side effects database, wherein the third dataset comprises the name of the plurality of compounds, the SMILE annotations and the corresponding side effects of the plurality of compounds. Further, a master dataset is generated in a tabular form by linking the first dataset, the second dataset and the third dataset using the SMILE annotations, wherein the master dataset comprises mapping of the side effects of the plurality of compounds with the gene expressions information, wherein the side effects are represented either as present or absent corresponding to each compound. Further, the master dataset is split vertically, a first predefined number of times (m) based on a plurality of levels of hierarchy of a set of terms for side effects, splitting results in the generation of a first predefined number of input sheets. In the next step, each of the first predefined number of input sheets horizontally in a second predefined number of times (n) based on a set of factors depending on requirements. Further a plurality of models is trained based on the horizontal split (HS(n)) for each vertical split (VS(m)). Later, the compound is provided for the prediction of side effects. Further, a probability score is calculated corresponding to each trained model of the plurality of models for each side effect present in the side effect database. In the next step, a final score is calculated for each side effect based on each of the calculated probability score, wherein the final score is calculated using a weightage given to the first predefined number of vertical splits at the plurality of levels of hierarchy. And finally, top side effects are selected from of the side effects present in the side effects database as the side effects of the compound based on the final score, wherein the number of top side effects is decided by a user.
In yet another aspect, one or more non-transitory machine-readable information storage mediums comprising one or more instructions which when executed by one or more hardware processors cause prediction of side effects of a compound is provided. Initially, a first dataset of gene expression profiles is generated for a plurality of compounds at a plurality of time points from a gene expression dataset, wherein the first dataset comprises a name of the plurality of compounds, corresponding simplified molecular-input line-entry (SMILE) annotations of the plurality of compounds and gene expression information of the plurality of compounds. In the next step, a second dataset of a chemical information data is generated using one of the SMILE annotations or curated from a chemical database. Further, a third dataset of a plurality of side effects of the plurality of compounds is generated using a side effects database, wherein the third dataset comprises the name of the plurality of compounds, the SMILE annotations and the corresponding side effects of the plurality of compounds. Further, a master dataset is generated in a tabular form by linking the first dataset, the second dataset and the third dataset using the SMILE annotations, wherein the master dataset comprises mapping of the side effects of the plurality of compounds with the gene expressions information, wherein the side effects are represented either as present or absent corresponding to each compound. Further, the master dataset is split vertically, a first predefined number of times (m) based on a plurality of levels of hierarchy of a set of terms for side effects, splitting results in the generation of a first predefined number of input sheets. In the next step, each of the first predefined number of input sheets horizontally in a second predefined number of times (n) based on a set of factors depending on requirements. Further a plurality of models is trained based on the horizontal split (HS(n)) for each vertical split (VS(m)). Later, the compound is provided for the prediction of side effects. Further, a probability score is calculated corresponding to each trained model of the plurality of models for each side effect present in the side effect database. In the next step, a final score is calculated for each side effect based on each of the calculated probability score, wherein the final score is calculated using a weightage given to the first predefined number of vertical splits at the plurality of levels of hierarchy. And finally, top side effects are selected from of the side effects present in the side effects database as the side effects of the compound based on the final score, wherein the number of top side effects is decided by a user.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the invention, as claimed.
BRIEF DESCRIPTION OF THE DRAWINGS
The accompanying drawings, which are incorporated in and constitute a part of this disclosure, illustrate exemplary embodiments and, together with the description, serve to explain the disclosed principles:
FIG. 1 illustrates a network diagram of a system for prediction of side effects of a compound according to some embodiments of the present disclosure.
FIG. 2 illustrates a schematic representation of horizontal split for prediction of side effects of the compound according to some embodiments of the present disclosure.
FIG. 3 shows flowchart illustrating steps involved in prediction of side effects of the compound according to some embodiments of the present disclosure.
DETAILED DESCRIPTION OF EMBODIMENTS
Exemplary embodiments are described with reference to the accompanying drawings. In the figures, the left-most digit(s) of a reference number identifies the figure in which the reference number first appears. Wherever convenient, the same reference numbers are used throughout the drawings to refer to the same or like parts. While examples and features of disclosed principles are described herein, modifications, adaptations, and other implementations are possible without departing from the scope of the disclosed embodiments.
Prediction of adverse drug reaction (ADR) is important field of research worldwide. It has been recognized that ADRs represent a significant public health problem all over the world. Identifying potential ADRs of drug candidates in the early stage of the drug development pipeline can improve drug safety, reduce risks for the patients and save money for the pharmaceutical companies. Various computational and machine learning methods have been used for the prediction of side effects.
The major hurdles in developing computation models are scarcity of data and sparsity of available data. All the adverse events being predicted do not occur at the same frequency level and there is a huge bias (some occurring quite often, and some may be only once or twice in the whole dataset). Handling such biased data has always been a problem in prediction modelling. Existing methodologies of multi-label classifications were tried, and none were able to reduce this bias of prediction.
The disclosure provides a system and a method for the prediction of side effects of a compound, especially the drug compounds. The disclosure provides a comprehensive, unbiased and cost-effective method of prediction of side effects for compounds whose gene expression could be measured. The disclosure further provides an in-silico side effect prediction during drug development process. The disclosure provides concrete system for prediction of side effects using an ensemble of matrix of models trained to predict adverse event at different levels of vocabulary hierarchy.
Referring now to the drawings, and more particularly to FIG. 1 through FIG. 3, where similar reference characters denote corresponding features consistently throughout the figures, there are shown preferred embodiments and these embodiments are described in the context of the following exemplary system and/or method.
According to an embodiment of the disclosure, a system 100 for prediction of side effects of a compound is shown in the network diagram of FIG. 1. The system 100 is configured to predict clinically relevant side effect without testing on subjects (may be used to study if an approved drug may have a side effect later). The system 100 is configured to build a model for robust prediction of side effect of a compound based on its chemical properties and the transcriptomic perturbation it causes.
Although the present disclosure is explained considering that the system 100 is implemented on a server, it may also be present elsewhere such as a local machine or an edge or cloud. It may be understood that the system 100 comprises one or more computing devices 102, such as a laptop computer, a desktop computer, a notebook, a workstation, a cloud-based computing environment and the like. It will be understood that the system 100 may be accessed through one or more input/output interfaces 104, collectively referred to as I/O interface 104. Examples of the I/O interface 104 may include, but are not limited to, a user interface, a portable computer, a personal digital assistant, a handheld device, a smartphone, a tablet computer, a workstation and the like. The I/O interface 104 is communicatively coupled to the system 100 through a network 106.
In an embodiment, the network 106 may be a wireless or a wired network, or a combination thereof. In an example, the network 106 can be implemented as a computer network, as one of the different types of networks, such as virtual private network (VPN), intranet, local area network (LAN), wide area network (WAN), the internet, and such. The network 106 may either be a dedicated network or a shared network, which represents an association of the different types of networks that use a variety of protocols, for example, Hypertext Transfer Protocol (HTTP), Transmission Control Protocol/Internet Protocol (TCP/IP), and Wireless Application Protocol (WAP), to communicate with each other. Further, the network 106 may include a variety of network devices, including routers, bridges, servers, computing devices, storage devices. The network devices within the network 106 may interact with the system 100 through communication links.
The system 100 may be implemented in a workstation, a mainframe computer, a server, and a network server. In an embodiment, the computing device 102 further comprises one or more hardware processors 108, one or more memory 110, hereinafter referred as a memory 110 and a data repository 112, for example, a repository 112. The memory 110 is in communication with the one or more hardware processors 081, wherein the one or more hardware processors 108 are configured to execute programmed instructions stored in the memory 110, to perform various functions as explained in the later part of the disclosure. The repository 112 may store data processed, received, and generated by the system 100.
The system 100 supports various connectivity options such as BLUETOOTH®, USB, ZigBee and other cellular services. The network environment enables connection of various components of the system 100 using any communication link including Internet, WAN, MAN, and so on. In an exemplary embodiment, the system 100 is implemented to operate as a stand-alone device. In another embodiment, the system 100 may be implemented to work as a loosely coupled device to a smart computing environment. The components and functionalities of the system 100 are described further in detail.
According to an embodiment of the disclosure, the system 100 is using a gene expression database 114, a chemical information data and a side effects database 116. In an example, The LINCS L1000 has been used as the gene expression database 114. The LINCS L1000 project has collected gene expression profiles for thousands of perturbagens at a variety of time points, doses, and cell lines.
A landmark gene is one whose gene expression has been determined as being informative to characterize the transcriptome and which is measured directly in the L1000 assay. The "L" in L1000 denotes that the assay targets a set of landmark genes. Landmark genes were selected as those widely expressed across lineage and were found to have good predictive power for inferring the expression of other genes that are not directly measured in the assay (already done in LINCS).
In an example, the measurements of 978 "landmark genes" are applied to an inference algorithm to infer the expression of 11,350 additional genes in the transcriptome. Through simulation, they observed that this reduced representation of the transcriptome is able to recapitulate approximately 80% of the relationships found when measuring all transcripts directly. The expression values of 978 landmark genes were obtained in CSV format.
In an embodiment of the disclosure, the chemical information data is generated through a software called PaDEL as CSV file. While the side effect database 116 is SIDER downloaded as TSV file.
FIG. 3 illustrates an example flow chart of a method 300 for prediction of side effects of the compound, in accordance with an example embodiment of the present disclosure. The method 300 depicted in the flow chart may be executed by a system, for example, the system 100 of FIG. 1. In an example embodiment, the system 100 may be embodied in the computing device.
Operations of the flowchart, and combinations of operations in the flowchart, may be implemented by various means, such as hardware, firmware, processor, circuitry and/or other device associated with execution of software including one or more computer program instructions. For example, one or more of the procedures described in various embodiments may be embodied by computer program instructions. In an example embodiment, the computer program instructions, which embody the procedures, described in various embodiments may be stored by at least one memory device of a system and executed by at least one processor in the system. Any such computer program instructions may be loaded onto a computer or other programmable system (for example, hardware) to produce a machine, such that the resulting computer or other programmable system embody means for implementing the operations specified in the flowchart. It will be noted herein that the operations of the method 300 are described with help of system 100. However, the operations of the method 300 can be described and/or practiced by using any other system.
Initially at step 302 of the method 300, a first dataset of gene expression profiles is generated for a plurality of compounds at a plurality of time points from a gene expression dataset. The first dataset comprises a name of the plurality of compounds, corresponding simplified molecular-input line-entry (SMILE) annotations of the plurality of compounds and gene expression information of the plurality of compounds. As mentioned above the LINCS1000 dataset contains transcriptomic data for chemical perturbagens. In an example, the gene expression values for all FDA approved drugs (available in GEO) is fetched under category ‘cmpds’ for ‘978 landmark genes’. All were level 5 data and has a value of -10 to +10 for expression. The metadata contains the SMILE structure of each compound used. All the data was available in csv file. Though data can also be fetched in any other format.
At step 304 of the method 300, a second dataset of a chemical information data is generated using one of the SMILE annotations or curated from the chemical database. This is done to obtain the chemical properties of compound like molecular weight, boiling and melting point of the compound, etc. In the present example a molecular descriptor tool. An online open source software called ‘PaDEL’ was used to generate a total of 1832 physio-chemical properties of the compounds. It should be appreciated that there could be multiple source and type of information that could be used.
At step 306 of the method 300, a third dataset of a plurality of side effects of the plurality of compounds is generated using a side effects database. The third dataset comprises the name of the plurality of compounds, the SMILE annotations and the corresponding side effects of the plurality of compounds. The side effects database was used for the training purpose. In an example, the TSV file of SIDER database was downloaded which contains reported side effects of FDA approved drugs. It has a column mentioning the CID (chemical ID) of each drug. Using the CID, the SMILE structure of each compound in SIDER is obtained from PubChem database.
At step 308 of the method 300, a master dataset is generated in a tabular form by linking the first dataset, the second dataset and the third dataset using the SMILE annotations. It should be appreciated that the linking can be done using any column or an ID. The master dataset comprises mapping of the side effects of the plurality of compounds with the gene expressions information, wherein the side effects are represented either as present or absent corresponding to each compound. The side effects are converted into columns with binary labels (1-present, 0-absent). Only those side effects were chosen for which there is at least one occurrence in at least one drug. It is to be noted that only common compounds present in between SIDER and LINCS (Total-585) are taken. This becomes the master dataset.
At step 310 of the method 300, the master dataset is split vertically, a first predefined number of times (m) based on a plurality of levels of hierarchy of a set of terms for side effects. The splitting results in the generation of the first predefined number of input sheets. The first splitting mechanism is based on hierarchy of terms for side effect. The side effects in SIDER are all from MedDRA ontology (preferred terms- PT). This vocabulary has the terms arranged in three levels, hence in the present example three splits were made as follows.
PT- Preferred Term (level 1) (1056 PTs)
HLGT -High Level Grouped Terms (level 2) (198 HLGTs)
SOC -System Organ Class (level 3) (27 SOCs)
So, three different input sheets were created. The master dataset mentioned above is the one at PT level. These sheets were grouped based on MedDRA HLGT terms into HLGT level side effects. For example, HLGT cardiac arrhythmias has cardiac asthma, pulmonary edema, etc. which are all PT. So, in the first sheet (PT level) there will be a ‘1’ if these PT are present or ‘0’ if not present in their respective columns. But in sheet 2 (HLGT level), there be a ‘1’ in column ‘cardiac arrhythmia’ if any of these PT are present, hence grouped.
Similarly, it was done for the level 3 SOC level and have a total of three master datasets having:
Drug name, gene expression, PT
Drug name, gene expression, HLGT
Drug name, gene expression, SOC
Drug name and gene expression will be common in all three sheets. This is how vertical splitting is done based on vocabulary ontology. The level depends on the user. One can have 1-N levels.
In the next step 312 of the method 300, each of the first predefined number of input sheets are split horizontally in a second predefined number of times (n) based on a set of factors depending on requirements as shown in the schematic diagram of FIG. 2. In the present example, in vertical splitting 3 sheets are there, each for one level having all the side effects for one level (e.g. 1056 side effect columns for PT level). Here, for each level these sheets with total side effects are split based on certain factors, e.g., the 1056 side effects are split into groups. The factors can be many depending on the use case like frequency of side effect, severity or relatedness, etc. The user is free to choose the number of splits and the factors for splitting. Here, in the present example splitting is done based on frequency, so for PT level, three horizontal splits were made based on frequency of side effects in the total drug set:
Very frequent (180 PT)
Least frequent (620 PT)
Intermediate (256 PT)
Similarly, for the HLGT level, 2 horizontal splits are there and a single ‘split’ for SOC (no split). The drug name and gene expression would be the same in all sheets with varying number of side effects based on split criteria. This results in generation of following sheets.
a) Vertical split1: PT level
Horizontal split1: highly frequent
Horizontal split2: least frequent
Horizontal split3: intermediate
b) Vertical split1: HLGT level
Horizontal split1: high frequent
Horizontal split2: less frequent
c) Vertical split1: SOC level
Horizontal split1: All SOC level side effects
Further at step 314 of the method 300, a plurality of models is trained based on the horizontal split (HS (n)) for each vertical split (VS (m)). In the present example, there are above 6 sheets having drug names, gene expression and respective side effect based on the split. Any machine learning or deep learning model can be used for training these. In the present example, deep neural network has been used. So, six different models were built and optimized, one for each part. It is multi-label classifier which uses the gene expression and the chemical properties to predict side effects.
Once the training and validation is done for each model, all the side effects are combined. The output would be an individual probability for each side effect in range 0-1 for the respective side effects being predicted. For the horizontal splits, the probabilities of all the side effects for a level is given by just concatenating all the side effects from all the horizontal levels for that vertical level. For example,
Vertical split1: PT level
Horizontal split1: gives probability of 180 side effects trained for
Horizontal split2: gives probability of 620 side effects trained for
Horizontal split3: gives probability of 256 side effects trained for
Hence total 1056 PT predictions for PT level vertical splitting. Similarly, 198 are there for HLGT level vertical split and 27 for SOC level vertical split. This results in generation of a total of 3 prediction sheets, one for each vertical level having corresponding number of side effects predicted for that level (concatenating all horizontal splits for a level).
In the next step 316 of the method 300, the compound for the prediction of side effects is provided. The compound is the sample drug compound. At step 318, a probability score is calculated corresponding to each trained model of the plurality of models for each side effect present in the side effect database.
Further at step 320, a final score is calculated for each side effect based on each of the calculated probability score, wherein the final score is calculated using a weightage given to the first predefined number of vertical splits at the plurality of levels of hierarchy.
Each level is having one sheet for the prediction of certain number of side effects for that level. It is collapsed into one score for that PT level term (PT level term because the side effects were reported at PT level in SIDER). For example, considering a PT term, like PT pulmonary edema has HLGT cardiac arrhythmia which is subclass of SOC cardiac disorders according to MedDRA vocabulary.
A single probability score was needed for the term ‘pulmonary edema’ combined from all the three levels. For these weightages were calculated for each vertical level. The weightage was calculated based on the accuracy at each vertical level (average accuracies all test drugs used for prediction). For accuracy the predicted label was compared with the one present in SIDER, i.e. presence of the predicted label in SIDER at that level. Based on the respective weightages a single probability is calculated that is the final probability of that side effect. The method of this ensemble scoring is dependent on the user. The method can be completely different.
And finally, at step 322 of the method 300, top side effects are selected out of the side effects present in the side effects database as the side effects of the compound based on the final score, wherein the number of top side effects is decided by a user. Though in this step, only top side effects are being collected, but it should be appreciated that the system 100 is configured to determine probability of all the side effects.
According to an embodiment of the disclosure the system 100 can also be explained with the help of following technical implementation. The example of side effects from SIDER database is taken into consideration. SIDER uses MedDRA ontology for the side effects. Further three levels of hierarchy- SOC, HLGT, PT have been sued and SIDER reports side effects at PT level. After merging the required gene expression (GEx) with the side effects (SE) from SIDER, a matrix, GEx-SE is obtained as shown in TABLE 1.
G1 G2 ... G(g) PT1 PT2 ... PT(m)
0.23 3.34 1 1
-1.2 0.2 0 0
…. ... ... ... ... .. ... ...
N N N N N
X y1 y2 ym
where X represents the feature matrix (N x g) which essentially contains g gene expression values (G1,G2,..,G(g)) for N perturbagens. y1, y2,.., ym are m side effect binary labels for corresponding perturbagens. This can be written as shown in equation (1):
¦(X_1?"{" y_11^l "," y_12^l "," y_13^l ",....," y_(1(m))^l "}" @X_2?"{" y_21^l "," y_22^l "," y_23^l ",....," y_(2(m))^l "}" @X_3?"{" y_31^l "," y_32^l "," y_33^l ",....," y_(3(m))^l "}" @".........." @X_N?"{" y_((N)1)^l "," y_((N)2)^l "," y_((N)3)^l ",....," y_((N)(m))^l "}" ) …………... (1)
y="{" ¦(1,ifpresent@0,otherwise)
where l denotes level 1(PT here).
Horizontal splitting:
Based on the frequency of occurrence of each SE in a given dataset, the m side effects are split into three groups having a, b, c numbers of side effect each such that (a + b + c) = m. This is termed as horizontal splitting (HS) of the dataset at level 1(PT). This splitting can be done based on other parameters also. Frequency as the parameter have been used. The frequency of each side effect is given by equation (2).
f_i=(?_1^N y_i)/N ………………….. (2)
The split is done based on the following criteria:
h1, f > 0.7 (a number of SE)
h2, 0.3 > f > 0.7 (b number of SE)
h3, f < 0.3 (c number of SE)
For the given level (PT here),
The equation (1) can be written as shown in equation (3a), (3b) and (3c):
for h1 :
¦(X_1?"{" y_11^l "," y_12^l "," y_13^l ",....," y_(1(a))^l "}" @X_2?"{" y_21^l "," y_22^l "," y_23^l ",....," y_(2(a))^l "}" @X_3?"{" y_31^l "," y_32^l "," y_33^l ",....," y_(3(a))^l "}" @".........." @X_N?"{" y_((N)1)^l "," y_((N)2)^l "," y_((N)3)^l ",....," y_((N)(a))^l "}" ) ……………… (3a)
for h2 :
¦(X_1?"{" y_11^l "," y_12^l "," y_13^l ",....," y_(1(b))^l "}" @X_2?"{" y_21^l "," y_22^l "," y_23^l ",....," y_(2(b))^l "}" @X_3?"{" y_31^l "," y_32^l "," y_33^l ",....," y_(3(b))^l "}" @".........." @X_N?"{" y_((N)1)^l "," y_((N)2)^l "," y_((N)3)^l ",....," y_((N)(b))^l "}" )……………..... (3b)
for h3 :
¦(X_1?"{" y_11^l "," y_12^l "," y_13^l ",....," y_(1(c))^l "}" @X_2?"{" y_21^l "," y_22^l "," y_23^l ",....," y_(2(c))^l "}" @X_3?"{" y_31^l "," y_32^l "," y_33^l ",....," y_(3(c))^l "}" @".........." @X_N?"{" y_((N)1)^l "," y_((N)2)^l "," y_((N)3)^l ",....," y_((N)(c))^l "}" )……………….. (3c)
where a, b, c represent the respective number of side effects in each horizontal split hn.
Each of the dataset mentioned in equations (3a) – (3c) is used to train and optimize a separate model considering the respective parameter bias, for example frequency in the given example. The predictions from all the sub-models in horizontal split are then concatenated to form the prediction labels of that level (PT here), (a + b + c = m).
Vertical Splitting:
If the data at this level is too sparse and the sparsity is to be decreased. One such way is to club the side effects to one upper level of hierarchy. This is termed as vertical splitting (VS). The prediction labels are reduced by going one level up in the ontology hierarchy. For example, the HLGT level in MedDRA which is one level up the PT level and SOC is one level up the HLGT level. The number of side effects to be predicted reduce because of this “level-up.” Thus, there are three VS in this example:
PT, having m number of PT terms
HLGT, all PT terms are grouped into n number of HLGT terms
SOC, all HLGT terms are grouped into o number of SOC terms
Similarly there can be any number of vertical splits as required. For each VS there may exist any number of horizontal splits. Then each vertical split can be represented as shown in equation (4a), (4b), and (4c).
VS1: PT (lth level)
¦(X_1?"{" y_11^l "," y_12^l "," y_13^l ",....," y_(1(m))^l "}" @X_2?"{" y_21^l "," y_22^l "," y_23^l ",....," y_(2(m))^l "}" @X_3?"{" y_31^l "," y_32^l "," y_33^l ",....," y_(3(m))^l "}" @".........." @X_N?"{" y_((N)1)^l "," y_((N)2)^l "," y_((N)3)^l ",....," y_((N)(m))^l "}" )y="{" ¦(1,ifpresent@0,otherwise) ………………………. (4a)
VS2: HLGT (l+1th level)
¦(X_1?"{" y_11^(l+1) "," y_12^(l+1) "," y_13^(l+1) ",....," y_(1(n))^(l+1) "}" @X_2?"{" y_21^(l+1) "," y_22^(l+1) "," y_23^(l+1) ",....," y_(2(n))^(l+1) "}" @X_3?"{" y_31^(l+1) "," y_32^(l+1) "," y_33^(l+1) ",....," y_(3(n))^(l+1) "}" @".........." @X_N?"{" y_((N)1)^(l+1) "," y_((N)2)^(l+1) "," y_((N)3)^(l+1) ",....," y_((N)(n))^(l+1) "}" )y="{" ¦(1,ifanyy^l "=1in" l^th HLGTgroup@0,otherwise) …… (4b)
VS3: SOC (l+2th level)
¦(X_1?"{" y_11^(l+2) "," y_12^(l+2) "," y_13^(l+2) ",....," y_(1(o))^(l+2) "}" @X_2?"{" y_21^(l+2) "," y_22^(l+2) "," y_23^(l+2) ",....," y_(2(o))^(l+2) "}" @X_3?"{" y_31^(l+2) "," y_32^(l+2) "," y_33^(l+2) ",....," y_(3(o))^(l+2) "}" @".........." @X_N?"{" y_((N)1)^(l+2) "," y_((N)2)^(l+2) "," y_((N)3)^(l+2) ",....," y_((N)(o))^(l+2) "}" )y="{" ¦(1,ifanyy^(l+1) "=1in" l^th SOCgroup@0,otherwise)…… (4c)
Each VS can have any number of HS. For example, each VS will have similar HS levels like equation (3a) to (3c).
Model training and optimization:
Each set (e.g., equation (3a), (3b), (3c)) is used to train a different machine learning model, i.e., 9 models in this example. Deep neural networks (DNN) have been used in the present example. Each set is divided into train set, test set and blind set and the model is built, optimized and trained. A neural network for this purpose can be generally represented as equation (5).
¦(X_1@X_2@X_3@"...." @X_N ) ?[L_1][L_2][L_3]"...."[L_l]?¦(y_1@y_2@y_3@"...." @y_N ) …………………… (5)
Input Hidden Layers output (predictions)
The computation performed by the kth hidden unit in the lth hidden layer is given by equation (6).
y_k^((l))=f"(" ?_k w_jk^((l)) "." f"(" ?_i w_ij^((l)) "." x_i "+" b_j^((l-1)) ")+" b_k^((l)) ")" ………… (6)
where w, b are the respective weights and biases.
The output of the last layer is converted to binary using the sigmoid function. To minimize the bias because of unbalanced dataset we have used focal loss as the loss function. Following the above splitting and training, the probabilities are obtained at different vertical levels (VS) after concatenating all horizontal levels at each vertical level as follows using equation (7).
¦(?VS?_1?P(1),P(2),P(3),....,P(m)@?VS?_2?P(1),P(2),P(3),....,P(n)@?VS?_3?P(1),P(2),P(3),....,P(o)@.........@?VS?_x?P(1),P(2),P(3),....,P(T)) ………………………….. (7)
where, P(x) represents the probability of each side effect at xth vertical level having T terms after grouping of m PT terms upto xth level.
Weightage calculation:
Each vertical predicts probability of the side effect either directly (PT) or the clubbed group (HLGT, SOC, etc.). Based on the prediction accuracy of each vertical of the test set, a weightage parameter (WVS(x)) is calculated, which is used to calculate the final Composite Probability (CP). The accuracy of prediction of ith side effect at any VS(x) , ai is given by equation (9) and (10).
a_i^(VS(x))=Totalnumberofcorrectprediction/N ………………………. (9)
W_(VS(x))=(?a_i^(VS(x)))/(Totalnumberofsideeffect"in" x^th VS) ………………………. (10)
Thus, one weightage corresponding to each VS is obtained.
Composite Probability calculation:
The probability of occurrence of ith side effect (at PT level) is then calculated by the formula given in equation (11).
?CP?_i=?_1^x P_i^(Vs(x))×W_(VS(x)) …………………………………… (11)
The written description describes the subject matter herein to enable any person skilled in the art to make and use the embodiments. The scope of the subject matter embodiments is defined by the claims and may include other modifications that occur to those skilled in the art. Such other modifications are intended to be within the scope of the claims if they have similar elements that do not differ from the literal language of the claims or if they include equivalent elements with insubstantial differences from the literal language of the claims.
The embodiments of present disclosure herein address unresolved problems related to accurate and cost-effective prediction of side effects of the drug. The embodiment thus provides a method and system for prediction of side effects of a compound, especially drug compounds.
It is to be understood that the scope of the protection is extended to such a program and in addition to a computer-readable means having a message therein; such computer-readable storage means contain program-code means for implementation of one or more steps of the method, when the program runs on a server or mobile device or any suitable programmable device. The hardware device can be any kind of device which can be programmed including e.g., any kind of computer like a server or a personal computer, or the like, or any combination thereof. The device may also include means which could be e.g., hardware means like e.g., an application-specific integrated circuit (ASIC), a field-programmable gate array (FPGA), or a combination of hardware and software means, e.g., an ASIC and an FPGA, or at least one microprocessor and at least one memory with software processing components located therein. Thus, the means can include both hardware means and software means. The method embodiments described herein could be implemented in hardware and software. The device may also include software means. Alternatively, the embodiments may be implemented on different hardware devices, e.g., using a plurality of CPUs.
The embodiments herein can comprise hardware and software elements. The embodiments that are implemented in software include but are not limited to, firmware, resident software, microcode, etc. The functions performed by various components described herein may be implemented in other components or combinations of other components. For the purposes of this description, a computer-usable or computer readable medium can be any apparatus that can comprise, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device.
The illustrated steps are set out to explain the exemplary embodiments shown, and it should be anticipated that ongoing technological development will change the manner in which particular functions are performed. These examples are presented herein for purposes of illustration, and not limitation. Further, the boundaries of the functional building blocks have been arbitrarily defined herein for the convenience of the description. Alternative boundaries can be defined so long as the specified functions and relationships thereof are appropriately performed. Alternatives (including equivalents, extensions, variations, deviations, etc., of those described herein) will be apparent to persons skilled in the relevant art(s) based on the teachings contained herein. Such alternatives fall within the scope of the disclosed embodiments. Also, the words “comprising,” “having,” “containing,” and “including,” and other similar forms are intended to be equivalent in meaning and be open ended in that an item or items following any one of these words is not meant to be an exhaustive listing of such item or items, or meant to be limited to only the listed item or items. It must also be noted that as used herein and in the appended claims, the singular forms “a,” “an,” and “the” include plural references unless the context clearly dictates otherwise.
Furthermore, one or more computer-readable storage media may be utilized in implementing embodiments consistent with the present disclosure. A computer-readable storage medium refers to any type of physical memory on which information or data readable by a processor may be stored. Thus, a computer-readable storage medium may store instructions for execution by one or more processors, including instructions for causing the processor(s) to perform steps or stages consistent with the embodiments described herein. The term “computer-readable medium” should be understood to include tangible items and exclude carrier waves and transient signals, i.e., be non-transitory. Examples include random access memory (RAM), read-only memory (ROM), volatile memory, nonvolatile memory, hard drives, CD ROMs, DVDs, flash drives, disks, and any other known physical storage media.
It is intended that the disclosure and examples be considered as exemplary only, with a true scope of disclosed embodiments being indicated by the following claims.
| # | Name | Date |
|---|---|---|
| 1 | 202121040025-STATEMENT OF UNDERTAKING (FORM 3) [03-09-2021(online)].pdf | 2021-09-03 |
| 2 | 202121040025-REQUEST FOR EXAMINATION (FORM-18) [03-09-2021(online)].pdf | 2021-09-03 |
| 3 | 202121040025-FORM 18 [03-09-2021(online)].pdf | 2021-09-03 |
| 4 | 202121040025-FORM 1 [03-09-2021(online)].pdf | 2021-09-03 |
| 5 | 202121040025-FIGURE OF ABSTRACT [03-09-2021(online)].jpg | 2021-09-03 |
| 6 | 202121040025-DRAWINGS [03-09-2021(online)].pdf | 2021-09-03 |
| 7 | 202121040025-DECLARATION OF INVENTORSHIP (FORM 5) [03-09-2021(online)].pdf | 2021-09-03 |
| 8 | 202121040025-COMPLETE SPECIFICATION [03-09-2021(online)].pdf | 2021-09-03 |
| 9 | 202121040025-Proof of Right [06-09-2021(online)].pdf | 2021-09-06 |
| 10 | 202121040025-FORM-26 [08-04-2022(online)].pdf | 2022-04-08 |
| 11 | Abstract1.jpg | 2022-04-16 |