FIELD OF THE INVENTION The present invention generally relates to bioinfonnatics» proteomics, molecular
modeling, computer-aided molecular design (CAMD), and more specifically comput^-aided
drug design (CADD) and computational modeling of molecular combinations.
B ACKOROUNP Q? IBB INVENTION An explanation of conventional drug discovery processes and their limitations is
useful for understanding the present invention.
Discovoing a new drug to treat or cure some biological condition, is a longthy and expensive process, typically taking on average 12 years and $800 million per drug, and taking possibly up to IS years or more and SI billion to complete in some cases.
A goal of a drug discovery process is to identify and characterize a chemical compound or ligand biomolecule, that afTects the function of one or more other biomolecules (i.e., a drug "target*^ in an organism, usually a biopolymer, via a potential molecular interaction or combination. Horein the term biopolymer refers to a macromolecule that
comprises one or more of a protein, nucleic acid (DNA or RNA), peptide or nucleotide sequence or any portions or fragments thereof. Herein the tentn biomolecule refers to a chemical entity that comprises one or more of a biopolymw, carbohydrate, hormone, or other molecule or chemical compound, either inorganic or organic, including, but not limited to, synthetic, medicinal, drag-like, or natural compounds, or any portions or fragments thereof
The target molecule is typically a disease-related target protein or nucleic acid for which it is desired to affect a change in function, structure, and/or choniical activity in order to aid in the treatment of a patient disease or other disorder. In other cases, the target is a biomolecule found in a disease-causing organism, such as a virus, bacteria, or parasite, that when affected by the drug will affect the survival or activity of the infectious organism. In yet other cases, the target is a biomolecule of a defective or harmful cell such as a cmcat cell. In yet other cases the target is an antigen or other environmental chanical agent that may induce an allergic reaction or other undesired inununological or biological response.
The ligand is typically a small molecule drug or chemical compound with desired drug-like properties in terms of potency, low toxicity, membrane permeability, solubility, chemical / metabolic stability, etc. In other cases, the ligand may be biologic such as an injected protein-based or ppeptide-based drag or even another full-fledged protein. In yet other cases the ligand may be a chemical substrate of a target enzyme. The ligand may even be covalently bound to the target or may in fact be a portion of the protein, e.g., protein secondary structure component, protein domain containing 6r near an active site, protein subunit of an appropriate protein quaternary stracture, etc.
Throughout the remainder of the background discussion, unless otherwise specifically differentiated, a (potential) molecular combination will feature one ligand and one target, the ligand and target will be separate chemical entities, and the ligand will be assumed to be a chemical compound while the target will be a biological protein (mutant or wild type). Note that the frequency of nucleic acids (both DNA/RNA) as targets will likely increase in coming years as advances in gene therapy and pathogenic microbiology progress. Also the term **molecular complex** will refer to the bound state between the target and ligand when interacting with one another in the midst of a suitable (ptttm aqueous) environment A "potential" molecular complex refers to a bound state that may occur albeit with low probability and therefore may or may not actually form under normal conditions.
The drag discovery process itself typically includes four different subprocesses: (1) target validation; (2) lead generation / optimization; (3) preclinical testing; and (4) clinical trials and approval.
Target validation includes determination of one or more targets that have disease relevance and usually takes two-and-a-half years to complete. Results of the target validation phase might include a determination that the presence or action of the target molecule in an organism cames or influences some effect that initiates, exaco-bates, or contributes to a disease for which a cure or treatment is sought. In some cases a natural binder or substrate for the target may also be determined via experimental methods.
Lead generation typically involves the identification of lead compounds that can bind to the target molecule and thereby alter the effects of the target through either activation, deactivation, catalysis, or inhibition of the function of the target, in which case the lead would be a viewed as a suitable candidate ligand to be used in the dmg application process. Lead optimization involves the chemical and structural refinement of lead candidates into drug precursors in order to improve binding affinity to the desired target, increase selectivity, and also to address basic issues of toxicity, solubility, and metabolism. Together lead generation and lead optimization typically takes about three years to complete and might result in one or more chemically distinct leads for further consideration.
In preclinical testing, biochemical assays and animal models are used to test the selected leads for various pharmacokinetic factors related to dmg absorption, distribution, metaibolism, excretion, toxicity, side effects, and required dosages. This preclinical testing takes approximately one year. After the preclinical testing period, clinical trials and approval take another six to eight or more years during which the drug candidates are tested on human subjects for safety and efficacy.
Rational drug design generally uses structural information about drug targets (structure-based) and/or their natural ligands (ligand-based) as a basis for the design of effective lead candidate generation and optimization. Structure-based rational drug design generally utilizes a three-dimensional model ofthestructure for the target. For target proteins or nucleic adds such structures may be as the r^vlt of X-ray crystallography / NMR or other measurement procedures or may result from homolo^ modeling, analysis of protein motifs and conserved domains, arid/or computational modeling of protein folding or the nucleic acid equivalent Model-built structures are oflen all that is available when considering many membrane-associated target proteins, e.g., GPCRs and ion-channels. The structure of a ligand may be generated in a similar manner or may instead be constmcted ab initio from a known 2-D chemical rqsresentation using fuiidamaital physics and chemistry principles, provided the ligand is not a biopolyraer. •
Rational drug design may incorporate the use of any of a number of computational components ranging from computational modeling of target-ligand molecular interactions and combinations to lead optimization to computational prediction of desired drug-like biological properties. The use of computational modeling in the context of rational drug design has been largely motivated by a desire to both reduce the required time and to improve the focus and efficiency of drug research and development, by avoiding often time consuming and costly efforts in biological "wet" lab testing and the like.
Computational modeling of target-ligand molecular combinations in the context of lead generation may involve the large-scale in-silico screening of compound libraries (i.e., library screening), whether the libraries are virtually generated and stored as one or more compound structural databases or constructed via combinatorial chemistry and organic synthesis, using computational methods to rank a selected subset of ligands based on computational prediction of bioactivity (or an equivalrat measure) with respect to the intaided target molecule.
Throughout the text, the term "binding mode" refers to the 3-D molecular structure of a potential molecular complex in a bound state at or near a minimum of the binding energy (i.e., maximum of the binding affinity), where the term "binding energy" (sometimes into-changed with "binding affinity" or "bmding free energy") refers to the change in free energy of a molecular system upon formation of a potential molecular complex, i.e., the transition from an unbound to a (potential) bound state for the ligand and target. Here, the terra "free enwgy" generally refers to both enthalpic and entropic effects BS the result of physical interactions between the constiuent atoms and bonds of the molecules between themselves (i.e., both intermolecular and intramolecular interactions) and with their surrounding environment. Examples of the free energy are the Gibbs free energy encountered in the canonical or grand canonical resembles of equilibrium statistical mechanics. In general, the optimal binding free energy of a given target-ligand pair directly correlates to the likelihood of formation of a potential molecular complex between the two molecules in chemical equilibrium, though, in truth, the binding free energy describes an ensemble of (putative) complexed structures and not one single binding mode. However, in computational modeling it is usually assumed that the change in free energy is dominated by a single structure corresponding to a minimal energy. This is certainly true for tight binders (pK - 0.1 to 10 nanomolar) but questionable for weak ones (pK « 10-100 micromolar). The dominating structure is usually taken to be the binding mode. In some cases it may be
necessary to consider more than one aJtemative-binding mode when the associated system . states are nearly degenerate in terms of energy.
It is desirable in the drug discovery process to identify quickly and efBciently the optimal docking configurations, i.e., binding modes, of two molecules or parts of molecules. Efficiency is especially relevant in the lead generation and lead optimization stages for a drug discovery pipeline, -whece it is desirable to accurately predict the binding mode for possibly millions of potential molecular complexes, before submitting promising candidates to furth^ analysis.
Binding modes and binding affmity are of direct interest to drug discovery and rational drug design because they often help indicate bow well a potential drug candidate may serve its purpose. Furthermore, whra^ the binding mode is determinable, the action of the dmg on the target can be better understood. such understanding may be useful when, for example, it is desirable to further modify one or more diaracteristics of the ligand so as to improve its potency (with respect to the target), binding specificity (with respect to other targets), or other chemical and metabolic properties.
A number of laboratory methods «cist for measuring or estimating affinity betweoi a target molecule and a ligand. Often the target might be first isolated and then mixed with the ligand in vitro and the molecular interaction assessed experimentally such as in the myriad biochemical and functional assays associated with high throughput screening. However, such methods are most useful where the target is simple to isolate, the ligand is simple to manufactureand the molecular interaction easily measured, but is more problematic when the target cannot be easily isolated, isolation interferes with the biological process or disease - pathway, the ligand is difficult to synthesize in sufficient quantity, or where the particular target or ligand is not well characterized ahead of time. In the latter case, many thousands or millions of expoiments might be needed for all possible combinations of the target and ligands, making the use of laboratory methods unfeasible.
While a numbear of attempts have been made to resolve this bottleneck by first using specialized knowledge of various chemical and biological ]n-operties of the target (or even related targets such as protein family meanbere) and/or one or more already known natural binders or substrates to the target, to reduce the nutnbor of combinations required for lab processing, this is still impractical and too eff poisi ve in most cases. Instead of actually combining molecules in a laboratory setting and measuring ^cperimental results, another approach is to use computers to simulate or characterize molecular interactions between two or more molecules (i.e., molecular combinations modeled in silled). The use of
A
computational methods to assess molecular combinations and interactions is usually associated with one or more stages of rational drug design, whether structure-based, ligand-based, or both.
The computational prediction of one or more binding modes and/or the computational assessment of the nature of a molecular combination and the likelihood of formation of a potential molecular complex is generally associated with the term"docking** in the art. To date, conventional "docking" methods have included a wide variety of computational techniques as described in the forthcoming section entitled "REFERENCES AND PRIOR ART'.
Whatever the choice of computational docking method there are inhei^t trade-offs betweeai the computational complexity of both the underlying molecular models and the intrinsic numerical algorithms, and the amount of compute resources (time, numbor of CPUs, number of simulations) that must be allocated to process eadi molecular combination. For example, while highly sophisticated molecular dynamics simulations (MD) of the two molecules surrounded by explicit water molecules and evolved over trillions of time steps may lead to higher accuracy in modeling the potential molecular combination, the resultant computational cost (i.e., time and compute po-wer) is so enormous that such simulations are intractable for use with more than just a few molecular combinations.
One major distinction amongst docking methods as applied to computational modeling of molecular combinations is whether the ligand and target structures reanain rigid throughout the course of the simulation (i.e., rigid-body docking) vs. the ligand and/or target being allowed to change theirmolecular conformations (i.e., flexible doddng). In general, the latter scenario involves more computational complexity, though flexible docking may often achieve higher accuracy than rigid-body docking whea modeling various molecular combinations.
That being said rigid-body docking can provide valuable insight into the nature of a molecular combination and/or the likelihood of formation of a potential molecular complex and has many potential uses within the context ofrational drug discovery. For instance rigid-body doddng may be appropriate for docking small, rigid molecules (or molecular fragments) to a simple protein with a well-defined, nearly rigid active site. As another example, rigid-body docking may also be used to more effidently and rapidly screen out a subset of likely nonactive ligands in a molecule library fora given target, and then flying more onerous flexible docking procedures to the surviving candidate molecules. Rigid-body docking may also be suitable for de novo h'gand design and combinatorial library design.
Moreover, in order to better predict the binding mode and better assess the nature and/or likelihood of a molecular combination when one or both molecules are likely to be flexible, rigid-body docking can be used in conjunction with a process for generating likely yet distinct molecular conformers of one or both molecules for straightforward and efficient virtual screening of a molecule library against a target molecule. However, as will be discussed, even rigid body docking of molecular combinations can be computationally expensive and thus there is a clear need for better and more efficient computational methods based on rigid body docking when assessing the nature and/or likelihood of molecular combinations.
As outlined in the section entitled "REFERENCES AND PRIOR ART', conventional computational methods for predicting binding modes and assessing the nature and/or likelihood of molecular combinations in the context of ri^d-body docking include a wide variety of techniques. These include methods based on patt«n matching (ofton graph-based), maximization of shape complementarity (i.e., shape correlations), geometric hashing, pose clustering, and even the use of one or more flexible docking methods with the simplifying run-time condition that both molecules are rigid.
Of special interest to this invention is class of rigid-body docking techniques based on the maximization of shape complementarity via evaluation of the spatial correlation between two representative molecular surfaces at different relative positions and orientations. Here the term **shape complementarity" measures the geometric fit or correlation between the molecular shapes of two molecules. The concept can be generalized to any two objects. For example, two pieces of a jigsaw puzzle that fit each other exhibit strong shape complanentarity.
Shape complementarity based methods while typically treating molecules as rigid and thus perhaps less ri gorous than their flexible docking counterparts, especially in the context of flexible molecules, is still potentially valuable for the fast, efficient screening of two molecules-in order to make a preliminary assessment of the nature and/or likelihood of formation of a potential molecular complex of the two molecules or to make an initial prediction of the prefen-ed binding mode for the molecular combination. Such a prelhninary assessment may significantly reduce the number of candidates that must be further screened in silico by another more computationally costiy docking method.
One example includes the "FTDOCK" docking software of the Q3rm6rifl!ge Crystallographic Data Center based on computation of spatial correlations in the Fourier domain and described in Aloy» P., Moont, G., Gabb, H. A., Querol, E., Aviles, F. X., and
Sternberg, M. J. E., "Modeling Protein Docking using Shape Complementarity, Electrostatics and Biochemical Information," (1998), Proteins: Structure. Function, and Genetics^ 33(4) 535-549; all ofwhich is hweby incorporated by reference in their entirety. However, the number of computations associated with this method renders the process impractical for use with conventional computer software and hardware configurations when performing large-scale screening. Moreover, the method is practical for high accuracy prediction of the binding mode due to the requirement of a hi^ resolution of the associated sampling space.
Anotha example is the Patchdock docking software based on least-square minimization (or equivalent minimization) of separation distances between critical surface and/or fitting points that represent the molecular surfaces of the two molecules and writton by Nussinov-Wolfeon Structural Bioinformatics Group at Tel-Aviv University, based on principles desaibed in Lin, S. L., Nussinov, R., Fischer, D., and Wolfson, H. J., "Molecular Surface Representations by Sparse Critical Points", Proteins: Structure. Function, and Genetics 18,94-101 (1994); all of which is herdjy incorporated by reference in their witirety. Howevor, this method oftm suffers from degraded accuracy, especially when the molecular sur&ce geometry is complex or when the ligand molecule is very small relative to the protein' receptor and/or characterized by poor binding affinities. Moreover, the method can be computationally expensive for a high resolution sampling space of relative positions and orientations of the system, and even the cost of computing the surface critical points is often itself quite expensive.
Yet another example is the "Hex" docking software developed for the efficient estimation of shape complementarity based on the decomposition of two volumetric functions describing a representative molecular surface for each molecule onto an appropriate orthogonal basis set, such as a radial-spherical harmonics expansion. The "Hex" docking software is described in Ritchie, D. W. and Kemp. G. J. L, "Protem Docking Using Spherical Polar Fourier Con-elations", (2000), Proteins: Structure, Function, and Genetics, 39,178-194; (hereinafter, "Ritchie et aF), all of which is hereby incorporated by reference in their entirety.
The chief advantage of this type of method is that the required number of calculations scale linearly with the desired number of sampled configurations, thus allowing for a dense sampling of the geometric shape complementarity. Moreover, the compute time is roughly invariant with respect to the sizes of the two molecules and is thus suitable for protein-protein docking as has been demonstrated with respect to midtiple protein-protein systems, including both enzyme-inhibitor and antibody-^antigen, as shown in Ritchie et al. However, to achieve
high accuracy for complex molecular surface geometries, it is necessary to perfonn the orthogonal basis expansion with a large expansion order and as such the total compute time can be quite large. Furthermore, current methods such as those outlined in Ritchie er a/., arc not amenable to implementation in customized or other application specific hardware for use in large-scale screening.
While high shape complementarity alone is often a positive indicator of a favorable binding mode, additional electrostatic interactions involving the dbarges comprising the two molecules, as well as charges or ions in an ambient environment, are also important physical aspects of any potential molecular compile. In fact, many potential molecular combinations involving a false positive lead candidate (i.e., a lead that in reality does not bind strongly to the target) may exhibit high shape complementarity but poor electrostatic affinity. It is also possible, though relativeiy rare, for a potoittal molecular combination to demonstrate such a high electrostatic affinity that even if the shape complementarity is relatively poor, the two molecules may indeed have a high likelihood of forming a valid molecular complex. Further discussion of the importance of electrostatic mteractions in biology and chemistry can be found in the review article by Honig er fl/. [35].
Throughout the description the tcrai "charge" will refer to either the conventional definition related to the "net charge" of an molecular component, i.e., its electrostatic monopole moment, or a "partial charge" representing various nonvanishing higher moments of an electrostatic multipole expansion, e.g., dipoles, quadrupoles, etc., whether in the classical or quantum mechanical regime.
To illustrate this point. Fig. I a shows two molecules 110 and 120 of a potential molecular complex with high shape complementarity. However, the positive charges 125 of molecule 120 in or near the active site, denoted by region 130, are in unfavorable close electrostatic contact with the positive charges 115 located on molecule 110 in close proximity to region 130. The system in Fig. la, then exhibits poor electrostatic affinity and is unlikely to form a favorable molecular coinbination, yet computation of shape complementarity alone would not have detected this and instead overestimated the likelihood of combination of molecules 110 and 120.
Observe that molecule 110 may contain charges other than 115 and in fact molecule 110 may in fact be overall electrically neutral. The same potentially applies to molecule 120, but as illustrated in Fig. 1 a, it is both the arrangement and magnitude of the charges on one molecule relative to another in a given relative orientation and position of the two molecules . that most heavily impacts on the electrostatic affinity of the system.
Fig. 1 b shows two molecules 140 and 150 with identical molecular shapes as molecules 110 and 120, respectively, of Fig. 1 a. However, now the positive charges 155 of molecule 150, in or near the corresponding active site, denoted by region 160, are now in favorable close electrostatic contact with negative charges 145 on molecule 140 in close proximity to region 160. The system in Fig. lb, unlike the system in Fig. 1 a, then exhibits good electrostatic affinity in addition to high shape complementarity and is thus molecules 140 and 150 are more likely to form an energetically favorable molecular complex.
Thus in order to accurately analyze and charactoize the nature and/or likelihood of a molecular combination, including, estimation of the binding affinity, and prediction of the binding mode (or even additional alternative modes) for the syston, it is desirable to esthnate the electrostatic affinity, i.e., the change in the electrostatic energy, of the system upon formation of a potential molecular complex. As used herein, **electrostatic'energy*is a quantitative measure of Ae electrostatic affinity. Typically, when the electrostatic affinity is hig^, the change in electrostatic energy is very negative, whereas when the electrostatic affinity is poor, the change in electrostatic energy is near zero or may even be positive. As used herein, the "change in electrostatic energy*'refers to the net difference in energy between a state where the two molecules are mutually intoticting and a reference state where the two molecules are vary far apart and thus their mutual electrostatic interaction is negligible. Computing the electrostatic affinity, as opposed to the final absolute electrostatic energy of the system, is significant since the two molecules in their initial reference state may already exhibit fevorable self-electrostatic energies.
In general, however, it is computationally expensive to estimate the electrostatic affinity of a molecular combination in a suitable environment Systran for a large sequence of relative positions and orientations of the constituent molecules via conventional means, especially when one or more of the molecules are a macromolecule such as a protein or a long nucleotide sequence. Full treatment in the quantum mechanical regime is extremely impractical when the molecules in question are comprised of fifty or more atoms, due to the high complexity of suitable Hamiltonian functions. This is true even when considering only one relative position and orientation of the system, let alone the possibly millions of relative positions and orientations for each of possibly thousands or even millions of molecular combinations encountered during library screening. The reader is referred to Labanowski et al, [58] for a review of quantum mechanical calculations of electrostatics interactions.
Full treatment in the classical regime usually entails numerical solutions to second order partial differential equations [49][50] with appropriately chosen boundary conditions
such as the Poissoji-Boltzmann equation [45][47][48]. Such numerical solutions generally require many computations and exhibit high memory overhead. Moreover, new solutions must be generated for each distinct relative configuration of the two molecules because of the corresponding change in boundary conditions. Such classical methods are then also highly unsuitable for fast molecular docking and/or library screening.
Some attempts have been made to overcome this computational bottleneck by implementing a simple Coulombic energy model with a distance-^iependent thelectric function £^r^), as follows:
where r^ is the distance between charge g^ and g^. The reader is refeired to Mehler et at.,
[46] for a comparison of thelectric models. Examples of methods using the formulation of Eqn. 1 include those of Luty era/., [26] and AutoDock [31].
However, Eqn. 1 is ad hoc at best when the two molecules are not in vacuum (i.e., e = constant = so) and thus often does not well represent the nature of electrostatic interactions for charges embedded in a polarizable medium (e.g., an appropriate solvent like water or even salt water). Moreover, even for the simple model represented by Eqn.'l, the computational effort stills scales quadralically with the number of charges in the two molecules, e.g., if the molecule has N point charges, their are N(N-l) distinct pairs which need to be calculated. This can be a real problem, especially when considering possibly millions of relative orientations and translations of the two molecules during a high-resolution search of the electrostatic affinity space.
Further compxitational savings can be obtained by introducing a cutoff radius, rcutom beyond which the interaction is ignored, such as in Luty et ah. [26]. However, due to the long distance scales inherent in the electrostatic interaction, this may lead to numerical inaccuracies, and thus accurate prediction of the electrostatic affinity may still require large cutoff radii, e.g., rcmoir ^ 10-20 A, thereby significantly reducing the savings provided by the cutoff radius.
Methods based on the use of a Generalized Bora approximation [51 ][52], whether based on computations of volume or surface integrals, are viable alternatives for the calculation of charge-charge interactions for pairs of atoms or ions in the presence of continuum solvent. In general, however, though less..computationally demanding, the
Generalized Bom approximation is not as accurate in estimating electrostatic energies as the Poisson-Bolt23nann equation, and is still more costiy than the distance-dqpendent thelectric CouJombic model of Eqn. 1.
Recently, a new approach as described in Ritchie et al, has been developed for the estimation of electrostatic affinity based on the decomposition of two volumetric functions describing respectively the charge distributions, and the electrostsdc potential gmerated hereby, for each molecule onto an appropriate orthogonal basis set. As with the computations of shape complementarity based on ^milar tedmiques, required number of calculations scale linearly with the desired number of sampled configurations, thus allowing for a dense sampling of the search space associated with the electrostatic affinity of the two molecules.
However, the implementation described by Ritchie et al, has several drav^adcs. First, it requires a complicated and computationally intoisive rqpresentation of the electrostatic potential in temis of a Oreens function expansion as applied to a goieral solution to Poisson*s equation. Second, all solv^it effects are ignored and the calculations are performed in vacuum, ther^y limiting the applicable scope of the model. Third, their characterization of atomic charges as point charges, represented by Dirac delta functions, leads to sharp discontinuities in the charge distribution. Such discontinuities are very difficult to model accurately in a spherical hamionics expansion, thereby requiring a very large expansion order for the underlying basis expansions, and thus leading to large computational costs. Altogether, this leads to an imprecise estimate for the electrostetic affinity of most molecular complexes, unless the expansion order is so high as to make the calculations intractable in tfje context of a library screening. Furthermore, this approach is not amenable to implementation in customized or other application ^ecific hardware for use in lai:ge-scale screening, due to the large numbra: of required computations and exorbitant requirements for both memory storage and / or manory or i/o bandwidth.
In summary, it is desirable in the drug discovery process to identify quickly and efficiently the optimal configurations, i.e., binding modes, of two molecules or parts of molecules. Efficiency is especially relevant in the lead generation and lead optimization stages for a drug discovery pipeline, where it may be desnable to accurately predict the binding mode and binding affinity for possibly millions of potential target-ligand molecular combinations, before submitting promising candidates to further analysis. Thore is a clear need then to have more efficient systems and methods for computational modeling of the molecular combinations with reaspnable accuracy.
In general, the present invention relates to an effici«it computational method an analysis of molecular combinations based on maximization of electrostatic affinity (i.e., minimization of electrostatic energy relative to an isolated reference state) over a set of configurations of a molecular combination through computation of a basis expansion representing charge density and electrostatic potential functions associated with the molecules in a coordinate system. Here, the analysis ofthe molecular combination may involve the prediction of likelihood of formation of a potential molecular complex, the prediction of the binding mode (or even additional alternative modes) for the combination, the characterization of the nature of the interaction or binding of various components of the molecular combination, or even an approximation of binding affinity for the molecular combination based on an electrostatic affinity score or an equivalent measure. The teadiing of this disclosure might also be used in conjunction with other methods for computation of shape complemratarity, including the disclosure described in Kita 11, in ordex to generate a composite affinity score reflecting both shape complementarity and electrostatic affinity, for one or more configurations of a molecular combination. The invention also addresses and solves various hurdles and bottlenecks associated with efficient hardware implementation of the invention.
. REFERENCES AND PRIOR ART Prior art in the field of the current invention is heavily documented: the following
tries to summarize it.
Drews [1] provides a good ovCTview of the current state of drug discovery. In [2] Abagyan and Totrov show the state of high throughput docking and scoring and its applications. Lamb ei al. [3] further teach a general approach to the design, docking, and Virtual screening of multiple combinatorial libraries against a family of proteins, finally Waskowycz et al., [4] descnlje the use of multiple computers to accelerate virtual screening of a large ligand library against a specific target by assigning groups of ligands to specific computers.
[1] J. Drews, "Drug Discovery: A Historical perspective,** Science 287,1960-1964(2000).
[2] Ruben Abagyan and Maxim Totrov, "Higih-throughput docking for lead generation". Current Opinion in Chemical Biology 2001,5:375-382.
wu xuua/ujo3>o
[3] Lamb, M. L.; Burdick, K. W.; Toba, S.; Young, M. M.; Skillman, A. Get al,
"Design, docking, and evaJuation of multiple libraries against multiple targets". Proteins 2001,42,296-318.
[4] Waszkowycz, B., Perkins, T.D.J., Sykes, R.A., Li, J., "Large-scale virtual screening for discovering leads in the post-genomic era", IBM Systems Journal, Vol. 40, No. 2 (2001). There are a numba* of examples of software tools currmtly used to perform docking simulations. These methods involve a wide range of computational techniques, including use of a) rigid-body pattem-matdhing algorithms, either based on surface correlations, use of geometric hashing, pose clustering, or graph pattem-matdbing; b) fragmental-based methods, including incremental construction or "place and join" opexators; c) stochastic optimization methods including use of Monte Carlo, simulated annesaling, or g«ietic (or memetic) algorithms; d) molecular dynamics simulations or e) hybrids strategies derived thereof.
The earliest docking software tool was a graph-based rigid-body patton-matching algorithm called DOCK [6] developed at UCSF back in 1982 (vl .0) and now up to v5.0 (with extensions to include incremental construction). Othor examples of graph-based pattern-matching algorithms include CLIX (which in turn uses GRID), FLOG and UGIN.
[5] Shoichet, B.K., Bodian, D.L. and Kuntz, I.D., "Molecular docking using shape
descriptors", J. Comp. C/»cm., Vol. 13 No. 3,380-397 (1992). [6] Meng, E.C., Gschwend, DA., Blaney, J.M., and I.D. Kuntz, "Qrientational sampling and rigid-body minimization in molecular docking", Proteins: Structure. Function, and Genetics, Vol \7,266-21S (1993). [7] Ewing, T. J. A. and Kuntz, I. D., "Critical Evaluation of Search Algorithms for Automated Molecular Docking and Database Screening", J. Computational CAemisfty. Vol. 18 No. 9,1175-1189 (1997). [8] Lawrence, M.C. and Davis, P.C; "CLDC: A Search Algorithm for Finding Novel Ligands Capable of Binding Proteins of Known Three-Dimwisional Structure", Protems, Vol. 12,31-41 (1992). [9] Kastenholz, M. A., Pastor, M., Cruciani, G., Haaksma, E. E. J., Fox, T., "GRID/CPCA: A new computational tool to design selective ligands", J. Medicinal Chemistry, Vol. 43,3033-3044 (2000). [ 10] Miller, M. D., Kearsley, S. K., Underwood; D. J. and Sheridan, R. P., "FLOG: a system to select "quasi-flexible" ligtads complementary to a receptor of
wo 2uu^/UJ8^yb
known three-dimensional structure'^ J. Comjo«/er-^i<;/erfA/b/ec«/arZ)esign, Vol. 8 No.2,153-174(1994). [11 ] Sobolev, v., Wade, R. C, Vriend, G. and Edelman, M., "Molecular docking
using surface complemwitarity", Proteins, Vol. 25,120-129 (1996). Other rigid-body pattern-matching docking software tools include the shape-based correlation methods of FTDOCK and HEX [13], the geometric hashing of Fischer CM/., or the pose clustering of Rareyc/a/..
[ 12) Katdialski-Katzir, E., Shariv, 1., Eisenstein, M., Frieswn, A. A., Aflalo, C, and Vakser, I A., '^Molecular surface recognition: DetCTmination of geom^c fit between proteins and their ligahds by corrdation techniques", Proceeding of the National Academy of Sciences of the United States of America, Vol. 89 No. 6,2195-2199 (1992). [13] Ritchie, D. W. and Kemp. G. J. L., "Fast Computation, Rotation, and
Comparison of Low Resolution Spherical Harmonic Molecular Surfaces", J. Computational Chemistry, Vol 201^0.4, 383-395 (1999). [14] Fischer, D.,Norel,R., Wolfson,H. and Nussinov.R., "Surface motifs by a computer vision technique: searthes, detection, and hnplications for protein-ligand recognition", Pwfciiw, Vol. 16,278-292 (1993). [15] Rarey, M., Wefing, S., and Lengauer, T., "Placement of medium-sized molecular fragments into active sites of proteins", J. Computer-Aided Molecular Design,Wo].\0,Al'S4 (1996). In general, rigid-body pattern-matching algorithms assume that both the target and ligand are rigid (i.e., not flexible) and hence may be appropriate for docking small, rigid molecules (or molecular fragments) to a simple protein with a well-defined, nearly rigid active site. Thus this class ofdocldng tools may be suitable for de novo ligand design, combinatorial library design, or straightforward rigid-body sareening of a molecule library containing multiple conformere per ligand.
Incremental construction based docking software tools include FlexX from Tripos (licensed from EMBL), Hammerhead, DOCK v4.0 (as an option), and the nongreedy, backtracking algorithm ofLeach eta/,. Programs using incremental construction in the context ofde novo ligand design include LUDI [20] (from Acceliys) and GrowMol. Docking software tools based on "place and join" strategies include DesJariais e/ al.,.
[16] Kramer, B., Rarey, M. and Lengauer, T., "Evaluation of the FlexX
incremental construction algorithm forprotein-ligand docking", Proteins^ Vol. 37,228-241(1999).
[17] Rarey, M., Kramer, B., Lengauer, T., and Klebe, G.," A Fast Flexible
Docking Method Using An Incremental Construction Algorithm", ^. Mo/. 5«>/., Vol. 261,470-489 (1996). [18] Welch, W., Ruppert, J. and Jain, A. N., "Hammerhead: Fast, fully automated docking of flexible Hgands to protein binding sites". Chemical Biology^ Vol. 3, 449-462(1996). [19] Leach, A.R., Kuntz, I.D., "Conformational Analysis of Flexible Ugands in Macromolecular Receptor Sites", J. Comp. Chem., Vol. 13,730-748 (1992). [20] Bohm, H. J., "The computer program LUDl: a new method for the de novo design of enzyme inhibitors", J. Computer-Aided Molecular Design^ Vol. 6, 61-78(1992). [21] Bohacek,R.S. and McMartin,C., "Multiple Highly Diverse Structures
Complementaiy to Enzyme Binding Sites: Results of Extensive Application of a de Novo Design Method Incorporating Combinatorial Gro>vth", J. American Chemical Society,Voin6,5560-5511 (1994). [22] DesJarlais, R.L., Sheridan, R.P., Dixon, J.S., Kimtz, I.D., and
Venkataraghavan, R., "Docking Flexible Ugands to Macromolecular Receptors by Molecular Shape", J. Med. Chem., Vol. 29,2149-2153 (1986). Incremental construction algorithms may be used to model docking of flexible ligands to a rigid target molecule with a well-characterized active site. They may be used whoi screening a library of flexible ligands against one or more targets. They are often comparatively less compute intensive, yet consequwitly less accurate^ than many of their stochastic optimization based competitors. However, even FlexX may take on ord« of < 1 -2 minutes to process one target-ligand combination and thus may still be computationally onerous depending on the size of the library (e.g., tens of millions or more compounds). Recently FlexX was extended to FlexE [23] to attempt to accomit for partial flexibility of the target molecule's active site via use of user-deflned.oisembles of certain active site rotamers. [23] Claussen, H., Buning, C, Rarey, M., and Lengauer, T., "FlexE: Efficient
Molecular Docking Considering Protein Sthictiue Variations", J. Molecular B/a/ogy, Vol. 308,377-395 (2001).
Computational docking software tools based on stochastic optimization include ICM [24] (from MolSoft), GLIDE [25] (from Schrodinger), and LigandFit [26] (from Accelrys), ail based on modified Monte Carlo techniques, and AutoDock v.2.5 [27] (from ScriRJS Institute) based on simulated annealing. Others based on genetic or memetic algorithms include GOLD, DARWIN, and AutoDock v.3.0 [31] (also from Scripps).
[24] Abagyan, R.A., Totrov, M.M., and Kuznetsov, D.N., "Biased probability Monte Carlo conformational searches and electrostatic calculations for peptides and proteins", J, Comp. Chem., Vol. 15,488-506 (1994). [25] Halgren,T.A., Murphy, R.B.,Friesner,R-A., Beard, H.S.,Frye,L.L., Pollard, W.T., and Banks, J.L., *'Qlide: a new approadi for rapid, accurate docking and scoring. 2. Enrichment factors in database screening", ^Met/C^em., Vol. 47 No. 7,1750-1759 (2004). [26] Luty, B. A., Wasserman, Z. R., Stouten, P.P. W., Hodge, C. N., Zacharias, M., and McCannnon, J. A., "Molecular Mechanics/Grid Method for the Evaluation of Ligand-Receptor Interactions'", y. Cowp. CAcm., Vol.16,454-464(1995). [27] Goodsell,D.S. and Olson, A. J., "Automated Docking of Substrates to Proteins by Simulated Annealing", Proteins: Structure, Function, and Gene«cs, Vol. 8.195-202 (1990). [28] Jones, G.,WiIIett, P. and Glen, R.C., "Molecular Recognition of Receptor Sites using a Genetic Algorithm with a DescripticHj of Desolvation", J. Mol. i?/o/., Vol. 245,43-53 (1995). [29] Jones, Q., Willett, P., Glen, R. C, Leach, A.}' and Taylor, R., **Development and Validation of a Genetic Algorithm for Flexible Docking", J. Mol. Biol., Vol. 267,727-748 (1997). [30] Taylor, J.S. and Burnett, R.M.,PA>te/«J, Vol. 41,173-191(2000). [31 ] Morris, O. M., C5oodsell, D. S., Halliday, R. S., Huey, R., Hart, W. E., Belew, R. K. and Olson, A. J., "Automated Docking Using a Lamarddan Genetic Algorithm and an, Empirical finding Free Energy Function", J. Comp. Chem., Vol. 19,1639-1662 (1998). Stochastic optimization-based methods may be used to model docking of flexible ligands to a target molecule. Th^ genoally use a molecul&r-medianics based formulation of the affinity function and employ various strategies to Search for one or more frivorable system energy minima. They are often more compute intensive, yet also more robust, thaft thear
WU 2uu^/uJH^5»o
incremental construction competitors. As they arc stochastic in nature, different runs or simulations may often result in different predictions. Traditionally most docking software tools using stochastic optimi2ation assxime the target to be neariy rigid (i.e., hydrogen bond donor and acceptor groups in the active site may rotate), since otherwise the combinatoriBl complexity increases rapidly making the problrai difficult to robustly solve in reasonable time.
Molecular dynamics simulations have also been used in the context of computational modeling of target-ligand combinations. A molecular dynamic simulation refers to a simulation roettiod devoted to the calculation of the time dqpaidoit behavior of a molecular system in order to investi^e the structure, dynamics and tfaatnodynamics of molecular systems. Examples include the implementations presented in Di Nola et ai, [32], Mangoni et al, [33] and Luty et ai, [26] (along witti Monte Carlo). In principle, molecular dynamics simulations may be able to model protein flexibility to an arbitrary degree. On the other hand, they may also require evaluation of many fine-grained, time steps and are thus often very time-consuming (one order of hours or even days per target-ligand (^mbinadon). They also often require user-interaction for selection of valid trajectories. Use of molecular dynamics simulations in lead discovery is therefore more suited to local minimization of predicted complexes featuring a small number of promising lead candidates.
[32] DiNola,A.,Berends«i,H. J. C, and Roccatano,D., "Molecular Dynamics Simulation of the Docking of Substrates to Proteins", Proteins, Vol. 19,174-182(1994).
[33] Mangom,M.,Raccatano,D., and Di Nola, A., "Docking of Flexible Ligands to Flexible Receptors in Solution by Molecular Dynamics Simulation", Proteins: Structure, Function, and Genetics, 35,153-162,1999;
Hybrid methods may involve use of rigid-body pattern matching techniques for fast screening of selected low-energy ligand conformations, followed by Monte Carlo torsional optimization of surviving poses, and finally even molecular dynamics refinOTient of a few choice ligand structures in combination with a (potentially) flexible protein active site. An example of this type of docking software strategy is Wang et ai, [34].
[34] Wang, J., Kollman, P. A. and Kuntz» I. D., "Flexible ligand docking: A multistep strategy approach ", Proteins, Vol. 36,1-19 (1999).
A review discussing the importance of electrostatic interactions in biology and chemistry can be found in Honige/a/., [35]. When modeling electrostatics interactioas between molecules in an environment, especially in the context of molecular dynamics
wo 2005/0385!>6
simulations or other molecular-mechanics-based methods, the assigranent of charges (full or partial) to various molecular components must be addressed. An example of a commonly lased, and classically derived, method for assignment of partial charges is PARSE described in Sitkoff cr aL [36]. An example of commonly used quantum mechanical based software packages for the purpose of the assignment of partial charges is MOP AC [37] and GAMESS [38].
[35] Honig, B., and Nicholls, A.,'^Classical EleotFOStatics in Biology and
■ Chemistry", Science, Vol. 268,1144-1148(1995). [36] SitkofT, D., Sharp, K. A., and Honig, B, in ^'Accurate Calculation of
Hydration Free Energies Using Macroscopic Solvent Models", J. Phys. Chem., Vol 98,1978-1988(1994). [37] J. J. P. Stewart, "MOPAC: A General Molecular OrtitalPacskage" in fi«a«/«OT
Chemistiy Program Exchange,VollOttiQ.S6 (1990). [38] Schmidt, M. W., Baldridge, K. K., Boatz, J. A., Elbert, S. T., Gordon, M. S., Jensai, J. J, Kosdci, S., Matsunaga, N., Nguywi, K. A., Su, S., Windus, T. L, Dupuis, M., Montgomoy, J. A., |*General atomic and molecular electronic structure system", J. C<7/«/?K/. C/»«w., Vol. 14,1347-1363 (1993). Partial charges may also be assigned to each covalently bound.or other electrically neutral atoms of a molecule as per an molecular mechanics all-atom force field, espedally for macromolecules such as proteins, DNA/RKA, etc. such force fields may be used to assign various other atomic, bond, and/or other chemical or physical descriptors associated with components of molecules including, but not limited to, such items as vdW radii, solvation dependent parameters, and equilibrium bond constants. Examples ofsuch force fields include AMBER [39][40], OPLS [41], MMFF [42], CHARMM [43], and the general-purpose Tripos force-field [44]of Clark ef a/.
[39] Peariman,D.A.,C:ase,Dj\., Caldwell, J.C., Ross, W.S., Cheatham III, T.E., Ferguson, D.M., Seibel, G.L., Singfe U.C., Weiner, P., Kollman, ?A. AMBER 4.1, University of California, San Francisco (1995). [40] Cornell, W. D., Cieplak, P., Bayly, C, I., Goulg, L R., Merz, K. M., Ferguson, D. M., Spellmeyer, D. C, Fox, T., Caldwell, J. W., Kollman, P. A., "A second-generation force field for &e simulation of proteins, nucleic adds, and organic molecules", /. American Chemical Society, Vol. 117, 5179-5197 (1995).
wo 2(W5/U3»Syt)
[41 ] Jorgensen, W. L., & Tirado-Rives, J., J. American Chemical Society, Wo\. 110,
1657-1666(1988). [42] Halgren, T. A., "Merck Molecular Force Field. 1. Basis, Form, Scope,
Parameterization, and Perfonnance of MMFF94", J. Comp. Chem., Vol. 17, 490-519(1996). [43] Brooks, B. R., Bruccoleri, R. E., Olafson, B. D., States, D. J., Swaminathan, S. and Karplus, M.,"CHARMM: A Piograni for Macromolecular Energy, Minimization, and Dynamics Calculations", J. Comp. Chem., Vol 4,187-217 (1983). [44] Clark, M., Cramer, R.D,, Opdenbosch, N. v., "Validation of the General
Purpose Tripos 5.2 Force Field", J. Comp. Chem., Vol. 10,982-1012 (1989). A discussion on the calculation of total electrostatic energies involved in the formation of a potential molecular complex can be found in Oilson et al„ [45]. Computational solutions of electrostatic potentials in thie classical re^me range from simpler formulations, like those involving distance-dependent thelectric functions [46] to more complex formulations, like those involving solution of the Poisson-Boltssnann equation [47][48], a second order, generally nonlinear, elliptic partial differential equation. A review of numerical solvers for second order partial differential equations with appropriate boundary conditions can be found in Press et ai, [49] and Arfken et ah, [50].
Other classical formalisms that attempt to model electrosbitic desolvation include those based on the Generalized Bom solvationmodel [51 ][52], methods that involve representation of reaction field effects via additional solvent accessible or fragmental volume toms [53][54], or explicit representation of solvent in the context of molecular dynamics simulations [55][56][57]. A lengthy review of full quantum meciianicai treatment of electrostatics interactions can be found in Labanowksi e? a/., [58].
[45] Gilson, M. K., and Honig, B., "Calculation of the Total Electrostatic Energy of a Macromolecular System: Solvation Enorgies, Binding Energies, and Conformational Analysis", Proteins, Vol. 4,7-18 (1988). [46] MeWer, E.L. and Solmajer, T., "Electrostatic effects in proteins: comparison
of thelectric and charge models" Protein Engineering, Vol. 4,903-910 (1991). [47] Hoist, M., Baker, N., and Wang, F., "Adaptive Multilevel Finite Element
Solution of the Poisson-Boltzroann Equations L Algorithms and Examples", J. Com/>. C;»ew., Vol. 21, N6.15,1319-1342 (2000).
wo 2005/038596
[48] NichoUs, A., and Honig, B., "A Rapid Finite Difference Algorithm, Utilizing Successive Over-Relaxation to Solve Poisson-Boltzmann Equation", J. Comp. Cftew., Vol. 12, No. 4,435-445 (1991).
[49] Press,W.H.,FIannery,B.P.,Teukolsky,S.A.,andVetterling,W.T., "Numerical Recipes in C: The Art of Sciwitific Computing", Cflwfcr/rffe University Press (1993).
[50] Arfken, O. B., and Weber H. J., "Mathonatical Method for Physicist", Harcourt / Academic Press (lOWi).
[51] Still, W. C, Tempcz)^, A., Hawley, R. G. and Hendrickson, T., "A General
Treatment of Solvation for Molecular Mechanics", /. Am. Chem. iSoc, Vol.
112,6127.6129(1990). [52] Ghosh, A., Rapp, C. S., and Friesner, R. A., "A Generalized Bom Model
Based on Surface Integral Formulation", J. Physical Chemistry B., Vol.
102,10983-10 (1988).Eisenberg, D., and McLachlan, A. D., "Solvation Energy
in Protein Folding and Binding", ^■af«rg, Vol. 31,3086 (1986). [54] Privalov, P. L., and Makhatadze, G. I., "Contribution of hydration to protein
folding thennodynamics", J. Mol Bio., Vol. 232,660-679 (1993). [55] Bash, P., Singh, U. C, Langridge, R., and Kolhnan, P., "Free Energy
Calculation by Comput©- Simulation", Science^ Vol. 236,564 (1987). [56] Jorgensen,W.L.,Briggs, J. M., and Contreras,M.L., "Relative Partition
Coefficients for Organic Solutes from Fluid Simulations", /. Phys. Chem.,
Vol.94,1683-1686(1990). [57] Jackson, R. M., Gabb, H. A., and Sternberg, M. J. E., "Rapid Refinement of
Protein'Interfaces Incorporating Solvation: Application to the Docking
Problem", J. Mol. Biol, Vol. 276.265-285 (1998). [58] Labanowskiand J. Andzelm, editors, "Deaisity Functional Methods in
Chemistry", 5pnrtger-Fer/ag, New Yoric (1991).
BRIEF SUMMARY OF THE INVENTION Aspects of the present invention relate to a method and apparatus for an analysis of molecular combinations featuring two or more molecular subsets, wherein either one or botti molecular subsets are from a plurality of molecular subsets selected from a molecule library, based on computation of the electrostatic affinity of the system via utilization of a basis expansion representing charge density and electrostatic potential functions associated with the first and second molecular subsets in a coordinate system. Sets of transformed Ocpansion coefficients are calculated for a sequence of different configurations, Le., relative positions and orientations, of the first molecular subset and the second molecular subset using coordinate transformations. The sets of transformed expansion coefScients are constracted via the application of translation and rotation operators to a reference set of expansion coefficients. Thai an electrostatic afSnity, representing a c(»relation of the charge density and electrostatic potential functions of the first and second molecular subsets, is computed over the sequence of different sampled configurations for the molecular combination, where eadi sampled configuration differs in both the relative positions and orientations of the first and second molecular subsets. Aspects of the invention will .also be discussed relating to its use in conjunction with other methods for computation of shape complem^tarity, including the method described in Kita II, in determining a composite or augmented score rejecting . both electrostatic affinity anji shape complementarity for configurations of a molecular combination. Various mnbodiments of the invention relating to eflBcient implementation of the invention in the context of a hardware apparatus are also discussed.
BRIEF DESCRIPTIQN OP THE DRAWINGS
A more complex appreciation of the invention and lAany of the advantages thereof will be readily obtained, as the same becomes better understood by references to the detailed description when considered in connection with the accompanying drawings, whwcm;
Figs, la and lb show illustrations of two distinct molecular combinations, with labels for positive and negative charges, both demonstrating high shape complementarity but each respectively showing poor and high electrostatic affinity;
Fig. 2 is a block diagram view of an embodiment of a syston that utilizes the pr^ent invention in accordance with analysis of a molecular combinations based on computations of electrostatic affinity over a set of sampled configurations; ■
Figs. 3a, 3b, and 3 c respectively show a **ball and stick" represmtation of an input pose for a methotrexate molecule, a digital representation in the form of a pdb formatted file,
and another digital representation in the form of a mol2 formatted file, both files cjontaining structural and chemical information for the molecule depicted in Fig. 3a;
Fig. 4 shows an illustration of the thermodynamic cycle associated with the formation of a potential molecular complex in a solvwit environment as relates to an understanding of the concept of electrostatic affinity for a molecailar combination;
Fig. 5 shows a flow diagram of an ex«nplary method 500 for estimating the electrostatic affinity associated with analysis of a molecular combination, performed in accordance with embodimeaits of &e presoit invention;
Figs. 6a and 6b shows illustrations of two molecular subsets in two different configurations with assessed in accordance with embodiments of the present invoition;
Fig. 7 shows representation pf a molecular subset in discrete space for gonerating disCTetized charge density functions and electrostatic potential fields, in accordance with embodiments of the present invention;
Fig. 8 illustrates how a 2-D continuous shs^e is disoetized in accordance with embodimaits of the present invention;
Fig. 9 shows coordinate-based rqpresentations of two molecular subsets in a joint coordinate syst«n, in accordance with embodiments of the pr^ent invention;
Fig. 10a, 1(%, and 10c show representations of various coordinate systems used in the present invention;
Fig. 11 shows a representation of Euler angles as used in various onbodimenfa; of the present invention;
Fig. 12 shows a spherical sampling scheme used in various embodiments of the present invention;
Fig. 13 is an illustration of spherical harmonics function;
Fig. 14 shows two molecular subsets in various configurations, i.e., having various relative translations and orientations to one another, for computing electrostatic affinity scores, in accordance with embodiments of the present invention;
Fig. 15 shows a flow diagram of a novel and efficient method for computing an electrostatic affinity score for a configuration of a molecular combination based on transformation and combination of basis expansion coefficients for associated charge density functions and electrostatic potential fields in accordance with embodiments of the present invention.
DETAILED DESCRIPTION OF THE INVENTTOTJ The present invention has many appliciations, as will be apparent after reading this disclosure. In desoribing an embodiment of a computational system according to the present invention, only a few of the possible variations are described. Other applications and variations will be apparent to one of ordinary skill in the art, so the invention should not be construed as narrowly as the examples, but rather in accordance with the appended claims.
Embodiments of the invention will now be described, by way of example, not limitation. It is to be understood &at the invoition is of broad utility and may be used in many different contexts.
A molecular subset is a whole or parts of the components of a molecule, where the components can be single atoms or bonds, groups of atoms and/or bonds, amino acid residues, nucleotides, etc. A molecular subset might include a molecule, a part of a molecule, a cbmiical compound composed of one or more molecules (or other bio-reactive agents), a protein, one or more subsets or domains of a protein, a nucleic acid, one or more peptides, or one or more oligonucleotides. In another embodiment of the present invention, a molecular subset may also include one or more ions, individual atoms, or whole or parts of otfao* simple molecules such as salts, gas molecules, wato-molecule^, radicals, or even organic compounds like alcohols, esters, ketones, simple sugars, etc. In yet another embodiment, the molecular subset may also include organic molecules, residues, nucleotides, carbohydrates, inorganic molecules, and other themically active items including synthetic, medicinal, drug-like, or natural compounds.
In yet anotha- embodiment, the molecular subset may already be bound or attached to the target through one or more covalent bonds. In another embodiment the molecular subset may in fact include one or more structural components of the target, such as secondary structure elements that make-up a tertiary structure of a protein or subunits of a protein quaternary structure. In another embodiment the molecular subset may include one or more portions 6f a target molecule, such as protein domains that include the whole or part of an active site, one or more spatially connected subsets of the protein structure that are sel ected based on proximity to one or more protein residues, or even disconnected protein subsets that feature catalytic or other surface residues that are of interest for various molecular int«actions. In another embodiment, the molecular subset may include the whole of or part of an existing molecular complex, meaning a molecular combination between two or more other molecular subset, as, for example, an activated protem or an allostoically bound protein.
A molecular combination (or combination) is a collection of two or more molecular subsets that may potentially bind, form a molecular complex, or otherwise interact with one another. A combination specifies at the very least the identities of the two or more interacting molecular subsets.
A molecular pose is the geometric state of a molecular subset described by its position and orientation within the context of a prescribed coordinate system. A molecular configuration (or configuration) of a molecular combination represents the joint poses of all constituent molecular subsets of a molecular combination. Different configurations are doioted by diffo-ent relative positions and oriMitatioos of &e molecular subsets with respect to one another. Linear coordinate transformations that do not change the relative position or orientation of constituent molecular subsets will not result in different configurations.
For the purposes of the invoition diffo-ent configuraticms of a molecular combination are obtained by the application of rigid body transformations, including relative translation and rotation, to one or more molecular subsets. For the purposes of the invention, such rigid body transformations are expected to preserve the conformational structure, as well as the stereochemistry and/or tautomerism (if applicable), of each molecular subset. In regard to the invention it is contemplated that when analyzing distinct conformations or stereoisomers of a molecular subset, each distinct conformation or stereoisomer will appear in a distinct molecular combination, each with its own attendant analysis. In this way, molecular combinations featuring flexible molecular subsets may be better analyzed using the invention based on consideration of multiple combinations comprising distinct conformations and/or stereoisomers.
In many of the forthcoming examples and explanations, the molecular coinbination will represent the typical scenario of two molecular subsets where a ligand biomolecule (first molecular subset) interacts with a target biomolecule (usually a biopolymer; second molecular subset). Thus in regard to the present invention, an of a molecular combination may seek to determine whether, in what fashion (i.e., bmding mode), and/or to what degree, a ligand will interact with a target molecule based on computations of electrostatic affinity of one or more configurations. A detailed discussion of the concept of electrostatic afSnity will be fortficoraing in the description. It should be imderstood that," unless otherwise indicated, such examples and explanations could more generally apply to molecular combinations wherein more than two molecular subsets bind or interact with one another, representing the whol c of, or portion(s) of, one or more target molecules and/or one or more ligands.
As an example, in one embodiment of the presoit invention the molecular combination may represent a target interacting with a ligand (i.e., target-ligand pair) where one molecular subset is from the protein and the other the ligand. In a further embodiment, the molecular combination may represent a target-ligand pair where one molecular subset is the entire ligand biomolecule but the other molecular subset is a portion of a target biopolymer containing one or more relevant active sites.
In yet another embodiment, the molecular combination may feature more than two molecular subsets, one representing a target (whole or part) and the other two correspond to two distinct ligands into^acting with the same target at the same time, such as in the case of competitive thermodynamic equilibrium between a possible inhibitor and a natural bindo- of a protein. In yet another embodiment the previous example may be turned around such that the molecular combination features two target molecules in competition with one ligand biomolecule.
As another example, in one embodim«it the molecular combination may represmt a protein-protein interaction in which there are two molecular subsets, each representing fte whole or a relevant portion of one protein. In a further embodiment, the molecular combinations may also represent a protein-protein intoraction, but now with potentially more than two molecular subsets, each representing an apprc^ateprotdba domain.
As a further example, the molecular combination may feature two molecular subsets representing a target-ligand pair but also additional molecular subsets representing other atoms or molecules (hetero-atoms or hetero-molecules) relevant to the intoraction, such as, but not limited to, one or more catalytic or structural metal ions, one or more ordered, bound, or structural water molecules, one or more salt molecules, or ev^ other molecules sach as various lipids, carbohydrates, acids, bases, mRNA, ATP/ADP, etc. In yet another • embodiment, the molecular combination may feature two molecular subsets representing a target-ligand pair but also one or more added molecular subsets representing a whole or portion of a cell membrane, such as a section of a lipid bi-layw, nuclear membrane, etc., or a whole or portion of an organelle such as a mitochondrion, a libosome, oidoplasmic reticulum, etc.
In another «nbodiment, the molecular combination may feature two or more molecular subsets, witti one or more molecular subsets representing various portions of a molecular complex and another subset representing the ligand interacting with the complex at an unoccupied active site, such as for proteins complexed with an allost^c activator or for proteins contaiining multiple, distinct active sites.
In another eatnbodiment, the molecular combination may feature two or more molecular subsets representing protein chains or subunits interacting noncovalently as per a quaternary protein structure. In another embodiment, the molecular combination may feature two or more molecular subsets representing protein secondary structure elements interacting as per a tertiary structure of a polypeptide diain, induced for example by protein folding or mutagaiesis.
In many of the forthcoming examples and explanations, the molecular combination will represoit the typical scenario of a target-ligand pair interacting with one another. As already mentioned in regard to the presoit invention, an analysis of amolecular combination may sedc to detomiine whether, in what fashion, and/or to what degree or with what likelihood, a ligand will interact with a target molecule based on computations of electrostatic affinity. In another embodiment, the analysis may involve a plurality of molecular combinations, each corresponding to a different ligand, selected, for example, from a molecule library (virtual or otherwise), in combination with ita same target molecule, in order to find one or more Ugands that demonstrate high electrostatic afSnity with the target, and are therefore likely to bind or otherwise react with the target. In such cases, it may be necessary to assign a score or ranking to each analyzed molecular combination based on the estimated maximal electrostatic afSnity across a set of difTerent configurations for each combination, in order to achieve relative comparison of relevant predicted bioactivity.
In such a scenario where each target-ligand pair is an Individual combination, and if there are N ligands to be tested against one target, then there will be N distinct molecular combinations involved in the analysis. For sufficiently large molecule libraries, it may be necessary to analyze millions or more potential molecular combinations for a sin^e target protem. In yet another embodiment, the analysis may be reversed and the plurality of molecular combinations represents a plurality of target molecules, each in combination with the same ligand biomolecule in the same environment In other embodiments, the molecular combinations may represent multiple ligands and/or targets reacting simultaneously, i.e., more than just a target-ligand pair, and may also include various heteroatoms or molecules as previously discussed.
Fig. 2 illustrates a modeling system 200 fc»- (he analysis of molecular combinations including computations of electrostatic affinity across a set of configurations for the molecular combination. As shown a configuration analyzer 202 receives one or more input (or ref(»^ce) configuration records 206, including relevant structural, themical, and ph)«ical data associated with input structures for both molecular subsets from an input mol^lar
combination database 204. The configuration analyzer 202 comprises a configuration data transformation engine 208 and an electrostatic affinity engine 209. Results from the configuration analyzer 202 are output as configuration results records 211 to a configuration results database 210.
Modeling system 200 may be used to efficiently axalyzt molecular combinations via computations of electrostatic affinity. In some embodiments, this may include, butts not limited to, prediction of likelihood of formation of a potential molecular complex, or a proxy thereoj^ the estimation of the binding affinity betwerai molecular subsets in the molecular ccnnbination, the prediction of the binding mode (or even additional alternative modes) for the molecular combination, or the rank prioritization of a collection of molecular subsets (e.g., ligands) based on maximal electrostatic affinity with a targ^ molecular subset across sampled configurations of the combination, and would therefore also include usage associated with computaticmal target-ligand docking.
Furthermore, the method provides for performing a dense search in the configurational space of two or more molecular subsets having rigid bothes, tihat is, assessing relative orientations and translations of the constituent molecular subsets. The method can also be used in conjunction with a process for generating likely yet distinct conformations of one or both molecular subsets, in order to better analyze those molecular combinations where one or both of the molecular subsets are flexible.
In a typical opemtion, many molecular combinations, each featuring many different configurations, may be analyzed. Since the total possible number ofconfigurations may be enormous, the modeling system 200 may sample a subset of configurations during the ' analysis procedure according to an appropriate sampling scheme as will be discussed later. However, the sampled subset may still be very large (e.g., millions or even possibly billions ofconfigurations per combination). An electrostatic affinity score is generated for each sampled configuration and the results for one or more ccmfigurations recorded in a storage medium.
The molecular combination may then be assessed by examination of the set of configuration results including the corresponding computed electrostatic affinity scores. Once the cycle of computation is complete for one molecular combination, modeling of the next molecular combination may ensue. Alternatively, in some embodiments ofihc modeling system 200, multiple molecular combinations may be modded in parallel. Likewise, in some embodiments, during modeling of a molejcular combiiiation, more than one configuration may be processed in parallel as opposed-to simply in seqtkence.
In one embodiment, modeling system 200 may be implemented on a dedicated microprocessor, ASIC, or FPGA. In another ranbodiment, modeling system 200 may be implanented on an electronic or system board featuring multiple microprocessors, ASICs, or FPGAs. In yet another embodiment, modeling system 200 maybe implemented on or across multiple boards housed in one or more electronic devices. In yet another embodiment, modeling system 200 may be implemented across multiple devices containing one or more microprocessors, ASICs, or FPGAs on one or more electronic boards and the devices connected across a networic.
In some mibodiments, modeling Systran 200 may also include one or more storage media devices for the storage of various, required data elements used in or produced by the analysis. Alternatively, in some other embodiments, some or all of the istorage media devices may be externally located but networked or otherwise connected to the modeling system 200. Examples of external storage media devices may include one or more database servers or file systems. In some embodiments involving implementations featuring one or more boards, the modeling system 2O0 may also include one or more software processing components in order to assist the computational process. Alternatively, in some other embodiments, some or all of the software processing components may be extemally located but netv^orked or otherwise connected to the modeling system 200.
In some embodiments, results records from database 210 may be further subjected to a configuration selector 212 during which one or more configurations may be selected based on various results criteria and &en resubmitted to the coniiguration analyzer 202 (possibly under different operational conditions) for further scrutiny (i.e., a feedback cycle). In such embodiments, the molecular configurations are transmitted as inputs to the configuration analyzer 202 in tho form of selected configuration records 214. In another embodiment, the configuration selector 212 may examine the results records from database 210 and constract other configurations to be subsequently modeled by configuration analyzer 202. For example, if the configuration analyzer modeled ten target-ligand configurations for a given target-ligand pair and two of the configurations had substantially higher ^timated electrostatic affinity than the other eight, then the configuration selector 212 may generate further additional configurations fliat are highly similar to the top two high-scoring configurations and then schedule the new configurations for processing by configuration analyzer202.
In some embodiments, once analysis of a molecular combination is completed (i.e., all desired configurations assessed) a combination postprocessor 216 may used to select one or
more configuration results records from database 210 in order to generate one or more either qualitative or quantitative measures for the combination, such as a combination score, a combination sununary, a combination grade, etc., and the resultant combination measures are then stored in a combination results database 218. In one cmhodimmt, the combination measure may reflect the configuration record stored in database 210 with the best-observed electrostatic affinity. In another embodiment, multiple configurations with hig^ electrostatic affinity are submitted to the combination postprocessor 216 and a set of combination measures written to the combination results database 218. In another embodiment, the selection of multiple configur^ons for use by the combination postprocessor 216 may involved one or more thresholds or other dedsion-based criteria.
In a further embodiment, the combination measures output to the combination results database 218 are based on various statistical analysis of a sampling of possibly a large number of configuration results records stored in database 210. In other embodiment the selection sampling itself may be based on statistical methods (e.g., principal component analysis, multidimensional clustering, multivariate regression, etc.) or on pattern-matching methods (e.g., neural networks, support vector machines, etc.)
In another embodiment, the combination postprocessor 216 may be applied dynamically (i.e., on-the-fly) to the configuration results database 210 in parallel with the analysis of the molecular combination as configuration results records become available. In yet another embodiment, the combination postprocessor 216 may be used to rank different configurations in order to store a sorted list of either all or a subset of the configurations stored in database 210 that are associated with the combination in question. In yet other embodiments, once the final combination results records, reflecting the complete analysis of the molecular combinatioh by the configuration analyzer 202, have been stored in database 218, some or all of the configuration records in database 210 may be removed or deleted in order to conserve storage in the context of a library screen involving possibly many different molecular combinations. Alternatively, some form of garbage collection may be used in other embodiments to dynamically remove poor configuration results records from database 210.
In one embodiment, the molecular combination record database 204 may comprise one or more molecule records databases (e.g., flat file, relational, object oriented, etc.) or file systems and the configuration analyzer 202 receives an input molecule record corresponding to an input structure for each molecular subset of the combination. In another embodiment, when modeling target protein-ligand molecular combinations, ^e molecular combination
record database 204 is replaced by an input target record database and an input li^d (or drug candidate) record database. In a further embodiment, the input target molecular records may be based on either experimentally derived (e.g., X-ray crystallography, NMR, etc.), energy minimized, or model-built 3-D protein structures. In another embodiment, the input ligand molecular records may reflect energy minimized or randomized 3-D stractures or other 3-D structures converted from a 2-D chemical representation, or even a sampling of low energy conformers of the ligand in isolation. In yet another embodiment, the input ligand molecular reconls may (M3m»spond to naturally existing con^mids or even to virtually generated compounds, which may or may not be synthesizable.
In order to bettOT illustrate an example of an input stmoture and the associated input molecule record(s) that may form an input configuration record submitted to configuration analyzer 202 we refer the re»der to Figs. 3a, 3b, and 3c.
Fig. 3 a shows a "ball-and-stjck" rendoing of a pose 305 of a mefliotrexate molecule 300 with chemical formula C20H22N8O5. the depicted molecular subset consists of a collection of atoms 320 and bonds 330. The small, blade atoms, as indicated by item 313, represoit carbon atoms. The tiny, white atoms, as indicated by item 316, represent hydrogoi atoms, whereas the slightly larger dark atoms (item 310) are oxygen atoms and the larger white atoms (item 329) are nitrogen atoms. Continuing in Fig. 3a, item 323 doiotra a drcle containing a benzene ring (C6H4), and item 325 a circle containing a caiboxyl group (COO"), and it«n 327 another circle containing a methyl group (CH3). Item 333 denotes a covalent bond connecting the benzene ring 323 to the ester groiep that includes the mtdxylffoup 327. Itran 335 denotes a covalent bond connecting the carbon atom 313 to the carboxyl group 325. Lastly item 337 denotes a covalent bond connecting the methyl group 327 to a nitrogm atom 329.
Fig. 3b shows a pdb file representation 340 of a chemical structure for the methotrexate ligand pose described in Fig. 2a, including a general header 350, a section 360 composed of atom type and coordinate information, and a section 365 regarding bond connectivity infomiation. The header section 350 may contam any annotation or other information desired regarding the identity, source, or characteristics of the molecular subset and its conformation and/or stereochemistry. Section 0360 shows a list of all 33 nonhydrogen atoms of methotrexate and for each atom it includes a chemical type (e.g., atomic element) and three spatial coordinates. For instance, the line for atom 6 shows that it is a nitrogen atom with name NA4 in a compound (or r^due if a protein) named MTX in chain A with compound (or residue) ID of 1 and with(x, y, z) coordinates (20.821,57!440,
21.075) in a specified Cartesian coordinate system. Note that the compound or residue name field may be more relevant for amino or nucleic add residues in biopolymers.
Section 365 of the PDB file 340, sometimes called the connect record of a PDB file, describes a list of the bonds associated wiHh eadi atom. For instance, the first line of this section shows that atom 1 is bonded to atoms (2), and (12), whereas the second line shows that atom 2 is bonded to atoms (1), (3), and (4). Notice also how in this example hydrogons are missing and as such the bond connections for each atom may not be complete. Of course, completed variants of the PDB file representation are possible if the positions of hydrogen atoms are already specified, but in many Cases where the themical structure originates from experimental observations the positions of hydrogens may be voy uncertain or missing altogether.
Fig. 3c shows a MDL mol2 file containing various additional chemical descriptors above and beyond the information shown in the PDB file in Fig. 3b. Column 370 lists an index for each atom; column 373 lists an atom name (may be nonunique) for each atom; colunms 375,377, and 379 respectively list x, y, z coordinates for each atom in an internal coordinate system; column 380 lists a SYBYL atom type according to the Tripos force field [44] for each atom that codifies information for hybridization states, chemical type, bond connectivity, hydrogen bond capacity, aromaticity, and in some cas^ chonical group; and columns 382 and 385 list a residue ID and a residue name for eadi atom (relevant for proteins, nucleic acids, etc.). Section 390 lists all bonds in the molecular subset. Column 391 lists a bond index for eadi bond; columns 392 and 393 the atom indices of the two atoms connected by the bond; and column 395 the bond type, which may be single, double, triple, delocalized, amide, aromatic, or other specialized covalent bonds. In other embodiments such infomiation may also represent noncovalent bonds such as salt bridges or hydrogen bonds, hi this example, notice how the hydrogen atoms have now been included.
In one embodiment the configuration data transformation engine 208 may directly transform one or more input molecular configurations into one or more othw new configurations by application ofvarious rigid body transformations. In other embodim«3ts, the configuration data transformation engine 208 may instead apply rigid body transformations to sets of basis expansion coefSdents representing charge density and electrostatic pot«itial functions assodated with ref^mce poses for each molecular subset as will be discussed in more detail later in the technical description. In some embodiments, the set of configurations visited during the course of an analysis of a molecular combination may
be determiped according to a schedule or sampling scheme specified in accordance with a search of the pamitted coniiguration space for tthe molecular combination.
In some embodiments, whether generated by direct transfonnation of structural coordinates or by transfonnation of sets of basis expansion coeffioiwits, the configuration data transfonnation engine 208 may produce new configurations (or new sots of basis expansion coefficients corresponding to new configurations) sequoitially and feed them to the electrostatic af&nity engine 209 in a sequential manner, or may instead produce them in parallel and submit them in parallel to the electrostatic affinity
(r), reja-esenting a charge distribution. The "electrostatic potential function" graerated by a charge distribution represented by p(F), in vacuum, is given by.
where, EO is the thelectric constant of vacuum, t^ is the 3-D coordinate of a differential volume element, dr, within the domain of the charge distribution, and f is the 3-D coordinate of the point of evaluation of the electrostatic potential function. Note that Eqn. 2 only applies when the surrounding medium is vacuum or the charges are point charges and the thelectric medium is isotropic. The electrostatic potential function is not to be confused with the electric field, a vector function given by The electrostatic potential
function associated with Eqn. 2 is herein referred to as a Coulomb potential function, given its basis on Coulomb's law of electrostatics.
The dectrostatic etiCTgy, E, for a charge distribution represented by p(r) and witti an associated electrostatic potential function,
(r), is generally given as follows:
where the integral is over all points ? in the charge distribution. Eqn. 3 reduces to the following form when Eqn. 2 is applicable and E is taken to be the sdf-electrostatic energy of &e charge distribution:
where 1^ and r, are 3-D coordinates of corresponding differential volume elements, d^ and dfj, within tthe domain of the charge distribution.
When a charge distribution corresponding to a solute molecule is embedded in an anisotropic thelectric mediirai, Eqn. 3 still holds but eqns. 2 & 4 are no longer applicable, as the electrostatic potential function, (f), should now include the effects of electrostatic desolvation. Asusedherdtti,*'solvenf*refers to the plurality of atoms, ions, and/or simple molecules (e.g., water, salt, sugars) that comprise an ambient medium, polarizable OT otherwise and "electrostatic desolvation" refers to the interaction of a polar or charged solute entity in the presence of a polarizable medium having solvent entities. Herein "solvent entity" refers to individual solvent atoms, solvent molecules, and/or ions of the ambient medium and "solute entity" refers to polar or charged atoms or chemical groups that comprise the charge distributions associated with one or mente molecules. Generally, the presence of solvent surrounding the charge distribution requires solution to either the Poisson equation or the Poisson-Boltzmann equation [36][45], depending on the presence of one or more solvent ionic components, as will be discussed below.
For the purposes of the invention, we are interested in the change in the total electrostatic energy of a system comprising charge distributions associated with two molecules embedded in a solvent environment upon formation of a potential molecular complex. A discussion on the calculation of total elecfa-ostatic oiorgies involved in the formation of a potential molecular complex can be found in Gilson e/a/., [45],
Fig. 4 depicts the four thermodynamic states of the relevant thennodynamic cycle for such a system. In Fig. 4, the two molecules are represented by 410 and 420 respectively, region 430 refers to a region of vacuum, region 435 refers to region in a solvent environment, and the particles labeled 437 refer to solvent entities comprising region 435. The circles labeled 415 represent &e ccharge distribution of molecule 410, i.e., pi, and those circles labeled 425 represent the charge distribution of molecule 420, i.e., p2.
In Fig. 4, state 440 refers to a thermodynamic state where the molecules 410 and 420 are isolated and in vacuum, i.e., situated in region 430, whereas state 445 refo-s to the same two molecules, 410 and 420, in region 435, i.e., embedded as solutes m a solvent medium. State 450 refers to a themiodynamic state wherein molecules 410 and 420 are situated in the vacuum region 430, but are no longer in isolation with respect to one another and may in fact be forming a molecular complex. State 455 is titie state analogous to state 450 but now residing in the solvent region 435."
As used ha"ein, "isolation*' refers to a state where the two molecules do not interact with one another, i.e., are infinitely far apart. In practice "isolation" would more likely correspond to the case that the two molecules are very far apart relative to their respective characteristic dimensions, and thus their mutual electrostatic intoractioii energy is negligible.
In Fig. 4, Eqn. 3 for the electrostatic energy is applicable to all four depicted thermodynamic states with but depending on the particular state the form
for the electrostatic potential, (r), changes. For example, for vacuum state 440, the electrostatic energy, E840, is given by:
where \ is any point in a volume containing the charge distribution of molecule 410 (pi) and similarly \ is any point in a volume containing the charge distribution of molecule 420 (p2). In Eqn.5, Oj refers to the Coulomb electrostatic potential of molecule 410 in vacuum as defined in Eqn. 4, and similarly O2 refers to the Coulomb electrostatic poter^tial of molecule 420 in vacuum, also given by Eqrt. .2. Note that Bqn. 5 was-derived 60m Eqft. 3 utilizing the
joint charge distribution linear superposition of the electrostatic
potentials, i.e., and the condition of isolation for both molecules.
In Eqn. S, Ui refers to the self-electrostatic energy of molecule 410 in isolation and is equivalent to the first integral in Eqn. 5 and is also equivalent to Eqn. 4 when substituting p with pi. Similarly U2 refers to the selP-dectrostatic energy of molecule 420 in isolation and is equivalent to the second integral in Bqn. S and is also equivalent to Eqn. 4 when substituting p with p2.
For vacuum state 450, the electrostatic energy, E850, is given by:
where q, ^, PI , p2, ^1, <>2, Ui, and IJ2 are ail defined as before in regard to Eqn. 5. Once again, Bqn. 6 was derived from Eqn. 5 using the joint charge distribution . /3(r)=p,(r)+/320'), linear superposition of the electrostatic potentials, i.e.,
,(r)+^^(r), but now the two molecules are not in isolation and hence the mutual electrostatic interaction energy, U]2, represented by the third and fourth integrals cannot be ignored. In fact the change in electrostatic energy going ftom the isolated vacuimi state 440 to potential molecular complex in vacuum of state 450, as depicted by arrow 470, is:
^45o-^.^=t/„=ij/?,(?>,(r,y?,+ijp,(r>,(r,)4r2. [Eqn. 7]
For state 445, representing molecules 410 and 420 in< isolation but embedded in solvent, the total electrostatic energy, ESAS* is given by:
where 1;, ^, pi, and p2 are as before in eqns. 5-7, but i' and O2' refer respectively to the electrostatic potential of molecule 410 in solvent in isolation and molecule 420 in solvent in isolation. O,-'differs from the con«^>onding in v<3ct/<7, Coulombic electrostatic potential ,• as a result of electrostatic desolvation, i.e., the interaction of a polar or charged solute entity in the presence of a polarizable mediunci having solvent ^tities. As such then Eqn. 4 no longer applies to }' and hence Eqn. 4 no longer applies to U{'.
Electrostatic desolvation can "be brokoi into thtee kinds of interactions: (1) a reaction field term representing the favorable interaction of a solute entity with the induced
polarization charge near the solvent-solute interface, (2) a solvent screening efifcct reflecting the reduction of a Couloinbic electrostatic interaction between a pair of solute entities due to intervening solvent containii^ solvent entities with net charge and / or nonvanishlng dipole moments^ and (3) interaction of solute entities with an ionic atmospho-e comprised of one or more electrolytes in the solvent (if presatit).
All tfiree effects are dependent on the degree of solveait accessibility for each solute entity, i.e., the size and gecnnetry of the solute-solvent interfoce. The distance-dependoit didectric / Coulomb model of Eqn. I is an attempt to iqyproximate the solvent soeoung effects. In some casra an additional tam based on solvmt accessible sur&ce area (xr fitigmental volumes is added in order to approximate the reaction field effect [53][54]. Typically, the presoice of an ionic atmosphere is ignored alto^^her, though previous work, as desciibed in Sitkoff el al, [36], has shown that such sin^listic qyproximations are often consistently inaccurate representations of Ae role of electrostatic desolvation in the formation of a potmtial molecular complex.
Returning to Pig. 4, for state 455, represimting molecules 410 and 420 in no longer in isolation but still embedded in solvit, the electrostatic raergy, E4S5i is givon by:
where ^, ^, pi, and p2 areas before in eqns. 4-8. In goieral, the total electrostatic potential, O", of the potential molecular complex in fbe presence of a polarizable medium, such as that represented by solvent region 435, cannot be represented in tenns of two s^arate ek ctrostatic potential functions. However, as will be discussed later, in some cases it is possible to approximate the total electrostatic potential, 4>", in Eqn. 9 as two linearly super-imposable potentials, 0|" and 2", each generated separately by the charge distribution of one of the molecules, while the charge distribution on the otha- molecule is ignored, though both potmtials will be different from their counterparts, Oi' and il>2', in state 450.
Note that in Eqn. 9 the first integral represents &e total electrostatic energy of the charge distribution of molecule 410 in state 455, includmgboth its self-electrostatic energy and its mutual interaction wlth the charge distribution of molecule 420 as mediated by the ambient solvent of region 435. Similarly, the secoAd integral ofEqn. 8 repn»ents the total electrostatic energy of fee charge distribution of molecule 420 in state 455.
As depicted by arrow 475, going from state 440 to 445, the diifwence in total electrostatic energy is solely tthe result of electrostatic desolvation of eadi molecule in
isolation. Going from state 450 to state 455, as depicted by arrow 480, once gain the difference in total electrostatic energy is'solely the result of electrostatic desolvation, but now of the two molecules together in close proxinciity during formation of a potential molecular complex. Lastly, going from state 445 to state 455, as depicted by arrow 485, the difference in total elecfrostatic energy is the result of bringing the two molecules closer to one another; thereby bringing the charge distributions on each molecule closer together and also dianging the electrostatic desolvation of the two molecules as the solvent accessibilities of solute entities on each molecule are altered.
For the purposes of an analysis of molecular combinations based on computations of electrostatic affinity, the most relevant change of states is arrow 485, and thus the most relevant change in total electrostatic energy that must be calculated is given by AE =» E455 -E445. hi practice, due to the ccanplexity of Eqn. 9 and the inclusion of electrostatic > desolvation effects in the total electrostatic potential, ", AE may be difScult to accuratdy compute, especially for a large plurality of distinct molecular configurations. However, as shown in Fig. 4, as indicated by arrow 490, AE may be computed in four steps as follows: (1) compute -AE) « E440 - E445, (2) compute +AE2 = E450 - E440. (3) compute +AE3 - E455 - E450, and (4) form the sum AE = -AEi + AE2 + AE3. Thus in the syntax of Fig. 4, embodiments of the presMit invention directly provide for an accurate and efficient estimation of ttue electrostatic affinity, i.e., AE « B^ss - E445, for a large plurality of difTo-ent molecular configurations of for one or more molecular combinations.
Fig. 5 shows a flow diagram of an exemplary method 500 of analyzing a molecular combination based on computations of electrostatic affinity across a set of configurations, performed in accordance with embodiments of the present invaition. The method 500 of Fig. 5 is described with reference to Figs. 6-15. As explained below, the method 500 generally involves computing a basis expansion representing charge density and electrostatic potential functions associated with constituent molecular subsets, coniputing transformed expansion coefficiaits for different configurations (i.e., relative positions and orientations) of the molecular subsets, and computing a correlation frmction representing an electrostatic affinity of the two molecular subsets using the transformed expansion coefficients. Embodiments of this method incorporate various combinations of hardware, software, and firmware to perfonn the steps described below.
In Fig. 5, in step 502, a first molecular subset 602 and a second molecular subset 652 are provided, as shown in Fig. 6a. The molecular subsets used herein are gaierally stored as
digital representations in a molecule library or other collection such as the molecular combination database 204 of modeling system 200 in Fig, 2. Each molecular subset has a molecular shape, as illustrated in Fig. 6a, wherein the term "molecular shape" gaieraUy refers to a volumetric function representing the structure of a molecular subset comprising a plurality of atoms and bonds. Molecular subsets 602 and 652 respectively include a plurality of atoms 604 and 654 that are generally connected by chemical bonds in order to define the structure of each molecular subset. Those skilled in the art will appreciate that the molecular subsets 602 and 652 may have various shapes other than those shown in Fig. 6a. Often the first molecular subset 602 is a Hgand, and the second molecular subset 652 is a protein. However, as already discussed in regard to the definition of a molecular subset, the molecular subsets 602 and 652 can have various compositions.
In Fig. 5, in step 503, a first charge distribution is defined for first molecular subset 602, and a second charge distribiition may be defined for second molecular subset 652 using a similar methodology. Charges can be conceived of as concentrated at points, but more generally, they are distributed over surfaces or through volumes. For a molecule, the dbarge distribution typically refers to the plurality of solute entities, representing the duurges, partial or othenvise, assigned to constituent atoms and/or chemical groups.
In one embodiment, each solute entity is assigned a charge according to a siepplied energy parameter set. An "energy parameter'', as used herein, is a numerical quantity representing a particular physical or chemical attribute of a solute entity in the context of the specified energy model for an energy term. An "enargy model", as used herein, is a mathematical formulation of one or more energy terms, wherein an energy term represents a particular type ofphysical and/or chemical interaction. An "eno'gy parameter set", as used herein, is a catalog of energy parameters as pertains to a wide range of chemical species of atoms and bonds for a canonical set of energy models for a nxraiber of different energy terms.
An example of an en^gy term is the electrostatic interaction, which refers to the interaction between two or more solute entities, i.e., ionic, atomic or molecular charges, whether integral or partial charges. -As discussed above, a 'partial charge" refers to a numerical quantity that rqjresents the effective electrostatic behavior of an qtheaivise electrically neutral atom, chemical group, or molecule in the presence of an electrical field or another charge distribution. An example of an energy model is a Coulombic electrostatic energy model representing the electrostatic interaction between a pair of atomic or molecular charges in a thelectric medium.
An energy parameter may depend solely on the chemical identity of a solute entity or on the chemical identities of a pair or more of solute entities associated with the given interaction type represented by the energy model; or on the location of the solute aitity within the context of a chemical group, a molecular substructure such as an amino acid in a polypeptide, a secondary structure such as an alpha helix or a beta sheet in a protein, or of the molecule as a whole; or on any combination thereof. An example would be the value of charge assigned to a nitrogeti atom involved in the peptide bond on the backbone of a lysine residue of a protein with regard to a Coulombic electrostatic energy model. Herein, enecgy parameter sets includes those defined in conjunction with all-atom or unified atom force fields, often used to estimate the change in enca-gy as a function of the change in conformation of a molecule. Examples ofenergy parameter sets include those described in AMBER [39][40]. OPLS [41], MMFF [42], and CHARMM [43]
In Fig. 5, in step 504, a first molecular surface is defined for first molecular subset 602, and a second molecular surface is defined for second molecular subset 652. The first molecular surface is defined by recognition that, some of atoms 604 are situated along a border or boundary of molecular subset 602. Such atoms are herein referred to as **surfacc atoms". The surface atoms are proximal to and define a molecular surface 606 for the fi^t molecular subset 602 based on their location.
In &ne embodiment, the molecular surface 606 can be a solvit accessible molecular surface, which is generally the surface traced by the center of a small sphere rolling over the molecular surface 606. Computational methods for generation of solvent accessible surfaces includes the method presented in Connolly, M. L., "Analytical molecular surface calculation", (1983), J. Applied Crystallography. 16,548-558; all of which is hereby incorporated in its entirety: In another embodiment the molecular surface is defined based on the solute-solvent interface. A second molecular surface 656 is defined in a similar manner for molecular subset 652 based on the subset of atoms 654 that are surface atoms for molecular subset 652.
In Fig. 5, in step 505, a first charge draisity function, is generated as a representation of the first charge distribution of the first molecular subset 602 over a subset of a volume enclosed by the first molecular surface 606. A charge density function is a volume function representing the charge distribution of a molecular subset and is given by p(r). In Fig. 6a, a position 6 ] 4 exists wi|hin or near one of the atoms 604 which define molecular subset 602 and another position 616 exists outside of and not near any of the atoms 604 of molecular
subset 602. Generally, when defining a first charge density function, pi, for molecular subset 602, for position 614, and others like it, the charge density function will be nonzero, i.e., p, ?* 0. On other hand, generally, when defining a first charge d«isity function, pi, for molecular subset 602, for position 616, and others like it, the charge density function be very small or even zero, i.e.,
In one embodiment, the first charge density function is defined in terms of a Dirac delta function in order to rqnesrat point charges and is given by where i
denotes the i^** solute entity, qi is the net charge of solute entity i, and Cj is a constant, dep«ident on the chemical identity of the i'** solute entity.
In another embodiment, the first charge density function for the first molecular subset 602 is defined as a union of a set of konel functions. As used herein, "k«iiel function" is a 3-D volumetric function wiA finite support that is associated with points in a localized neighborhood about a solute entity.
In one ^nbodiment, each kernel fimctioti is dependent on the chonical identity of the associated solute oitity. In another embodiment, each kernel function is d^ondoit on the location ofthe associated solute ratity within a chemical group. In yet another embodim^t, .each kernel function is dependent on the charge of the associated solute entity.
In yet anoth^ onbodiment, in the case that the solute entity is an atom, the kernel function is a nonzero constant for points within a Van der Waals (vdw) sphere. As used herein, "vdw sphere" is a sphere with radius equal to the Van der Waals radius of an atom of the given type and centered on the atom, and zero at other points. In one example, this nonzero constant has a value of unity. In another example, the nonzero constant has a value such that when multiplied by the volume of the vdw sphere equals file charge assigned to the atom.
Alternatively, as in another embodiment, each kernel function is the charge of the associated solute entity multiplied by a 3-D probability distribution function centred on the solute entity. In yet another embodiment, each konel function is the charge of the associated solute entity multiplied by a 3-D Gaussian probability distribution function centered on the solute entity and with a known variance. In one example, this variance is a function of the chemical identity of the solute entity or the location of the solute entity within a chemical group.
In yet another embodiment, each kernel function can be the charge of the associated solute entity multiplied by an orthonormal radial basis function. In one onbodimait, each
kernel function is the charge of the associated solute entity multiplied by a scaled (or unsealed) Laguerre polynomial-based radial basis function. In another embodiment, each kernel function is the charge of the associated solute entity multiplied by a radial/spherical harmonic basis expansion of finite order representing the individual charge distribution of the
solute entity. An example of this embodiment is a first charge density function given as follows:
where i denotes the ith solute entity, is the cwiter of nearest neighbor
atom, qi is the charge, partial or otherwise of ith solute entity, 8,0 is a scaled radial basis function S„| for n^l and 1=0, and ai is a scaling factor less than unity, which accounts for the
truncation of the charge of the i^ solute entity to a small sphere centered on the ith solute entity.
In another onbodiment, each kernel function is a quantum mechanical wave function representing the charge distribution of the associated solute entity. As used herein, ^'quantum mechanical wave function", ^{P,t), describes a quantum mechanical particle, e.g., an
electron, an ion, an.atom, amolecule, such that its absolute square,|^(r,2)[ corresponds to the probability density of finding the particle at position X and time/.
Continuing in Fig. 5, in step 505, a second charge density function, p2, is generated as a representation of the second charge distribution of the second molecular subset 652 over a subset of a volume enclosed by the second molecular surface 656 in a manner similar to that used in defining the first charge density fhnction for molecular subset 602.
In Fig. 5, in step 506, a first electrostatic potential function is defined for the first molecular subset 602 wh^ isolated fiom the second molecular subset 652, based on the electrostatic interaction of the first charge distribution of the first molecular subset 602 with the itself and with various solvent entities in an ambient or aqueous environmrait of the first molecular subset 602 in isolation. As used herein, the tarm "ambient environment" refers to the 3-D volume occi^ied by various solvent entities and may be used int^xshangeably with ambient material, ambient medium, or even aqueous environment. Alternatively, there may be no solvent entities present in the ambient environmCTit and instead the ambient environment is a vacuum with a thelectric permittivity of EQ.
In one embodiment, the first electrostatic potential function is defined according to the Ck)ulombic electrostatic potential of Eqn. 2 with the assumption of an isotropic thelectric medium representing the polarizable ambient environm^it of molecular subset 602 in isolation and the representation of the first charge density function is in twins of Dirac delta functions, as discussed above, for point charges.
In another embodiment, the first electrostatic potential function associated with molecular subset 602 in isolation, is a solution to Poisson's equation, whose solution represents the electrostatic potential function for an arbitrary charge density function associated with a solute charge distribution embedded in a thelectric medium, whether isotropic or anisotropic in nature. Poisson's equation is a linear second order elliptical partial differential equation and is givwi as follows:
where, (r) is the electrostatic potential at r, e(r) is the permittivity as a function of position (e.g., em for points in the molecular subset, and Sw for points in the solvent), and p(r) is the arbitrary charge density function. An accurate solution to Poisson's equation, followed by application of Eqn. 3, leads to the total electrostatic energy for the molecular subset that includes both the reaction field and solvent screening effects associated widi electrostatic desol vation, as discussed in regard to states 445 and 455 of Fig. 4. Howevw, the effects of an ionic atmosph«-e are completely ignored in Eqn. 11.
Generally, Eqn. 11 must be solved numerically and an appropriate set of boundary conditions must be specified, i.e., a set of initial values specified on a boundary. A discussion of the relevance of boundary conditions to solutions of partial differential equations (PDEs) and conunon examples including Neumann, Dirichlet, and Cauchy boundary conditions can be found in Ariken et al, [50].
In another embodiment, the first electrostatic potential function is a solution to the Poisson-Boltzroann equation with suitable boundary conditions and one or more nonzero Debye-Huckel parameters, and represents the electrostatic potential function for the first molecular subset in isolation, including all three types of electrostatic desol vation effects, including the effects of an ionic atmosphere. The Poisson-Boltzmann equation (PBE) is also a second order elliptical partial differential equation, but generally nonlinear, and is given by:
where Thelinearized version of Eqn. 12 is given by expanding the hyperbolic
sin in terms of and retaining only the linear component as follows:
In both eqns. 12 and 13, (r) is the electrostatic potential at r, e(r) is the pramittivity as a function of position (e.g., 8m for points in the molecular subset, and EW for points in the
solvent), p(r) is the arbitrary charge density function, whwe K(r) is the Debye-
Hudcel parameter, and ec, k^, and T are respectively, the charge of an electron, Boltzmann constant, and the tempwature.
llie solutions to eqns. Jl-l 3 are computationally intensive and generally require hi^ memory overhead for accurate solutions, as discussed in refermces [45][47][503 Moreover, if the conformation of the molecular subset changes, or the mol^ailar subset is brought in close contact to one or more o'ther molecular subsets, the solution must in principle be recomputed as the solute-solvent interface has changed. In practice, however, if the conformational changes are small and fte solvoit accessibility of solute entities in the molecular subsets do not change appreciably, then previously computed solutions.to Eqn. 11, or alternatively eqns. 12 and 13, may be utilized as approximations to the electrostatic potential of the new configuration.
In another embodiment, the first elect^static potential function is a solution obtained by employing a generalized Bora solvation model, and represraits the electrostatic potential function for the first molecular Subset 602 in isolation, also including the effects of electrostatic desolvation between solute charges and a surrounding ambient environment comprising solvent entities. The Generalized Bom (GB) approxin?ation is an alternative approach to calculate charge-charge interactions Wg for pairs of atoms or ions in the presence of continuum solvent, using the following formula
where, the firet term is the energy in vacuum and the second term is the salvation energy, ew is the permittivity of the solvent, qi is the charge of the i* atom or ion, and fjj* is a suitably chosen smooth function dependent on the evaluation of various volume integrals over the' solvent exclude volume (Still et al, [51 ]or alternatively on various surfece integrals over the solvent accessible surface volume (Ghosh et al, [52])). As already mentioned in the
..I
background, while computationally less expensive than numerical solutions to the PBE, the GB solvation models are more complex to evaluate than their Coulombic counterparts.
In another embodiment, the first electrostatic potential function is a solution obtained by employing a molecular dynamics simulation with an explicit solvent dipole model such as those described in [55][56][57].
In Fig. 5, also in step 506 and in a manner similar to that use when defining a first electrostatic potential function for molecular subset 602, a second dectrostatic potential function (or function) is defined for the second molecular subset 652 when isolated fit>m the first molecular subset 602, based on the electrostatic interaction of the second charge distribution of the second molecular subset 652 with itself and with solvent entities in an ambient environment of the second molecular subset 652 in isolation.
In Fig. 5, in step 508, a first electrostatic potential field is defined corresponding to a representation of fiie first electrostatic potential function, generated by the charge distribution asfsociated with molecular subset 602 in isolation, over a first electrostatic potential computational domain 60S. The term "electrostatic potential field** should not to be confused
with the electric field, a vector function given by is the
electrostatic potential function of a charge distribution.
As used herein, "computational domain" refera to a volume over which the associated function has nonnegligible values. For a function like the electrostatic potential function, which graierally has both positive and negative values, *'nonnegligible*' means that for a point inside the domain the function value has an absolute magnitude above some a priori chosen threshold. Generally, the computational domain represents a topologically connected volume, whether continuous or discrete. The purpose of a finite volumetric computational domain is to reduce the number of computational operations, as well as the amount of required memory overhead and i/o or memory bandwidth. Those skilled in the art should understand that while Fig. 6a shows region 608 represented as a definite band, any volume related to molecular subset 602 can serve as the first electrostatic potential computational domain 608. In some embodiments the electrostatic potential computational domain may include points external to and proximal to the molecular surface 606. In other embodinxnits the computational domain may also include all of or a portion of the points lying within the volume enclosed by molecular surface 606. A second electrostatic potential computational domain 658 can be defined for molecular subset 652 in a similar manner.
In Fig. 6a, a region 610 outside of both domains 608 and 658 is shown. Position 620 represents a point that is not in cither of the computational domains 608 and 658. G(m6ra)]y, point 620 and others like it, will be assigned a value of zero for both the first and second electrostatic potential fields.
In Fig. 6a, regions 608,658, and 610, either in their entirety or any portions thereof, may lie within an ambient environment In one embodiment, the ambient environment is a vacuum. In another embodiment, the ambiwit environnieat is comprised a plurality of solvent entities. In yet another embodimrat, the ambient environment includes water molecular subsets with nonvanishing dipole momeits. In yet another embodiment, the ambient eavironmrat includes various salt ions. In yet another erobodim^t, the ambioit environment includes an acid or a base such that the solvent is at a nonneutralpH. In yet another embodiment, the ambient environment includes various fi-ee radicals. In yet another embodiment, the ambient environment includes various electrolytes. In yet another embodiment, the ambient raivironment includes various &tty acids. In yet another embodiment, the ambicsit «ivironment includes a heterogeneous mix of different kinds of solvent entities, such as those already listed.
Positions 617 and 618 represent two different points within the computational domain 608, the first being inside the molecular surface 606 and the latto- being external to the volume enclosed by the molecular surface 606. Generally, point 618 and others like it will be assigned a nonzero value for the first electrostatic potential field. Similarly, points 667 and 668 in domain 658 will goi^ly have a nonzoro value for the second electrostatic potential field. Moreover, points 618,668 and 620 will generally have a value zero for both the first and the second charge density function. As will be shown in Fig. 6b, points that lie within both domains 608 and 658 when the two molecular subsets are not in isolation, will generally be assigned nonzero values for both the first and the second electrostatic potential field.
In one embodiment, the computational domain 608 does not include points intemal to the molecular surface 606 of molecular subset 602, in which case point 617 would be assigned a value of zero being outside the computational domain for the first electrostatic potential field. In another embodiment, domain 608 is defined by moving a small sphere (or probe sphere) at positions proximal to and external to the first molecular sur&ce 606. In another embodiment, region 608 is defined as the volume swept out by a probe sphere, the center ofwhich moves along the entire first molecular surface 606. In another embodiment, the probe sphere moves along only a portion of the molecular surface 606. In one ^nbodiment, the probe sphere has a constant radius. In Fig. 6a, a domain 608 constructed in
this manner is shown and has a characteristic thickness 612 that is equal to the constant radius ofthe probe sphere.
In another embodiment, the radius ofthe probe sphere varies as a function of location on the molecular surface 606. Often the radius of the probe sphere is substantially lai^er than the radius of a typical atom 604 or other solute entity in molecular subset 602. In anoth(»-embodiment, in order to faithfully rein«sent an iqjpropriate ccnnputationa] domain for the first electrostatic potential function of molecular subset 602, meaning none of the points in r^on 610 have a significant magnitude for the first electrostatic potential function of molecular subset 602 in isolation, the radius of the probe sphere is substantially larger than the radius, or other characteristic dimension, of a typical solvent entity in the ambient environment of molecular subset 602.
Domain 658 can be constructed for the second electrostatic potential field for molecular subset 652 in a manna- similar to domain 608 for molecular subset 602.
In one embodiment the first electrostatic potential field for molecular subset 602 refers to the function formed by truncating the first electrostatic potential function to the first electrostatic potential computational domain 408 as follows:
where r is a point in 3-D space, is the first electrostatic potential function, D is a
volumetric computational domain containing points for which is a
constant threshold, and is the first electrostatic potential field. In another embodiment,
is multiplied in Eqn. 15 by a scalar value.
In another embodiment, is formed instead via the convolution of with
another function H(r) as follows:
where r, , (r), for a particular grid cell is assigned a nonzero numerical value v/hea the grid c^ll is inside the computational domain 708, i.e., the grid cell is occupied, and zero oth»^ise. Similarly, the first charge density function, pi(P ), for a particular grid cell is assigned a zero value for
points outside of domain 710 and nonzero otherwise. In one embodiment, the first charge density function is nonzero only for points in domain 710 which are in a localized neighborhood of one of the atoms 704. In another embodiment, the specification for the localized neigihborhoods is made based on the functional form for the kernel function locally defining the first charge density function with respect to each atom 704 in molecular subset 702. In one embodiment only a significant fraction of the grid cell must lie within the appropriate domain in order for the grid cell to be considered occupied. In yet another embodiment, those occupied grid ceils that, for the relevant discretized function, correspond to an absolute value with magnitude below a chosen numerical threshold, are instead relabeled as unoccupied and the corresponding discretized function value set to z«x).
While Fig. 7 shows a two-dimensional aioss-secti(»ial view of the dbaxge doisity and electrostatic potential volmnetric functions for the molecular subset 702 as projected onto a 2-D Cartesian grid, those skilled in the art should understand that the principles described above are equally applicable to three-dimensional and higher multidimensional spaces, as well as to other coordinate based representations, where the phrase "coordinate based represoitation" generally refers to representing a function in tenns of coordinates of a coordinate system.
In one embodiment, a Cartesian coordinate based representation is used where each grid cell in three-dimensions is a cuboid. The cuboid grid cells in Fig. 7 with nonzoo values for ^, are illustrated by example as cells 718 denoting grid cells within the confines of the
domain 708. The cuboid grid cells in Fig. 7 with nonzMO values for pi are illustrated by example as cells 710 containing one or more, or even parts of, the dark spheres representing atoms within the confines of the molecular surface 702. In this way, values are aligned to
^, and pi, for each grid cell so that the first electrostatic potential field and the first charge density function for mol ecular subset 702 are represented as a set of numbers for the entirety of grid cells. In one embodiment, these values are real numbers and can range from (-oo,+oo). In another embodiment, these values are of finite precision. In yet another atibodiment, these values are in a fixed-point representation.
The embodiments described above, in regard to the discretization of the first charge density function and the first electrostatic potential field of the first molecular subset, also apply to the discretization of the second charge density function and the second electrostatic potential field of the second molecular subset 602 in step 508.
In Fig. 5, in step 510, individual coordinate-based representations for the first and second molecular subsets are defined such that each molecular subset is represented in a coordinate system. A three-numensional coordinate system is a systematic way of desriibing points in three-dimensional space using sets of three numbers (or points in a plane using pairs of numbers for a two-dimensional space). An individual coordinate-based representation of the first molecular subset 902 of Fig. 9 includes the first charge disthbution aund first electrostatic potential field of the first molecular subset 602, using a first coordinate system 906, as shown in Fig. 9. An individual coordinate-based representation of the second molecular subset 652 includes the second charge distribution and second electrostatic potential field of the second molecular subset, using a second coordinate system 908, also shown in Fig. 9.
In one embodimmt, the individual coordinate based representations are defined using a spherical polar coordinate system. The spherical polar coordinate system is a three-dimensional coordinate system where the coordinates are as follows: a distance from the origin r, and two angles 6 and (p found by drawing a line firom the &y&a. point to the origin and measuring the angles formed with a given plane and a given line in that plane. Angle 6 is taken as the polar (co-latitudinal) coordinate with 6 €[0, n] and anj^e ep is the azimufhal (longitudinal) coordinate with An illustration is provided in Fig. 10a.
In another embodiment, the individual coordinate based representations are defined using a cylindrical coordinate system (Fig. 10b), which is another three dimensional coordinate system where the coordinates are described in terms of (r, 9, z), where r and 9 are the radial and angular components on the (x, y) plane and z component is the z-axis coming out of the plane.
In yet another embodiment, the individual coordinate based representations are dej5ned using a Cartesian coordinate system. The Cartesian coordinate system describes any point in three-dimensional space using three numbers, by using a set of three axes at right angles to one another and measuring distance along these axes. The three axes of thi«e-dimrasional Cartesian coordinates, conventionally denoted the x-, y* and z-axes, are chosen to be linear and mutually perpoidicular. In three dimensions, the coordinates can lie anywhere in the interval [-00, -H»]. An illustration is i»iovided in Fig. 1 Oc. Note that the diagrams in Figs. 1 Oa, 1 Ob, and 1 Oc are excerpted from w(^ pages available at Eric Weisstein's World of Mathematics on the worldwide web at http://math'world.wolfi:am.com/.
For practical purposes of computation in software and/or hardware, the individual coordinate based representations are generally disoiete in nature. The individual coordinate ' based representations are used to compute a reference set of basis expansion coefficients as descnbed below.
A point in space can be represented in many different coordinate systems. It is possible to convert jErom one type of coordinate based r^r^entation to another using a coordinate transformation. A coordinate transformation relabels the coordinates from one coordinate system to another coordinate system. For example, the following equations represent the transformation between Cartesian and spherical polar coordinate systems: {x = r sin 0 cos (|i, y « r sin 9 sin ^, z == r cos 0}, and thus a Cartesian .coordinate base representation for the first electrostatic potential field for molecular subset 602 can be convoted to a spherical polar coordinate based representation for the first electrostatic potential field for molecular subset 602 by applying an appropriate Cartesian to spherical coordinate transform.
In Fig. 5, also in step 510, the individual coordinate based representations 906 and 908 of the first and second molecular subsets are then placed in a joint coordinate system, as shown in Fig. 9. The joint coordinate syston is used to represent distinct configurations of the molecular combination. The joint coordmate system is also uded to generate new configurations by translating and/or rotating the respective individual coordinate based representations of molecular subsets 602 and 652 relative to one another, as desoibed below. In one embodiment, the joint coordinate Systran is also used to transform a refCTence set of basis expansion coefficients for each molecular subset as part of a process to generate shape complementarity scores for each configuration of a molecular combination, as descnbed below.
I
In one embodiment, a first three-dimensional Cartesian frame 920 is provided for the first molecular subset 602, and a second three-dimensional Cartesian frame 922 is provided for the second molecular subset, as shown in Fig. 9. Herein the term "Cartesian frame" generally refers to the unit vectors in the Cartesian coordinate system, as illustratedin Fig. . 10c.
In Fig. 9, the first and second Cartesian frames 920 and 922 are centered at respective molecular centers 902 and 904 of fte first and second molftmlar subsets. Thenaolecular center is generally a point in 3-D space that is designated as the cent« of the molecular subset. In one embodim«it, the molecular c«iter is the geometric crater of mass of the molecular subset. In another embodiment, the molecular center is the centroid of the molecular subset. An intermolecular axis 912 is defined as the vector betweoi the molecular centers 902 and 904, and the z-axes of the respective Cartesian frames 920 and 922 are both aligned with the intermolecular axis 912.
In principal, any rotation in three dimensions may be described using three angles. The three angles giving the three rotation matrices are called Eulear angles. There are several conventions for Euler angles^ depending on the axes about which the rotations are carried out. In a common convention described in Fig. 11, the first rotation is by angle ep about z-axis, the second is by angle 6 €[0, n] about x-axis, and the third is by angle \|; about z-axis (again). If the rotations are written in terms of rotation matrices B, C and D, then a gmeral rotation A can be ymtten as A = BCD where B, C, and D are shown below and A is obtained by multiplication of the three matrices.
The diagrams in Fig. 11 are excerpted fixim web pages available at Eric Weisstein's World of Mathematics on the worldwide web at http://mafiiworld.wolfiam.com/.
Another commonly used convention for Euler angles is the well-known '"roll, pitch, and yaw" convention CTCOuntered in aeronautics. Herein, the roll Euler angle is file Euler angle representing a rotation, o^ about the z-axis, the pitch Euler angle is the Eul^ angle representing a rotation, p, about the y-axis and the yaw Euler angle is the Euler angle representing a rotation, y, about x-axis.
In one embodiment, as shown in Fig. 9, R is the intermolecular separation between the first molecular subset 602 and the second molecular subset 652. pi and ^z i^er to pitch
Euler angles representing rotation of each corresponding molecular subset in the x-2 plane (i.e., around the y-axis). yi and 72 refer to yaw Euler angles representing rotations in the y-z plane (i.e., around the X-axis), Therefore, are polar and azimuthal Eulw angles
describing the pitch and yaw of the first molecular subset 602 witti resqject to the joint coordinate system, (p2>Y2) are polar and azimuthal Euler angles describing the pitdi and yaw of the second molecular subset 652 with respect to the joint coordinate system, and az is a twist Euler angle describing the roll of the second molecular subset 652 with respect to the intermolecular axis. In this way, a set of six coordinates, (R, Pi, yi, aj, p2, yz), completely specify the configuration of tfae-molecular combination, Le., the relative position and orientation of the molecular subsets.
For practical purposes of computation in software and/or hardware, the coordinate variables of the joint coordinate system, (R,p 1 ,YI ,a2,P2,y2), Bxe generally sampled as a discrete set of values. In other embodiments, the joint coordinate system may be charact«ized by a different set of parametCTS other than (R,Pi,yj,a2,P2,Y2)' Fpr example, may "be any
one of several sets of p^missfble Euler angles for molecular subset 652. In another example the angular parameters are not Euler angles. In yet another example, the parameters of the joint coordinate system are expressed in terms of translation and rotation operators, defined below, as applied to (|i, v, ^) of a prolate spheroidal coordinate systean for each molecular subset.
In Fig. 5, as part of the sampling schone definitions in step 512, an axial sam.pling scheme is defined in step 514. The axial sampling scheme has a plurality of axial sample points representing a sequence of positions distributed along the intermolecular axis 912 in
1
Fig. 9. As used herein, a "sample point" generally refers to one of a sequence of elements defining the domain of a disa-etized function, and "sampling sch«ne" gaierally refers to a scheme for selecting a sequence of sample points.
An "axial sampling scheme" is a scheme for selecting sample points along an axis or a line (i.e., "axial sample points") and thus provides for relative translation of the individual coordinate based representation 602 of the first molecular subset with respect to the coordinate based representation 652 of the second molecular subset The allowed values of the intermolecular sqjaration, R, are defmed by the axial sampling scheme. In another embodiment, the axial sampling scheme is a regular sampling scheme, which involves selecting sample points at regular intervals. In one embodiment, the axial sampling scheme is
an iiregular sampling scheme, which involves selecting sample points at irregular intervals according to a nonlinear mapping.
In one embodiment, the endpoints for the axial sampling scheme can be set based on geometric analysis of the electrostatic potential computational domains 60S and 658 of both the first and second molecular subsets. In another embodiment, the geometric analysis constitutes a determination of a maximum radial extent of each molecular subset, and the endpoints of the axial sampling stheme for the first molecular subset 602 are set based on a function of the maximum radial extents of each molecular subset.
In Fig. 5, in step 516, a first sph«ical sampling scheme is defined for the first molecular subset 602. The first i^herical sampling scheme has a plurality of spherical sample points representing a sequence of positions distributed on a surface of a first unit sphere centered on the molecular center of the first molecular subset. In one embodiment, the allowed values of the pitch and yaw Euler angels, (fi i ,YI), for molecular subset 602 are defined by the first spherical sampling scheme.
In one embodiment, the first spherical sampling schmie is the Cartesian product of a regular sampling of the pitch Euler angle (p 0 and a regular sampling of the yaw Euler angle (YI), where the Cartesian product of two sets A and B is a set of the ordo-ed pairs, {(a, b) | a e A, b e B} and either set is allowed to be a single element set. This is an example of an iiregular sampling scheme in that spherical sample points near the poles will be closer together than at or near the equator.
In anofher embodiment, the first ^herical sampling scheme is defined via an icosahedral mesh covering the two-dimensional surface of a sphere, where "icosahedral mesh" refers to the projection of all vertices and face centers of a many-sided icosahedron onto a unit sphere. In this way an evenly spaced 2-D grid can be constmcted on the surface of the sphere as shown in the illustration in Fig. 12. This is an example of a regular sampling scheme in that each ^herical sample point corresponds to the center of a 2-D surface element of approximately the same surface area. Similar icosahedral-based regular spherical sampling schemes are discussed in ref [13].
A second spherical sampling scheme for the second molecular subset 652 can be constructed in the same manner as the first spherical sampling scheme. In this way, the allowed values of the pitch'and yaw Euler angels, (p2,y2), for molecular subset 652 are defined by this second spherical sampling scheme.
In Fig. 5, in step 518, an angular sampling scheme is defined for the second molecular subset 652. The angular sampling scheme has a plurality of angular sample points representing a sequence of positions distributed on a circumference of a unit circle orthogonal to the intermolecular axis 912 of Fig. 9 that is connects the molecular centos of each
molecular subset The allowed values of the roll Euler angle, ai, for molecular subset 652 are defined by this angular sampling scheme. In one embodiment, the angular sampling scheme is a regular sampling scheme, representing intervals with imiform arc length. In another embodiment, the angular sampling scheme is an irregular sampling scheme.
In Fig. 5, in step 520, a basis expansion with a corresponding set of basis functions is provided. The "basis expansion," as used herein, refers to dea)mposition of a gen»al function into a set of coefRcimts, each representing projection onto a particular basis function. One can express this decomposition in a mathematical form, i.e., a genes-al ftuKstion in M-dimensions, f(x), can be written in tenns of a set of basis functions B,(x), as
whae, i e {0,1,2,... 00} refers to a specific basis function, X is a set of M coordinates, and each basis function, B,-, is generally one member of a set of M-dimensional functions in a function space such that any general function Jn the function space can be expressed as a linear combination of them with ^propriately chosen coefScioits. In Eqn. 18, ai is the expansion coefficient associated with the i* basis function, B,-.
The choice of basis expansion and h«ice the choice of a set of basis functions is often dictated by the choice of coordinate syston for representation of the gen^Bl function in question! caiaracteristics and/or underiying symmetries of the given function can also influwice the choice of basis expansion.
For practical purposes of computation in software and/or hardware, the upper limit of the summation in Eqn. 18 has a finite value, N. This upper limit is referred to as the order of the basis expansion. This leads to the following mathematical form for the basis expansion:
and the plurality of expansion coefficients {ai, 82,83,..., as} are known as a set of expansion coefficients.
such an approximation, as in Eqn. 19, necessitates the existence of representation errors because the basis expansion is now of finite order. However, in general, if N is chosen to be sufficiently large, the representation errors will be small for all but the most intransigent of functions. In one embodiment, the order of the expansion is predetennined and is much larger than unity, e.g., N ^ 30. In another embodiment, the order of the expansion is adaptively determined leased on a preliminary quantitative analysis of representation errors for trial values of the expansion order, and may therefore be of different magnitude for
. different pairs of molecular subsets 602 and 652 based on the characteristics of their r^pective charge density function and electrostatic potential field.
In one embodiment, the basis expansion is an orthogonal basis expansion comprising a plurality of mutually orthogonal basis functions. If the basis functions satisfy the following mathematical condition, they are called mutually orthogonal:
where Qj is a constant (not necessarily unity wheni = j), 8^ is the usual Kionecker delta, and the integral is over the entire M-dimensional space.
For an orthogonal basis expansion, an expansion coefficient, a{, corresponding to a particular basis function, Bi, can be written as follows:
where Qj is a constant.
However, once again for the practical purposes of computation, the expansion coefficients are discretized by converting the integral in Eqn. 21 to a finite summation. In the case of a set of expansion coefficients for an orthogonal basis expansion, the discretized expansion coefficient,- % for an orfhonormal basis function, Bj, takes the following form:
where the summation is over the discrete points c, i.e., x^ is a sample point in the M-
dimensional. space represented here by >[.
In another embodiment, the basis expansion is an orthonormal basis expansion comprising a plurality ofmutuallyorihonormal basis functions. If the basis functions are mutually orthogonal and in Eqn. 22, C\\ is unity for all relevant basis functions, then the basis
functions are said to be mutuaUy orthonormal. This similarly simplifies the expressions for a,-ineqns. 21 and22.
A g«ieral 3-D function in spherical polar coordinates can be represented in temis of a radial/sph^cal harmonics basis expansion comprising a plurality of basis functions, each basis function defined as the product of one of a set of orthonormal radial basis functions, R^(r), and one of a set of real-valued spherical harmonics basis functions, y|"(d,4), as follows:
where {anim} is the set of radial / spherical harmonics expansion coefficients, (r, 6, <^) are the spherical coordinates of a point in 3D space, n »[1, N), integer, 1» [0, n-1 ], integer, m = [-1, 1], integer.
The usage of such an expansion is common prac^ce in the quantum mechanical description of numerous atomic and molecular oihitals. Hence the indices n, I, and |mj ^0 are often respectively referred to as the principal quantum number, angular quantum (or orbital) number and azimuthal (or magnetic moment) quantum number.
In Eqn. 23, each radial basis function, R^ir), is a l-D orthonormal basis function depending solely on the radius, r.
The form for the radial basis functions is often diosoi based on the problem at hand, e.g., the scaled hydrogen atom radial wave function in quantum mechanics is based for example on associated Laguenre polynomials (ArjBcen et at) as follows:
where the square root twm in the normalization factor, p is title scaled distance, p = r^/k, k is the scaling parameter, F is the gamma function, and L() are the associated Laguenre polynomials; where a general Laguerre polynomial is a solution to the Laguerre differential equation ^ven by:
and the associated Laguerre polynomials themselves are given explicitly as follows:
according to Rodrigues' formula.
Various radial basis functions can be used in accordance with embodiments of the present invention. In one embodiment, the radial basis functions include the scaled Laguerre polynomial-based functions of Eqn.24. In another embodiment, the radial basis functions include unsealed forms of Eqn. 24 in terms of r (not p) and without the normalization constants. In yet another embodiment the radial basis functions include a Basse! function of the first kind (Jn(r)). In yet anotb^ embodiment the radial basis functions include a Hetmite polynomial function (Hn(r)). In other embodiments, the radial basis functions include any mutually orthonormal set of basis functions that depend on the radius in a spherical coordinate system centered on the respective molecular center of the molecular subs^ in question.
In Eqn. 23, each real-valued spherical harmonic basis function,
ortfaonormd basis function depending on the angular variables (6, ep) of a spherical coordinate system centered on the molecular center of the each molecular subset. Sphttical harmonics satisfy a spherical harmonic differential equation, representing the angular part of the Laplace's equation in spherical coordinate system:
The spherical harmonics themselves are complex-valued, sqparable functions of 9 and 9, and are given in terms of an associated Legendre polynomial, Pi™(x), by the equation,
where the associated Legendre polynomial is given by:
Spherical harmom'cs can be used to represent 3-D molecular shapes in the context of molecular docking, as described in Kita 11, An illustration of the first few sphraical harmonics is shown in Fig. 13, in terms of their amplitudes.
Real-valued spherical harmonics functions, yr(0,) are <1^R^ as:
where the integral is over the extent of function f(r, G, ^) in spherical coordinates and dV is a differential volume element in spherical coordinates.
The discretized analog of the expansion coefficients in Bqn. 31 are given by:
where the summation is over all grid cells, c, and (r^, 0^, ^^ are the spherical coordinates of
the center ofgrid cell c and AVc is the volume of grid cell c. Eqn. 32 may then be used to represent various volumetric functions associated with a given molecular subset via corresponding sets of expansion coefficients.
Now instead, in one embodiment of the present invention, the following variants of Eqn. 32 that are displayed in Eqn. 33 & 34, are used in step 52Q of Fig. 5, to respectively describe the first electrostatic potentia] field and the first dharge density function of molecular subset 602 in terms of corresponding sets of expansion coefficients:
where again the sunomation is over all grid cells, c, (r^, 9^, ^^ are the spherical coordinates of the center of grid cell c and AVo is the volume of grid cell c. But now (piCr^, 9^, ^^ and pj (r^, ^c> ^o) ^^ respectively the coordinate based representations based on a ^herical polar coordinate system for the first electrostatic potential field and the first charge density function
for molecular subset 602, and {(p'nim} and {p'mm} are corresponding sets of expansion coefficients for an initial pose of molecular subset 602, herein designated as reference sets of expansion coefficients for molecular subset 602.
In one embodiment of the present invention, the grid cells in Eqn. 32 are cuboids from a Cartesian coordinate system and the spherical 3-tuples (r^, 6^, (|>e) are converted 'on-tfie-fly'
to Cartesian 3-tuples (Xc, Vc, Zc) iising a suitable coordinate transformation, for easy addressing of the function, f, over a lattice representation stored in a computear-readable
manory. In anothor embodiment the grid cells can be a varying volume, AVo. Iny^another embodiment the grid cells represent small volumes in a spherical coordinate syston. In yet another embodiment the grid cells represent small volumes in a c>dindrical coordinate system.
Similarly, in step 520 of Fig. S, the coordinate based representations for the second electrostatic pot^tial field and the second charge density function for molecular subset 652, . respectively, 92 and P2, are represented via use of eqns. 33 & 34 in ienhs of sets of expansion coefficiraits, namely {cp^nim} and {p^nim}, for an initial configuration of molecular subset 652, also designated herein as reference sets of expansion coefficients but now for molecular subs^652.
Thus in step 520 of Fig. S, altogether there are four sets of reference expansion coefficients computed for the two molecular subsets 602 and 652, each corresponding to an electrostatic potential field or a charge density function for one of the molecular subsets. Once again, the reference sets for molecular subset 602 are designated as {c)} for all grid cells c (both occupied and unoccupied) in a coordinate based representation for the first electrostatic iK>tmtia] field of molecular subset 602 are converted to a stream or an array of Cartesian-based values {f (x^, y^, z^)} via a suitable coordinate transformation and stored on a computo'-readable and recordable mediimi for future retrieval. Later, when fhe stoied values are to be Used in the context of Eqn. 33 to compute expansion coeffidmts, the stored values are first retrieved and th«i converted back into {f^r^, 9^, ^^} by an invarse coordinate transformation.
Similariy the values corresponding to the coordinate based representation for the second electrostatic potential field for molecular subset 652 can be precomputed, stored and retrieved in a similar manner, including the use of coordinate transformations to convert back in forth fi-om a spherical polar coordinate based representation to a Cartesian coordinate based representation, more suitable for storage in a computer-addressable monory. Hie same embodiments can be applied to fht charge density functions of eitho: molecular subset in order to facilitate efBcient conqmtation of the refereaice sets of expansion coefficients {p nim} and {p\im) as per Eqii. 34.
In another embodiment, in tine context of a hardware implementation of the invention, the possibly expensive precomputation step of {
0 is directly applied to the reference sets of expansion coefiBdents {p'nim} and {(p^nim) for the first charge density function and the first electrostatic potential field for molecular subset 602 according to the following rule:
wh«-e are the new translated set of expansion coefficients for the first
charge density function and the first electrostatic potential field for molecular subset 602, (n, 1, m) are quantum numbers for the new translated expansion coefficient, (n', 1', m') are quantum numbers for the one of the set of reference expansion coefficients, the summation is
over all possible values of n' and 1', Smm* is the standard Kronecker delta, and K^'n'^^ are matrix elements of a translation matrix function with values equal to resultant overlap integrals between two different basis functions of the radial / spherical harmonics expansion, with quantum numbers n, 1, m and n', r, m'respectively, separated by a distance R, and which are nonzero only when m = m'.
The exact form for the translation matrix K^-n'^^ depends on the choice of radial basis functions used in Eqn. 23. Eqn. 35 has been used previously to efficiently derive a new set of translated expansion coefficients from a reference set of expansion coefiRcioats as described in Danos, M., and Maximon, L. C, "Multipole matrix elements of flie translation operator", J. Math. Phys.,6(l),766-778,1965; and Talman, J. D., "Special Functions: A Group Theoretical Approach", W. A, Benjamin Inc., New York, 1968; all of which are hereby incorporated by reference in their entirety.
In one embodiment, the entire set of translation matrix elements, K„n'ii'|ni|» may be precomputed for all relevant v£dues of n, n', I, r, & m for a finite order of expansion, N, and stored on a computer-readable and recordable medium for future retrieval as needed. This is advantageous since calculation of the overlap integrs^s which define each translation matrix element can be very costly and yet for a given finite order of expansion and a giv«i form for the radial basis functions used in Eqn. 22, the calculations need to be done only once and the resultant K-matrix is appUcable to any molecular subset regardless of size and shape. Moreover, for the large values of N, the number of translation matrix elements is 0(N*).
In Fig. 5, in step 524, the method continues with providing a first rotation operator representing rotation (change of orientation) of the coordinate based representation 906 of the first charge density function and the first electrostatic potential field for the first molecular subset 602 with respect to the Cartesian frame 920 co-located with the molecular caitor 902 of molecular subset 602 in fte joint cooniinate system of Fig. 9. Note in one embodiment of the present invention, the coordinate based represratations for the first charge density function and the first electrostatic potential field of molecular subset 602 are rotated together in that they do rotate relative to one another.
The term "rotation operator" generally refers to an operator that, when applied to a point, results in the point's rotation about an axis as defined by the rotation operator. The opearator can be applied to any collections of points, e.g., a line, a curve, a surface, or a volume. As described with regard to Eqn. 17, any rotation in 3-D can be represented by a set of three Euler angles.
In one embodiment, different orientations of the coordinate based representation 906 of the first charge density function and the first electrostatic potential field for the first molecular subset 902 with respect to the Cartesian frame 920 are generally represented by a
set of three Euler angles representing roll (ai), pitch (pi), and yaw (yi), as shown in Fig.. 9. In another embodiment the roll angle (ai), describing rotation with respect to the z-axis of the Cartesian frame 920, is ignored since with respect to the common z-axis of the joint coordinate system only the relative orientation between the two molecular subsets is relevant. Then the orientation of molecular subset 902 with respect to Cartesian frame 920 is fully described by a pair of Euler angles O i > 71). In another embodhnent the angles need not be Euler angles and in &ct depoid on Iho choice of joint coordinate system.
In one embodiment, the first rotation operator is a matrix function of (ai,pi,YO* Then the first rotation operator can be directly applied to the set of reference expansion coeffidents for the first charge density function and the first electrostatic potential field of molecular subset 602. In one embodiment, using the joint (R,pi ,YJ ,a2,p2,Y2) coordinate system of Fig. 9, in conjunction with the radial/^h^cal harmonics expansion of Eqn. 23, the first rotaticm operator representing the rotation of the coordinate based rqpresentation of the first charge doisity function and the first electrostatic potential field for molecular subset 602 from (aj-O.pi-'O, Yi=0) to arbifirary (ai.pi, Yi) is directly applied to the reference set of wtpansion
coefficients j^i/„ } and ^li„ ] for the first charge density function and the first electrostatic potential field for molecular subset 602 according to the following rule:
where are the new rotated set of expansion coefficients, (n, 1, m) are the
quantum numbers for the new rotated expansion coefficient, m' denotes the magnetic moment quantum number for fee set of reference expansion coefficients, the
summation is over all possible values of m', and R'^^^. are matrix elements of a block diagonal matrix such that each R"^ denotes a (2H-1)*(21+1) block sub matrix.
This property that the harmonic expansion coefficients transform amongst themselves under rotation in a similar way in which rotations transform the (x, y, z) coordinates in
Cartesian frame was first presented in the context of molecular shapes by Leicester, S. E., Finney, J. L., and Bywater, R. P., in "A Quantitative Represwitation of Molecular-Surface Shape. I. Theory and Development of the Method" (1994), J. Mathematical Chemistry, 16(3-4), 315-341; all of which is hereby incorporated by reference in its entirety.
For an arbitrary Euler rotation with angles (a,p,Y) and for a pair of positive magnetic moment quantum numbers, m and m', the individual matrix elements are computable in terms of Wigner rotation matrijf elanents, d mm- (P). as follows:
where d^mm'(P). the elements of the Wigner rotation matrix are given by:
with k, » max (0, m-m"), kj =* min(l - m', 1 + m), and C(l,m,k) being a constant function. Similar forms exist for the other eight possible signed pairs of m and m'. For furtho-details on Wigner matrix elements, refer to Su, Z., and Coppens, P., J. Applied Crystallography^ 27, 89-91 (1994); all of which is hereby incorporated by reference in their entirety. In one embodiment, where (ai^O) for all rotations of molecular subset 602, Eqn. 37 simplifies and the R'„„'matrix elements are functions of (PI, Yi) alone.
For basis expansions other than the radial / ^herical hannonics enpansion of Eqn. 23, eqns. 36 and 37 will be replaced by appropriate analogs depending on the dioice of angular basis functions, with suitable indices representing each basis function.
In Fig. 5, also in step 524, the method contmues with providing a second rotation operator representing rotation of the coordinate based reprraentation 908 of the second charge density function and the second electrostatic potential field of molecular subset 6S2 with respect to the Cartesian frame 922 co-located with the molecular center 904 of molecular subset 652 in the joint coordinate system. Note in one embodiment of the present invention, the coordinate based representations for the second charge density function and the second electrostatic potential field of molecular subset 652 are rotated together in that they do rotate relative to one another.
As with the first molecular subset 602, different orientations of the coordinate based rqpresentation 908 of the second charge density function and the second electrostatic potential field of for the second molecular subset 652 with respect to the Cartesian frame 922 are generally represented by a set of three Euler angles representing roll (aj), pitch (pa), and
yaw (Y2), as shown in Fig. 9. In another embodiment the angles need not be Eulor angles and in fact depend on the choice of joint coordinate system.
In one embodiment, the second rotation operator is a matrix function of (a2, P2, Y2). Then the second rotation operator can be directly applied to the set of reference expansion coefficients for the second charge density function and the second electrostatic potcnti^ field of molecular subset 652 in a manner similar to iht application of the first rotation operator to the set of reference expansion coefficients for the first charge density function and the first electrostaticpotential field of molecular subset 602.
In one embodiment, the matrix function representing the second rotation operator can be split up into two distinct rotation operators, the first being a function of (P2, fz) aJone (i.e., a2«=0) and the second being a function of the roll Euler angle, a2, alone (i.e., (P2=0. Y2'"0)). Thus either of these two rotation operators can be applied first to the referwice set of expansion coefficients in order to obtain an intermediate rotated set of coeffidents and the remaining operator then applied in succession in ordo' to generate a final resultant set of rotated coefficients. In such an embodiment, the two rotation operators are designated as the second and third rotation operators in order to avoid confusion regarding the first rotation operator for molecular subset 602. Moreover, in this embodiment, when in conjimction with the radial / spherical harmonics expansion of eqns. 23,36, and 37 can be applied for detennining the result of application of each rotation operator to the second molecular subset 652, in which case the application of the third rotation operator reduces to simple multiplication by constants and sines and cosines of the quantity (m'a).
In another embodiment, similar to that described in Kita II, the simplified form for the third rotation operator permits direct application of the third rotation operator to electrostatic affinity scores themselves, as described below, as opposed to intemiediate rotated expansion coefificients for the second charge density function and the second electrostatic potential field of associated with the second molecular subset 652.
In Fig. 5, in step 526, after the translation operators are defined, sets of translated expansion coefBcients are constructed for the first molecular subset 602 fi'om the sets of reference expansion coefficients for the first charge density function and the first electrostatic potential field of molecular subset 602. The term "translated expansion coefficients" generally refers to a set of expansion coefficients obtained by applying a translation operator to another set of expansion coefficients.
As discussed above, step 514 provides for an axial sampling scheme comprised of axial sample points, which delimit the allowed values of the intermoiecular separation, R, in Fig. 9 as applied to the relative translation of the two molecular subsets. In order to account for all allowed relative translations of the two molecular subsets, it is necessary to compute a set of translated expansion coefficients for both the first charge density function and the first
electrostatic potential field of the first molecular subset 602,
corresponding to each distinct axial sample point, Ki, in the axial sampling
scheme.
As discussed above, this is accomplished via direct a^Hcation of a translation operator in the form of a matrix multiplication to the referoice sets of expanaon coefficients
for the first molecular subset 602, In one eroboditnent,
where the radial / spherical harmonics expansion of Eqn. 23 is utilized, Bqn. 35 governs the
construction of for all axial sample points. Any and all
permutations of the order in which axial sample points are visited is permitted, so long as in the end the construction is completed for all axial sample points.
In another embodiment, molecular subset 602 is held fixed, and the translation operator is directly applied instead to the reference sets of expansion coefficients for the second charge density function and the second electrostatic potmtial field of the second
molecular subset 652, Since only relative translation of
the two molecular subsets in m^ningful, it is necessary to apply the translation operator to the coordinate based rep'esentations for p and 9 for only one of the two molecular subsets.
In Fig. 5, in step 528, after the rotation operators are defined, sets of rotated expansion coefficients are constructed for the second molecular subset 652 fi'om the sets of refermce expansion coefficients for the second charge density function and the second electrostatic potential field of molecular subset 652. The term "rotated expansion coefRciwits'* generally refers to a set of expansion coefficients obtained by applying a rotation operator to anpthtf set of expansion coefficients. As discussed above, step 516 provides for a second spherical sampling scheme comprised of spherical sample points which delimit the allowed values of the pitch and yaw Euler angles, (p2, Y2), in Fig. 9 as applied to orientation of fee second molecular subset 652. Also as discussed above, step 518 provides for an angular sampling scheme comprised of angular sample points which delimit the allowed values of (he roll Euler
angle, aa, in Fig. 9 as applied to rotation of the second molecular subset 652 with respect the joint z-axis.
In order to account for all allowed orientations of the second molecular subset 652, it is necessary to compute a set of rotated expansion coefficients for the second charge doisity function and the second electrostatic potential field of the second molecular subset 652,
corresponding to
each distinct angular sample pomt, a2i, in the angular sampling scheme and each distinct spherical sample point, (p2j, T2k)t in the second sphraical sampling stheme, i.e., (azwpy, fTk). e Cartesian product of the angular sampling scheme and the second ^horical sampling scheme.
As discussed above, this computation is accomplished via direct explication of a rotation operator in fte form of a matrix multiplication to the reference sets of expansion
coefficients for the second molecular subset 652, . In one embodiment,
where the radial / spherical harmonics expansion of Eqn. 23 is utilized, Eqn. 35 governs the
Cartesian product of the angular sampling scheme and the second spherical sampling scheme. Any and all permutations of the order in which each {CLZI,^, yak) is visited - is permitted, so long as in tfie end the construction is completed for all p^mitted (a2i,P^-, fzk)-Also as discussed above, in one onbodiment the construction can be accomplished by two
distinct rotational operators, &e first a function of the pitdi and yaw Euler angles, (P2, Yz)* and the second a function solely of the roll Euler angle, a2. Moreover, in another embodiment, the latter operator (designated previously as the "third rotation operatof) can be deferred until generation of electrostatic ener©r affinity, as described below.
In Fig, 5, in step 529, sets of transformed expansion coeftlcients are constructed for the first molecular subset 602 from the sets of translated expansion coefBcients generated in Fig. 5, step 524, for the first charge density function and the first electrostatic potential field of molecular subset 602. The terni "transformed expansion coefficients*' generally refers to a set of expansion coefficients obtained by applying an operator representing an arbitrary linear transformation on another set of expansion coefficients. This linear tiansformation may be the composition of one or more translation and / or rotation operators.
As discussed above, step 516 provides for a first spherical sampling scheme comprised of spherical sample points which delimit the allowed values of the pitch and yaw Euler angles, (pi, yO, in Fig« 9 as applied to orientation of the first molecular subset 602. As
discussed above, in regard to step 524, each set of translated expansion coefficients corresponds to an axial sample point of an axial sampling scheme which delimits the allowed values of the intermolecular separation, R, in Fig. 9 as applied to the relative translation of the two molecular subsets, hi order to account for all allowed configurations (both orientations and relative translation) of the first molecular subset 602, it is necessary to compute a set of transformed expansion coefficients for both the first charge density functioQ
and the first electrostatic potential field of the first molecular subset 602,
distinct axial sample point, Rj, in the axial sampling scheme and each distinct sphorical
sample poipt, (p ij, yijc), in the first sphoical sampling scheme, i.e.,
Cartesian product of the axial sampling stheme and the first spherical sampling schenie.
As discussed above, this computation is accomplished via direct application of a first rotation operator in the form of a matrix multiplication to the translated sets of expansion
coefficients of st^ 524 for the first molecular subset 602, Ri)}. In one embodiment, where the radial / spherical hannonics expansion of Eqn. 23 is . utilized, a variant of Eqn. 35 governs the constmction of
in terms, respectively, of
Cartesian product of the axial sampling
scheme and &te first spherical sampling scbone. Any and all permutations of the ordo* in which the are visited are permitted, so long as in the end the construction
is completed for all permitted
Due to commutativity, the transformed coefficients for the first molecular subset 602 can be generated by the application of the first rotation operator and the translation operator in any ordw. Operations are "commutative*' if the order in which they are done does not a£fect the results of the operations. In one embodiment, as the first rotation operator commutes with the translation operator, the first rotation operator is applied to the set of ref^ence expansion coefficients, in order to gmtcate sets of rotated coefGcimts first charge density function and the first dectrostatic potential fidld for the first molecular subset 602, in a manner similar to step 526 for the second molecular subset 652.
However, in general, it is more efficient in tarns of computations (and potential storage) to generate sets of translated coefficients for one axial sample point at a time and to
then subsequently apply the first rotation operator in order to gaierate the sets of transformed coefficients for the first molecular subset 602.
In Fig. 5, in step 530, an electrostatic energy affinity is defined. As used herein^ the "electrostatic affinity score" is a representation of the change in total electrostatic energy of a system going from a state of two molecular subsets in isolation in an ambient environment (e.g., state 445 in Fig. 4) to the potential formation of a molecular compile comprised of the two molecular subsets in close proximity at a relative orientation and position to one another and embedded in the same ambient medium (e.g., state 455 in Fig. 4). In ordw for the present invention to perform a dense search in the conformational space of the two molecular subsets treated as rigid bothes, i.e., for possibly millions of relative orientations and translations of the two molecular subsets, in an efficient manner, the "electrostatic affinity score" is intended to be an accurate approximation of the relevant change in electrostatic energy, e.g., AE = E45S - E445of Fig. 4 and eqns. 8-9.
In one embodiment, the self-electrostatic energies of both molecular subset 602 and molecular subset 652 ore assumed to be nearly the same before (i.e., in relative isolation) and after the formation of a potential molecular complex. Such an ^proximation justifies the previously described embodiments where the computational domains 608 and 658 do not include points internal to their respective molecular surfaces 606 and 656.
However, in some cases it is plausible to approximate the total electrostatic potential, '\ in Eqn. 9 as two lineariy super-imposable potentials, Oj" and <1>2", each generated separately by the charge distribution of on one of the molecular subsets, while the charge distribution on the other molecular subset is ignored, though both potentials will be different finom their counterparts, Oi' and O2', in state 445.
Such an approximation of the total electrostatic potential, ", is suitable wh«i either (a) there is no ionic atmosphere comprised of one or more electrolytes present in the ambient medium, i.e., Eqn. 11 for the Poisson equation is applicable, (b) the interaction of solute entities on molecular subsets 602 and 652 with an existing ionic atmosphwe in the ambient medium can be ignored due to the Debye-Huckel parameter, as defined in regard to eqns. 12-13, being very small at all points of the system, or (c) the original O" is relatively small everywhere so that the linearized version of for the Poisson Boltzmann equation, as shown in Eqn. 13, is applicable.
To this effect, in some embodiments, the mutual electrostatic interaction between molecular subsets 602 and 652 in state 4S5 of Fig. 4 is approximated by representing the total
electrostatic potential, O", appearing in Eqn. 9, in terms of a linear superposition of two electrostatic potential functions, Oi" and ^2", respectively for molecular subsets 602 ahd 652. In another embodiment the aforementioned linear superposition is a direct sum.
In another embodiment, wh^ super-position is appropriate, 0i" and 2" are replaced by Oi'and O2', the electrostatic potential functions for each molecular subset when in isolation as defined in regard to Eqn. 8 for state 445 of Fig. 4. Such an af^roximation means that changes in both the solvent screening and the self-reaction field of each charge distribution with the ambient polarizable medium as a result of changes in the relative position and oriratation of the two molecular subsets are ignored.
In yet another embodiment, Oi'and O2'are instead replaced by the coordinate based representations of the two electrostatic potential fields, respectively (pi and ep2, defined in step 508 of Fig. 5 in regard to molecular subsets 602 and 652.
In one embodiment, the "electrostatic affinity score" is defined as follows:
where dV represents a differential volume element in ^e joint coordinate system of Fig. 9 and c, and Cj are empirical constants. In Eqn. 39, pi and P2 are, respectively, the coordinate based representation5 of the first and second charge density functions of molecular subset 602 and 652 and, as described earlio-, may take on various forms according to different embodiments. In Eqn. 39, (pj and ses for the first molecular subset and the set of sampled poses for the second molecular subset. The ita-ation over the set of sampled configurations for the molecular combination, for the purpose of generating the plurality of electrostatic affinity scores, can be performed in any order.
In one embodiment, the prediction of the binding mode is generally decided based on the particular configuration, i.e., relative position and oriratation, tibat yields the highest electrostatic affinity score. In another embodiment, (he magm'tude of the best score, or (he. top x% of scores, determines the results of the analysis of the molecular combination of the two molecular subsets. In another embodimrait, all electrostatic affinity scores below a preset numerical threshold are rejected, and only those configurations mfh. passing scores are retained for further analysis. In yet another embodiment, the electrostatic affinity soorw are filtered based on an adaptive threshold dependent on observed statistics of fhe scores as they are generated. In yet another embodiment, the statistical analysis of both passing saxe magnitudes, as well as multidimensional clustoing of the relative position and orientation coordinates of passing configurations, is used to predict the binding mode and/or assess the nature and likelihood of the molecular combinati'on. In one embodiment, in the context of a
hardware or mixed sonware / hardware implementation of the invention, the passing scores and their corresponding states are selected and output to memory off hardware.
In yet another embodiment, the above strategies can be used when screening a
collection of second molecular subsets against the same first molecular subset 602 in order to
predict potential binding modes and estimate binding affinity based on cotnputations of
electrostatic affinity, in orderto select promising candidates for further downstream
processing in the drug discovery pipeline. .
In one embodiment, the plurality of electrostatic affinity scores is calculated at one value for the order of the expansion, Nt, and then the results are quantitatively analyzed according to certain decision crit^a. In anotha* embodiment, the decision criteria are based on a cluster analysis of the electrostatic affinity scores. As used herein, the term ^'cluster analysis" graierally refCTS to a multivariate analysis technique that seeks to organize information about variables so that relatively homogeneous groups, or "clusters", can be formed. The clusters formed vvith this family of methods should be highly, intanally homogenous (members fiT>m same cluster are similar to one another) and higjhly, externally heterogeneous (members of each cluster are nof like members of other clusters).
A further plurality of electrostatic affinity scores may then be calculated at a hij^or value for flie order of the expansion, N2 > Ni, based on results of the quantitative analysis. The electrostatic affinity scores may be computed at the higher expansion ordear, N2, only at those sample points for which the corresponding shape ccnnplementarity score computed at the lower expansion order, Nj, satisfies the decision critwia imposed by the aforwnentioned quantitative analysis.
In one embodiment, in order to accurately assess the likelihood of combination of moleculsir subsets 602 and 652 including those encountered in the context of screening against a series of molecular subsets from a molecule library, an augmented score, S\ is defined as follows:
where S is a shape complementarity score for the two molecular subsets, for example such as defined in regard to Kita II, A£ is the electrostatic affinity score of Eqn. 39, and y is a scalar constant, meant to weight the two scores relative to. one another.
In one embodiment the augm^ted score, S', is generated for a plurality of dififerent configurations of the molecular combination in a manner similar to that described above in regard to the electrostatic affinity score. In another embodiment, the augmented scores are
generated by separately generating the shape complementarity scores and the electrostatic affinity scores for a plurality of different configurations of the molecular combination. In another embodiment, the shape complementarity and electrostatic affinity scores are generated concurrently for a plurality of different configurations of the molecular combination, thereby determining the augmented scores.
To compute S and AE concurrently, one option is to pedbrm the computation in parallel using parallel processing units or elements. Those skilled in the art will appreciate that additional processing elements, as well as possible additional memory ovethead and memory and/or i/o bandwidth, must be used in order to perform the computations, e.g., twice as many processing elements to do roughly twice as much work in similar amount of time. In one embodiment, the computation of shape complementarity and electrostatic affinity scores for a given configuration of the molecular combination is interieaved such that one type of score calculation is performed right after completion of another type of score calculaticms on one set of processing elemmts that support both kinds of score calculations.
In one embodiment, the above embodiments regarding the identification of passing electrostatic affinity scores can be applied to the augmented score by separately applying decision criteria to both the shape complementarity score, S, and the electrostatic {dfftnlty score, AE, and diose molecular configurations with both passing S and passing AE are deemed to be passing in tenns of the augmmted score, S'. In another embodiment, the above embodiments regarding the identification of passing electrostatic affinity scores can be directly applied to the augmented score itself.
Similarly, the above embodiments involving computation of electrostatic affinity scores in two or more passes represented by different values for the order of the basis expansion can be directly applied to the augmented score as well.
Generally, it is inefficient to directiy evaluate Eqn. 39 in its integral form. By first applying a basis expansion to the coordinate based representations of the charge doisity functions and the electrostatic potential fields of the two molecular subsets to obtain reference sets of expansion coefficients and then using appropriate translation and rotation operators to generate sets of transformed expansion coefficients for the first molecular subs^ 602 corresponding to sampled poses of the first molecule, and to likewise generate sets of transformed expansion coefficients for the second molecular subset 652 corresponding to sample poses of the second molecular subset, the electrostatic affinity score, and hence the augmented score of Eqn. 40, for a given configuration of the molecular combination can be
computed efficiently and to arbitrary precision based on the magnitude of N, the order of the expansion.
In one embodiment, within the context of the joint coordinate system of Fig. 9 and using the radial / spherical harmonics ^pansion of Eqn. 23, and also the formulations of eqns. 33-36 to construct and transform reference sets of expansions coefficients for the coordinate based represoitations of both the charge density functions and electrostatic potential fields for both molecular subsets 602 and 652, Eqn. 39 can be rewritten as follows:
where AE' is the term corresponding to the integral in Eqn. 31 (ignoring the constants C| and C2). The transformed expansion coefficients for the first molecular subset are evaluated at a
"= Yik)} and the rotated expansion coefficients for the second molecular subset are evaliiated at
Y2k)).
In the embodiment where the third rotation opwator is directly applied to the computed electrostatic affinity scores themselves, the score may be computed in two steps. In the first step, two intennediate factors A^ and A* are computed,
where m denote the neptive value of m, the transformed expansion coefficients for the first
but the rotated expansion coefficients for the second molecular subset are evaluated at a sample point based on application of the second rotation (^>erator
In the second step, the electrostatic affinity score is given by,
where m is the azimuthal quantum number, and az r^resents the angular value associated with the third rotation operator. Splitting off the third rotation operator in the above maimer generally reduces the total computation significantly.
As described above, a plurality of electrostatic affinity scores is genovted for eadi existing element of the set of sampled configurations of the molecular combination, and can be generated in any order. In Fig. 9, this represoits a sampling of electrostatic affinity scores over a six-dimensional space representing the relative positions and orientations of the two molecular subsets as given by {(R » Ri, pj - pij,yi » yik, az =» aai, pa * Pam, Ya=ran)) where {Ri} refers to the elements of the axial sampling schone of step 514, {Pij,yiic) to the elements of the first spherical sampling scheme of step 516, {paimYin} to the elements of the second sphraical sampling scheme, (also of step 516), and (azi) to the elemouts of the angular sampling scheme of step 518.
In another embodiment, a plurality of augmented scores, as defined in Eqn. 40, are generated in a similar manner, using Eqn. 41, or alternatively eqns. 42-43, for the electrostatic affinity score and analogous expressions for the shape complementarity score as discussed previously in regard to Kita II. As described abov^ the shape complranentarity and electrostatic affinity scores may be constructed concurrentiy or separately, though the choice will have direct implications on the feasibility of various embodiments for identification of passing augmented scores, also discussed above.
For reasonable sampling resolution for each sampling stheme, the total number of electrostatic affinity score scores can be very large. For example, if thore are 50 axial sample points, each 1 A apart, 1000 first spherical sample pomts from an icosahedral mesh, 1000 second spherical sample points from an icosahedral mesh, and 100 angular sample points, this represents approximately five billion scores. However, reduction of the sampling resolution can lead to unacceptable inaccuracies in the final prediction and characterization of the optimal binding mode for the two molecular subsets. Thus efficiency in perfomiing the repeated computation of Eqn. 41, or alternatively eqns. 42-43, for AE'=AE' (R,Pi ,y i ,