Abstract: Disclosed is a method (100) for an identification of seed sequences for enzyme engineering. The method (100) provides seed sequences that are most suitable for structure-based enzyme engineering thereby helping in classification of a new sequence to the respective subfamily. The method (100) provides target sequence which is suitable starting point for modification using information from a crystal structure and enzyme assays. The method (100) is widely applicable for the synthesis of active pharmaceutical ingredients thereby providing a library of ?-transaminases enzymes for industrial applications. Figure 1
DESC:METHOD FOR THE IDENTIFICATION OF SEED SEQUENCES FOR ENZYME ENGINEERING
Field of the invention
The present invention relates to a method for the identification of seed sequences for enzyme engineering and more specifically, to the method for enzyme engineering to tailor the active site to accommodate unnatural substrates.
Background of the invention
Biocatalysis is more versatile than traditional organic chemistry process in asymmetric synthesis of chemical compounds. One of the major characteristics of biocatalyst enzymes is their stereoselectivity that enable them to preferentially synthesize chiral compounds from prochiral precursors. Enantiomerically pure amino acids or amine compounds are essential intermediates that act as building blocks for the synthesis of many pharmaceutical and fine chemicals. Transaminases are versatile enzymes for the synthesis of pure enantiomers. It catalyses the transfer of an amino group from an amine donor substrate which is either an amino acid or an amine compound to a carbonyl compound. On the basis of amine donor either it is an amino acid or an amine compound, transaminases are further grouped as a-transaminase and ?-transaminase. While a-transaminases mediate the transamination of amino group bound to the alpha carbon, ?-TA enzymes catalyze the transfer of an amino group attached to a carbon atom positioned beta or distal from the C-alpha group in an amino acid or an amine compound to a carbonyl compound which is generally called as amino acceptor. More broadly, ?-transaminase are those that transaminate beta amino acids. These enzymes can handle primary amines that do not possess carboxylate group. In addition ?-TAs can handle a broad range of substrate variedly from keto acid to an aldehyde or a ketone. Moreover their application in kinetic resolution of racemic amines and asymmetric amination to yield chiral amines makes ?-TA’s an attractive enzyme for synthesis of enantiopure chiral amines.
Transaminases are ubiquitous in living organisms and play a key role in amino acid metabolism. These enzymes belong to pyridoxal 5’ phosphate (PLP) dependent superfamily of enzymes. The PLP dependent enzymes are classified under seven major structural groups based on fold types representing five major evolutionary lineages. There are about 140 different catalytic functions mediated by these enzymes and therefore represents an example for divergent evolution. The extensive divergence in functionality makes it very difficult to assume a correlation between the sequence and functional relationship. Percudani and Perachhi 2009 had developed a database compiling all the information on PLP dependent enzymes. Among the seven major types of PLP dependent enzymes, TAs are classified under fold I and IV types. Data mining of Protein fold database (Pfam) done by Finn et al., 2010 suggested six subfamilies, wherein class I, II, III and V belongs to same fold type I. Fold IV encompasses D-alanine Transaminase (DATA), L-selective Branched chain amino acid transaminase (BCAT) and Amino-deoxy chorismate lyases (ADCL) and R-selective ?-transaminase (RATA). Aminotranferase fold IV thus provides a spectra of specialized enzymes, via. L-selective BCAT, D-selective DATA, both is a-amino acid transaminase, while the RATA is an R-selective amine transaminase.
Irrespective of the affiliations to different fold types and sub familial groups, the catalytic mechanism is same for any transaminase enzymes. The enzymatic action follows the ping-pong mechanism characterized by two half reaction. The first half reaction starts with the internal aldimine intermediate that constitute the Schiff base covalent linkage between the strictly conserved lysine and the PLP cofactor. Upon entry of the amino donor compound into the active site, a transaldimination reaction mediates the formation of external aldimine intermediate between the substrate and the PLP, releasing the lysine. A 1,3 protropic rearrangement converts the external aldimine to ketimine intermediate, which in turn get hydrolyzed in the next step to yield alpha keto acid and pyridoxamine monophosphate (PMP). The second half reaction is the reversal of the first half reaction to transfer the amine to the keto acid, resulting in regeneration of PLP for next round of catalysis.
Chiral amines are essential precursors for the synthesis of pharmaceutical drugs, fine chemicals and agro products. Chemical synthesis of enantiopure amines suffers from many limitations such as harsh reaction condition, toxic metallo catalyst, poor enantioselectivity and requires many additional purification steps that would reduce the yield of the product. Therefore biocatalysis is a promising alternative to the harmful chemical synthesis methods. Though there are wide range of enzymes such as hydrolases, monoamine oxidases and ?-TA, available for production of chiral amines through kinetic resolution of racemic mixtures, stereoselective ?-transaminase is the most preferred and promising enzymes for asymmetric amination of chiral amines.
Stereo-selective ?-transaminases have been examined earlier. However little was known about R-selective enzymes compared to S-selective counterparts which are extensively studied. For long, R-amines had been synthesized from kinetic resolution of racemic mixture by employing the S-selective enzymes. Hohne et al., 2010 suggested a methodology to identify R-selective enzymes. From both a structural and sequence perspective, the R- and S-selective enzymes are distinct. While S-selective enzymes are grouped under Aminotransferase fold I, R-selective enzymes are Fold IV. Unlike S-selective transaminase which is abundant in nature, R-selective transaminases are rare. R-amines are key building blocks in the synthesis of many pharmaceutical drugs such as Sitagliptin, an antiglycemic agent in the treatment of Type II diabetes mellitus, Cinacalcet, used in the treatment of secondary hyperthyroidism, Silodosin for the treatment of benign prostatic hyperplasia, Tamsulosin, which is used in the treatment of difficult urination, Formoterol for the management of asthma and chronic obstructive pulmonary disease and antihistamine Levocetirizine.
ATA’s achieve broad substrate specificity and stereoselectivity through their characteristic small and large binding pocket, compartmentalized in their active site. Aminotransferase fold IV family proteins has 2 major domains, the small domain which is characterized by a/ß structure from the N-terminus and a large domain located at the C-terminus with a pseudo barrel structure. The two domains are connected by a short interconnecting loop. The active site is located between the two domains and cofactor binding is at the bottom of the active site. Additionally, two loops that are constituent of small domain from the other subunit also line the active site. Thus the active site is contributed by the residues from both the domains and is lined by residues from the two loops from the other subunit. More specifically the characteristic small binding pocket is lined by the small domain and the large binding pocket is lined by the large domain. One of the two loops, the highly flexible capping loop is situated at the entrance of the active site cavity and controls the accessibility to the substrates into the active site pocket. The interdomain loop hangs over the active site entrance and has some influence over the substrate entry.
The localization of alpha carboxylate group in relation to PLP in the active site decides the enantio-selectivity between BCAT and DATA. In BCAT, the carboxylate group is oriented to the phosphate side of the PLP, while it is reversely oriented on the top of the O3 atom of the PLP in the DATA. While in BCAT, the alpha carboxylate group is bound to the small binding pocket, while in DATA; it is bound to the large binding pocket. In BCAT, the a-carboxylate group of the substrate is stabilized through interaction with tyrosine which in turn is polarized by an arginine, both of the residues contour the small binding pocket. While in DATA, the stabilization is through interaction with a positively charged arginine along with a histidine and tyrosine residue. The (D, R) enantio-preference should have evolved into R-selective enzymes (RATA) wherein a similarly oriented arginine residue from the capping loop of RATA enzymes act as a dual substrate recognition residue that stabilizes the carboxylate group of the D-amino acid substrate but move away to provide hydrophobic environment when an R-amine substrate enters the active site. Interestingly, S-ATA’s too possess the ‘flipping’ arginine residue that resides in a loop near the active site and play a key role in dual substrate recognition. Recently, Hohne had suggested that R-selective enzymes should have evolved from L-selective BCAT by random mutagenesis that modifies the small binding pocket disabling the binding of carboxylate groups and promote binding of small hydrophobic moiety in its small binding pocket.
Iwasaki et al., 2003 had isolated two species Arthrobacter sp. KNK168 and Pseudomonas sp. KNK425, that mediated the amination of 3,4-dimethoxyphenylacetone utilizing the Sec-butylamine as the amino donor to produce 3,4-dimethoxyamphetamine with different enantioselections. Arthrobacter sp. KNK168 species yielded the R-enantiomer while Pseudomonas sp. KNK425 gave S-enantiomer with more than 99% enantiomeric excess. Recently four R-selective ?-Transaminases from Hyphomonas neptunium (HT-Wta), Aspergillus Terrus (AT-Wta), Arthrobacter sp (Ar-wTA) and an evolved variant ARmut11-Wta were successfully shown to aminate prochiral ketones yielding R-amines. R-selective enzymes as demonstrated in ATA-117 from Arthrobacter sp. KNK168 was known to aminate methyl ketones and small cyclic ketones. It was also demonstrated that AT-wTA aminate ketones with aliphatic side chains of length up to 6 carbon atoms with high enantiomeric excess and yield. It is also observed that the yield is significantly lower when AT-wTA converts aromatic substrates, more specifically the ketones, such as acetophenone, wherein an aromatic moiety is bonded to the carbonyl carbon. The yield is further lowered if a larger group replaces methyl group bonded adjacent to the carbonyl carbon. Mutti had demonstrated that natural enzymes such as HT-wTA, At-wTA and Ar-wTA does not aminate a-tetralone, while the evolved variant ARmut11-wTA successfully aminate the aromatic ketone. While studying the feasibility of synthesizing sitagliptin, Savile found that the R-selective ?-Transaminase from Arthrobacter sp. ATA-117 is defective in binding the prochiral ketone prositagliptin. Docking studies showed that there is severe steric hindrance in the characteristic small binding pocket of the enzyme in accommodating the bulky trifluorophenyl substituent bonded to the ß-keto side. It was further suggested that there will be undesired interaction in the large binding pocket. In the mission to develop an evolved variant that could bind prositagliptin, Savile et al., 2010 had developed an engineered enzyme ARmut11-wTA that encompass 28 mutations at various positional loci that spans across both the large and small binding pockets of ATA-11745. Thus R-selective ?-TAs do not transaminate unnatural amine acceptors which usually have bulkier aromatic substituent group and therefore require extensive enzyme engineering to produce evolved variants to accomplish the transamination.
Guan et al. 2015 validated the rationale for one particular mutation G136F among the 28 mutations proposed by Savile et al., 2010 through structural studies. The G136F mutant altered the conformation of the loop that line the active site of the adjacent subunit that had effectively modified the topology, volume and surface properties of the active site enabling the binding of prositagliptin. The finding thus proposes that this loop could be an efficient target for rational mutagenesis approach to generate ideal evolved variants for binding unnatural substrates. Given the huge potential for R-amines in the synthesis of compounds that have clinical and industrial importance, search for novel R-selective ?-Transaminase is of immense importance. As Aminotransferase fold IV proteins share the highly conserved structural fold the divergence lies in the very few differences in the active site residues. Thus the sub-family classification of new sequences is rather difficult and may even be misleading. The search is currently made in two complementary directions; one is applying In-silico methods to identify new sequences that are potentially R-selective and other is activity based. The activity based approach employ metagenomics methods to screen microorganism by enriching the growth media with different substrates of choice or screening of different organism based on their ability to convert the potential substrates. Several R-selective ?-Transaminases could be identified by this approach. In-silico tools utilize the presence of key amino acid residues to predict the putative functionality of the protein. Hohne et al., 2010 had identified 17 ?-TA, all R-selective, by the identification of signature sequence motifs that define the divergence of Aminotransferase fold IV family. Pavkov-Keller et al., 2016 had discovered two new sequences that show R-selection but do not show any similarity to the defined signature sequence motifs of the already known four subfamilies of this fold through enrichment-based approach. Thus the two methods complement each other in identifying new R-selective ?-Transaminase sequence. These methods, however, have limitations as the search algorithm is based on only a few signature residues and there are instances where the classification based on the conservation of signature does not correlate with the activity. Pavkov-Keller et al., 2016 showed how a sequence from Curtobacterium pusilum (PDBid-5K3W) is phylogenetically related to ADCL, although is actually an RATA enzyme. Indeed, Hohne et al., 2010 algorithm does not specify any signature for the RATA family; RATA sequences are the ones where all the defined signatures of other three subfamilies are found absent. Thus, there is a need for more robust search methodologies to identify novel R-ATA enzymes. Towards this goal, the shape and charge complementarity between the substrate of interest and the cavity topology of enzyme were used to identify new R-ATA sequences. With extensive data mining of the sequence database, 5603 sequences sharing 30-90% sequence similarity with the non-redundant set of 25 aminotransferase fold IV structures spanning all the four subfamilies. The initial classification was made based on Hohne et al. algorithm, and a sizable number of sequences that do not show conservation of signature residues are categorized as ‘Sequence of unknown function (SUF)’. All the 5603 sequences were modeled as cognate-dimers and their cavity topology is mapped. Docking using relevant natural and unnatural substrates provided an assessment of shape complementarity between the docked substrate and the cavity topology. The best complementing enzyme with potential ligands were be identified as RATA enzymes.
Accordingly, there exists a need to provide a method for the identification of seed sequences for enzyme engineering that could help in classification of a new sequence to the respective subfamily, and ‘substrate first’ approach for the identification of target enzyme sequences for a specific function or protein engineering to optimize catalytic activity.
Objects of the invention
An object of the present invention is to identify target enzyme sequences for a specific function or protein engineering to optimize catalytic activity.
Another object of the present invention is to identify seed sequences for enzyme engineering that helps in classification of a new sequence to the respective subfamily.
Summary of the invention
Accordingly, the present invention provides a method for the identification of seed sequences for enzyme engineering. The method involves screening non-redundant aminotransferase structures from a standard database. The screened non-redundant aminotransferase structures are used to form a parent dataset. The parent dataset includes a representative structural dataset of 25 proteins sharing a complete sequence coverage along with a sequence identity range of 30-90%.
The parent dataset is pruned on the basis of sequence similarity to form a derived dataset. The derived dataset includes sequences that have sequence identity between 30-90%. The derived datasets are classified to form subfamily datasets. Specifically, the subfamily datasets are formed by classifying the derived dataset into four subclasses such as R-amino transferase, D-amino transferase, L-branched chain amino acid aminotransferase, and 4-amino-4-deoxychorismate lyase.
The method further involves modelling of substrates and evaluating chemical feasibility of the models. The chemical feasibility evaluation is carried out to determine appropriate distances between the substrate and cofactor binding sites as well as to determine an arrangement of catalytic residues. Further, the method involves mapping and validating a cavity shape profiles against the structures in the derived dataset. Moreover, the method involves docking of substrates and deriving shape complementarity with the cavity topology. The substrates such as natural and non-natural ligands are selected from 3,4 dimethoxyphenyl acetone, (2R)-4-oxo-4-[3-(trifluoromethyl)-5,6-dihydro[1,2,4]triazolo[4,3-A]pyrazin-7
(8H) -YL]-1-(2,4,5-trifluorophenyl)butan-2-amine) and similar compounds.
Brief description of the drawings
The objectives and advantages of the present invention will become apparent from the following description read in accordance with the accompanying drawing wherein,
Figure 1 is a flowchart of a method for the identification of seed sequences for enzyme engineering, in accordance with the present invention;
Figures 2a and 2b show a distribution of different enzyme classes in an enzyme library, in accordance with the present invention;
Figures 3a, 3b and 3c illustrate an experimental validation of homology models and dimeric arrangement, in accordance with the present invention;
Figure 4 illustrates an experimental validation of a cavity shape profiles derived from dimeric enzyme models in the library, in accordance with the present invention;
Figures 5a-5d shows shape profiles differ in the four subclasses of transaminases in the library, in accordance with the present invention;
Figures 6a-6d shows an R-amino transferase sub-family has the largest diversity in cavity shape profiles, in accordance with the present invention;
Figure 7 shows ligand structures that are complementary to the cavity shape profiles, in accordance with the present invention; and
Figure 8 illustrates complimentary of a docked ligand (substrate) into the cavity shape profile, in accordance with the present invention.
Detailed description of the invention
The foregoing objects of the present invention are accomplished and the problems and shortcomings associated with the prior art, techniques and approaches are overcome by the present invention as described below in the preferred embodiments.
Referring to figure 1, a method (100) for an identification of seed sequences for enzyme engineering in accordance with the present invention is shown. Specifically, the method (100) is applicable in the synthesis of (2R)-4-oxo-4-[3-(trifluoromethyl)-5,6-dihydro[1,2,4]triazolo[4,3-A]pyrazin-7(8H)-YL]-1-(2,4,5-trifluorophenyl)butan-2-amine and similar compounds with applications in agro-chemicals, neutraceuticals and therapeutics.
Figure 1 shows the detailed flow chart illustrating the method (100) for the identification of seed sequences for enzyme engineering from steps (10) to (70). At step (10), the method (100) involves screening non-redundant aminotransferase structures from a standard database. Specifically, National Center for Biotechnology Information (NCBI) database is used for screening the sequences. The NCBI database includes 480,00,000 sequences. However, it is understood here that any other standard database can be used as per intended application in other alternative embodiments of the method (100).
At step (20), the method (100) involves forming a parent dataset using the screened non-redundant aminotransferase structures. The parent dataset is formed by BLAST analysis of the screened non-redundant aminotransferase structures. In an embodiment, the parent dataset includes a representative structural dataset of 25 proteins sharing a complete sequence coverage along with a sequence identity range of 30-90%. The selected 25 sequences are bonafide transaminases with characterized crystal structures. The parent dataset is based on a semi-automated search for all sequences that are similar to the 25 seed sequences.
At step (30), the method (100) involves forming a derived dataset by pruning the parent dataset on the basis of sequence similarity. All sequences in the derived dataset have sequence identity between 30-90%. Specifically, the derived dataset includes the sequence signatures of different transaminase classes. However, it is understood here that the derived dataset may include other sequences as per intended application in other alternative embodiments of the method (100).
At step (40), the method (100) involves forming subfamily datasets by classifying the derived dataset into four subclasses. Specifically, the sequence signatures of different transaminase classes are utilized to further sub-classify the derived dataset sequences into four subclasses namely R-amino transferase (RATA), D-amino transferase (DATA), L-branched chain amino acid aminotransferase (BCAT), and 4-amino-4-deoxychorismate lyase (ADCL).
At step (50), the method (100) involves modelling of substrates and evaluating chemical feasibility of the models. Specifically, semi-automated methods are used to generate three-dimensional structural models of the monomeric proteins. These structural models are then manually evaluated to obtain an appropriate dimer structure. In an embodiment, the appropriate dimer structure includes two monomeric protein models arranged in a particular fashion. Thereafter, the chemical feasibility of the models is evaluated by manual checks. The evaluation is carried out to determine appropriate distances between the substrate and cofactor binding sites as well as to determine an arrangement of catalytic residues.
At step (60), the method (100) involves mapping and validating a cavity shape profiles against the structures in the derived dataset. The models form the basis for cavity shape profiles to be generated. This forms a ‘die-cast’ like description that is unique to each structural model. The uniqueness of the die-cast allows each sequence-structure entry in each sub-family formed at step (104) to be described purely on the basis of this shape as opposed to sequence and structure features of the parent protein.
At step (70), the method (100) involves docking of substrates such as natural and non-natural ligands and deriving shape complementarity with the cavity topology. The natural and non-natural ligands are selected from 3,4 dimethoxyphenyl acetone, (2R)-4-oxo-4-[3-(trifluoromethyl)-5,6-dihydro [1,2,4] triazolo[4,3-A]pyrazin-7(8H)-YL]-1-(2,4,5-trifluorophenyl)butan-2-amine) and similar compounds. The diverse substrate compounds are docked into the cavities or die-casts identified in step (106). The docking allows the identification of the enzyme sequence that is most likely to be effective for a given substrate.
The invention is further illustrated hereinafter by means of examples.
Examples:
Example 1: Building a unique structural and sequence dataset
A total of 1237 non-redundant aminotransferase structures were screened from the Protein Data Bank (PDB) database (135787 entries as on December 6, 2017) and 94 structures, categorized as amino-transferase class IV within PFAM, were selected for further analysis. Culling this 94 structure-dataset with a PDB90 constraint and excluding ligand bound and structures of mutant proteins spanning all the four subfamilies, a representative structural dataset of 25 proteins sharing a complete sequence coverage along with a sequence identity range of 30-90% was obtained (Table 1). Table 1 also provides the chain length along with the statistical scores of these crystal solutions and their functional annotation.
BLAST analysis against the non-redundant sequence database for the each entry in the structural data set, a sequence set of 15081 was built. Culling of this sequence database at 30-90% similarity at a query coverage span of 100%, a total of 5941 sequences were filtered. The first check was for the presence of Schiff base forming lysine and it was found that 93 sequences do not have the catalytic lysine. These were subsequently discarded. The next parameter examined was sequence length- sequences greater than 400 and less than 200 residues were discarded (243 sequences). 5603 sequences were subsequently used for sub-classification. A HMM-guided sequence profile was constructed for the extracted sequence dataset and the sequence set was clustered into four bins, viz. R-amino transferase (RATA), D-amino transferase (DATA), L-branched chain amino acid aminotransferase (BCAT), and 4-amino-4-deoxychorismate lyase (ADCL). The sequence fingerprints were verified to categorize the sequences in the derived dataset into the four subfamilies- BCAT (2025 sequences), DATA (853), RATA (163) and ADCL (385). The sequences that were not categorized into any subfamily were grouped as Sequences of unknown function (SUF) dataset. The monomer models of all the five subset sequences were modelled using MODELLER. A sequence similarity matrix was constructed for all sequences against all the twenty five structural entries and the most similar structure was taken as the template for each sequence for modelling. An in-house modified script of the MODELLER was used to automate the process. As the capping loop from the adjacent protomer is involved in the catalysis, the predicted models had to be dimerized. While SYMMDOCK and Galaxy Gemini were used to obtain the cognate dimer models, substantial differences were seen between the dimeric models and the biologically relevant dimer. Each model thus had to be manually evaluated. The manually curated dataset of modelled dimeric ?-TA enzymes was used in the subsequent analysis.
Categorizing the sequence dataset
The curated sequence dataset containing 5603 protein sequences sharing 30 to 90% sequence similarity was classified into four classes: ADCL, BCAT, DATA and RATA, as based on Hohne et al., 2010 algorithm. The conserved catalytic lysine was selected as the key residue, and the sequences not encoding the lysine at the expected locus were discarded. The sequence fingerprints were verified to categorize the sequences in the derived dataset into the four subfamilies- BCAT (2025 sequences), DATA (853), RATA (163) and ADCL (385). The sequences that were not categorized into any subfamily were grouped as Sequences of unknown function (SUF) dataset (2177). The classification work flow is shown in figure 1 and the subsequent Table 1 lists the number of sequences within each subfamily dataset. A pie chart depicting the share of sequences within each subfamily is shown figures 2a and 2b.
Table 1: No. of sequences listed in each subfamily.
Subfamily Sequences grouped
BCAT 2025
DATA 853
RATA 163
ADCL 385
SUF 2177
Total 5603
As evident from Table 1, RATA dataset has the least number of sequences-indicating the rarity of the class. Sequences of unknown function (SUF) had the largest number of sequences that indicate the limitations in the current sequence motif based search protocols. Indeed, this emphasizes the need for more robust search strategies. All the 5603 sequences were modelled and dimerized. Dimers were manually curated and evaluated for correct assembly. As these enzymes are functionally active in their cognate dimeric state and the interface lies exactly within the dimer-interface, accuracy at this step is very critical for further analysis.
Example 2: Active site cavity topology mapping
The program Hollow was used to map the cavity profiles for all the constructed dimers. NZ atom of the Schiff base forming lysine was taken as the probe centre and a probe radius of 12Å to search for the void. Details such as volume and shape profile were determined. The range of these scores within and across the considered sub-families was further evaluated to analyze topological dissimilarity. As the cavity profile determined by Hollow does not map the electrostatic features of the molecular surface, CastP was also used to complement the analysis to get the charge profile.
Subfamily-specific cavity shape profile was most diverse for RATA proteins. The functional divergence among the Aminotranferase fold IV protein subfamilies was evaluated by the difference in the active site topology. At the first instance, active site cavity mapping of all the twenty five entries of the structural data set was performed using Hollow and analysed for the conserved profile for each subfamily. Examining active site topology, variations in the active site topology of the proteins within a subfamily were observed. However, a consensus shape profile, unique for each subfamily could be mapped for DATA, BCAT and ADCL proteins (refer figure 5a-5d). Accessing the cavity profiles of the proteins in each subfamily, the members of RATA subfamily had the most divergent profiles (refer figures 6a-6d). Further analysis revealed that orientation and length of the capping loop was highly variant across these structures and becomes the major determinant for the differences in the profile. The length of the capping loop in 3WWH was large and inclined towards the small binding site, unlike in other proteins, where the loop was short and inclined towards the large binding site. The feature of small binding site getting constricted in 3WWH makes it unviable for binding prositagliptin- this was engineered to widen the small binding site for the conversion to sitagliptin. Cavity shape profile of the engineered variant (PDBid-3WWJ) was mapped and verified for the widened small binding pocket. Surface of the cavity profiles showed that both the small and large binding pocket was relatively open compared to the other proteins. Sequences that show a similar profile to that of 3WWJ can be selected as the potential candidate to work for the conversion of prositaglipin. All the dimerized models have been mapped for the cavity profile and currently being analysed. Docking analysis was carried out in parallel to check the shape complementarity.
Example 3: Modeling of substrates in the active site
Autodock4.2 was used to dock the substrates into the active site of the models. The Cofactor PLP was retained during the docking. The NZ atom of the Schiff base forming lysine was fixed as a grid centre. Docking results were validated by verifying the solutions that retain the catalytically relevant conformation.
Example 4: Cloning, Expression and Purification of target enzymes
The gene (accession id: NZ_AUIR01000012) encoding the ?-transaminase (WP_051243513) from Thalassobacullum Salexgenis DSM 19539 which was codon optimized for expression in E. coli systems was ordered. The gene was subcloned into pET28a expression vector between NdeI and XhoI restriction sites to introduce a N-terminal 6X His purification tag. The cloned gene was transformed into E. coli BL21* expression strain. The gene was subcloned into pET28a expression vector between NdeI and XhoI restriction sites to introduce a N-terminal 6X His purification tag. E. coli BL21* harbouring PET28a-Tsal-?-TA and pET28a-Saur-?-TA were grown at 37 °C in LB medium supplemented with kanamycin antibiotic (50 mg/ml). Overexpression was induced by addition of 0.5 mM IPTGA and incubated at 18 °C for 12 hours. Overexpressed cells were harvested and resuspended in cold lysis buffer (50 mM Tris-HCl pH 7.5, 300 mM NaCl and 5 mM Imidazole). The lysed cells were disrupted by sonication followed by centrifugation at 14000 rpm for 45 minutes to pellet down the insoluble proteins and other cell debris. The supernatant fraction containing the soluble proteins including the ?-TA was incubated with Ni-NTA resins pre equilibrated with lysis buffer for about an hour at 4 °C. The incubated resins were washed with buffer containing 20 mM imidazole and the bound protein was eluted using buffer containing 250 mM imidazole. The eluted protein was checked for the purity using SDS-PAGE. A portion of the eluted protein was desalted using 50 mM Tris.HCl, 100mM NaCl, and 10% glycerol to remove imidazole. Protein was concentrated to 12 mg/ml and stored in -80 °C till further use.
Example 5: Catalytic activity measurements
The purified test enzymes were tested for functionality using a protocol modified from the method described in Schatzle et al., 2009. The transfer of amino group from alpha methyl benzylamine to pyruvate to generate acetophenone and D-alanine was monitored by increase in absorption of acetophenone at 245 nm. The assay was carried out in a final volume of 1 ml containing substrate concentration of 2.5 mM each of D,L-1-phenyl benzylamine and pyruvate, 0.25% DMSO, 50 mM HEPES, pH 8.0. The reaction was started by adding 6 nanomoles of enzyme pretreated with PLP. A parallel reaction was carried out as a control in the absence of enzyme. The production of acetophenone was monitored by measuring the absorbance at 245 nm in time scan duration of 30 minutes at 1 minute interval. The enzyme unit was defined as micromoles of acetophenone released per minute.
Enzyme units/ml = (?A of test - ?A of blanks) ? 100 ? enzyme dilution factor
Slope of std. graph ? 30
?A = Absorbance at 30 min – Absorbance at 0 min
30 = Reaction incubation time (min)
100 = Conversion factor for the enzyme volume
The standard graph for the absorbance of known concentration of acetophenone ranging from 5 µM to 50 µM was plotted recording the absorbance at 245 nm. All the enzymes were also tested for the alpha transamination function.
Example 6: Crystallization, diffraction data collection and structure determination
The purified eluted protein in the presence of imidazole was used for crystallization. The crystallization condition was screened for, using Hampton Screens, set up using Mosquito pipetting robot. For each screen conditions, the droplet contained 1:1 ratio of 300 nL each of protein and the reservoir solution. The crystallization was set in Grenier 96 well handing drop plates. The concentration of the protein was varied between 3 mg/ml to 7 mg/ml. Crystals of Tsal_wTA was obtained in 0.2 M Ammonium sulfate, 0.1 M Bis-Tris pH 5.5, and 25% PEG 3350 using hanging drop vapour diffusion method after 24 hours incubation at 20 °C. A grid was set around the condition varying the concentration of ammonium sulfate and the pH. Diffraction quality crystals were obtained at 4 mg/ml concentration in the grid condition containing 0.05 to 0.1 M ammonium sulfate, 0.1 M Bis-Tris pH 5.5 and 25% PEG 3350. The crystals were soaked in cryo solution containing 0.05 to 0.1 M ammonium sulfate, 0.1 M Bis-Tris pH 5.5, 25% PEG3350 and 20% ethylene glycol for about 10 minutes before freezing on the cryo stream. Cocrystallization with various amine acceptor and amine donor substrates were also carried out. 10 mM of D-alanine, Glycine, a-Methyl benzylamine and an unnatural amine sitagliptin were used as amine donors. 10 mM concentrations of amine acceptor substrates such as pyruvate and unnatural amine acceptor prositagliptin, which is the prochiral carbonyl compound of sitagliptin. The substrates of each category, both separately and in combination was incubated with 4 mg/ml of protein for about 2 hours at 4 °C and set for crystallization with the identified condition. The crystals were soaked in cryo solution containing 0.05 to 0.1 M ammonium sulfate, 0.1 M Bis-Tris pH 5.5, 25% PEG3350 and 20% ethylene glycol for about 10 minutes before freezing on the cryo stream.
Soaking experiment was also carried out by incubating the native Tsal-wTA crystals in a 20% ethylene glycol cryo solution containing 10 mM of substrates both separately or in various combination of amine donor and amine acceptor for about 10 minutes before freezing on the cryo stream. The X-ray diffraction data was collected at ID-23 beamline at ESRF, Grenoble, France.
Purified imidazole free Saur_wTA protein was set for crystallization. The crystallizing condition was screened for, using Hampton Screens, set up using Mosquito pipetting robot. For each screen conditions, the droplet contained 1:1 ratio of 300 nL each of protein and the reservoir solution. The crystallization was set in Grenier 96 well handing drop plates. The concentration of the protein was varied between 5 mg/ml to 10 mg/ml. Droplets containing 7 mg/ml gave diffraction quality crystals of Saur_Wta in two different crystallization conditions (0.06 M citric acid, 0.04 M bis-tris propane pH 4.1 and 16% PEG3350; 8% v/v Tacsimate pH 4.0, 20% w/v PEG3350) using hanging drop vapour diffusion method after 24 hours incubation at 20 °C. Data collection was made in home source using Rigaku R-Axis IV diffractometer. Imosflm and XDS were used for data processing. Phasing was calculated using Molecular replacement method carried out using PHASER module of CCP4 suite. The phasing model was a low resolution structure of Tsal_wTA solved using a dataset collected at the home source whose phases were calculated using BALBES online server. Refinement was done using Refmac5 and solvent was added using Arp/Warp solvent module of CCP4 suite. COOT was used for model building and inspection.
Example 7: In-silico Docking studies
External aldimine complex of PLP with R isomer of alpha-methyl benzylamine, D-alanine and Acetyl naphthalene was sketched and coordinates were generated using Prodrg. Docking studies were performed using Autodock. Docking was performed on cognate dimers of Tsal_Wta, Saur_Wta and the generated mutant of Saur_R99A. The docking results were analyzed in Pymol. Superimposition of the determined structures was made to compare the length and the electrostatic potential of the capping loop. Electrostatic potential of the loop was calculated using Adaptive Poisson-Boltzmann server inbuilt plugin provided in Autodock.
Two case studies are described hereinafter to show the validity of the method (100).
Case study 1: Docking of L-glutamate in complex with PLP on the active site of a BCAT protein (5CE8)
L-glutamate-PLP complex was sketched using MOLINSPIRATION webserver and docked on the active site of protein 5CE8 after removing the bound cofactor PLP and other solvent molecules (Figure 7). Schiff base forming Lys150 is taken as the grid centre for docking using Autodock. The docking conformation is validated by verifying the PLP interactions.
The cavity shape profile of the protein 5CE8 was mapped as dimer using Hollow. Superimposition of the docked solution with the cavity showed that the docked solution fits exactly within the mapped cavity. This indicates the shape complementarity between the substrate and the active site (Figure 8).
Case study 2: Docking of prositagliptin on the active site of the engineered ATA-117 (3WWJ)
Docking of prositagliptin on the active site of engineered ATA-117 (PDBid- 3WWJ) Coordinates of Prositagliptin was obtained from PDBeCHEM and was docked on the active site of protein 3WWJ with bound PLP retained. The docking was validated through verifying the catalytically relevant conformation that requires the placing of trifluorophenyl group on the small binding pocket and the heterocyclic moiety in the large binding pocket. Docked solution is superimposed on the hollow output and found it to be fitting into the active site indicating the complementarity.
Selection of Test sequences:
Based on the established signatures and the verified presence of residues which are proposed by Savile et al., 2010 as mutations on the ATA-117, an RATA protein from Arthrobacter sp. KNK1633 scaffold, six test sequences were identified for experimental validation for their activity to convert prositaglptin. These are- Tetraspora japonica; Hyphomonas hirschina; Pseudo fijiensis; Cecrospora entrica; Thalasobaculum salexgenis; Staphylococcus aureus
Experimental validation: Structure of Thalasobaculum salexgenis ?-TA
In the crystal structure of T. sal ?-TA determined at 1.7Å resolution, the asymmetric unit consists of a dimeric molecule (Figures 3a-3c). The structure was well refined to an R-factor of 17% and R-free of 20%. Cofactor Pyridoxal 5’ phosphate (PLP) was modeled in the appropriate difference fourier peak in the active site adjacent to Lys168 in both the protomers. The continuous electron density between the difference fourier peak and Lys168 indicate covalent linkage and the NZ atom of Lysine and C4 atom of PLP demonstrating the formation of internal aldimine. The determined dimeric structure solution was seen as a crystal packing of 2 protomers and is not biologically relevant cognate dimer. In a typical dimeric arrangement of Aminotransferase Fold IV proteins, the capping loop from the adjacent protomer that controls the accessibility to substrates will form the dimeric interface. In the obtained structure solution, the capping loop is in the periphery indicating the obtained dimer is not a biological arrangement. Moreover as with any other reported R-selective omega transaminase structures, the capping loop is highly flexible and mobile which caused the segment a disordered one which is reflected in the poor electron density for residues from 115 to 119 and therefore is left unbuilt. Likewise the residues from 160 to 165 were not built due to poor electron density. The data collection, processing and refinement statistics were given in Table 1. An attempt to cocrystallize T. sal ?-TA with D-alanine and pyruvate resulted in high order oligomeric structures. However the substrates are not bound to the active site. The tetrameric structure crystallize in space group P 1 21 1 (Figures 3a-3c). The data set was truncated at 2.5Å resolution. The asymmetric unit has 4 molecules. The structure was well refined to an R-factor of 28% and R-free of 32%. Pyridoxal 5’ phosphate (PLP) was modeled in the appropriate difference fourier peak in the active site adjacent to Lys168 in all the four protomers. The continuous electron density between the differene fourier peak and Lys168 indicate covalent linkage and the NZ atom of Lysine and C4 atom of PLP were linked appropriately demonstrating the formation of internal aldimine. The tetrameric solution has one of the dimeric pair assembled as a biologically relevant cognate partner. 2 other molecules were packed in the ASU that share a weak dimeric interface with each of the protomers of the cognate pairs. The dodecameric structure crystallize in space group I 1 2 1. The data set was truncated at 2.1Å resolution. The asymmetric unit has 12 molecules. The structure was well refined to an R-factor of 25% and 30%. PLP was modeled in the appropriate difference fourier peak in the active site adjacent to Lys168 in all the twelve protomers with the C4A atom of the PLP covalently linked to the NZ atom of the Lys168 demonstrating the formation of internal aldimine. The determined structure solution has both cognate and non cognate pairs.
Experimental validation: Structure of Staphylococcus ?-TA
The crystal structure of S. aureus ?-TA was determined at 2.1 Å resolution. The protein was crystallized in P21 2 21 space group. The asymmetric unit contains one monomer. The structure was well refined to an R-factor of 18% and 23%. PLP was modeled in the appropriate difference fourier peak in the active site adjacent to Lys146. The continuous electron density from NZ atom of Lys146 leading to the difference fourier peak indicate covalent linkage and the NZ atom of Lysine was linked with C4A atom of PLP demonstrating the typical internal aldimine of PLP dependent enzymes.
Capping loop analysis
The capping loop from the adjacent protomer that controls the accessibility of the substrate into the active site has high scope for engineering. The capping loop was also engineered to excavate the small binding pocket in ATA-11746, to enable it to accommodate the bulky trifluoro phenyl substituent. The length of the loop in ATA-117 is relatively large and provides steric hindrance in the small binding pocket. In the engineered variant, the flattened loop due to the shortening of length widens the pocket. Thus the loop provides scope for enzyme engineering. Capping loop in T. sal ?-TA and S. au ?-TA have a capping loop of intermediate length. The superimposition of the structures with ATA-117 (PDBid-3WWH) and the engineered variant (PDBid-3WWJ) indicated that the loops in these proteins is shorter than the wild type but relatively larger than the engineered variant of ATA-117.
Advantages of the invention
1. The method (100) provides seed sequences that are most suitable for structure-based enzyme engineering.
2. The method (100) provides target sequence which is suitable starting point for modification using information from the crystal structure and enzyme assays.
3. The method (100) is widely applicable for the synthesis of other active pharmaceutical ingredients thereby providing a library of ?-TA enzymes for industrial applications.
The foregoing descriptions of specific embodiments of the present invention have been presented for purposes of illustration and description. They are not intended to be exhaustive or to limit the present invention to the precise forms disclosed, and obviously many modifications and variations are possible in light of the above teaching. The embodiments were chosen and described in order to best explain the principles of the present invention and its practical application, to thereby enable others skilled in the art to best utilize the present invention and various embodiments with various modifications as are suited to the particular use contemplated. It is understood that various omission and substitutions of equivalents are contemplated as circumstance may suggest or render expedient, but such are intended to cover the application or implementation without departing from the scope of the present invention.
,CLAIMS:We claim:
1. A method for an identification of seed sequences for enzyme engineering, the method comprising the steps of:
screening non-redundant aminotransferase structures from a standard database;
forming a parent dataset using the screened non-redundant aminotransferase structures;
forming a derived dataset by pruning the parent dataset on the basis of sequence similarity;
forming subfamily datasets by classifying the derived dataset;
modelling of substrates and evaluating chemical feasibility of the models;
mapping and validating a cavity shape profiles against the structures in the derived dataset; and
docking of substrates and deriving shape complementarity with the cavity topology.
2. The method as claimed in claim 1, wherein the parent dataset includes a representative structural dataset of 25 proteins sharing a complete sequence coverage along with a sequence identity range of 30-90%.
3. The method as claimed in claim 1, wherein the derived dataset includes sequences that have sequence identity between 30-90%.
4. The method as claimed in claim 1, wherein the subfamily datasets are formed by classifying the derived dataset into four subclasses such as R-amino transferase, D-amino transferase, L-branched chain amino acid aminotransferase, and 4-amino-4-deoxychorismate lyase.
5. The method as claimed in claim 1, wherein the chemical feasibility evaluation is carried out to determine appropriate distances between the substrate and cofactor binding sites as well as to determine an arrangement of catalytic residues.
6. The method as claimed in claim 1, wherein the substrates such as natural and non-natural ligands are selected from 3,4 dimethoxyphenyl acetone, (2R)-4-oxo-4-[3-(trifluoromethyl)-5,6-dihydro[1,2,4]triazolo[4,3-A]pyrazin-7
(8H)-YL]-1-(2,4,5-trifluorophenyl)butan-2-amine) and similar compounds.
| # | Name | Date |
|---|---|---|
| 1 | 201821023778-PROVISIONAL SPECIFICATION [26-06-2018(online)].pdf | 2018-06-26 |
| 2 | 201821023778-FORM 1 [26-06-2018(online)].pdf | 2018-06-26 |
| 3 | 201821023778-FORM 3 [26-06-2019(online)].pdf | 2019-06-26 |
| 4 | 201821023778-ENDORSEMENT BY INVENTORS [26-06-2019(online)].pdf | 2019-06-26 |
| 5 | 201821023778-DRAWING [26-06-2019(online)].pdf | 2019-06-26 |
| 6 | 201821023778-COMPLETE SPECIFICATION [26-06-2019(online)].pdf | 2019-06-26 |
| 7 | Abstract1.jpg | 2019-08-16 |
| 8 | 201821023778-Proof of Right (MANDATORY) [30-09-2019(online)].pdf | 2019-09-30 |
| 9 | 201821023778-FORM-26 [30-09-2019(online)].pdf | 2019-09-30 |
| 10 | 201821023778-ORIGINAL UR 6(1A) FORM 1 & 26-031019.pdf | 2019-10-07 |
| 11 | 201821023778-FORM 18 [27-05-2022(online)].pdf | 2022-05-27 |