Sign In to Follow Application
View All Documents & Correspondence

"System And Method For Recognizing Programming Practical Biomolecules"

Abstract: The present invention for the most part identifies with methods of quickly and proficiently looking through organically related information space. All the more explicitly, the invention incorporates methods of recognizing biomolecules with wanted properties, or which are generally reasonable for procuring such properties, from complex bio-particle libraries or sets of such libraries. The invention likewise gives methods of displaying arrangement action connections. The same number of the methods are PC executed, the invention moreover gives advanced systems and programming to playing out these methods

Get Free WhatsApp Updates!
Notices, Deadlines & Correspondence

Patent Information

Application #
Filing Date
05 March 2020
Publication Number
37/2021
Publication Type
INA
Invention Field
BIOTECHNOLOGY
Status
Email
ipr@optimisticip.com
Parent Application

Applicants

MESBRO TECHNOLOGIES PRIVATE LIMITED
Flat no C/904, Geomatrix Dev, Plot no 29, Sector 25, Kamothe, Raigarh-410209, Maharashtra, India

Inventors

1. Mr. Bhaskar Vijay Ajgaonkar
Flat no C/904, Geomatrix Dev, Plot no 29, Sector 25, Kamothe, Raigarh-410209, Maharashtra, India

Specification

Claims:We Claim:
1. A method for recognizing nucleotides for variety in nucleic acids encoding a protein variation library so as to affect an ideal action, said method comprising of:
a. receiving information describing a preparation set of a protein variation library, wherein the information involves movement and a nucleotide arrangement for every protein variation in the preparation set.
b. the information, building up a succession movement model for anticipating action as a component of autonomous factors
c. using the arrangement action model to rank situations in a reference nucleotide grouping and nucleotide types at the positioned positions in the reference nucleotide succession arranged by sway on the ideal movement
d. using the positioning to recognize at least one nucleotide, in the reference nucleotide arrangement, that are to be fluctuated or fixed so as to affect the ideal movement, wherein the nucleotides to be differed incorporate codons encoding specific amino acids.
2. The PC program method as claimed in claim 1, wherein the movement is a component of articulation of nucleic acids.
3. The PC program method as claimed in claim 1, wherein it involves communicating the new protein variation library from polynucleotides encoding individuals from the new protein variation library and wherein the polynucleotides are set up by quality blend.
, Description:Technical Field of the Invention:
The technical field of the invention is to the fields of molecular biology, molecular evolution, bioinformatics, and digital systems. More specifically, the invention relates to methods of identifying biomolecule targets with desired properties and methods for computationally predicting the activity of a biomolecule. Systems, including digital systems, and system software for performing these methods are also provided. Methods of the present invention have utility in the optimization of proteins for industrial and therapeutic use.

Background of the Invention:
Protein design has long been known to be a difficult task if for no other reason than the combinatorial explosion of possible molecules that constitute searchable sequence space. The protein design problem was recently shown to belong to a class of problems known as NP-hard, indicating that there is no algorithm known that can solve such problems in polynomial time. Because of this complexity, many approximate methods have been used to design better proteins; chief among them is the method of directed evolution. Directed evolution of proteins is today dominated by various high throughput screening and recombination formats, often performed iteratively.

Sequence space can be described as a space where all possible protein neighbours can be obtained by a series of single point mutations. Smith (1970) “Natural selection and the concept of a protein space, for example, a 100-residue long protein would be a 100-dimensional object with 20 possible values, i.e., the 20 naturally occurring amino acids, in each dimension. Each one of these proteins has a corresponding fitness on some complex landscape. Models of such “fitness landscapes” “The roles of mutation, inbreeding, crossbreeding and selection in evolution,” Proceedings of 6th International Conference on Genetics, 1:356-366) but have since been expanded on by others (Eigen, M. (1971) “Self-organization of matter and the evolution of biological macromolecules,” Naturwissenschaften, 58(10):465-523; Kauffman, S. et al. (1987) “Towards a general theory of adaptive walks on rugged landscapes,” J. Theor. Biol., 128(1):11-45; Kauffman, E. S., et al. (1989) “The NK model of rugged fitness landscapes and its application to maturation of the immune response,” J. Theor. Biol., 141(2):211-45; Schuster, P., et al. (1994) “Landscapes: complex optimization problems and biopolymer structures,” Comput. Chem., 18(3):295-324; Govindarajan, S. et al. (1997) “Evolution of model proteins on a foldability landscape,” Proteins, 29(4):461-6). The sequence space of proteins is immense and is impossible to explore exhaustively. Accordingly, new ways to efficiently search sequence space to identify functional proteins would be highly desirable.

Object of the Invention
An object of the present invention is to methods, apparatus, and software for identifying amino acid residues for variation in a protein variant library.

Summary of the Invention
These residues are then varied in the sequences of protein variants in the library in order to affect a desired activity such as stability, catalytic activity, therapeutic activity, resistance to a pathogen or toxin, toxicity, etc. The method of this aspect may be described by the following sequence of operations: (a) receiving data characterizing a training set of a protein variant library; (b) from the data, developing a sequence activity model that predicts activity as a function of amino acid residue type and corresponding position in the sequence; and (c) using the sequence activity model to identify one or more amino acid residues at specific positions in the systematically varied sequences that are to be varied in order to impact the desired activity. In this method, the protein variants in the library may have systematically varied sequences. Further, the data provides activity and sequence information for each protein variant in the training set.

In some embodiments, the method also includes (d) using the sequence activity model to identify one or more amino acid residues that are to remain fixed (as opposed to being varied) in new protein variant library.

The protein variant library may include proteins from various sources. In one example, the members include naturally occurring proteins such as those encoded by members of a single gene family. In another example, the members include proteins obtained by using a recombination-based diversity generation mechanism. Classical DNA shuffling (i.e., DNA fragmentation-mediated recombination) or synthetic DNA shuffling (i.e., synthetic oligonucleotide-mediated recombination) may be performed on nucleic acids encoding all or part of one or more naturally occurring parent proteins for this purpose. In still another example, the members are obtained by performing DOE to identify the systematically varied sequences.

Generally, the sequence activity model may be of any form that does a good job of predicting activity from sequence information. In a preferred embodiment, the model is a regression model such as a partial least squares model or a principal component regression model. In another example, the model is a neural network.

Using the sequence activity model to identify residues for fixing or variation may involve any of many different possible analytical techniques. In some cases, a “reference sequence” is used to define the variations. Such sequence may be one predicted by the model to have a highest value (or one of the highest values) of the desired activity. In another case, the reference sequence may be that of a member of the original protein variant library. From the reference sequence, the method may select subsequence’s for effecting the variations. In addition or alternatively, the sequence activity model ranks residue positions (or specific residues at certain positions) in order of impact on the desired activity.

One goal of the method may be to generate a new protein variant library. As part of this process, the method may identify sequences that are to be used for generating this new library. Such sequences include variations on the residues identified in (c) above or are precursors used to subsequently introduce such variations. The sequences may be modified by performing mutagenesis or a recombination-based diversity generation mechanism to generate the new library of protein variants. This may form part of a directed evolution procedure. The new library may also be used in developing a new sequence activity model.

In some embodiments, the method involves selecting one or more members of the new protein variant library for production. One or more of these may then be synthesized and/or expressed in an expression system.

Another aspect of the invention pertains to methods for defining a library of biological molecules. Such methods may be characterized by the following sequence of operations: (a) receiving an original set of data points representing the activity and sequence of multiple biological molecules in a training set; (b) constructing a bootstrap set of data points selected, with replacement, from the original set of data points; (c) generating a model from the bootstrap set, which model comprises indicators of the relative importance of individual residues or other units in biological molecules represented by the data points in the bootstrap set; (d) repeating (b) and (c) multiple times to generate multiple values of each indicator from the models generated in (c); (e) for each indicator, determining (i) an average or mean value of the multiple values and (ii) a statistical indication of the distribution of the multiple values; (f) ranking the individual residues or other units on basis of their respective values of (i) and (ii) determined in (e); and (g) toggling particular ones of the individual residues or other units based on rankings produced in (f) to thereby define the library of biological molecules.

Yet another aspect of the invention pertains to apparatus and computer program products including machine-readable media on which are provided program instructions and/or arrangements of data for implementing the methods and software systems described above. Frequently, the program instructions are provided as code for performing certain method operations. Data, if employed to implement features of this invention, may be provided as data structures, database tables, data objects, or other appropriate arrangements of specified information. Any of the methods or systems of this invention may be represented, in whole or in part, as such program instructions and/or data provided on machine-readable media.

These and other features of the present invention will be described in more detail below in the detailed description of the invention and in conjunction with the following figures.

Brief Description of Drawings:
FIG. 1 is a flowchart embodiment of a method of a present invention

Detailed Description of Invention:
FIG. 1 presents a flow chart showing various operations that may be performed in the order depicted or in some other order. As shown, a process begins at a block with receipt of data describing a training set comprising residue sequences for a protein variant library. In other words, the training set data is derived from a protein variant library. Typically, that data will include, for each protein in the library, a complete or partial residue sequence together with an activity value. In some cases, multiple types of activities (e.g., rate constant and thermal stability) are provided together in the training set.

In many embodiments, the individual members of the protein variant library represent a wide range of sequences and activities. This allows one to generate a sequence-activity model having applicability over a broad region of sequence space. Techniques for generating such diverse libraries include systematic variation of protein sequences and directed evolution techniques. Both of these are described in more detail elsewhere herein.

Activity data may be obtained by assays or screens appropriately designed to measure activity magnitudes. Such techniques are well known and are not central to this invention. The principles for designing appropriate assays or screens are widely understood. Techniques for obtaining protein sequences are also well known and are not central to this invention. The activity used with this invention may be protein stability (e.g., thermal stability). However, many important embodiments consider other activities such as catalytic activity, resistance to pathogens and/or toxins, therapeutic activity, toxicity, and the like.

After the training set data has been generated or acquired, the process uses it to generate a sequence-activity model that predicts activity as a function of sequence information. Such model is an expression, algorithm or other tool that predicts the relative activity of a particular protein when provided with sequence information for that protein. In other words, protein sequence information is an input and activity prediction is an output. For many embodiments of this invention, the model can also rank the contribution of various residues to activity. Methods of generating such models (e.g., partial least squares regression (PLS), principal component regression (PCR), and multiple linear regression (MLR)) will be discussed below, along with the format of the independent variables (sequence information), the format of the dependent variable(s) (activity), and the form of the model itself (e.g., a linear first order expression).

A model generated at block is employed to identify multiple residue positions or specific residue values (e.g. glutamine at position) that are predicted to impact activity. See block. In addition to identifying such positions, it may “rank” the residue positions or residue values based on their contributions to activity. For example, the model may predict that glutamine at position has the most pronounced effect on activity, phenylalanine at position has the second most pronounced effect, and so on. In a specific approach described below, PLS or PCR regression coefficients are employed to rank the importance of specific residues.

Documents

Application Documents

# Name Date
1 202021009596-STATEMENT OF UNDERTAKING (FORM 3) [05-03-2020(online)].pdf 2020-03-05
2 202021009596-POWER OF AUTHORITY [05-03-2020(online)].pdf 2020-03-05
3 202021009596-FORM FOR STARTUP [05-03-2020(online)].pdf 2020-03-05
4 202021009596-FORM FOR SMALL ENTITY(FORM-28) [05-03-2020(online)].pdf 2020-03-05
5 202021009596-FORM 1 [05-03-2020(online)].pdf 2020-03-05
6 202021009596-FIGURE OF ABSTRACT [05-03-2020(online)].jpg 2020-03-05
7 202021009596-EVIDENCE FOR REGISTRATION UNDER SSI(FORM-28) [05-03-2020(online)].pdf 2020-03-05
8 202021009596-EVIDENCE FOR REGISTRATION UNDER SSI [05-03-2020(online)].pdf 2020-03-05
9 202021009596-DRAWINGS [05-03-2020(online)].pdf 2020-03-05
10 202021009596-COMPLETE SPECIFICATION [05-03-2020(online)].pdf 2020-03-05
11 Abstract1.jpg 2020-03-07
12 202021009596-ORIGINAL UR 6(1A) FORM 26-120320.pdf 2020-03-14