Specification
This invention relates to leukemia disease genes and uses thereof.
TECHNICAL FIELD
This invention relates to leukemia disease genes and methods of using the same for diagnosis and treatment of leukemia.
BACKGROUND
Myelodysplastic syndromes (MDS) are a heterogeneous group of clonal disorders of bone marrow cell precursors characterized by variable clinical courses and outcomes. Approximately 30 percent of patients with MDS eventually progress to acute myelogenous leukemia (AML) and a clinical diagnostic assay especially suited to early identification of this subset of patients would help focus therapeutic options in these individuals.
Recent expression profiling studies have revealed differences in AC133* hematopoeitic stem cell fractions from patients with MDS versus AML (Miyazato et al. (2001) Blood 98:422-427). Similar results have been observed in transcriptional profiles of CD34+ cells purified from bone marrow of patients with MDS, which are radically altered from the transcriptional profiles of CD34+ cells from disease-free individuals (Hofmann et al., (2002) Blood 100: 3553-3560). These studies, however, involved positive selection of specific cell subtypes, which is laborious and time-consuming.
SUMMARY OF THE INVENTION
The present invention features the use of peripheral blood samples containing peripheral blood mononuclear cells (PBMCs) for diagnosis or evaluation of the progression or treatment of AML and MDS. The present invention does not require positive selection of specific cell subtypes from the blood sample, thereby allowing rapid diagnosis and assessment of a leukemia. Accordingly, peripheral blood samples suitable for the present invention include, but are not limited to,
whole blood samples or samples comprising un-fractionated PBMCs. In many cases, the peripheral blood samples employed comprise enriched un-fractionated PBMCs. By "enriched," it means that the percentage of PBMCs in a sample is higher than that in whole blood. In many cases, at least 75%, 80%, 85%, 90%, 95%, 99%, or 100% of the cells in an enriched sample are PBMCs. Enriched un-fractionated PBMCs can be prepared from whole blood by Ficoll gradients centrifugation or using cell purification tubes (CPTs). Other conventional methods can also be used to prepare enriched un-fractionated PBMCs.
The invention provides genes whose expression profiles are indicative of the existence, status, progression or treatment of a leukemia. Leukemias that are amenable to the present invention include, but are not limited to, AML and MDS. For example, Table 4 recites genes differentially expressed in PBMCs from MDS patients versus PBMCs from disease-free subjects. Table 6 recites genes differentially expressed in PBMCs from AML patients versus PBMCs from MDS patients. Table 8 recites genes whose expression levels are useful for distinguishing humans with AML from humans with MDS, humans with AML from disease-free humans, and humans with MDS from disease free humans. Acute lymphoblastic leukemia, chronic lymphocytic leukemia, chronic myelogenous leukemia, or hairy cell leukemia may also be analyzed according to the present invention.
Thus, in one aspect, the invention provides methods for diagnosis, or monitoring the occurrence, development, progression or treatment of leukemia (such as, for example, AML or MDS) in a subject using genes from Table 4 or Table 6. The methods include generating a gene expression profile from a peripheral blood sample from the subject and comparing the gene expression profile to one or more reference expression profiles (e.g. an expression profile representing a disease-free human, an expression profile representing a human with a leukemia, or an expression profile representing a human of borderline diagnosis). The gene expression profile and reference expression profiles include the expression patterns of one or more genes selected from Table 4 or 6 in PBMCs. In some embodiments, genes different from those recited in Table 2 are selected from Table 4 or 6, although genes recited in Table 2 can additionally be included. In some embodiments, the genes selected from Table 4 or 6 are those also recited in Table 8.
The difference or similarity between the gene expression profile and the one or more reference expression profiles is indicative of the presence, absence, occurrence, development, progression, or effectiveness of treatment of leukemia in the subject. The gene expression profile and the reference expression profiles can include the expression pattern of only one gene or of two or more (e.g. three or more, four or more, five or more, six or more, eight or more, ten or more, fifteen or more, twenty or more, forty or more, sixty or more, 100 or more, 200 or more, 300 or more, or 400 or more). In some embodiments, smaller numbers of genes (e.g. two, up to three, up to four, up to five, up to six, up to eight, up to ten, up to fifteen, up to twenty, up to forty, up to sixty, up to 100, or up to 200) are used.
The expression profile of the leukemia disease gene(s) in a subject of interest can be determined by measuring the RNA transcript level of each of the gene(s) in a peripheral blood sample of the subject. Methods suitable for this purpose include, but are not limited to, quantitative RT-PCR, nucleic acid arrays, Northern blot, in situ hybridization, slot-blotting, and nuclease protection assay. The expression profile of the leukemia disease gene(s) can also be determined by measuring the protein product level of each of the gene(s) in the peripheral blood sample of the subject. Methods suitable for this propose include, but are not limited to, immunoassays (e.g., ELISA, RIA, FACS, or Western Blot), protein arrays, two-dimensional gel electrophoresis, and mass spectroscopy.
A typical reference expression profile employed in the present invention includes values or ranges that are suggestive of the expression pattern of the leukemia disease gene(s) in peripheral blood samples of disease-free humans or patients with known leukemias. In one example, a reference expression profile comprises the average expression levels of each of the leukemia disease gene(s) in peripheral blood samples of disease-free humans. In another example, a reference expression profile comprises the average expression levels of each of the leukemia disease gene(s) in peripheral blood samples of patients having a known leukemia. In still another embodiment, a reference expression profile comprises two or more individual expression profiles, each of which is the expression profile of the leukemia disease gene(s) in a peripheral blood sample of a different leukemia patient or disease-free human. In a further embodiment, a reference expression profile
comprises ranges that reflect variations in the expression levels of each of the leukemia disease gene(s) in peripheral blood samples of disease-free humans or patient with known leukemias.
A reference expression profile employed in the present invention can be prepared using the same type of peripheral blood samples as the peripheral blood sample of the subject of interest and following the same preparation procedure and methodology. A reference expression profile can be predetermined or prerecorded. It can also be determined concurrently with or after the measurement of the expression profile of the subject of interest.
The comparison of the expression profile of a subject of interest to a reference expression profile can be performed manually or electronically. The difference or similarity between the expression profile of the subject of interest and the reference expression profile is indicative of the presence or absence, or progression or non-progression, of leukemia in the subject.
In some embodiments, the expression level of each of the leukemia disease genes employed in the comparison is correlated with a class distinction under a nearest-neighbor analysis or a significance analysis of microarrays. The class distinction represents an ideal expression pattern of the gene in un-fractionated PBMCs of disease-free humans and patients who have a specified leukemia (e.g., uniformly high in PBMCs of the disease-free humans and uniformly low in PBMCs of the leukemia patients, or vice versa). The disease status of a subject of interest (disease-free versus leukemia) can be predicted by comparing the expression profile of the leukemia disease genes in the subject of interest to a reference expression profile of the same genes using a fc-nearest-neighbors or weighted voting algorithm. Based on the comparison, the subject from whom the sample was taken can be diagnosed with leukemia or diagnosed as disease-free; or an existing leukemia can be assessed for changes, such as those associated with progression or treatment.
The invention also provides a general method for diagnosing or monitoring the occurrence, development, progression or treatment of MDS. The method includes generating a gene expression profile from a peripheral blood sample of a subject and comparing the gene expression profile to one or more reference expression profiles (e.g. an expression profile representing a disease-free
human, an expression profile representing a human with MDS, an expression profile representing a human with a non-MDS leukemia such as AML, or an expression profile representing a human of borderline diagnosis). The gene expression profile and the one or more reference expression profiles comprise the expression patterns of one or more MDS disease genes in PBMCs. The difference or similarity between the gene expression profile and the one or more reference expression profiles is indicative of the presence, absence, occurrence, development, progression, or effectiveness of treatment of MDS in the subject. The MDS disease genes can optionally include one or more genes selected from Tables 4, 6, or 8. The gene expression profile and the reference expression profiles can include the expression pattern of only one gene or of two or more (e.g. three or more, four or more, five or more, six or more, eight or more, ten or more, fifteen or more, twenty or more, forty or more, sixty or more, 100 or more, 200 or more, 300 or more, or 400 or more). In some embodiments, smaller numbers of genes (e.g. two, up to three, up to four, up to five, up to six, up to eight, up to ten, up to fifteen, up to twenty, up to forty, up to sixty, up to 100, or up to 200) are used. The comparison of the gene expression profile to the reference expression profiles can be done, for example, by a k-nearest neighbor analysis or a weighted voting algorithm. Based on the comparison, the subject from whom the sample was taken can be diagnosed with MDS or diagnosed as MDS-free or disease-free; or an existing MDS can be assessed for changes, such as those associated with progression or treatment.
The invention also provides a method for identifying an MDS patient who is likely to progress to acute myelogenous leukemia (AML) using one or more genes from Table 6. The method includes generating a gene expression profile from a peripheral blood sample from an MDS patient and comparing the gene expression profile to one or more reference expression profiles (e.g. an expression profile representing a human with AML, an expression profile representing a human with MDS known to progress to AML, or an expression profile representing a human with MDS known not to progress to AML). The gene expression profile and the one or more reference expression profiles include the expression patterns in PBMCs of one or more leukemia disease genes selected from Table 6. The difference or similarity between the gene expression profile and the one or more reference
expression profiles is indicative that the MDS patient is likely to progress to AML. The leukemia disease genes selected from Table 6 are optionally different from those recited in Table 2, although genes from Table 2 could also be included. The leukemia disease genes selected from Table 6 are optionally among those also recited in Table 8.
In another aspect, the present invention features methods for evaluating the effectiveness of a treatment of leukemia in a patient of interest. These methods comprise comparing an expression profile of at least one leukemia disease gene in a peripheral blood sample of the patient of interest to a reference expression profile of the same gene(s), where the peripheral blood sample is isolated from the patient after initiation of the treatment, and each of the leukemia disease gene(s) employed is differentially expressed in un-fractionated PBMCs of patients who have the leukemia being evaluated, as compared to in un-fractionated PBMCs of disease-free humans. In one example, the leukemia being assessed is MDS, and the leukemia disease gene(s) employed includes one or more genes selected from Table 4. An elimination or reduction in the difference between the expression profile of the leukemia disease gene(s) in the patient of interest and the corresponding expression profile in disease-free humans during the course of the treatment is indicative of the effectiveness of the treatment for the patient of interest. As compared to conventional methods, the gene expression profiling-based methods may have improved sensitivity for the detection of disease progression or remission.
In still another aspect, the present invention features methods for evaluating the effectiveness of a treatment in preventing the progression of MDS to AML in a patient of interest. These methods comprise comparing an expression profile of at least one leukemia disease gene in a peripheral blood sample of the patient of interest to a reference expression profile of the same gene(s), where the peripheral blood sample is isolated from the patient after initiation of the treatment, and each of the leukemia disease gene(s) employed is differentially expressed in un-fractionated PBMCs of MDS patients as compared to in AML patients. Examples of leukemia disease genes suitable for this purpose include, but are not limited to, those depicted in Table 6. The expression profile of the leukemia disease gene(s) in the
patient of interest during the course of the treatment is indicative of the effectiveness of the treatment in preventing the progression of MDS to AML in the patient.
The invention also provides arrays useful, for example, for diagnosing MDS or other leukemias. The arrays include a substrate having several addresses; distinct probes, such as distinct nucleic acid sequences or distinct antibody variable regions, are disposed on each address. In some embodiments, at least 15% (or at least 30% or at least 50%) of the addresses have probes that can specifically detect MDS disease genes in PBMCs; the MDS disease genes are optionally selected from Table 4. In other embodiments, at least 15% (or at least 30% or at least 50%) of the addresses have probes that can specifically detect genes selected from Tables 4 or 6; the selected genes are different from those recited in Table 2, although genes from Table 2 could also be included.
The invention also provides digitally-encoded expression profiles, as may be encoded in a computer-readable medium, useful, for example, as reference expression profiles to evaluate a gene expression profile from a peripheral blood sample. Each expression profile includes one or more digitally-encoded expression signals including a value representing the expression of a gene selected from Tables 4 or 6; the selected genes are different from those recited in Table 2, although digitally-encoded expression signals including values representing the expression of genes from Table 2 could additionally be included in the expression profile. The values in the digitally-encoded expression signals can represent, for example, the expression of the genes in a PBMC of a human with MDS or a human with AML. Each expression profile can include a single digitally-encoded expression signal or can include two, three, four, five, six, seven, eight, nine, or more digitally-encoded expression signals, such as at least ten, at least 20, at least 30, at least 40, at least 50, at least 100, or at least 200.
In another aspect, the invention provides kits useful for diagnosis of a leukemia. In one embodiment, the kit includes one or more probes that can specifically detect MDS disease genes (optionally selected from Table 4) in PBMCs. The probes are optionally polynucleotides that hybridize under stringent or nucleic acid array hybridization conditions to the RNA transcripts, or the complements thereof, of the MDS disease genes, or, optionally, are antibody variable domains that
bind the products of the MDS disease genes. In another embodiment, the kit includes one or more probes that can specifically detect genes selected from Tables 4 and 6; the selected genes are different from those recited in Table 2, although probes for genes from Table 2 could additionally be included. Genes selected from Tables 4 and 6 can optionally be among those also recited in Table 8. The kits also include one or more controls, each representing a reference expression level of a gene detectable by the probes.
In another aspect, the invention features a method of making a decision, e.g. selecting a payment class, for a course of treatment for a leukemia such as AML or MDS. The method includes assigning an individual to a class based on a value that is a function of the expression of one or more genes in a peripheral blood sample from the individual, thereby making a decision regarding the individual. The genes include one or more genes from among those recited in Tables 4 and 6 but not recited in Table 2, although the expression of genes recited in Table 2 could also be considered. In some embodiments, the one or more genes are selected from those also recited in Table 8. The decision can include, for example, selecting a treatment, such as an AML treatment, MDS treatment, other leukemia treatment, or an absence of treatment, based on the assignment of the individual to the class. The decision also can include administering or declining to administer a treatment based on the assignment; issuing, transmitting or receiving a prescription; or authorizing, paying for, or causing a transfer of funds to pay for a treatment. "Treatment" as used herein, refers to any action to deal with a disease or condition, regardless of whether the action is intended as preventative, curative, or palliative, for example; or to address a cause or symptom of the disease or condition; or to improve a second treatment by, for example, improving its efficacy or addressing a side effect. The decision may be recorded, such as in a computer-readable medium.
The invention also features a method of providing information on which to make a decision about an individual. The method includes providing (e.g. by receiving) an evaluation of a subject, wherein the evaluation was made by a method described herein, such as by determining the level of expression of one or more genes in a peripheral blood sample of the subject, thereby providing a value. The genes include one or more genes from among those recited in Tables 4 and 6 but not
recited in Table 2, although the expression of genes recited in Table 2 could also be considered. The method also includes providing a comparison of the value with a reference value, thereby providing information on which to make a decision about the subject. The method can also include making the decision or communicating the information to another party, such as by computer, compact disc, telephone, facsimile, or letter. The decision can include selecting a subject for payment or making or authorizing payment for a first course of action if the subject demonstrates a gene expression level, pattern or profile observed in a leukemia (e.g AML or MDS) and a second course of action if the subject demonstrates a gene expression level, pattern or profile observed in a different leukemia (e.g. MDS or AML) or in leukemia-free humans. Payment can be from a first party to a second party. The first party can be a party other than the patient, such as a third party payor, an insurance company, employer, employer-sponsored health plan, HMO, or governmental entity. In some embodiments, the second party is selected from the subject, a healthcare provider, a treating physician, an HMO, a hospital, a governmental entity, or an entity that sells or supplies the drug.
In one aspect, the invention features a method of making a data record. The method includes entering the result of a method described herein into a record, e.g. a computer readable record. In some embodiments, the record is evaluated and/or transmitted to a third party payor, an insurance company, employer, employer sponsored health plan, HMO, or governmental entity, or a healthcare provider, a treating physician, an HMO, a hospital, a governmental entity, or an entity which sells or supplies the drug.
In one aspect, the disclosure features a method of providing data. The method includes providing data described herein, e.g., generated by a method described herein, to provide a record, e.g., a record described herein, for determining if a payment will be provided. In some embodiments, the data is provided by computer, compact disc, telephone, facsimile, email, or letter. In some embodiments, the data is provided by a first party to a second party. In some embodiments, the first party is selected from the subject, a healthcare provider, a treating physician, an HMO, a hospital, a governmental entity, or an entity which sells or supplies the drug, In some embodiments, the second party is a third party
payor, an insurance company, employer, employer sponsored health plan, HMO, or governmental entity. In some embodiments, the first party is selected from the subject, a healthcare provider, a treating physician, an HMO, a hospital, an insurance company, or an entity which sells or supplies the drag and the second party is a governmental entity. In some embodiments, the first party is selected from the subject, a healthcare provider, a treating physician, an HMO, a hospital, an insurance company, or an entity which sells or supplies the drug and the second party is an insurance company.
In one aspect, the disclosure features a method of transmitting a record described herein. The method includes a first party transmitting the record to a second party, such as by computer, compact disc, telephone, facsimile, email, or letter. In some embodiments, the second party is selected from the subject, a healthcare provider, a treating physician, an HMO, a hospital, a governmental entity, or an entity which sells or supplies the drug. In some embodiments, the first party is an insurance company or government entity and the second party is selected from the subject, a healthcare provider, a treating physician, an HMO, a hospital, a governmental entity, or an entity which sells or supplies the drug. In some embodiments, the first party is a governmental entity or insurance company and the second party is selected from the subject, a healthcare provider, a treating physician, an HMO, a hospital, an insurance company, or an entity which sells or supplies the drug.
" Other features, objects, and advantages of the present invention are
apparent in the detailed description that follows. It should be understood, however, that the detailed description, while indicating preferred embodiments of the invention, does so by way of illustration only, not limitation. Various changes and modifications within the scope of the invention will become apparent to those skilled in the art from the detailed description.
DETAILED DESCRIPTION
The present invention features the use of whole blood samples or samples comprising un-fractionated PBMCs for diagnosing or monitoring the progression or treatment of AML and MDS. Genes that are differentially expressed in un-
fractionated PBMCs of AML (or MDS) patients as compared to in disease-free humans can be identified. These genes can be used as surrogate markers for diagnosing or evaluating the treatment of AML (or MDS) in a subject of interest. Genes that are differentially expressed in un-fractionated PBMCs of AML patients as compared to in MDS patients can also be identified. These genes can be used to monitor the progression of MDS in a patient of interest. The present invention does not require positive selection of specific cell subtypes (e.g., CD344" or AC133"1"), thereby allowing for rapid diagnosis and evaluation of AML and MDS. Other leukemias, such as acute lymphoblastic leukemia, chronic lymphocytic leukemia, chronic myelogenous leukemia, or hairy cell leukemia, can be similarly assessed according to the present invention.
Various aspects of the invention are described in further detail in the following subsections. The use of subsections is not meant to limit the invention. Each subsection may apply to any aspect of the invention. In this application, the use of "or" means "and/or" unless otherwise stated.
A. General Methods for Identifying Leukemia Disease Genes
This invention features the use of nucleic acid arrays for the
identification of genes that are differentially expressed in un-fractionated PBMCs of leukemia patients as compared to in disease-free humans or in patients who have a different type of leukemia. Nucleic acid arrays allow for quantitative detection of expression profiles of a large number of genes at one time. Non-limiting examples of nucleic acid arrays suitable for this purpose include Genechip® microarrays (Affymetrix, Santa Clara, CA), cDNA microarrays (Agilent Technologies, Palo Alto, CA), and bead arrays (U.S. Patent Nos. 6,288,220 and 6,391,562).
Polynucleotides to be hybridized to a nucleic acid array can be labeled with one or more labeling moieties to allow for detection of hybridized polynucleotide complexes. The labeling moieties can include compositions that are detectable by spectroscopic, photochemical, biochemical, bioelectronic, immunochemical, electrical, optical or chemical means. Exemplary labeling moieties include, but are not limited to, radioisotopes, chemiluminescent
compounds, labeled binding proteins, heavy metal atoms, spectioscopic markers (such as fluorescent markers or dyes), magnetic labels, linked enzymes, mass spectrometry tags, spin labels, electron transfer donors and acceptors, and the like. Polynucleotides to be hybridized to a nucleic acid array can be cDNA, cRNA, or other types of nucleic acid molecules.
Hybridization reactions can be performed in absolute or differential hybridization formats. In the absolute hybridization format, polynucleotides derived from one sample, such as un-fractionated PBMCs from an AML or MDS patient or a disease-free human, are hybridized to the probes on a nucleic acid array. Signals detected after the formation of hybridization complexes correlate to the polynucleotide levels in the sample. In the differential hybridization format, polynucleotides derived from two biological samples, such as one from an AML or MDS patient and the other from a disease-free human, are labeled with different labeling moieties (e.g., Cy3 and Cy5, respectively). A mixture of these differently labeled polynucleotides is hybridized to a nucleic acid array. The nucleic acid array is then examined under conditions in which the emissions from the two different labels are individually detectable.
Signals gathered from nucleic acid arrays can be analyzed using commercially available software, such as software provided by Affymetrix or Agilent Technologies. Controls, such as for scan sensitivity, probe labeling or cDNA quantitation, can be included in the hybridization experiments. In many embodiments, signals from nucleic acid arrays are scaled or normalized before being further analyzed. The expression signals of a gene can be normalized to take into account variations in hybridization intensities when more than one array is used under similar test conditions. Signals for individual polynucleotide complex hybridization can also be normalized using the intensities derived from internal normalization controls contained on each array. In addition, genes with relatively consistent expression levels across the samples can be used to normalize the expression levels of other genes. In one embodiment, the expression levels are normalized across the samples such that the mean is zero and the standard deviation is one. In another embodiment, the expression signals from a nucleic acid array are
subject to a variation filter which excludes genes showing minimal or insignificant variation across different classes of samples.
Expression profiles in un-fractionated PBMCs of leukemia patients are compared to the corresponding expression profiles in disease-free humans. Genes that are differentially expressed in un-fractionated PBMCs of leukemia patients as compared to in un-fractionated PBMCs of disease-free humans can be identified. These genes are hereinafter referred to as leukemia disease genes. By "differentially expressed," it means that the average expression level of a leukemia disease gene in un-fractionated PBMCs of leukemia patients is statistically significantly different from that in un-fractionated PBMCs of disease-free humans. In many instances, the p-value of a Student's t-test (e.g., two-tailed distribution, two-sample unequal variance) for the observed difference is no more than 0.05, 0.01, 0.005, 0.001, 0.0005, 0.0001, or less. The average expression level of a leukemia disease gene in un-fractionated PBMCs of leukemia patients can be substantially higher or lower than that in disease-free PBMCs. For instance, the average expression level of a leukemia disease gene in PBMCs of leukemia patients can be at least 1,2, 3, 4, 5, 10, 20, or more folds higher or lower than that in PBMCs of disease-free humans. Leukemia disease genes that are differentially expressed in patients who have different leukemias (e.g., AML versus MDS) can be similarly identified.
Leukemia disease genes can also be identified using supervised or unsupervised clustering algorithms. Non-limiting examples of supervised clustering algorithms include the nearest-neighbor analysis, support vector machines, the SAM (Significance Analysis of Microarrays) method, artificial neural networks, and SPLASH. Non-limiting examples of unsupervised clustering algorithms include self-organized maps (SOMs), k-means, principal component analysis, and hierarchical clustering.
The nearest-neighbor analysis, also known as the neighborhood analysis, is described in Golub et al, (1999) Science 286:531-537; Slonim et at., (2000) Procs. of the Fourth Annual International Conference on Computational Molecular Biology, Tokyo, Japan, April 8-11, pp. 263-272; and U.S. Patent No. 6,647,341, all of which are incorporated herein by reference. In the analysis, the expression profile of each gene is represented by an expression vector g = (BI, ei, 63,. . ., en), where e.
corresponds to the expression level of gene "g" in the ith sample. A class distinction can be represented by an idealized expression pattern c = (c\, 02, 03, . . ., cn), where c; = 1 or -1, depending on whether the ith sample is isolated from class 0 or class 1. Class 0 includes subjects having a first disease status (e.g., disease-free), and class 1 includes subjects having a second disease status (e.g. AML or MDS). Other forms of class distinction can also be employed. Typically, a class distinction represents an idealized expression pattern, where the expression level of a gene is uniformly high for samples in one class and uniformly low for samples in the other class.
The correlation between gene "g" and the class distinction can be measured by a signal-to-noise score:
P(g,c) = (µ1(g)) - µ2(g)]/[ó(g) + a2(g)]
where µ(g) and µ(g) represent the means of the log-transformed expression levels of gene "g" in class 0 and class 1, respectively, and µ(g) and a2(g) represent the standard deviation of the log-transformed expression levels of gene "g" in class 0 and class 1, respectively. A higher absolute value of a signal-to-noise score indicates that the gene is more highly expressed in one class than in the other. In one example, the samples used to derive the signal-to-noise scores comprise enriched or purified un-fractionated PBMCs and, therefore, the signal-to-noise score P(g,c) represents a correlation between the class distinction and the expression level of gene "g" in un-fractionated PBMCs. The correlation between gene "g" and the class distinction can also be measured by other methods, such as the Pearson correlation coefficient or the Euclidean distance, as appreciated by those skilled in the art.
The significance of the correlation between gene expression profiles in un-fractionated PBMCs and a class distinction can be evaluated using a random permutation test. An unusually high density of genes within the neighborhoods of the class distinction, as compared to random patterns, suggests that many of these genes have expression patterns that are significantly correlated with the class distinction. The correlation between genes and a class distinction can be diagrammatically viewed through a neighborhood analysis plot, in which the y-axis represents the number of genes within various neighborhoods around the class distinction and the x-axis indicates the size of the neighborhood (i.e., P(g,c)).
Curves showing different significance levels for the number of genes within corresponding neighborhoods of randomly permuted class distinctions can also be included in the plot.
In many embodiments, the leukemia disease genes identified by the present invention are above the median significance level in the neighborhood analysis plot. This means that the correlation measure P(g,c) for each of these leukemia disease genes is such that the number of genes within the neighborhood of the class distinction having the size of P(g,c) is greater than the number of genes within the corresponding neighborhoods of randomly permuted class distinctions at the median significance level. The leukemia disease genes identified by the present invention can also be above the 40%, 30%, 20%, 10%, 5%, 2%, or 1% significance level. As used herein, x% significance level means that x% of random neighborhoods contain as many genes as the real neighborhood around the class distinction.
The leukemia disease genes identified by the nearest-neighbor analysis can be used to construct class predictors. Each class predictor includes two or more leukemia disease genes, and can be used to assign a subject of interest to a disease status (e.g., AML, MDS, or disease-free). In one embodiment, a class predictor includes or consists of leukemia disease genes that are significantly correlated with a class distinction by the permutation test (e.g., genes above the 1%, 2%, 5%, 10%, 20%, 30%, 40%, or 50% significance level). In another embodiment, a class predictor includes or consists of leukemia disease genes that have top absolute values of P(g,c).
The SAM method can also be used to correlate disease statuses with gene expression profiles in un-fractionated PBMCs. The prediction analysis of microarrays (PAM) method can be used to identify class predictors that can best characterize a predefined disease or disease-free class and predict the class membership of new samples. See, for example, Tibshirani et al., (2002) Proc. Natl. Acad. Sci. U.S.A. 99:6567-6572.
The prediction accuracy of a class predictor of the present invention can be evaluated by k-fold cross validation, such as 10-fold cross validation, 4-fold cross validation, or leave-one-out cross validation. In a typical k-fold cross validation, the
identified. Genes that were differentially expressed in un-fractionated PBMCs of AML patients as compared to in MDS patients were also identified.
Table 1 lists qualifiers on HG-U133A Genechips® that showed elevated or decreased signals when hybridized to ANIL samples as compared to disease-free samples. Each qualifier in Table 1 corresponds to an AML disease gene which is differentially expressed in un-fractionated PBMCs of AML patients as compared to in disease-free humans. The hybridization signal at each qualifier represents the expression level of the corresponding gene in un-fractionated PBMCs.
Table 1 also illustrates the average hybridization signals at each qualifier for AML ("AML Average") or disease-free samples ("Disease-Free Average"). The standard deviations of these signals ("AML StDev" and "Disease-Free StDev," respectively) are also provided. In addition, the ratios between AML and disease-free hybridization signals ("AML/Disease-Free") and the p-values of Student's t-test (two-tailed distribution, two-sample unequal variance) for the observed differences are provided.
data is divided into k subsets of approximately equal size. The model is trained k times, each time leaving out one of the subsets from training and using the omitted subset as the test samples to calculate the prediction error. Where k equals the sample size, it becomes the leave-one-out cross validation.
Other methods can also be used to identify leukemia disease genes. These methods include, but are not limited to, quantitative RT-PCR, Northern Blot, in situ hybridization, protein arrays, immunoassays (e.g., ELISA, RIA or Western Blot), differential display, serial analysis of gene expression (SAGE), representation differential analysis (RDA), subtractive hybridization, GeneCalling® (CuraGen, New Haven, CT), and total gene expression analysis (TOGA). Genes thus identified are differentially expressed in un-fractionated PBMCs of one class of subjects relative to another class of subjects, each class of subjects having a different disease status (e.g., AML, MDS, or disease-free).
The above-described methods can also be used to identify genes whose expression profiles in un-fractionated PBMCs are predictive of different stages of leukemia progression, or different clinical responses of leukemia patients to a therapeutic treatment. For instance, gene expression profiles in PBMCs of MDS patients who eventually progress to AML can be compared to the corresponding gene expression profiles in MDS patients who do not progress to AML. Genes that are differentially expressed in these two classes of patients can be identified and used for the prediction of progression from MDS to AML. For another instance, leukemia patients can be grouped based on their responses to a therapeutic treatment. The global gene expression analysis is then used to identify genes that are differentially expressed in PBMCs of one group of patients versus another group. Genes thus identified are predictive of clinical outcome of a leukemia patient in response to the therapeutic treatment.
B. Identification of AML and MDS Disease Genes
[0042] HG-U133A Genechips® (Affymetrix, Inc.) were used to identify AML or MDS disease genes. Genes that were differentially expressed in un-fractionated PBMCs of AML (or MDS) patients as compared to in disease-free humans were
Table 1. Genes Differentially Expressed in AML vs. Disease-Free PBMCs
(Table 1 Removed)
Table 2. Annotation of Genes Differentially Expressed in AML vs. Disease-Free PBMCs
(Table 2 Removed)
Each qualifier on a HG-U133A Genechip® represents a set of oligonucleotide probes (PM or perfect match probe) that are stably attached to the respective regions on the Genechip®. The RNA transcript (or the complement thereof) of the gene identified by a qualifier can hybridize under nucleic acid array hybridization conditions to at least one oiigonucleotide probe of the qualifier. Preferably, the RNA transcript (or the complement thereof) of the gene does not hybridize under nucleic acid array hybridization conditions to the mismatch (MM) probes of the qualifier. A mismatch probe is identical to the corresponding PM probe except for a single, homomeric substitution at or near the center of the mismatch probe. For instance, the MM probe for a 25-mer PM probe has a homomeric base change at the 13th position.
In one embodiment, the RNA transcript (or the complement thereof) of the gene identified by a qualifier can hybridize under nucleic acid array hybridization conditions to at least 50%, 60%, 70%, 80%, 90% or 100% of the PM probes of the qualifier, but not to the corresponding mismatch probes. The discrimination score (R) for each of these PM probes, as measured by the ratio of the hybridization intensity difference of the corresponding probe pair (i.e., PM - MM) over the overall hybridization intensity (i.e., PM + MM), can be no less than 0.015, 0.02, 0.05, 0.1, 0.2, 0.3, 0.4, 0.5 or greater. In another embodiment, the RNA transcript (or the complement thereof) of the gene, when hybridized to a HG-U133A Genechip® according to the manufacturer's instructions, produces a "present" call at the corresponding qualifier under the default settings (i.e., the threshold Tau is 0.015 and the significance level 6 mM EDTA, pH 7.4, and 0.005% Triton X-100. In some cases, the wash buffer is replaced with a more stringent wash buffer. 1000 ml of the stringent wash buffer can be prepared by mixing 83.3 ml of 12x MES stock, 5.2 ml of 5 M NaCl, 1.0 ml of 10% Tween 20 and 910.5 ml of water.
Example 4, Gene Expression Data Analysis
Data analysis and absent/present call determination are performed on raw fluorescent intensity values using Genechip® 3.2 software (Affymetrix). The "average difference" values for each transcript are normalized to "frequency" values using the scaled frequency normalization method in which the average differences for 11 control cRNAs with known abundance spiked into each hybridization solution were used to generate a global calibration curve. See Hill et al. (2001) Genome Biol., 2(12):research0055.1-0055.13, the entire content of which is incorporated herein by reference. This calibration is then used to convert average difference values for all transcripts to frequency estimates, stated in units of parts per million ranging from 1:300,000 (3 parts per million (ppm)) to 1:1000 (1000 ppm).
Genechip® 3.2 software uses algorithms to calculate the likelihood as to whether a gene is "absent" or "present" as well as a specific hybridization intensity value or "average difference" for each transcript represented on the array. The
algorithms used in these calculations are described in the Affymetrix Genechip® Analysis Suite User Guide.
Specific transcripts can be evaluated further if they meet the following criteria. First, genes that are designated "absent" by the Genechip® 3.2 software in all samples are excluded from the analysis. Second, in comparisons of transcript levels between arrays, a gene is required to be present in at least one of the arrays. Third, for comparisons of transcript levels between groups, a Student's /-test is applied to identify a subset of transcripts that had a significant difference (p < 0.05) in frequency values. In many cases, a fourth criterion, which requires that average fold changes in frequency values across the statistically significant subset of genes be 2-fold or greater, is also used.
Unsupervised hierarchical clustering of genes and/or arrays on the basis of similarity of their expression profiles can be performed using the procedure described inEisen et al. (1998) Proc. Nat. Acad. Sci. U.S.A.. 95: 14863-14868. Nearest-neighbor prediction analysis and supervised cluster analysis can be performed using metrics illustrated in Golub et al, supra. For hierarchical clustering and nearest-neighbor prediction analysis, data can be first log-transformed and then normalized to have a mean value of zero and a variance of one. A Student's /-test can be used to compare disease-free, AML and MDS PBMC expression profiles. A p value of no more than 0.05 (e.g., no more than 0.01, 0.001, or less) can be used to indicate statistical significance.
K-nearest-neighbor's approach can be used to perfonn a neighborhood analysis of real and randomly permuted data using a correlation metric [P(g,c) = ((j.l-u.2) / (ol+ a2)], where g is the expression vector of gene g, c is a class vector, |al and al define the mean expression level and standard deviation of gene g in class 1, respectively, and u2 and a2 define the mean expression level and standard deviation of gene g in class 2, respectively. The measures of correlation for the most statistically significant genes observed in real class distinctions (AML versus disease-free, MDS versus disease-free, or AML versus MDS) can be compared to the most statistically significant measures of correlation observed in randomly permuted class distinctions. The top 1%, 5% and median distance measurements of 100 randomly permuted classes compared to the observed distance measurements
for AML versus disease-free, MDS versus disease-free, or AML versus MDS can be plotted to show the statistical verification of the leukemia disease genes identified by this invention.
Example 5. Gene Classifiers for Prediction of Disease-Free versus MDS versus AML
A 24-qualifier signature (8 cDNAs representing 7 genes defining AML, 8 cDNAs representing 7 genes defining MDS, and 8 cDNAs representing 8 genes defining disease-free) was identified. This signature can accurately predict and classify PBMC samples of disease-free individuals, MT>S patients, or AML patients. This signature also identifies rapid MDS progressors as "AML," with implications for early detection of AML progression in MDS patients.
The qualifiers in the 24-qualifier signature are listed in Table 8, below. For each qualifier, the signal-to-noise value associated with the qualifier is provided in the column labeled "Score." Each signal-to-noise value was greater than the value in the adjacent "Perm 1%" column, representing the signal-to-noise values observed for the top 1% of random permutations when the labels of the profiles were scrambled and then compared using identical class sizes. Thus, the actual signal-to-noise values for the qualifiers were superior to those in the top 1% of random permutations. The corresponding human genes are identified by name, by symbol, by chromosomal location ("Cyto Band"), by Unigene number, and by GenBank accession number. Human genes used to identify AML include human myb; human neuronal protein 3.1; human myeloperoxidase; human catalase; human CGI-49; human stem cell growth factor; and human serine peptidase inhibitor, Kazal type 2 (acrosin-trypsin inhibitor). Human genes used to identify MDS include human NEDD4L; human glutathione peroxidase 3; human X-linked Kx blood group; human synuclein, alpha; human chromosome 8 open reading frame 51/hypothetical protein MGC3113; human interferon, alpha-inducible protein 27; and human transglutaminase 3. Human genes used to identify PBMCs from disease-free individuals include human chromosome 21 open reading frame 7; human amyloid beta A4 precursor protein-binding family A member 2; human KIAA0449; human F-box only protein 21; human death effector filament-forming ced-4-like apoptosis
protein; human zinc finger protein 14; human vasoactive intestinal peptide receptor 1; and human KIAA0443.
Gene expression patterns in peripheral blood mononuclear cells were measured by oligonucleotide arrays for 45 disease-free subjects, 36 patients diagnosed with AML and 20 patients with initial diagnoses of MDS. Comparisons of these groups identified transcriptional differences that easily separated AML and MDS from healthy volunteers, and annotation revealed that many of the differences appeared due to proliferation of CD34+ blasts in the circulation of these patients. The possibility of discriminating between MDS and AML patients on the basis of transcriptional profiles in peripheral blood was next explored. Of the 20 patients with initial diagnoses of MDS, six of the patient samples were determined to come from either 1) MDS patients with conflicting diagnoses between the site pathologist and a central pathologist (n=3) or 2) MDS patients who rapidly progressed after blood sampling (< 3 months, n=3). A supervised approach on a training set of healthy, AML and non-progressor MDS samples was used to identify a gene classifier correlated with profiles in healthy individuals, stable MDS patients, and AML patients. An 8 gene classifier was optimally predictive, exhibiting an overall accuracy of 94 % in the training set (62/66 subjects correctly assigned by leave-one-out cross validation). One of the four misclassified samples in the training set was from an MDS patient with a conflicting diagnosis who was "misclassified" upon cross validation as AML. When the 8 gene classifier was applied to the remaining samples in the test set, this classifier identified the remaining unambiguous samples in the test set with similar accuracy (87% overall accuracy of class assignment). This 8-gene predictor also assigned both samples from patients with conflicting diagnosis and all three samples from MDS patients with rapid times to disease progression as originating from patients with AML. These preliminary results imply that AML-like transcriptional profiles of MDS-diagnosed patients can precede standard clinical evidence of AML progression (e.g., blast hyperproliferation). The results from these studies indicate that the expression pattern of select transcripts in peripheral blood can provide early indicators of AML progression in leukemic patients with blast percentages that are commonly associated with a diagnosis of MDS.
Table 8. 24-qualifier classifier for AML, MDS, and normal PBMCs
(Table Removed)
[00112] The foregoing description of the present invention provides illustration and description, but is not intended to be exhaustive or to limit the invention to the precise one disclosed. Modifications and variations are possible in light of the above teachings or may be acquired from practice of the invention. Thus, it is noted that the scope of the invention is defined by the claims and their equivalents.
We claim:
1. A method for diagnosis, or monitoring the occurrence, development, progression
or treatment, of myelodysplastic syndromes (MDS), the method comprising the
steps of:
(1) generating a gene expression profile from a peripheral blood sample of a
subject; and
(2) comparing the gene expression profile to one or more reference
expression profiles,
wherein the gene expression profile and the one or more reference expression profiles comprise the expression patterns of one or more MDS disease genes in peripheral blood mononuclear cells (PBMCs), and wherein the difference or similarity between the gene expression profile and the one or more reference expression profiles is indicative of the presence, absence, occurrence, development, progression, or effectiveness of treatment of MDS in the subject.
2. The method of claim 1, wherein the peripheral blood sample comprises whole
blood or enriched un-fractionated PBMCs.
3. The method of any one of the preceding claims, wherein the one or more MDS
disease genes comprise one or more genes selected from Tables 4 or 6.
4. The method of any one of the preceding claims, wherein the one or more MDS
disease genes comprise ten or more genes selected from Tables 4 or 6.
5. The method of any one of the preceding claims, wherein the one or more MDS
disease genes comprise one or more MDS disease genes selected from Table 8.
6. The method of any one of the preceding claims, wherein the one or more
reference expression profiles comprise a reference expression profile representing a
disease-free human.
7. The method of any one of the preceding claims, wherein step (2) comprises
comparing the gene expression profile to the one or more reference expression
profiles by k-nearest neighbor analysis or a weighted voting algorithm.
8. The method of any one of the preceding claims, further comprising the step of
diagnosing or assessing MDS in the subject based on the comparison of step (2).
9. A method for diagnosis, or monitoring the occurrence, development, progression
or treatment, of leukemia in a subject, the method comprising the steps of:
(1) generating a gene expression profile from a peripheral blood sample from
the subject; and
(2) comparing the gene expression profile to one or more reference
expression profiles,
wherein the gene expression profile and the one or more reference expression profiles comprise the expression patterns of one or more leukemia disease genes selected from Table 4 or 6 in peripheral blood mononuclear cells (PBMCs), wherein the difference or similarity between the gene expression profile and the one or more reference expression profiles is indicative of the presence, absence, occurrence, development, progression, or effectiveness of treatment of leukemia in the subject, wherein the one or more leukemia disease genes are not recited in Table 2.
10. The method of claim 9, wherein the peripheral blood sample comprises whole
blood or enriched un-fractionated PBMCs.
11. The method of claim 9 or 10, wherein the one or more leukemia disease genes
comprise ten or more genes selected from Table 4 or 6.
12. The method of any one of claims 9-11, wherein the one or more leukemia
disease genes selected from Table 4 or 6 comprise one or more genes also recited in
Table 8.
13. The method of any one of claims 9-12, wherein the one or more reference
expression profiles comprise a reference expression profile representing a disease-
free human.
14. The method of any one of claims 9-13, wherein step (2) comprises comparing
the gene expression profile to the one or more reference expression profiles by a k-
nearest neighbor analysis or a weighted voting algorithm.
15. The method of any one of claims 9-14, wherein the leukemia is an acute
myelogenous leukemia.
16. The method of any one of claims 9-14, wherein the leukemia is a
myelodysplastic syndrome.
17. The method of any one of claims 9-16, further comprising the step of
diagnosing or assessing leukemia in the subject based on the comparison of step (2).
18. A method for identifying an MDS patient who is likely to progress to acute
myelogenous leukemia (AML), the method comprising the steps of:
(1) generating a gene expression profile from a peripheral blood sample from
an MDS patient;
(2) comparing the gene expression profile to one or more reference
expression profiles, wherein the gene expression profile and the one or more
reference expression profiles comprise the expression patterns of one or more
leukemia disease genes selected from Table 6 in peripheral blood mononuclear cells
(PBMCs), wherein the difference or similarity between the gene expression profile
and the one or more reference expression profiles is indicative that the MDS patient
is likely to progress to AML.
19. The method of claim 18, wherein the one or more reference expression profiles
comprises a reference expression profile representing an AML patient.
20. The method of claim 18 or 19, wherein the peripheral blood sample comprises
whole blood or enriched un-fractionated PBMCs.
21. The method of any one of claims 18-20, wherein the one or more leukemia
disease genes selected from Table 6 are also recited in Table 8.
22. An array for use in a method for diagnosing a myelodysplastic syndrome
(MDS) comprising a substrate having a plurality of addresses, each address
comprising a distinct probe disposed thereon, wherein at least 15% of the plurality
of addresses have disposed thereon probes that can specifically detect MDS disease
genes in peripheral blood mononuclear cells.
23. The array of claim 22, wherein at least 30% of the plurality of addresses have
disposed thereon probes that can specifically detect MDS disease genes in peripheral
blood mononuclear cells.
24. The array of claim 22, wherein at least 50% of the plurality of addresses have
disposed thereon probes that can specifically detect MDS disease genes in peripheral
blood mononuclear cells.
25. The array of any one of claims 22-24, wherein the MDS disease genes are
selected from Table 4.
26. The array of any one of claims 22-25, wherein the probe is a nucleic acid probe.
27. The array of any one of claims 22-25, wherein the probe is an antibody probe.
28. An array for use in a method for diagnosis of leukemia comprising a substrate
having a plurality of addresses, each address comprising a distinct probe disposed
thereon, wherein at least 15% of the plurality of addresses have disposed thereon
probes that can specifically detect genes selected from Tables 4 or 6, wherein the
genes are not recited in Table 2.
29. The array of claim 28, wherein at least 30% of the plurality of addresses have
disposed thereon probes that can specifically detect genes selected from Tables 4 or
6, wherein the genes are not recited in Table 2.
30. The array of claim 28, wherein at least 50% of the plurality of addresses have
disposed thereon probes that can specifically detect genes selected from Tables 4 or
6, wherein the genes are not recited in Table 2.
31. The array of any one of claims 28-30, wherein the probe is a nucleic acid probe.
32. The array of any one of claims 28-30, wherein the probe is an antibody probe.
33. A computer-readable medium comprising a digitally-encoded expression profile
comprising a plurality of digitally-encoded expression signals, wherein each of the
plurality of digitally-encoded expression signals comprises a value representing the
expression of a gene selected from Tables 4 or 6, wherein the gene is not recited in
Table 2.
34. The computer-readable medium of claim 33, wherein the value represents the
expression of the gene in a peripheral blood mononuclear cell of a patient with a
myelodysplastic syndrome (MDS).
35. The computer-readable medium of claim 33, wherein the value represents the
expression of the gene in a peripheral blood mononuclear cell of a patient with acute
myelogenous leukemia (AML).
36. The computer-readable medium of claim 33, wherein the digitally-encoded
expression profile comprises at least ten digitally-encoded expression signals.
37. A kit for diagnosis of a myelodysplastic syndrome (MDS), the kit comprising:
a) one or more probes that can specifically detect MDS disease genes in peripheral
blood mononuclear cells; and b) one or more controls, each representing a reference
expression level of an MDS disease gene detectable by the one or more probes.
38. The kit of claim 37, wherein the MDS disease genes are selected from Table 4.
39. The kit of claim 38, wherein the MDS disease genes selected from Table 4 are
also recited in Table 8.
40. A kit for diagnosis of leukemia, the kit comprising: a) one or more probes that
can specifically detect genes selected from Tables 4 or 6, wherein the genes are not
recited in Table 2; and b) one or more controls, each representing a reference
expression level of a disease gene detectable by the one or more probes.
41. The kit of claim 40, wherein the genes selected from Tables 4 or 6 are also
recited in Table 8.
42. A method of making a decision regarding an individual, the method comprising
the step of:
assigning the individual to a class based on a value that is a function of the expression, in a peripheral blood sample from the individual, of one or more genes selected from Tables 4 or 6, wherein the genes are not recited in Table 2, thereby making a decision regarding the individual.
43. The method of claim 42, wherein the one or more genes selected from Tables 4
or 6 are also recited in Table 8.
44. The method of claim 42 or 43, wherein the decision is recorded.
45. The method of claim 44, wherein the decision is recorded in a computer-
readable medium.
46. The method of any one of claims 42-45, wherein the method further includes
selecting a leukemia treatment based on the assignment.
47. The method of any one of claims 42-46, wherein the method further includes
administering a leukemia treatment based on the assignment.
48. The method of any one of claims 42-47, wherein the method further includes
issuing, transmitting or receiving a prescription for a leukemia treatment based on
the assignment.
49. The method of any one of claims 42-48, wherein the method further includes
authorizing, paying for, or causing a transfer of funds to pay for a leukemia
treatment based on the assignment
50. The invention substantially such as herein described.