Abstract: Transcriptome-based prediction of heterosis or hybrid vigour and other complex phenotypic traits. Analysis of transcript abundance in predictive gene sets, for predicting magnitude of heterosis or other complex traits in plants and animals. Transcriptome-based screening and selection of individuals with desired traits and/or good hybrid vigour.
This invention relates to methods of producing hybrid plants and
hybrid non-human animals having high levels of hybrid vigour or
heterosis and/or producing plants and non-human animals (e.g.
hybrid, inbred or recombinant plants) having other traits such as
desired flowering time, seed oil content and/or seed fatty acid
ratios, and plants and non-human animals produced by these
methods.
The invention relates to selection of suitable organisms,
preferably plants or non-human animals, for use in producing
hybrids and/or for use in breeding programmes, e.g. screening of
germplasm collections for plants that may be suitable for
inclusion in breeding programmes.
Many animal and plant species exhibit increased growth rates,
reach larger sizes and, in the cases of crops [1,2] and farm
animals [3, 4], have higher yields and productivity when bred as
hybrids, produced by crossing genetically dissimilar parents, a
phenomenon known as hybrid vigour or heterosis [5]. The term
heterosis can be applied to almost any aspect of biology in which
a hybrid can be described as outperforming its parents.
The degree of heterosis observed varies a lot between different
hybrids. The magnitude of heterosis can be described relative to
the mean value of the parents (Mid-Parent Heterosis, MPH) or
relative to the "better" of the parents (Best-Parent Heterosis,
BPH) .
Heterosis is of great importance in many agricultural crops and
in plant and animal breeding, where it is clearly desirable to
produce hybrids with high levels of heterosis. However, despite
extensive genetic analysis in this area, the molecular mechanisms
underlying heterosis remain poorly understood. Some progress has
been made towards understanding the heterosis observed in simple
traits controlled by single genes [6], but the mechanisms
controlling more complex forms of heterosis, such as the
vegetative vigour of hybrids, remain unknown [7, 8, 9].
Genetic analyses of heterosis have led to three, non-exclusive,
genetic mechanisms being hypothesised to explain heterosis:
the "dominance" model, in which heterotic interactions are
considered to be the cumulative effect of the phenotypic
expression of dispersed dominant alleles, whereby deleterious
alleles that are homozygous in the respective parents are
complemented in the hybrids [2, 10];
the "overdominance" model, in which heterotic interactions
are considered to be the result of heterozygous loci resulting in
a phenotypic expression in excess of either parent, so that the
heterozygosity per se produces heterosis [5, 11, 12);
the "epistatic" model, which includes other types of
specific interactions between combinations of alleles at separate
loci [13, 14].
Hypothetical models based on gene regulatory networks have been
proposed to explain these types of interaction [15].
Whilst the hypothesised models attempt to explain in genetic
terms at least a proportion of heterosis observed in hybrids,
they do not provide a practical indicator that would enable
breeders to predict quantitatively the level of heterosis for a
given hybrid or to know which hybrid crosses are likely to
perform well.
In allogamous crops, such as maize, heterotic groups have been
established that enable the selection of inbreds that will show
good heterosis when crossed. For example, Iowa Stiff Stalk vs.
Non-Stiff Stalk lines [16]. Inter-group hybrids have greater
genetic distance and heterosis than hybrids produced by crossing
within an individual heterotic group [17] and it has been
proposed that the level of genetic diversity may be a predictor
of heterosis and yield [18]. However, this has not proven to be
a reliable approach for the prediction of heterosis in crops
[17]. Heterosis shows an inconsistent relationship with the
degree of relatedness of the two parents, with an absence of
correlation reported between heterosis and genetic distance in
Arabidopsis thaliana [7, 19] and other species [20, 21, 22].
Thus, in general the level of heterosis observed in a hybrid does
not depend solely upon the genetic distance between the two
parents from which the hybrid was produced, nor does this
variable, genetic distance, necessarily provide a good indicator
of likely heterosis of hybrids.
At the gene transcript level, expression of alleles in a hybrid
may represent the cumulative level of expression of the alleles
inherited from each parent, or expression may be non-additive.
Non-additive patterns of gene expression are believed to
contribute to hybrid effects and therefore several studies have
investigated non-additive gene expression in hybrids compared
with their parents. Characteristics of the transcriptome (the
contribution to the mRNA pool of each gene in the genome) have
been analysed in heterotic hybrids of crop plants, and extensive
differences in gene expression in the hybrids relative to the
parents have been reported [23, 24, 25, 26, 27]. Hybrid
transcriptomes were shown to be different from the transcriptomes
of the parents. Quantitative changes were seen in the
contribution to the mRNA pool of a subset of genes, when the
transcriptomes of the hybrids were compared with the
transcriptomes of their parents. These experiments were
conducted with the expectation that differences in the
transcriptomes of the hybrids, compared with their parents,
contribute to the basis of heterosis.
Using differential display, Sun et al [24] identified differences
in gene expression, of approximately 965 genes, between wheat
seedling hybrids and their parents. The hybrids were generated
from two single direction crosses, and represented one heterotic
and one non-heterotic sample. Differences in gene expression were
found between the hybrids and the parents, with some evidence
provided of differences in response between the hybrids. In
later experiments, Sun et al [28] used differential display
techniques to identify changes in transcriptional remodelling for
2800 genes, between nine parental and 20 wheat hybrids. They
found that around 30% of these genes showed some degree of
remodelling. Broad trends in gene expression were assessed by
random amplification. Gene expression differences were observed
between the hybrid and both parents, between the hybrid and one
parent only, and genes expressed only in the hybrid. The total
number of non-additively expressed genes was found to correlate
with some traits. The authors concluded that these differences
in gene expression must be involved in developing a heterotic
phenotype.
Guo et al. [29] reported allele-specific variation in transcript
abundance in hybrids. Transcript abundance of 15 genes was
analysed in maize hybrids, and transcript levels for the two
alleles of each gene were compared. In 11 genes, the two alleles
were found to be expressed unequally (bi-allelic expression), and
in 4 genes just one allele was expressed (mono-allelic
expression). Allele-specific differences in expression were
observed between genetically different hybrids. Additionally,
the two alleles in each hybrid were shown to respond differently
to abiotic stress. Allele-specific differences may indicate
different functions for the two parental alleles in hybrids, and
this functional diversity of the two parental alleles in the
hybrid was suggested to have an impact on heterosis.
Auger et al. [27] examined differences in transcript abundance
between hybrids relative to their inbred parents. Several genes
were found to be expressed at non-additive levels in the hybrids,
but relevance to heterosis was not demonstrated.
Vuylsteke et al. [30] measured variations in transcript abundance
between three inbred lines and two pairs of reciprocal F1 hybrids
of Arabidopsis. Non-additive levels of gene expression in the
hybrids were used to estimate the proportion of genes expressed
in a "dominance" fashion according to a genetic model of
heterosis.
Microarray technology has also been used to study differences in
transcript abundance across plant populations. For example,
Kliebenstein et al. [31] used microarrays to quantify gene
expression in seven Arabidopsis accessions, and found an average
of 2234 genes to be significantly differentially expressed
between any pair of accessions. The differences in gene
expression were found to be related to sequence diversity in the
accessions. Kirst et al. [32] examined transcript abundance in a
pseudobackcross population of eucalyptus in order to compare
transcript regulation in different genetic backgrounds of
eucalyptus, and concluded that the genetic control of transcript
levels was modulated by variation at different regulatory loci in
different genetic backgrounds. Paux et al. [33] also conducted
transcript profiling of eucalyptus genes, to examine gene
expression during tension wood formation.
Another mechanism that has been proposed to explain heterosis is
complementation of bottlenecks in metabolic systems [34]. It is
possible that several different mechanisms are involved in
heterosis, so that any one specific mechanism may only explain a
proportion of heterosis observed.
Heterosis has been the subject of intense genetic analysis for
almost a century, but no reliable and accurate basis for
determining, predicting or influencing the degree of heterosis in
a given hybrid has yet been identified. Thus, there has been a
long-felt need to identify some basis on which parents may be
selected in order to produce hybrids of increased vigour.
Attempts to produce hybrids with high levels of heterosis must
currently be undertaken on the basis of trial and error, by
experimentally crossing different parents and then waiting for
the progeny to grow until it can be seen which of the new hybrids
exhibit the most vigour. Breeding for new heterotic hybrids thus
necessarily results in the co-production of significant numbers
of under-performing hybrids with low hybrid vigour. The desired
hybrids may not be obtained, or may only represent a fraction of
the total number of hybrids produced overall. Additionally,
hybrids must normally reach a certain age before their level of
heterosis can be determined, which increases still further the
time, cost and resources that must be invested in a breeding
program, since it is necessary to continue to grow large numbers
of hybrids even though many, or perhaps all, will not have the
desired characteristics.
A method that could provide at least some measure of prediction
of the level of heterosis likely to be exhibited by a given
hybrid could result in significantly more effective breeding
programs.
There are comparable needs to determine a basis on which plants
or animals may be selected as parents for producing hybrids with
further desirable multigenic traits, and for predicting which
hybrid, inbred or recombinant plants or animals are likely to
exhibit desired traits.
The invention disclosed herein is based on the unexpected finding
that transcript abundance of certain genes is predictive of the
degree of heterosis in a hybrid. Transcriptome analysis may be
used to identify genes whose transcript abundance in hybrids
correlates with heterosis. The abundance of those gene
transcripts in a new hybrid can then be used to predict the
degree of heterosis of the new hybrid. Moreover, transcriptome
analysis may be used to identify genes whose transcript abundance
in plants or animals correlates with heterosis in hybrids
produced by crossing those plants or animals. Thus,
transcriptome data from parents can be used to predict the
magnitude of heterosis in hybrids which have yet to be produced.
We show herein that changes in transcript abundance in the
transcriptome represent the majority of the basis of heterosis.
Importantly, this means that predictions based on transcript
abundance are close to the observed magnitude of heterosis, i.e.
the invention allows quantitative prediction of the degree of
heterosis in a hybrid. Transcriptome characteristics alone may
thus be used to predict heterosis in hybrids and as a basis for
selection of parents.
Thus, remarkably, we have solved a problem that has been
unanswered for almost a century. By demonstrating that the basis
of heterosis resides primarily at the level of the regulation of
transcript abundance, we have provided a means of predicting
heterosis in hybrids and thus selecting which hybrids to
maintain. Furthermore, we were able to identify characteristics
of parental transcriptomes that could be used successfully as
markers to predict the magnitude of heterosis in untested
hybrids, and we have thus also provided basis for identifying
parents which can be crossed to produce heterotic hybrids.
This invention differs from previous studies involving
transcriptome analysis of hybrids, since those earlier studies
did not identify any relationship between the transcriptomes of
hybrids and the degree of heterosis observed in those hybrids.
As discussed above, earlier studies showed that transcript levels
of some genes differ in hybrids compared with the parents from
which those hybrids were derived, and differences between hybrid
and parent transcriptome were suggested to contribute to
phenotypic differences including heterosis. However, the
previous investigators did not compare transcriptome remodelling
in a range of non-heterotic hybrids and heterotic hybrids, and
did not show whether transcriptome remodelling correlates with
heterosis.
We have recognised that most differences in the hybrid
transcriptome are due to hybrid formation, not heterosis. We
found that, in fact, transcriptome remodelling involving
transcript abundance fold-changes of 2 or more occurs to a
similar extent in all hybrids relative to their parents,
regardless of the degree of heterosis observed in the hybrids.
Accordingly, the overall degree of transcriptome remodelling in a
hybrid is not an indicator of the degree of heterosis in that
hybrid.
Therefore, earlier studies involving limited numbers of hybrids
were not able to identify genes whose transcript abundance
correlated with heterosis. The vast majority of differences in
transcript abundance observed in earlier studies would have been
due only to hybrid formation itself, and would not show any
correlation with heterosis. Nor was any such correlation even
looked for in the prior art, since it was not recognised that a
correlation might exist.
However, despite showing that the overall degree of transcriptome
remodelling in a hybrid is not related to heterosis, we found
that transcriptome analysis can nevertheless be used to reveal
features of the hybrid transcriptome that are predictive of the
degree of heterosis in a hybrid. Through transcriptome analysis
of a wide range of hybrids we have unexpectedly shown that
transcript abundance of a proportion of genes correlates with
heterosis. As described herein, we studied 13 different
heterotic hybrids of Arabidopsis thaliana, and identified
features of the hybrid transcriptome that are characteristic of
heterotic interactions. We identified 70 genes whose transcript
abundance in the hybrid transcriptome correlated with the degree
of heterosis in the Arabidopsis hybrids. We then successfully
used the transcript abundance of that defined set of 70 genes to
quantitatively predict the magnitude of heterosis observed in 3
untested hybrid combinations. Transcript abundance of two
additional genes, Atlg67500 and At5g45500, was also shown to have
a significant negative correlation with heterosis. Transcript
abundance of each of these genes successfully predicted heterosis
in further hybrids.
Further, we identified a larger set of genes whose transcript
abundance in the transcriptome of Arabidopsis inbred lines
correlated with the degree of heterosis in hybrid progeny
produced by crossing those lines. We successfully used the
transcript abundance of that set of genes to quantitatively
predict the magnitude of heterosis in 3 hybrids produced from
those lines. Transcript abundance of At3gll220 was found to be
negatively correlated with heterosis in a highly significant
manner and transcript abundance of this gene in the parental
transcriptome was found to be predictive of heterosis in hybrid
offspring.
Heterosis in hybrids of Arabidopsis thaliana may be predicted on
the basis of the transcript abundance of these identified
Arabidopsis geries. Moreover, since heterosis is a widely
observed phenomenon, and is not restricted to Arabidopsis or even
to plants, but is also observed in animals, it is to be expected
that many of the same genes whose transcript abundance correlates
with heterosis in Arabidopsis will also correlate with heterosis
in other organisms. Transcript abundance of orthologues of those
genes in other species may thus correlate with heterosis.
However, prediction of heterosis need not be based on genes
selected from the sets of genes disclosed herein, since one
aspect of the invention is use of transcriptome analysis to
identify the particular genes whose transcript abundance
correlates with heterosis in any population of hybrids that is of
interest. Once identified, those genes may then be used for
prediction of heterosis or other trait in the particular hybrids
of interest. Whilst the identified genes may include at least
some genes, or orthologues thereof, from the set of genes
identified in Arabidopsis, they need not do so.
The invention enables hybrids likely to exhibit high levels of
heterosis to be identified and selected, while hybrids likely to
exhibit lower degrees of heterosis may be discarded. Notably,
the invention may be used to predict the level of heterosis in a
hybrid at an early stage in the life of the hybrid, for example
in a seedling, before it would be possible to directly observe
differences between heterotic and non-heterotic hybrids. Thus,
the invention may be used in a hybrid whose degree of heterosis
is not yet determinable from its phenotype. The invention thus
provides significant benefits to a breeder, since it allows a
breeder to determine which particular hybrids in a potentially
vast array of different hybrids should be retained and grown.
For example, a breeder may use transcript abundance data from
seedlings to decide which plant hybrids to grow or test in
yield/performance trials.
Furthermore, we have shown that regulation of transcript
abundance underlies not only heterosis but also other traits.
These may include all genetically complex traits in hybrid,
inbred or recombinant plants and animals, e.g. flowering time or
seed composition in plants. Accordingly, the invention also
relates to determining features of plant or non-human animal
transcriptomes (e.g. transcriptomes of hybrids and/or inbred or
recombinant plants or animals) for prediction of other traits in
the plant or animal or offspring thereof. Where the invention
relates to traits other than heterosis, the plant or animal may
be a hybrid or alternatively it may be inbred or recombinant.
Examples of traits that may be predicted using the invention are
yield, flowering time, seed oil content and seed fatty acid
ratios in plants, especially plant hybrids, e.g. accessions of A.
thaliana. These and other traits may also be predicted in the
plant or non-human animal (e.g. hybrid, inbred or recombinant
plant or animal) before those traits are manifested in the
phenotype. Thus, for example, we demonstrate herein that the
invention allows seed oil content of inbred plants to be
accurately predicted by analysis of plants that have not yet
flowered. The invention thus confers significant predictive,
cost and workload reductive advantages, particularly for traits
manifested at a relatively late stage, since it means that it is
not necessary to wait until a plant or animal reaches a
particular (often late) stage of development before being able to
know the magnitude or properties of the trait that will be
exhibited by a given plant or animal.
Other aspects of the invention allow prediction of traits in
plants or animals based on characteristics of their parents, and
thus traits of plants or animals may be predicted and selected
for even before those plants or animals are produced. As noted
above, the trait may be heterosis in a plant or animal hybrid.
Therefore, in accordance with the invention, features of plant or
animal transcriptomes may be identified that allow the degree of
heterosis of plants or animals produced by crossing those plants
or animals to be predicted. The invention can be used to predict
one or more traits, such as the degree of heterosis observed in
plants or animals produced by crossing different combinations of
parental germplasms. This is potentially as valuable or even
more valuable than being able to predict heterosis and other
traits in plants and animals that have already been produced,
since it avoids producing under-performing plants or animals and
therefore allows significant savings in logistics, costs and
time. Particular plants or animals may thus be selected for
breeding, with an increased chance that their progeny will be
heterotic hybrids, or possess other traits, compared with if the
parents were selected at random. Thus, the methods of the
invention allow prediction in terms of the level of heterosis or
of other traits produced by any particular cross between
different parents, and allow particular parents to be selected
accordingly. For example in agricultural crop plant breeding the
invention reduces the need to make large numbers of different
crosses in order to obtain new heterotic hybrids, since the
invention can be used to identify in advance which particular
crosses will be most productive.
Remarkably, methods of the invention may be used to predict
traits based on transcript abundance in tissues in which the
trait is not exhibited or which have no apparent relevance to the
trait. For example, traits such as flowering time or seed
composition may be predicted in plants based on transcript
abundance data from non-flowering tissue, such as leaf tissue.
Thus, the invention allows generation of statistical correlations
between one or more traits and abundance of one or more gene
transcripts. There is no requirement for the tissue sampled for
transcriptome analysis to be the same as that used for trait
measurement. It may be preferable that the tissue sampled for
transcriptome analysis is, in terms of evolution, be a more
ancient origin - hence the transcriptome in leaves can be used to
predict more recently evolved characteristics of plants, such as
flowering time or seed composition.
Based on the extensive transcriptome remodelling in hybrids of
Arabidopsis thaliana disclosed herein, including some
combinations that are heterotic for vegetative biomass and some
combinations that are non-heterotic, it is evident that the
methods of the invention may be applied to advantage in crops of
economic importance.
Maize is currently bred as a hybrid crop, with its cultivation in
the UK being for silage from the whole plant. Biomass yield is
therefore paramount, and heterosis underpins this yield. In the
USA maize is primarily grown for corn production, for which
kernel weight represents the productive yield, and this yield is
also dependent on heterosis. The ability to efficiently select
for hybrid performance at an early stage of the hybrid parent
breeding process provided by the method of this invention greatly
accelerates the development of hybrid plant lines to increase
yields and introduce a range of "sustainability" traits from
exotic germplasm without loss of yield. Oilseed rape hybrids
hold much potential, but their exploitation is limited as
heterosis is often restricted to vegetative vigour, with little
improvement in seed dry weight yield. The ability to select for
specific performance traits at early stages of growth similarly
accelerates the development of more productive and sustainable
varieties. There is great potential for hybrid breeding of bread
wheat (already a hexaploid, so benefits from some "fixed"
heterosis) which, like oilseed rape, is supported by a breeding
community based in the UK. In addition, hybrid varieties are
important for a large number of vegetable species cultivated in
the UK (such as cabbages, onions, carrots, peppers, tomatoes,
melons), which are grown for enhancement of crop uniformity,
appearance and general quality. Use of the invention to define a
predictive marker for heterosis and other performance traits thus
has the potential to revolutionise both the breeding process and
the performance of crops for the farmer.
As demonstrated in the Examples, we identified relationships
between gene expression in glasshouse-grown seedlings of maize
inbreds and phenotypes (grain yield) in related plants at a later
developmental stage and after growth under different
environmental conditions.
In summary, the invention involves use of transcriptome analysis
of plants or animals, e.g. hybrids and/or inbred or recombinant
plants or animals, for:
(i) identifying genes involved in the manifestation of heterosis
and other traits; and/or
(ii) predicting and producing plants or animals of improved
heterosis and other traits by selecting plants or animals for
breeding, wherein the plants or animals which exhibit enhanced
transcriptome characteristics with respect to a selected set of
genes relevant to the transcriptional regulatory networks present
in potential parental breeding partners; and/or
(iii) predicting a range of trait characteristics for plants and
animals based on transcriptome characteristics.
The invention also relates to plant and animal hybrids of
improved heterosis, and to hybrids, inbreds or recombinants with
improved traits as produced or predicted by the methods of the
invention.
The results disclosed herein provide evidence for a link between
heterosis and growth repression that is a consequence of stress
tolerance mechanisms. We identified a number of genes which are
highly predictive of heterosis, and which showed a significant
negative correlation between gene expression and heterotic
performance. As discussed in the Examples herein, these genes
may represent key genetic loci that are downregulated in
heterotic hybrids, leading to decreased expression of stress-
avoidance genes and thus allowing better hybrid performance under
favourable conditions. This raises the possibility that
heterosis, at least for vegetative biomass, is at least partly a
consequence of genetic interactions that lead to a reduction in
repression of growth, rather than direct promotion of growth.
However, whatever the molecular mechanism underlying heterosis,
we have established that certain genes and sets of genes
predictive of heterosis may be identified and successfully used
in accordance with the present invention for predicting
heterosis.
A hybrid is offspring of two parents of differing genetic
composition. Thus, a hybrid is a cross between two differing
parental germplasms. The parents may be plants or animals. A
hybrid is typically produced by crossing a maternal parent with a
different paternal parent. In plants, the maternal parent is
usually, though not necessarily, impaired in male fertility and
the paternal parent is a male fertile pollen donor. Parents may
for example be inbred or recombinant.
An inbred plant or animal typically lacks heterozygosity. Inbred
plants may be produced by recurrent self-pollination. Inbred
animals may be produced by breeding between animals of closely
related pedigree.
Recombinant plants or animals are neither hybrid nor inbred.
Recombinants are themselves derived by the crossing of
genetically dissimilar progenitors and may contain extensive
heterozygosity and novel combinations of alleles. Most samples
in germplasm collections of plant breeding programmes are
recombinant.
The invention may be used with plants or animals. In some
embodiments the invention preferably relates to plants. For
example, the plants may be crop plants. The crop plants may be
cotton, sugar beet, cereal plants (e.g. maize, wheat, barley,
rice), oil-seed crops (e.g. soybeans, oilseed rape, sunflowers),
fruit or vegetable crop plants (e.g. cabbages, onions, carrots,
peppers, tomatoes, melons, legumes, leeks, brassicas e.g.
broccoli) or salad crop plants e.g. lettuce [35]. The invention
may be applied to hardwood timber trees or alder trees [36]. All
species grown as crops could benefit from the invention,
irrespective of whether they are currently cultivated extensively
as hybrids.
Other embodiments relate to non-human animals e.g. mammals, birds
and fish, including farm animals for example cattle, pigs, sheep,
birds or poultry (e.g. chickens), goats, and farmed fish e.g.
salmon, and other animals such as sports animals e.g. racehorses,
racing pigeons, greyhounds or camels. Heterosis has been
described in a variety of different animals including for example
pigs [37], sheep [38, 39], goats [39], alpaca [39], Japanese
quail [40] and salmon [41], and the invention may be applied to
these and to other animals.
The invention can most conveniently be used in relation to
organisms for which the genome sequence or extensive collections
of Expressed Sequence Tags are available and in which microarrays
are preferably also available and/or resources for transcriptome
analysis have been developed.
In one aspect, the invention is a method comprising:
analysing the transcriptomes of plants or animals in a
population of plants or animals;
measuring a trait of the plants or animals in the
population; and
identifying a correlation between transcript abundance of
one or more, preferably a set of, genes in the plant or animal
transcriptomes and the trait in the plants or animals.
Thus the invention provides a method of identifying an indicator
of a trait in a plant or animal.
The population may comprise e.g. at least 5, 10, 20, 30, 40, 50
or 100 plants or animals. Use of a large population to obtain
trait measurements from many different plants or animals may
allow increased accuracy of trait predictions based on
correlations identified using the population.
The invention may thus be used to generate a model (e.g. a
regression, as described in detail elsewhere herein) for
predicting the trait based on transcript abundance of the one or
more genes e.g. a set of genes.
One or more traits may be determined or measured, and thus
correlations may be identified, and models may be generated, for
a plurality of traits.
The plant or animal may be a hybrid, or it may be inbred or
recombinant. In a preferred embodiment the plant or animal is a
hybrid. A preferred trait is heterosis.
Plants or animals in a population may or may not be related to
one another. The population may comprise plants or animals, e.g.
hybrids, having different maternal and/or paternal parents. In
some embodiments, all plants or animals, e.g. hybrids, in the
population have the same maternal parent, but may have different
paternal parents. In other embodiments, all plants or animals,
e.g. hybrids, in the population have the same paternal parent,
but may have different maternal parents. Parents may be inbred
or recombinant, as explained elsewhere herein.
Methods for determining heterosis, for transcriptome analysis and
for identifying statistical correlations are described in detail
elsewhere herein.
Determining or measuring heterosis or other trait can be
performed once the relevant phenotype is apparent e.g. once the
heterosis can be calculated, or once the trait can be measured.
Transcriptome analysis may be performed at a time when the degree
of heterosis or other trait of the plant or animal can be
determined. Transcriptome analysis may be performed after,
normally directly after, measurements are taken for determining
or measuring heterosis or other trait in the plant or animal.
This is suitable e.g. when measurements are taken for determining
heterosis for fresh weight in hybrids.
However, we have demonstrated herein that it is possible to use
transcriptome analysis of plants at a relatively early
developmental stage, e.g. before flowering, to identify genes
whose transcript abundance correlates with traits that only occur
later in development, e.g. traits such as the time of flowering
and aspects of the composition of seeds produced by plants.
Accordingly, transcriptome analysis may be performed when the
degree of heterosis or other trait is not yet determinable from
the phenotype. This is suitable e.g. when measuring aspects of
performance other than fresh weight, such as yield, for
determining heterosis. For example, transcriptome analysis may
be performed when plants are in vegetative phase or when animals
are pre-adolescent, in order to predict heterosis for
characteristics that are evident later in development, or to
predict other traits that are evident later in development. For
example, heterosis for seed or crop yields, or traits such as
flowering time, seed or crop yields or seed composition, may be
predicted using transcriptome data from vegetative phase plants.
Correlations between traits and transcript abundance represent
models that may be used to predict traits in further plants or
animals by determining transcript abundance in those plants or
animals.
Thus, in another aspect, the invention is a method comprising:
determining transcript abundance of one or more, preferably
a set of, genes in a plant or animal, wherein the transcript
abundance of the one or more genes, or set of genes, in the
transcriptome of the plant or animal correlates with a trait in
the plant or animal; and
thereby predicting the trait in the plant or animal.
The analysis of transcript abundance is predictive of the trait
in a plant or animal of the same genotype as the plant or animal
in which transcript abundance was determined. Thus, in some
embodiments the method may be used for the purpose of predicting
a trait in the actual plant or animal whose transcript abundance
is determined, and in other embodiments the method may be used
for the purpose of predicting a trait in another plant or animal
that is genetically identical to the plant or animal whose
transcript abundance was sampled. For example the method may be
used for predicting a trait in a genetically identical plant or
animal that may be grown or produced subsequently, and indeed the
decision whether to grow or produce the plant or animal may be
informed by the trait prediction.
Methods of the invention may comprise determining transcript
abundance of one or more genes, preferably a set of genes, in a
plurality of plants or animals, and thus predicting one or more
traits in the plurality of plants or animals. Thus, the
invention may be used to predict a rank order for the trait in
those plants or animals, which allows selection of plants or
animals that are predicted to exhibit the highest or lowest trait
(e.g. longest or shortest time to flowering, highest seed oil
content, highest heterosis).
The plant or animal may be a hybrid, or it may be inbred or
recombinant. In a preferred embodiment the plant or animal is a
hybrid. A preferred trait is heterosis, and thus the method may
be for predicting the magnitude of heterosis in a hybrid.
A method of the invention may comprise:
determining transcript abundance of one or more, preferably
a set of, genes in a plant or animal, e.g. a hybrid, wherein
transcript abundance of the one or more genes, or set of genes,
correlates with a trait in a population of plants or animals,
e.g. a population of hybrids; and
thereby predicting the trait in the plant or animal.
Plants or animals in the population may or may not be related to
one another. The population typically comprises plants or
animals, e.g. hybrids, having different maternal and/or paternal
parents. In some embodiments, all plants or animals in the
population have the same maternal parent, but may have different
paternal parents. In other embodiments, all plants or animals in
the population have the same paternal parent, but may have
different maternal parents. Where plants or animals in the
population share a common maternal parent or a common paternal
parent, the plant or animal in which the trait is predicted may
share the same common maternal or paternal parent, respectively.
The method may comprise, as an earlier step, a method of
identifying an indicator of the trait in a plant or animal, as
described above.
The plant or animal in which the indicator of the trait is
identified may be the same genus and/or species as the plant or
animal in which transcript abundance is determined for prediction
of the trait. However, as discussed elsewhere herein,
predictions of traits in one species may be performed based on
correlations between transcript abundance and trait data obtained
in other genus and/or species.
Thus, the invention may be used to predict one or more traits in
a plant or animal, typically a previously untested plant or
animal. As noted above, the method is useful for predicting
heterosis or other trait in a plant or animal when heterosis or
other trait is not yet determinable from the phenotype of the
organism at the time, age or developmental stage at which the
transcriptome is sampled. In a preferred embodiment the method
comprises analysing the transcriptome of a plant prior to
flowering.
Suitable methods of determining transcript abundance and of
predicting heterosis or other traits based on transcript
abundance are described in more detail elsewhere herein.
Once genes whose levels of transcript abundance are involved in
heterosis or other traits have been identified for a given plant
or animal species, further aspects of the invention may involve
regulation of transcript abundance, regulation of expression of
one or more of those genes, or regulation of one or more proteins
encoded by those genes, in order to regulate, influence, increase
or decrease heterosis or another trait in a plant or animal
organism.
Thus, the invention may involve increasing or decreasing
heterosis or other trait in an organism, by upregulating one or
more genes or their encoded proteins, wherein transcript
abundance of the one or more genes correlates positively with
heterosis or other trait in the organism, or by downregulating
one or more genes or their encoded proteins in an organism,
wherein transcript abundance of the one or more genes correlates
negatively with heterosis or other trait in the organism. Thus,
heterosis and other desirable traits in the organism may be
increased using the invention. The invention also extends to
plants and animals in which traits are up- or down-regulated
using methods of the invention. The invention may comprise down-
regulating one or more genes involved in stress avoidance or
stress tolerance, wherein transcript abundance of the one or more
genes is negatively correlated with heterosis, e.g. heterosis for
biomass.
Examples of genes whose transcript abundance correlates
positively with heterosis, and examples of genes whose transcript
abundance correlates negatively with heterosis, are shown in
Table 1 and Table 19. Additionally, transcript abundance of
genes Atlg67500 and At5g45500 correlates negatively with
heterosis. In a preferred embodiment the one or more genes are
selected from Atlg67500 and At5g45500 and/or those shown in Table
1 and/or Table 19, or are orthologues of Atlg67500 and/or
At5g45500 and/or of one or more genes shown in Table 1 and/or
Table 19.
The invention may involve increasing or decreasing a trait in an
organism, by upregulating one or more genes whose transcript
abundance correlates negatively with the trait in the organism,
or by downregulating one or more genes whose transcript abundance
correlates positively with the trait in hybrids. Thus,
undesirable traits in organisms may be decreased using the
invention.
Examples of genes whose transcript abundance correlates with
particular traits are shown in Tables 3 to 17, Table 20 and Table
22. Preferred embodiments of the invention relate to one or more
of those traits, and preferably to one or more of the listed
genes for which transcript abundance is shown to correlate with
those traits, as discussed elsewhere herein. Thus, the one or
more genes may be selected from the genes shown in the relevant
tables, or may be orthologues of those genes. For example,
flowering time (e.g. as represented by leaf number at bolting)
may be delayed (time to flowering increased, e.g. leaf number at
bolting increased) by upregulating expression of one or more
genes in Table 3A or Table 4A. Flowering time may be accelarated
(time to flowering decreased, e.g. leaf number at bolting
decreased) by downregulating expression of one or more genes in
Table 3B or Table 4B..
A trait may be increased by upregulating a gene for which
transcript abundance correlates positively with the trait or by
downregulating a gene for which transcript abundance correlates
negatively with the trait. A trait may be decreased by
downregulating a gene for which transcript abundance correlates
positively with the trait or by upregulating a gene for which
transcript abundance correlates positively with the trait.
Deregulation of a gene involves increasing its level of
transcription or expression, and thus increasing the transcript
abundance of that gene. Upregulation of a gene may comprise
expressing the gene from a strong and/or constitutive promoter
such as 35S CaMV promoter. Upregulation may comprise increasing
expression of an endogenous gene. Alternatively, upregulation
may comprise expressing a heterologous gene in a plant or animal,
e.g. from a strong and/or constitutive promoter. Heterologous
genes may be introduced into plant or animal cells by any
suitable method, and methods of transformation are well known in
the art. A plant or animal cell may for example be transformed
or transfected with an expression vector comprising the gene
operably linked to a promoter e.g. a strong and/or constitutive
promoter, for expression in the cell. The vector may integrate
into the cell genome, or may remain extra-chromosomal.
By "promoter" is meant a sequence of nucleotides from which
transcription may be initiated of DNA operably linked downstream
(i.e. in the 3' direction on the sense strand of double-stranded
DNA) .
"Operably linked" means joined as part of the same nucleic acid
molecule, suitably positioned and oriented for transcription to
be initiated from the promoter. DNA operably linked to a
promoter is under transcriptional initiation regulation of the
promoter.
Downregulation of a gene involves decreasing its level of
transcription or expression, and thus decreasing the transcript
abundance of that gene. Downregulation may be achieved for
example by antisense or RNAi, using RNA complementary to
messenger RNA (mRNA) transcribed from the gene.
Anti-sense oligonucleotides may be designed to hybridise to the
complementary sequence of nucleic acid, pre-mRNA or mature mRNA,
interfering with the production of polypeptide encoded by a given
DNA sequence (e.g. either native polypeptide or a mutant form
thereof), so that its expression is reduce or prevented
altogether. Anti-sense techniques may be used to target a coding
sequence, a control sequence of a gene, e.g. in the 5' flanking
sequence, whereby the antisense oligonucleotides can interfere
with control sequences. Anti-sense oligonucleotides may be DNA
or RNA and may be of around 14-23 nucleotides, particularly
around 15-18 nucleotides, in length. The construction of
antisense sequences and their use is described in refs. [42] and
143]
Small RNA molecules may be employed to regulate gene expression.
These include targeted degradation of mRNAs by small interfering
RNAs (siRNAs), post transcriptional gene silencing (PTGs),
developmentally regulated sequence-specific translational
repression of mRNA by micro-RNAs (miRNAs) and targeted
transcriptional gene silencing.
A role for the RNAi machinery and small RNAs in targeting of
heterochromatin complexes and epigenetic gene silencing at
specific chromosomal loci has also been demonstrated. Double-
stranded RNA (dsRNA)-dependent post transcriptional silencing,
also known as RNA interference (RNAi), is a phenomenon in which
dsRNA complexes can target specific genes of homology for
silencing in a short period of time. It acts as a signal to
promote degradation of mRNA with sequence identity. A 20-nt
siRNA is generally long enough to induce gene-specific silencing,
but short enough to evade host response. The decrease in
expression of targeted gene products can be extensive with 90%
silencing induced by a few molecules of siRNA.
In the art, these RNA sequences are termed "short or small
interfering RNAs" (siRNAs) or "microRNAs" (miRNAs) depending in
their origin. Both types of sequence may be used to down-
regulate gene expression by binding to complimentary RNAs and
either triggering mRNA elimination (RNAi) or arresting mRNA
translation into protein. siRNA are derived by processing of
long double stranded RNAs and when found in nature are typically
of exogenous origin. Micro-interfering RNAs (miRNA) are
endogenously encoded small non-coding RNAs, derived by processing
of short hairpins. Both siRNA and miRNA can inhibit the
translation of mRNAs bearing partially complimentary target
sequences without RNA cleavage and degrade mRNAs bearing fully
complementary sequences.
The siRNA ligands are typically double stranded and, in order to
optimise the effectiveness of RNA mediated down-regulation of the
function of a target gene, it is preferred that the length of the
siRNA molecule is chosen to ensure correct recognition of the
siRNA by the RISC complex that mediates the recognition by the
siRNA of the mRNA target and so that the siRNA is short enough to
reduce a host response.
miRNA ligands are typically single stranded and have regions that
are partially complementary enabling the ligands to form a
hairpin. miRNAs are RNA genes which are transcribed from DNA,
but are not translated into protein. A DNA sequence that codes
for a miRNA gene is longer than the miRNA. This DNA sequence
includes the miRNA sequence and an approximate reverse
complement. When this DNA sequence is transcribed into a single-
stranded RNA molecule, the miRNA sequence and its reverse-
complement base pair to form a partially double stranded RNA
segment. The design of microRNA sequences is discussed in ref.
[44].
Typically, the RNA ligands intended to mimic the effects of siRNA
or miRNA have between 10 and 40 ribonucleotides (or synthetic
analogues thereof), more preferably between 17 and 30
ribonucleotides, more preferably between 19 and 25
ribonucleotides and most preferably between 21 and 23
ribonucleotides. In some embodiments of the invention employing
double-stranded siRNA, the molecule may have symmetric 3'
overhangs, e.g. of one or two (ribo)nucleotides, typically a UU
of dTdT 3' overhang. Based on the disclosure provided herein,
the skilled person can readily design of suitable siRNA and miRNA
sequences, for example using resources such as Ambion's siRNA
finder, see http://www.ambion.com/techlib/misc/siRNA_finder.html.
siRNA and miRNA sequences can be synthetically produced and added
exogenously to cause gene downregulation or produced using
expression systems (e.g. vectors). In a preferred embodiment the
siRNA is synthesized synthetically.
Longer double stranded RNAs may be processed in the cell to
produce siRNAs (see for example ref. [45]). The longer dsRNA
molecule may have symmetric 3' or 5' overhangs, e.g. of one or
two (ribo)nucleotides, or may have blunt ends. The longer dsRNA
molecules may be 25 nucleotides or longer. Preferably, the longer
dsRNA molecules are between 25 and 30 nucleotides long. More
preferably, the longer dsRNA molecules are between 25 and 27
nucleotides long. Most preferably, the longer dsRNA molecules are
27 nucleotides in length. dsRNAs 30 nucleotides or more in length
may be expressed using the vector pDECAP [46].
Another alternative is the expression of a short hairpin RNA
molecule (shRNA) in the cell. shRNAs are more stable than
synthetic siRNAs. A shRNA consists of short inverted repeats
separated by a small loop sequence. One inverted repeat is
complimentary to the gene target. In the cell the shRNA is
processed by DICER into a siRNA which degrades the target gene
mRNA and suppresses expression. In a preferred embodiment the
shRNA is produced endogenously (within a cell) by transcription
from a vector. shRNAs may be produced within a cell by
transfecting the cell with a vector encoding the shRNA sequence
under control of a RNA polymerase III promoter such as the human
H1 or 7SK promoter or a RNA polymerase II promoter.
Alternatively, the shRNA may be synthesised exogenously (in
vitro) by transcription from a vector. The shRNA may then be
introduced directly into the cell. Preferably, the shRNA molecule
comprises a partial sequence of the gene to be downregulated.
Preferably, the shRNA sequence is between 40 and 100 bases in
length, more preferably between 40 and 70 bases in length. The
stem of the hairpin is preferably between 19 and 30 base pairs in
length. The stem may contain G-U pairings to stabilise the
hairpin structure.
siRNA molecules, longer dsRNA molecules or miRNA molecules may be
made recombinantly by transcription of a nucleic acid sequence,
preferably contained within a vector. Preferably, the siRNA
molecule, longer dsRNA molecule or miRNA molecule comprises a
partial sequence of the gene to be downregulated.
In one embodiment, the siRNA, longer dsRNA or miRNA is produced
endogenously (within a cell) by transcription from a vector. The
vector may be introduced into the cell in any of the ways known
in the art. Optionally, expression of the RNA sequence can be
regulated using a tissue specific promoter. In a further
embodiment, the siRNA, longer dsRNA or miRNA is produced
exogenously (in vitro) by transcription from a vector.
In one embodiment, the vector may comprise a nucleic acid
sequence according to the invention in both the sense and
antisense orientation, such that when expressed as RNA the sense
and antisense sections will associate to form a double stranded
RNA. In another embodiment, the sense and antisense sequences
are provided on different vectors.
Alternatively, siRNA molecules may be synthesized using standard
solid or solution phase synthesis techniques which are known in
the art. Linkages between nucleotides may be phosphodiester bonds
or alternatives, for example, linking groups of the formula
P(0)S, (thioate); P(S)S, (dithioate); P(0)NR'2; P(0)R'; P(0)0R6;
CO; or CONR'2 wherein R is H (or a salt) or alkyl (1-12C) and R6
is alkyl (1-9C) is joined to adjacent nucleotides through-O-or-S-
Modified nucleotide bases can be used in addition to the -
naturally occurring bases, and may confer advantageous properties
o.n siRNA molecules containing them.
For example, modified bases may increase the stability of the
siRNA molecule, thereby reducing the amount required for
silencing. The provision of modified bases may also provide siRNA
molecules which are more, or less, stable than unmodified siRNA.
The term 'modified nucleotide base' encompasses nucleotides with
a covalently modified base and/or sugar. For example, modified
nucleotides include nucleotides having sugars which are
covalently attached to low molecular weight organic groups other
than a hydroxyl group at the 3'position and other than a
phosphate group at the 5'position. Thus modified nucleotides may
also include 2'substituted sugars such as 2'-O-methyl- ; 2-0-
alkyl ; 2-0-allyl ; 2'-S-alkyl; 2'-S-allyl; 2'-fluoro- ; 2'-halo
or 2; azido-ribose, carbocyclic sugar analogues a-anomeric
sugars; epimeric sugars such as arabinose, xyloses or lyxoses,
pyranose sugars, furanose sugars, and sedoheptulose.
Modified nucleotides are known in the art and include alkylated
purines and pyrimidines, acylated purines and pyrimidines, and
other heterocycles. These classes of pyrimidines and purines are
known in the art and include pseudoisocytosine, N4,N4-
ethanocytosine, 8-hydroxy-N6-methyladenine, 4-acetylcytosine,5-
(carboxyhydroxylmethyl) uracil, 5 fluorouracil, 5-bromouracil, 5-
carboxymethylaminomethyl-2-thiouracil, 5-carboxymethylaminomethyl
uracil, dihydrouracil, inosine, N6-isopentyl-adenine, 1-
methyladenine, 1-methylpseudouracil, 1-methylguanine, 2,2-
dimethylguanine, 2methyladenine, 2-methylguanine, 3-
methylcytosine, 5-methylcytosine, N6-methyladenine, 7-
methylguanine, 5-methylaminomethyl uracil, 5-raethoxy amino
methyl-2-thiouracil, -D-mannosylqueosine, 5-
methoxycarbonylmethyluracil, 5methoxyuracil, 2 methylthio-N6-
isopentenyladenine, uracil-5-oxyacetic acid methyl ester,
psueouracil, 2-thiocytosine, 5-methyl-2 thiouracil, 2-thiouracil,
4-thiouracil, 5methyluracil, N-uracil-5-oxyacetic acid
methylester, uracil 5-oxyacetic acid, queosine, 2-thiocytosine,
5-propyluracil, 5-propylcytosine, 5-ethyluracil, 5ethylcytosine,
5-butyluracil, 5-pentyluracil, 5-pentylcytosine, and
2,6,diaminopurine, methylpsuedouracil, 1-methylguanine, 1-
methylcytosine.
Methods relating to the use of RNAi to silence genes in C.
elegans, Drosophila, plants, and mammals are known in the art
[47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59].
Other approaches to specific down-regulation of genes are well
known, including the use of ribozymes designed to cleave specific
nucleic acid sequences. Ribozymes are nucleic acid molecules,
actually RNA, which specifically cleave single-stranded RNA, such
as mRNA, at defined sequences, and their specificity can be
engineered. Hammerhead ribozymes may be preferred because they
recognise base sequences of about 11-18 bases in length, and so
have greater specificity than ribozymes of the Tetrahymena type
which recognise sequences of about 4 bases in length, though the
latter type of ribozymes are useful in certain circumstances.
References on the use of ribozymes include refs. [60] and [61].
The plant or animal in which the gene is upregulated or
downregulated may be hybrid, recombinant or inbred. Thus, in
some embodiments the invention may involve over-expressing genes
correlated with one or more traits, in order to improve vigour or
other characteristics of the transformed derivatives of inbred
plants and animals.
In a further aspect, the invention is a method comprising:
analysing transcriptomes of parental plants or animals in a
population of parental plants or animals;
measuring heterosis or other trait in a population of
hybrids, wherein each hybrid in the population is a cross between
a first plant or animal and a plant or animal selected from the
population of parental plants or animals;
and
identifying a correlation between transcript abundance of
one or more genes, preferably a set of genes, in the population
of parental plants or animals and heterosis or other trait in the
population of hybrids.
Thus, the invention provides a method of identifying an indicator
of heterosis or other trait in a hybrid.
The plants or animals in the population whose transcriptomes are
analysed are thus parents of the hybrids. These parents may be
inbred or recombinant.
All hybrids in the population of hybrids used for developing each
predictive model are the result of crossing one common parent
with an array of different parents. Normally, all hybrids in the
population share one common parent, which may be either the
maternal parent or the paternal parent. Thus, the paternal
parent of the all the hybrids in the population may be the "first
parent plant or animal"., or the maternal parent of all the
hybrids in the population may be the "first parent plant or
animal". For plants, a first female parent is normally crossed
to a population of different male parents. For animals, a first
male parent may preferably be crossed with a population of
different females.
Suitable methods of determining or measuring heterosis in
hybrids, of transcriptome analysis and of identifying
correlations are discussed elsewhere herein.
Correlations between traits and transcript abundance represent
models that may be used to predict traits in further plants or
animals by determining transcript abundance in those plants or
animals. The invention may thus be used to generate a model
(e.g. a regression, as described in detail elsewhere herein) for
predicting the trait based on transcript abundance of the one or
more genes e.g. a set of genes.
Accordingly, in another aspect, the invention is a method of
predicting heterosis or other trait in a hybrid, wherein the
hybrid is a cross between a first plant or animal and a second
plant or animal; comprising
determining the transcript abundance of one or more
genes, preferably a set of genes, in the second plant or animal,
wherein the transcript abundance of those one or more genes, or
of the set of genes, in a population of parental plants or
animals correlates with heterosis or other trait in a population
of hybrids produced by crossing the first plant or animal with a
plant or animal from the population of parental plants and
animals; and
thereby predicting heterosis or other trait in the hybrid.
The invention may be used to predict one or more traits in hybrid
offspring of parental plants or animals, based on transcript
abundance in one of the parents. The parental plants or animals
may be inbred or recombinant. Plants or animals may be referred
to as "parents" or "parental plants or animals" even where they
have not yet been crossed to produce a hybrid, since the
invention may be used to predict traits in hybrids before those
hybrids are produced. This is a particular advantage of the
invention, in that methods of the invention may be used to
predict heterosis or other trait in a potential hybrid, without
needing to produce that hybrid in order to determine its
heterosis or traits.
A plurality of plants or animals may be tested by determining
transcript abundance using the method of the invention, each
plant or animal representing the second parent for crossing to
produce a hybrid, in order to identify a suitable plant or animal
to use for breeding to produce a hybrid with a desired trait. A
parent may then be selected for breeding based on the predicted
trait for a hybrid produced by crossing that parent. Thus, in
one example a germplasm collection, which may comprise a
population of recombinants, may be screened for plants that may
be suitable for inclusion in breeding programmes.
Following prediction of the trait in the hybrid, the inbred or
recombinant plant or animal may be selected for breeding to
produce a hybrid, e.g. as discussed further below.
Alternatively, if the hybrid for which the trait is predicted has
already been produced, that hybrid may be selected e.g. for
further cultivation.
The method of predicting the trait may comprise, as an earlier
step, a method of identifying an indicator of the trait in a
hybrid, as described above.
When the method is used for predicting heterosis in hybrids based
upon parental transcriptome data, for example data from inbred
plants or animals, the one or more genes may comprise At3gll2200
and/or one or more of the genes shown in Table 2, or one or more
orthologues thereof.
When the method is used for predicting yield, e.g. grain yield,
in hybrids based on parental transcriptome data, for example data
from inbred plants or animals, e.g. maize, the one or more genes
may comprise one or more of the genes shown in Table 22, or one
or more orthologues thereof. For example, transcript abundance
of one or more genes, e.g. a set of genes, from Table 22 may be
determined in a maize plant and used for predicting yield in a
hybrid cross between that maize line and B73.
Genes with transcript abundance correlating with other traits are
shown in Tables 3 to 17 and Table 20, and transcript abundance of
one or more of those genes in parental plants or animals may be
used to predict those traits in accordance with hybrid offspring
of those plants or animals, in accordance with this aspect of the
invention. Alternatively, the invention may be used to identify
other genes with transcript abundance in parental plants or
animals correlating with those traits in their hybrid offspring.
By predicting heterosis and other traits in hybrids produced by
crossing parental germplasm, whether they be inbred or
recombinant, the invention allows selection of inbred or
recombinant plants and animals that can be crossed to produce
hybrids with high or improved levels of heterosis and desirable
or improved levels of other traits.
Inbred or recombinant plants and animals may thus be selected on
the basis of heterosis or other trait predicted in hybrids
produced by crossing those plants and animals.
Accordingly, one aspect of the invention is a method comprising:
determining transcript abundance of one or more genes,
preferably a set of genes, in parental plants or animals, wherein
the transcript abundance of the one or more genes in a population
of parental plants or animals correlates with heterosis or other
trait in hybrid crosses between a first parental plant or animal
and plants or animals from the population of parental plants or
animals;
selecting one of the parental plants or animals; and
producing a hybrid by crossing the selected plant or animal
and a different plant or animal, e.g. by crossing the selected
plant or animal and the first plant or animal.
Thus, one or more traits may be predicted for hybrid crosses
between the parental plants or animals, and then a parental plant
or animal predicted to produce a hybrid with a desired trait e.g.
late flowering, high heterosis, and/or high yield, and/or with a
reduced undesirable trait, may be selected. Methods for
predicting traits are discussed in more detail elsewhere herein.
Genes whose transcript abundance correlates with heterosis or
other trait in hybrids produced by crossing a first plant or
animal and other plants or animals are referred to elsewhere
herein, and may be At3gll2200 and/or one or more genes selected
from the genes in Table 2, or orthologues thereof. Genes with
transcript abundance correlating with other traits are shown in
Tables 3 to 17 and Table 20, as described elsewhere herein.
Hybrids produced by methods of the invention may be raised or
cultivated, e.g. to maturity or breeding age. The invention also
extends to hybrids produced using methods of the invention.
The invention may be applied to any trait of interest. For
example, traits to which the invention applies include, but are
not limited to, heterosis, flowering time or time to flowering,
seed oil content, seed fatty acid ratios, and yield. Examples
genes whose transcript abundance correlates with certain traits
are shown in the appended Tables. For animals, preferred traits
are heterosis, yield and productivity. Traits such as yield may
be underpinned by heterosis, and the invention may relate to
modelling and/or predicting yield and other traits, and/or
modelling and/or predicting heterosis for yield and other traits,
based on transcript abundances of genes.
Genes in Tables shown herein are identified by AGI numbers,
Affymetrix Probe identifier numbers and/or GenBank database
accession numbers. AGI numbers can be used to identify the gene
from TAIR (The Arabidopsis Information Resource), available on-
line at http://www.arabidopsis.org/index.jsp, or findable by
searching for "TAIR" and/or "Arabidopsis information resource"
using an internet search engine. Affymetrix Probe identifier
numbers can be used to identify sequences from Netaffx, available
on-line at http://www.affymetrix.com/analysis/index.affx, or
findable by searching for "netaffx" and/or "Affymetrix" using an
internet search engine. It is now possible to convert between
the two identifier formats using the converter, from Toronto
university, currently available at
http://bbc.botany.utoronto.ca/ntools/cgi-
bin/ntools_agi_converter.cgi, or findable by searching for "agi
converter" using an internet search engine. GenBank accession
numbers can be used to obtain the corresponding sequence from
GenBank, available at
http://www.ncbi.nlm.nih.gov/Genbank/index.html or findable using
any internet search engine.
A set of genes may comprise a set of genes selected from the
genes shown in a table herein.
In methods of the invention relating to heterosis, the one or
more genes may comprise one or more of the 70 genes listed in
Table 1 or one or more orthologues thereof, and/or may comprise
one or more of the genes listed in Table 19 or one or more
orthologues thereof.
In methods relating to traits other than heterosis, the trait may
for example be a trait referred for Tables 3 to 17, Table 20 or
Table 22, and the one or more genes may comprise one or more of
the genes shown in the relevant tables, or one or more
orthologues thereof. Preferably, the genes in Tables 3 to 17, 20
and/or 22 aire used for predicting or influencing (increasing or
decreasing) traits in inbred plants or animals. However, the
genes may also be used for predicting, increasing or decreasing
traits in recombinants and/or hybrids.
When the trait is flowering time, or time to flowering, in
plants, e.g. as represented by leaf number at bolting, the one or
more genes may comprise one or more genes shown in Table 3 or
Table 4, or orthologues thereof. Table 3 shows genes for which
transcript abundance was shown to correlate with flowering time
in vernalised plants, and Table 4 shows genes for which
transcript abundance was shown to correlate with flowering time
in unvernalised plants. These may be used for predicting
flowering time in vernalised or unvernalised plants,
respectively. However, as discussed elsewhere herein, transcript
abundance of genes which correlates with a trait in vernalised
plants may also correlate (normally according to a different
model or equation) with the trait in unvernalised plants. Thus,
transcript abundance of genes in either Table 3 or Table 4 may be
used to predict flowering time in either vernalised or
unvernalised plants, using the appropriate correlation for
vernalised or unvernalised plants respectively.
Whilst the transcript abundance data of the genes listed in many
of the Tables herein were used in our example for predicting
traits in vernalised plants, these data could also be used to
predict traits in unvernalised plants. Thus, a first correlation
may be identified between transcript abundance and the trait in
vernalised plants, and a second correlation may be identified
between transcript abundance and the trait in unvernalised
plants. The appropriate model may then be used to predict the
trait in vernalised or unvernalised plants respectively, based on
transcript abundance of one or more of those genes,- or
orthologues thereof.
Oil content is a useful trait to measure in plants. This is one
of the measures used to determine seed quality, e.g. in oilseed
rape.
When the trait is oil content of seeds, e.g. as represented by %
dry weight, the one or more genes may comprise one or more genes
shown in Table 6, or orthologues thereof.
Seed quality may also be represented by the proportion,
percentage weight or ratio of certain fatty acids.
Normally, seed traits are predicted for vernalised plants, e.g.
oilseed rape in the UK is grown as a Winter crop and will
therefore be vernalised at the time of trait expression (seed
production in this example). However, predictions may be for
either vernalised or unvernalised plants.
When the trait is ratio of 18:2 / 18:1 fatty acids in seed oil,
the one or more genes may comprise one or more genes selected
from the genes shown in Table 7, or orthologues thereof.
When the trait is ratio of 18:3 / 18:1 fatty acids in seed oil,
the one or more genes may comprise one or more genes selected
from the genes shown in Table 8, or orthologues thereof.
When the trait is ratio of 18:3 / 18:2 fatty acids in seed oil, •
the one or more genes may comprise one or more genes selected
from the genes shown in Table 9, or orthologues thereof.
When the trait is ratio of 20C + 22C / 16C + 18C fatty acids in
seed oil, the one or more genes may comprise one or more genes
selected from the genes shown in Table 10, or orthologues
thereof.
When the trait is ratio of polyunsaturated / monounsaturated +
saturated 18C fatty acids in seed oil, the one or more genes may
comprise one or more genes selected from the genes shown in Table
12, or orthologues thereof.
When the trait is % 16:0 fatty acid in seed oil, the one or more
genes may comprise one or more genes selected from the genes
shown in Table 14, or orthologues thereof.
When the trait is % 18:1 fatty acid in seed oil, the one or more
genes may comprise one or more genes selected from the genes
shown in Table 15, or orthologues thereof.
When the trait is % 18:2 fatty acid in seed oil, the one or more
genes may comprise one or more genes selected from the genes
shown in Table 16, or orthologues thereof.
When the trait is % 18:3 fatty acid in seed oil, the one or more
genes may comprise one or more genes selected from the genes
shown in Table 17, or orthologues thereof.
It may be desirable to predict responsiveness of a plant trait to
vernalisation, and this may be measured for example as the ratio
of a trait measurement in vernalised plants to the trait
measurement in unvernalised plants.
For example, responsiveness of flowering time to vernalisation
may be measured as the ratio of leaf number at bolting in
vernalised plants to leaf number at bolting in unvernalised
plants. Genes whose transcript abundance correlates with this
ratio are shown in Table 5. Thus, in embodiments of the
invention where the trait is responsiveness of plant flowering
time to vernalisation, the one or more genes may comprise one or
more genes shown in Table 5, or orthologues thereof.
Responsiveness to vernalisation of the ratio of 20C + 22C / 16C +
18C fatty acids in seed oil may be measured as the ratio of
(ratio of 20C + 22C / 16C + 18C fatty acids in seed oil in
vernalised plants) to (ratio of 20C + 22C / 16C + 18C fatty acids
in seed oil in unvernalised plants). Genes whose transcript
abundance correlates with this ratio are shown in Table 11.
Thus, in embodiments of the invention where the trait is
responsiveness of this ratio to vernalisation, the one or more
genes may comprise one or more genes shown in Table 11, or
orthologues thereof.
Responsiveness to vernalisation of the ratio of polyunsaturated /
monounsaturated + saturated 18C fatty acids in seed oil may be
measured as the ratio of (ratio of polyunsaturated /
monounsaturated + saturated 18C fatty acids in seed oil in
vernalised plants) to (ratio of polyunsaturated / monounsaturated
+ saturated 18C fatty acids in seed oil in unvernalised plants).
Genes whose transcript abundance correlate's with this ratio are
shown in Table 13. Thus, in embodiments of the invention where
the trait is responsiveness of this ratio to vernalisation, the
one or more genes may comprise one or more genes shown in Table
13, or orthologues thereof.
When the trait is yield, the one or more genes may comprise one
or more of the genes shown in Table 20 or Table 22, or
orthologues thereof.
Genes in Tables 1 to 17 are from Arabidopsis thaliana, and may be
used in embodiments of the invention relating to A. thaliana or
to another organism, such as for predicting or increasing
heterosis in a plant or animal (genes of Tables 1 and 2, or
orthologues thereof), or for predicting, increasing or decreasing
another trait in A. thaliana or other plant. Genes in Tables 19,
20 and 22 are from maize, and may be used in embodiments of the
invention relating to maize or to another organism, such as for
predicting or increasing heterosis in a plant or animal (genes of
Table 19 or orthologues thereof) or for predicting, increasing or
decreasing another trait in maize or other plant.
We have demonstrated that transcript abundance in plants of genes
shown in Tables 1, 3 to 17, 20 and 22 is predictive of the
described traits in those plants. In some embodiments of the
invention relating to use of parental transcriptome data for
prediction of traits in hybrids, transcript abundance in plants
of genes shown in Tables 1, 3 to 17, 20 and 22 or orthologues
thereof may be used to predict the described traits in hybrid
offspring of those plants.
Preferably, in embodiments of the invention relating to use of
parental transcriptome data for prediction of heterosis in
hybrids, transcript abundance in plants of At3gll2200 and/or of
genes shown in Table 2, or orthologues thereof, is used to
predict the magnitude of heterosis in hybrid offspring of those
plants.
In embodiments of the invention relating to use of parental
transcriptome data for prediction of yield, e.g. grain yield, in
hybrids, transcript abundance in plants of one or more genes
shown in Table 22 is used to predict the yield in hybrid
offspring of those plants.
Heterosis or other trait is normally determined quantitatively.
As noted above, heterosis may be described relative to the mean
value of the parents (Mid-Parent Heterosis, MPH) or relative to
the "better" of the parents (Best-Parent Heterosis, BPH).
Heterosis may be determined on any suitable measurement, e.g.
size, fresh or dry weight at a given age, or growth rate over a .
given time period, or in terms of some measure of yield or
quality. Heterosis may be determined using historical data from
the parental and/or hybrid lines.
Heterosis may be calculated based on size, for which size
measurements may for example be taken of the maximum length and
width of the plant or animal, or of a part of the plant or
animal, e.g. using electronic callipers. For plants, heterosis
may be calculated based on total aerial fresh weight of the
plants, which may be determined by cutting off all above soil
plant material, quickly removing any soil attached, and weighing.
In preferred embodiments, heterosis is heterosis for yield (e.g.
in plants or animals, yield of harvestable product), or heterosis
for fresh weight (e.g. fresh weight of aerial parts of a plant).
The magnitude of heterosis may thus be determined, and is
normally expressed as a % value. For example, mid parent
heterosis for fresh weight can be presented as. a percentage
figure calculated as (weight of the hybrid - mean weight of the
parents) / mean weight of the parents. Best parent heterosis for
fresh weight can be presented as a percentage figure calculated
as (weight of the hybrid - weight of the heaviest parent) /
weight of the heaviest parent.
For other traits, an appropriate measurement can be determined by
the skilled person. Some traits can be directly recorded as a
magnitude, e.g. seed oil content, weight of plant or animal, or
yield. Other traits would be determined with reference to
another indicator, e.g. flowering time may be represented by leaf
number at bolting. The skilled person is able to select an
appropriate way to quantify a particular trait, e.g. as a
magnitude, ratio, degree, volume, time or rate, and to measure
suitable factors representative of the relevant trait.
A transcript is messenger RNA transcribed from a gene. The
transcriptome is the contribution of each gene in the genome to
the mRNA pool. The transcriptome may be analysed and/or defined
with reference to a particular tissue, as discussed elsewhere
herein. Analysis of the transcriptome may thus be determination
of transcript abundance of one or more genes, or a set of genes.
Transcriptome analysis or determination of transcript abundance
is normally performed on tissue, samples from the plants or
animals. Any part of the plant or animal containing RNA
transcripts may be used for transcriptome analysis. Where an
organism is a plant, the tissue is preferably from one or more,
preferably all, aerial parts of the plant, preferably when the
plant is in the vegetative phase before flowering occurs. In
some embodiments, transcriptome analysis may be performed on
seeds. Methods of the invention may involve taking tissue
samples from the plants or animals. In methods of predicting the
heterosis or other trait, the sampled organism may remain viable
after the tissue sample has been taken. Where prediction is to
be performed- for genetically identical plants or animals, which
may be grown on a different occasion, tissues may include all
parts or all aerial plants or a whole seed (for plants) or the
whole embryo (for animals). Where prediction is to be performed
for the exact plant sampled, a subset of the leaves of the plant
may be sampled. However, there is no requirement for the
organism to remain viable, since sampling of one or more
individuals for transcriptome analysis that results in loss of
viability may be used for the prediction of heterosis or other
traits in hybrid, inbred or recombinant organisms of similar or
identical genetic composition grown on either the same or a
different occasion and under the same or different environmental
conditions.
Typically, transcriptome analysis is performed on RNA extracted
from the plant or animal. The invention may comprise extracting
RNA from a tissue sample of the hybrid or inbred plant or animal.
Any suitable methods of RNA extraction may be used, e.g. see the
protocol set out in the Examples.
Transcriptome analysis comprises determining the abundance of an
array of RNA transcripts in the transcriptome. Where
oligonucleotide chips are used for transcriptome analysis, the
numbers of genes potentially used for model development are the
numbers of probes on the GeneChips - ca. 23,000 for Arabidopsis
and ca. 18,000 for the present maize Chip. Thus, while in some
embodiments, the transcript abundance of each gene in the genome
is assessed, normally transcript abundance of a selected array of
genes in the genome is assessed.
Various techniques are available for transcriptome analysis, and
any suitable technique may be used in the invention. For
example, transcriptome analysis may be performed by bringing an
RNA sample into contact with an "oligonucleotide array or
oligonucleotide chip, and detecting hybridisation of RNA
transcripts to oligonucleotides on the array or chip. The degree
of hybridisation^ to each oligonucleotide on the chip may be
detected. Suitable chips are available for various species, or
may be produced. For example, Affymetrix GeneChip array
hybridisation may be used, for example using protocols described
in the Affymetrix Expression Analysis Technical Manual II
(currently available at
http://www.affymetrix.com/support/technical/manuals.affx. or
findable using any internet search engine). For detailed
examples of transcriptome analysis, please see the Examples
below.
Transcript abundance of one or more genes, e.g. a set of genes,
may be determined, and any of the techniques above may be
employed. Alternatively, reverse transcriptase may be used to
synthesise double stranded DNA from the RNA transcript, and
quantitative polymerase chain,reaction (PCR) may be used for
determining abundance of the transcript.
Transcript abundance of a set of genes may be determined. A set
of genes is a plurality of genes, e.g. at least 10, 20, 30, 40,
50, 60, 70, 80, 90 or 100 genes. The set may comprise genes
correlating positively with a trait and/or genes correlating
negatively with the trait. As noted below, preferably, the set
of genes is one for which transcript abundance of that set of
genes allows prediction of heterosis or other trait. The skilled
person may use methods of the invention to determine which genes
are most useful for predicting heterosis or other traits in
hybrids, and therefore to determine which genes can most usefully
be assessed for transcript abundance in accordance with the
invention. Additionally, examples of sets of genes for
prediction of heterosis and other traits are shown herein.
Preferably, analysis of transcript abundance is performed in the
same way for the plants or animals used to generate a model or
correlation with a trait "model organism" as for the plants or
animals in which the trait is predicted based on that model "test
organism". Preferably, the model and test organisms are raised
under identical conditions and transcriptome analysis is
performed on both the model and test organisms at the same age,
time of day and in the same environment, in order to maximise the
predictive value of the model based on transcriptome data from
the model organisms.
Accordingly, predicting a trait in a test plant or animal may
comprise determining transcript abundance of one or more genes in
the test plant or animal at a particular age, wherein transcript
abundance of the one or more genes in the transcriptome of model
plants or animals at that age conditions correlates with the
trait. Thus, preferably transcript abundance in the organism
(i.e. plant or non-human animal) is determined when the organism
is at the same age as the organisms in the population on which
the correlation between transcript abundance and heterosis or
other trait was determined. Thus, predicting the degree of a
trait in an organism may comprise determining the abundance of
transcripts of one or more genes, preferably a set of genes, in
the organism at a selected age, and determining the transcript
abundance of one or more genes, preferably a set of genes,
wherein the transcript abundance of those one or more genes or
set of genes in the transcriptome of organisms at the said age
correlates with heterosis or other trait in the organism.
As noted elsewhere herein, the age at which transcript abundance
is determined may be earlier than the age at which the trait is
expressed, e.g. where the trait is flowering time the
transcriptome analysis may be performed when plants are in
vegetative phase.
Preferably, transcriptome analysis and determination of
transcript abundance is determined on plant or animal material
sampled at a particular time of day. For example, plant tissue
samples may be taken at the middle of the photoperiod (or as
close as practicable). Thus, when predicting a trait by
determining the transcript abundance of one or more genes (e.g.
set of genes) whose transcript abundance correlates with that
trait, the transcript abundance data for making the prediction
are preferably determined at the same time of day as the
transcript abundance data used to generate the correlation.
Some aspects of the invention relate to plants, such as cereals,
that require vernalisation before flowering. Vernalisation is a
period of exposure to cold, which promotes subsequent flowering.
Plants requiring vernalisation do not flower the same year when
sown in Spring, but continue to grow vegetatively. Such plants
("winter varieties") require vernalisation over Winter, and so
are planted in the Autumn to flower the following year. In the
present invention, plants may be vernalised or unvernalised.
Transcriptome data may be obtained from plants when vernalised or
unvernalised, and those data may be used to identify a
correlation between transcript abundance and a trait measured in
vernalised plants and/or a correlation between transcript
abundance and the trait measured in unvernalised plants. Thus,
surprisingly, we have shown that transcriptome data from
vernalised plants can be used to develop a model for predicting
traits in unvernalised plants, as well as being useful to develop
a model for predicting traits in vernalised plants.
In methods of the invention, comparisons and predictions are
preferably between plants or animals of the same genus and/or
species. Thus, methods of predicting heterosis or other trait in
a plant or animal may be based on correlations obtained in a
population of hybrids, inbreds or recombinants of that species of
plant or animal. However, as discussed elsewhere herein,
correlations bbtained in one species may be applied to other
species, e.g. to other plants or other animals in general, or to
both plants and animals, especially where the other species
exhibit similar traits. Thus, the test organism in which the
trait is predicted need not be of the same species as the model
organisms in which the correlation for prediction of the trait
was developed.
Determination of transcript abundance for prediction of a trait
is normally performed on the same type of tissue as that in which
the correlation between the trait and transcript abundance was
determined. Thus, predicting the degree of heterosis in a hybrid
may comprise determining transcript abundance in tissue in or
from the hybrid/ and determining the transcript abundance of one
or more genes, preferably a set of genes, wherein the transcript
abundance of those one or more genes in the transcriptome of the
said tissue in hybrids correlates with heterosis or other trait
in hybrids.
Data may be compiled, the data comprising:
(i) a value representing the magnitude of heterosis or other
trait in each plant or animal;
(ii) transcriptome analysis data in each plant or animal, wherein
the transcriptome analysis data represents the abundance of each
of an array of gene transcripts.
For determination of a correlation, data should be obtained from
a plurality of plants or animals. In methods of the invention it
is thus preferable that transcriptome analyses are performed and
traits are determined for at least three plants or animals, more
preferably at least five, e.g. at least ten. Use of more plants
or animals, e.g. in a population, can lead to more reliable
correlations and thus increase the quantitative accuracy of
predictions according to the invention.
Any suitable statistical analysis may be employed to identify a
correlation between transcript abundance of one or more genes in
the transcriptomes of the plants or animals and the magnitude of
heterosis or other trait. The correlation may be positive or
negative. For example, it may be found that some transcripts
have an abundance correlating positively with heterosis or other
trait, while other transcripts have an abundance correlating
negatively with heterosis or other trait.
Data from each plant or animal may be recorded in relation to
heterosis and/or multiple other traits. Accordingly, the
invention may be used to identify which genes have a transcript
abundance correlating with which traits in the organism. Thus, a
detailed profile may be compiled for the relationship between
transcript abundance and heterosis and other traits in the
population of organisms.
Typically, an analysis is performed using linear regression to
identify the relationship between transcript abundance and the
magnitude of heterosis (MPH and/or BPH) or other trait. An F-
value may then be calculated. The F value is a standard statistic
for regression. It tests the overall significance of the
regression model. Specifically, it tests the null hypothesis
that all of the regression coefficients are equal to zero. The
value is the ratio of the mean regression sum of squares divided
by the mean error sum of squares with values that range from zero
upward. From this we get the F Prob (the probability that the
null hypothesis that there is no relationship is true). A low
value implies that at least some of the regression parameters are
not zero and that the regression equation does have some validity
in fitting the data, indicating that the variables (gene
expression level) are not purely random with respect to the
dependent variable (trait value at that point).
Preferably a correlation identified using the invention is a
statistically significant correlation. Significance levels may
be determined as F statistics from the regression Mean Square in
the analysis of variance tables of the linear regression
analysis. Statistical significance may be indicated for example
by F < 0.05, or < 0.001.
Other potential relationships exist between gene expression and
plant phenotype, besides simple linear relationships. For
example, relationships may fall on a logistic curve. A computer
model (e.g. GenStat) may be used to fit the data to a logistic
curve.
Non-linear modelling covers those expression patterns that form
any part of a sigmoidal curve, from exponential-type patterns, to
threshold and plateau type patterns. Non-linear methods may also
cover many linear patterns, and thus may preferentially be used
in some embodiments of the invention.
Normally a computer program is used to identify the correlation
or correlations. For example, as described in more detail in the
Examples below, linear regression analysis may be performed using
GenStat, e.g. Program 3 below is an example of a linear
regression programme to identify linear regressions between the
hybrid transcriptome and MPH.
More generally, each of the methods of the above aspects may be
implemented in whole or in part by a computer program which, when
executed by a computer, performs some or all of the method steps
involved. The computer program may be capable of performing more
than one of the methods of the above aspects.
Another aspect of the invention provides a computer program
product containing one or more such computer programs,
exemplified by a data carrier such as a compact disk, DVD, memory
storage device or other non-volatile storage medium onto which
the computer program(s) is/are recorded.
A further aspect of the invention is a computer system having a
processor and a display, wherein the processor is operably
configured to perform the whole or part of the method of one or
more of the above aspects, for example by means of a suitable
computer program, and to display one or more results of those
methods on the display. Typically the computer will be a general
purpose computer and the display will be a monitor. Other output
devices may be used instead of or in addition to the display
including, but not limited to, printers.
Preferably, a set of genes, e.g. less than 1000, 500, 250 or 100
genes, is identified for which transcript abundance correlates
with heterosis or other trait, wherein transcript abundance of
that set of genes allows prediction of heterosis or other trait.
A smaller set of genes that remains predictive of the trait may
then be identified by iterative testing of the precision of
predictions by progressively reducing the numbers of genes in the
models, preferentially retaining those with the best correlation
of transcript abundance with heterosis or the other trait, e.g.
genes with the most significant (e.g. p<0.001) correlations
between transcript abundance and traits. Thus, methods of the
invention may comprise identifying a correlation between a trait
and transcript abundance of a set of genes in transcriptomes, and
then identifying a smaller set or sub-set of genes from within
that set, wherein transcript abundance of the smaller set of
genes is predictive of the trait. Preferably the smaller set of
genes retains most of the predictive power of the set of genes.
The magnitude of heterosis or other trait may be predicted from
transcript abundance of one or more genes, preferably of a set of
genes as noted above, based on a correlation of the transcript
abundance with heterosis or other trait (e.g. a linear regression
as described above).
Thus, the equation of the linear regression line (linear or non-
linear) for each of the gene transcripts showing a correlation
with magnitude of heterosis or other trait may be used to
calculate the expected magnitude of heterosis or other trait from
the transcript abundance of that gene. The aggregate of the
predicted contributions for each gene is then used to calculate
the trait value (e.g. as the sum of the contribution from each
gene transcript, normalised by the coefficient of determination,
r2.
Drawings
Figure 1: Workflows for the analysis of expression data for the
investigation of heterosis, a) Standard protocols; b)
Recommended Prediction Protocol; c) Alternative 'Basic'
Prediction Protocol; d)Transcription Remodelling Protocol
List of Tables
Table 1: Genes in Arabidopsis thaliana hybrids, transcripts of
which correlate with magnitude of heterosis in the hybrids
Table 2: Genes in Arabidopsis thaliana inbred lines, transcripts
of which correlate with magnitude of heterosis in hybrids
produced by crossing those lines with Ler ms1. (A: positive
correlation; B: negative correlation)
Table 3: Genes in Arabidopsis thaliana inbred lines, showing
correlation in transcript abundance with leaf number at bolting
in vernalised plants (A: positive correlation; B: negative
correlation)
Table 4: Genes in Arabidopsis thaliana inbred lines showing
correlation in transcript abundance with leaf number at bolting
in unvernalised plants (A: positive correlation; B: negative
correlation)
Table 5: Genes in Arabidopsis thaliana inbred lines showing
correlation in transcript abundance with ratio of leaf number at
bolting (vernalised plants) / leaf number at bolting
(unvernalised plants). (A: positive correlation; B: negative
correlation)
Table 6: Genes in Arabidopsis thaliana inbred lines showing
correlation between transcript abundance and oil content of
seeds, % dry weight in vernalised plants (A: positive
correlation; B: negative correlation)
Table 7: Genes in Arabidopsis thaliana inbred lines showing
correlation between transcript abundance and ratio of 18:2 / 18:1
fatty acids in seed oil in vernalised plants (A: positive
correlation; B: negative correlation)
Table 8: Genes in Arabidopsis thaliana inbred lines showing
correlation between transcript abundance and ratio of 18:3 / 18:1
fatty acids in seed oil in vernalised plants (A: positive
correlation; B: negative correlation)
Table 9: Genes in Arabidopsis thaliana inbred lines showing
correlation between transcript abundance and ratio of 18:3 / 18:2
fatty acids in seed oil in vernalised plants (A: positive
correlation; B: negative correlation)
Table 10: Genes in Arabidopsis thaliana inbred lines showing
correlation between transcript abundance and ratio of 20C + 22C /
16C + 18C fatty acids in seed oil in vernalised plants (A:
positive correlation; B: negative correlation)
Table 11: Genes in Arabidopsis thaliana inbred lines showing
correlation between transcript abundance and ratio of (ratio of
20C + 22C / 16C + 18C fatty acids in seed oil (vernalised
plants)) / (ratio of 20C + 22C / 16C + 18C fatty acids in seed
oil (unvernalised plants)) (A: positive correlation; B: negative
correlation)
Table 12: Genes in Arabidopsis thaliana inbred lines showing
correlation between transcript abundance and ratio of
polyunsaturated / monounsaturated + saturated 18C fatty acids in
seed oil in vernalised plants (A: positive correlation; B:
negative correlation)
Table 13: Genes in Arabidopsis thaliana inbred lines showing
correlation between transcript abundance and ratio of (ratio of
polyunsaturated / monounsaturated + saturated 18C fatty acids in
seed oil (vernalised plants)) / (ratio of polyunsaturated /
monounsaturated + saturated 18C fatty acids in seed oil
(unvernalised plants)) (A: positive correlation; B: negative
correlation)
Table 14: Genes in Arabidopsis thaliana inbred lines showing
correlation between transcript abundance and % 16:0 fatty acid in
seed oil in vernalised plants (A: positive correlation; B:
negative correlation)
Table 15: Genes in Arabidopsis thaliana inbred lines showing
correlation between transcript abundance and % 18:1 fatty acid in
seed oil (vernalised plants)
(A: positive correlation; B: negative correlation)
Table 16: Genes in Arabidopsis thaliana inbred lines showing
correlation between transcript abundance and % 18:2 fatty acid in
seed oil (vernalised plants) (A: positive correlation; B:
negative correlation)
Table 17: Genes in Arabidopsis thaliana inbred lines showing
correlation between transcript abundance and % 18:3 fatty acid in
seed oil (vernalised plants) (A: positive correlation; B:
negative correlation)
Table 18: Prediction of complex traits in inbred lines
(accessions) using models based on accession transcriptome data
Table 19: Genes in maize for prediction of heterosis for plant
height. Data were obtained in plants at CLY location only (model
from 13 hybrids). Representative public ID shows GenBank
accession numbers. (A: positive correlation; B: negative
correlation)
Table 20: Genes in maize for prediction of average yield. Data
were obtained in plants across 2 sites, MO and L (model from 12
hybrids to predict 3). Representative public ID shows GenBank
accession numbers. (A: positive correlation; B: negative
correlation)
Table 21: Pedigree and seedling growth characteristics of maize
inbred lines used in Example 6a
Table 22: Maize genes for which transcript abundance in inbred
lines of the training dataset is correlated (P<0.00001) with plot
yield of hybrids with line B73. A negative value for the slope
indicates a negative correlation between abundance of the
transcript and yield, and a positive value indicates a positive
correlation.
Table 23: Maize plot yield data for Example 6a.
Examples
Example 1: Transcriptome remodelling in Arabidopsis hybrids
Our initial studies employed Arabidopsis thaliana. We conducted
all of our heterosis analyses in Fl hybrids between accessions of
A. thaliana, which can be considered inbred lines due to their
lack of heterozygosity. The genome sequence of A. thaliana is
available [62] and resources for transcriptome analysis in this
species are well developed [63]. A. thaliana also shows a wide
range of magnitude of hybrid vigour [7, 64, 65].
The null hypothesis is that all parental alleles contribute to
the transcriptome in an additive manner, i.e. if alleles differ
in their contribution to transcript abundance, the observed value
in the hybrid will be the mean of the parent values. There are
six patterns of transcript abundance in hybrids that depart from
this expected additive effect of contrasting parental alleles
[28]:
(i) transcript abundance in the hybrid is higher than either
parent;
(ii) transcript abundance in the hybrid is lower than either
parent;
(iii) transcript abundance in the hybrid is similar to the
maternal parent and both are higher than the paternal parent;
(iv) transcript abundance in the hybrid is similar to the
paternal parent and both are higher than the maternal parent;
(v) transcript abundance in the hybrid is similar to the maternal
parent and both are lower than the paternal parent;
(vi) transcript abundance in the hybrid is similar to the
paternal parent and both are lower than the maternal parent.
When using quantitative analytical methods, the terms "higher
than", "lower than" and "similar to" can be defined by specific
fold-difference criteria. Although differences in the
contributions to the transcriptome of divergent alleles in maize
hybrids has been reported as common [29, 66] the lack of absolute
quantitative analysis of transcript abundance in parental inbred
lines means that it is not possible to determine whether the
observed effects are due to allelic interaction in the hybrid or
simply the expected additive effects of alleles with differing
transcript abundance characteristics. We would not consider such
additive effects as components of transcriptome remodelling.
We produced reciprocal hybrids between A. thaliana accessions
Kondara and Br-0, and between Landsberg er msl and Kondara, Mz-0,
Ag-0, Ct-1 and Gy-0, with Landsberg er msl as the maternal
parent. Hybrids and parents were grown under identical
environmental conditions and heterosis calculated for the fresh
weight of the aerial parts of the plants after 3 weeks growth
(see Materials and Methods). The heterosis observed for each
combination was recorded (BPH (%) and MPH (%)).
RNA was extracted from the same material and the transcriptome
was analysed using ATH1 GeneChips. Plants were grown in three
replicates on three successive occasions. RNA was pooled from
the three replicates for analysis of gene expression levels on
each occasion.
Transcript abundance values in A. thaliana hybrids were compared
over all experimental occasions and genes showing differences, at
defined fold-levels from 1.5 to 3.0, corresponding to the six
patterns indicative of transcriptome remodelling, were
identified. Genes with transcript abundance differing between
the parents by the same defined fold-level were also identified.
The number of genes that appeared consistently in each of these 8
categories across all 3 experimental occasions was counted. To
assess whether the number of genes classified into each category
differed from that expected by chance, permutation analysis
(bootstrapping) was used to calculate an expected value under the
null hypothesis of no remodelling.
The significance of the experimental results was assessed, for
each category independently, using Chi square tests. The results
of the analysis,, summarised in Table 1 for 2-fold differences,
show that transcriptome remodelling occurred in all of the
hybrids analysed, with most individual observations showing
highly significant (p<0.001) divergence from the null hypothesis.
Similar analyses were conducted for 1.5- and 3-fold differences,
with extensive remodelling also being identified. Based on the
analysis of gene ontology information, there were no obvious
functional relationships of the remodelled genes in the hybrids.
Further analysis of selected genes from these categories were
conducted using additional GeneChip hybridisation experiments and
by quantitative RT-PCR, and confirmed the transcript abundance
patterns. GeneChip hybridization was also performed using
genomic DNA from accessions Kondara, Br-0 and Landsberg er ms1,
to assess the proportion of differences between parental
transcriptomes attributable to sequence polymorphisms that would
prevent accurate reporting of transcript abundance by the arrays.
We found that ca. 20% of the differences between parental
transcriptomes may be attributable to sequence variation.
However, this does not affect the remodelling analysis, as
additivity of allelic contributions to the mRNA pool in hybrids
where one parental allele failed to report accurately on the
array would result in intermediate signal strength, so would not
be assigned to any of the remodelled classes.
The relationship of transcriptome remodelling with hybrid vigour
was assessed by carrying out linear regression of the number of
genes remodelled in each hybrid combination, at the 1.5, 2 and 3-
fold levels, on the magnitude of heterosis observed. This
revealed a strong relationship between heterosis and the
transcriptome remodelling at the 1.5-fold level (r = +0.738,
coefficient of determination r2 = 0.544 for MPH; r = +0.736, r2 =
0.542 for BPH). The correlation was more modest between
heterosis and the transcriptome remodelling involving higher fold
level changes (r2 = 0.213 and 0.270 for MPH and BPH,
respectively, for 2-fold changes; r2 = 0.300 and 0.359 for MPH
and BPH, respectively, for 3-fold changes). There was extensive
remodelling, at all fold changes, even in the hybrid combinations
showing the least heterosis. Consequently, the majority of
remodelling events identified that result in transcript abundance
changes of 2-fold or greater, even in strongly heterotic hybrids,
are likely to be unrelated to heterosis. The most highly
enriched class in heterotic hybrids is those genes showing 1.5-
fold differential abundance, which is below the threshold usually
set in transcriptome analysis experiments.
Heterosis shows an inconsistent relationship with the degree of
relatedness of parental lines, with an absence of correlation
reported between heterosis and genetic distance in A. thaliana
[7]. We estimated the genetic distance between the accessions
used in the hybrid combinations we have analysed, and these are
shown in Table 1. To assess the relationship of transcriptome
remodelling with genetic distance, we regressed the number of
genes classified as having remodelled transcript abundance in
each hybrid combination against genetic distance. We found that
transcriptome remodelling is associated with genetic distance in
the higher-fold remodelling classes (r2 = 0.351 and 0.281 for 2
and 3-fold changes respectively), but not for 1.5-fold
remodelling (r2 = 0.030). We found no relationship between
heterosis and genetic distance, in accordance with previous
reports in A. thaliana (r2 = 0.024 and 0.005 for MPH and BPH,
respectively, against relative genetic distance). We conclude
that the formation of hybrids between divergent inbred lines
results in transcriptome remodelling, with the extent of
remodelling increasing with the degree of genetic divergence of
those lines. This result is consistent with the expected effects
of allelic variation on transcriptional regulatory networks. The
relationship between transcriptome remodelling and heterosis can
be interpreted as meaning that heterosis is likely to require
transcriptome remodelling to occur, but that much of this
involves low magnitude remodelling of the transcript abundance of
a large number of genes.
The results of the above experiments indicate that the
conventional approach to the analysis of the transcriptome in the
hybrid, i.e. studying one or very few hybrid combinations, is
unlikely to result in the identification of genes involved
specifically in heterosis.
Example 2: Transcript abundance in hybrid transcriptomes
We carried out an analysis using linear regression to identify
the relationship between transcript abundance in a range of
hybrids and the strength of heterosis (both MPH and BPH) shown by
those hybrids. Significance levels were determined as F
statistics from the regression Mean Square in the analysis of
variance tables of the linear regression analysis. For this, we
used the heterosis measurements and hybrid transcriptome data
from the combinations described above with Landsberg er msl as
the maternal parent, and from additional hybrids between
Landsberg er msl, as the maternal parent, and Columbia, Wt-1,
Cvi-0, Sorbo, Br-0, Ts-5, Nok3 and Ga-0. Transcriptome data from
32 GeneChips, representing between 1 and 3 replicates from each
of these 13 hybrid combinations of accessions, were used in this
study. Nine genes were identified that showed highly significant
(F<0.001) regressions (all positive) of transcript abundance in
the hybrid on the magnitude of both MPH and BPH. Thirty-four
genes showed highly significant regressions (F<0.001; 22
positive, 12 negative) of transcript abundance in the hybrid on
MPH and significant regressions (F<0.05) on BPH. Twenty-seven
genes showed highly significant regressions (F<0.001; 23
positive, 4 negative) of transcript abundance in the hybrid on
magnitude of BPH and significant (F<0.05) regression on MPH. The
genes are shown in Table 1 below. Based on gene ontology
information, there are no obvious functional relationships
between these 70 genes and no excess representation of genes
involved in transcription.
The ability to identify a set of genes that show highly
significant correlation of transcript abundance and magnitude of
heterosis across 13 hybrids indicates that transcriptome-level
events are predominant in the manifestation of heterosis. To
confirm that this is correct, and that the genes we have
identified are indicative of the transcript abundance
characteristics that are important in heterosis, we utilized
these discoveries to predict the strength of heterosis in new
hybrid combinations based on the transcript abundance of the 70
defined genes. We built a mathematical model using the equations
of the linear regression lines recalculated for each of the 70
genes against both MPH and BPH, to calculate the expected
heterosis as the sum of the contribution from each gene,
normalised by the coefficient of determination, r2. The model
operates as a Microsoft Excel spreadsheet, which is available as
supplementary materials on Science Online. The spreadsheet also
contained the normalised transcriptome data for the 70 genes from
each of the hybrids studied. The model was validated by
"predicting" the heterosis in the training set of 32 hybrids from
which transcriptome data were used for its construction. It
predicted heterosis across the full range of magnitude observed,
for both MPH and BPH, with a very high correlation between
predicted and observed values for individual samples (r2= 0.768
for MPH, r2 = 0.738 for BPH). Three new hybrid combinations were
produced, between the maternal parent Landsberg er msl and
accessions Shakdara, Kas-1 and Ll-0. These were grown, in a
"blind" experiment, under the same environmental conditions as
the training set for the model, heterosis for fresh weight was
measured and the transcriptomes analysed. The transcript
abundance data for the 70 genes of the model were extracted for
each of the new hybrids and entered into the heterosis prediction
model. The results, as summarised below, confirmed that the
model produced excellent quantitative predictions of heterosis,
particularly MPH, confirming that transcriptome-level events
were, indeed, predominant in the manifestation of heterosis.
Prediction of heterosis using a model based on hybrid
transcriptome data
Mid parent heterosis for fresh weight is presented as a
percentage figure calculated as (weight of the hybrid - mean
weight of the parents) / mean weight of the parents.
Best parent heterosis for fresh weight is presented as a
percentage figure calculated as (weight of the hybrid - weight of
the heaviest parent) / weight of the heaviest parent.
Example 2a: Highly significant and specific correlation between
heterosis and transcript abundance of Atlg67500 and At5g45500 in
hybrids
In a further experiment to identify specific genes that show
transcript abundance (gene expression) patterns in hybrids
correlated with heterosis, we conducted an additional analysis
based upon linear regression. For this we used a "training"
dataset consisting of hybrid combinations between Landsberg er
msl and Ct-1, Cvi-0, Ga-0, Gy-0, Kondara, Mz-0, Nok-3, Ts-5, Wt-
5, Br-O, Col-0 and Sorbo. For each individual gene represented
on the array, the transcript abundance in hybrids was regressed
on the magnitude of heterosis exhibited by those hybrids. Twenty
one genes showed highly significant (p<0.001) correlation, but
this is no more than is expected by chance, as data for almost
23,000 genes were analysed. However, the exceptionally high
significance for the two genes showing the greatest correlation
(r2 = 0.457, P = 6.0 x 10"6 for gene Atlg67500; r2 = 0.453, P = 6.9
x 10"6 for gene At5g45500) is highly unlikely to have occurred by
chance. In both cases the correlation was negative, i.e.
expression is lower in more strongly heterotic hybrids.
We tested whether the expression characteristics of these genes
could be used for the prediction of heterosis. This was
conducted by removing one hybrid from the dataset, formulating
the regression line and using this relationship to predict the
expected heterosis corresponding to the gene expression measured
for the hybrid that had been removed. The analysis was repeated
by the removal and prediction of heterosis in each of the 12
hybrids in turn. Three untested hybrids were developed
(Landsberg er msl crossed with Ll-0, Kas-l and Shakdara) as a
"test" dataset, grown and assessed for heterosis as for the lines
of the training dataset, and their transcriptomes analysed using
ATH1 GeneChips. Using formulae derived by regression using all
12 hybrids in the training dataset, the expression data for genes
Atlg67500 and At5g45500 in the hybrids of the test dataset were
used to predict the heterosis in these test hybrids. Both showed
very high correlation between predicted and measured heterosis.
Overall, predicted heterosis based on the expression of Atlg67500
are better correlated with measured heterosis (r2 = 0.708) than
those based on the expression of At5g45500 (r2 = 0.594).
However, removal of one anomalous prediction in the training
dataset (that of the heterosis shown by the hybrid Landsberg er
msl x Nok-3) improves the latter to r2 = 0.773. Nevertheless,
the predictions of heterosis in all three hybrids of the test
dataset based on the expression of At5g45500, in particular, are
remarkably accurate.
Hybrids that show greater heterosis tend to be heavier than
hybrids that show little heterosis. As expected, we identified
such a correlation between the magnitude of heterosis we measured
and weight for the 15 hybrids of our training and test datasets
(r2 = 0.492). In order to assess whether the expression of genes
Atlg67500 and At5g45500 are specifically predicting heterosis, we
assessed the possibility of correlation between gene expression
and the weight of the plants in which expression is being
measured. For this, we used the plant weight and gene expression
data from the 12 parental lines in the training dataset. We
found the expression of Atlg67500 to show weak negative
correlation with the weight of the plants (r2 = 0.321), but there
was no correlation for At5g45500 (r2 < 0.001). We conclude that
the transcript abundance of At5g45500 is indicative specifically
of heterosis, but that of Atlg67500 is likely to be influenced
also by the weight of hybrid plants. This conclusion is
consistent with the errors in prediction of heterosis in the test
dataset using the expression of Atlg67500: the prediction of
heterosis in the hybrid Landsberg er msl x Kas-1 (which is
unusually heavy for the heterosis it shows) is over-estimated,
whereas the prediction of heterosis in the hybrid Landsberg er
msl x Ll-0 (which is unusually light for the heterosis it shows)
is underestimated.
Gene At5g45500 is annotated as encoding "unknown protein", so its
functions in the process of heterosis cannot be deduced based
upon homology. The function of gene Atlg67500 is known: it
encodes the catalytic subunit of DNA polymerase zeta and the
locus has been named AtREV3 due to the homology of the
corresponding protein with that of yeast REV3 [67]. REV3 is
important in resistance to UV-B and other stresses that result in
DNA damage as its function is in translesion synthesis, which is
required to repair forms of damage to DNA that blocks
replication. Studies have shown no differential expression for
Atlg67500 in response to UV-B or other stresses [68]. However,
the expression of At5g45500 is increased in aerial parts that
were subjected to UV-B, genotoxic and osmotic stresses [68].
Thus both of the genes with expression correlated with heterosis
in hybrid plants have potential roles in stress resistance. As
the expressions of both are negatively correlated with heterosis,
one hypothesis is that greater expression of these genes might be
related to increased resilience to specific stresses, but this
has a repressive effect on growth under favourable conditions.
This resembles the situation where biomass and seed yield
penalties were found to be associated with R-gene-mediated
pathogen resistance to Pseudomonas syringae [69]. Heterosis, at
least for vegetative biomass, may therefore be the consequence of
genetic interactions that lead to a reduction in repression of
growth, rather than direct promotion of growth.
Example 3: Transcript abundance in transcriptomes of inbred
lines
We carried out separate analyses using linear regression to
identify the relationship between transcript abundance in the
parental lines and the strength of MPH shown by their respective .
hybrids with Landsberg er msl. Significance levels were
determined as F statistics from the regression Mean Square in the
analysis of variance tables of the linear regression analysis.
In total, 272 genes were identified that showed highly
significant (F<0.001) regressions of transcript abundance in the
parent on the magnitude of MPH. See Table 2 below. Based on
gene ontology information, there are no obvious functional
relationships between these genes and no excess representation of
genes involved in transcription.
The invention permits use of transcriptome characteristics of
inbred lines as "markers" to predict the magnitude of heterosis
in new hybrid combinations.
We built mathematical models, using the equations of the linear
regression lines for each of the genes, to calculate the expected
heterosis. These models operate as programmes within the Genstat
statistical analysis package [70]. The results, as summarised in
the table below, confirmed that the model successfully predicted
the heterosis observed in the untested combinations using
transcriptome characteristics of the inbred parents as markers.
Prediction of heterosis using a model based on parental
transcriptome data
Example 3a: Highly significant correlation between heterosis and
transcript abundance of At3gll220 in inbred parents
We conducted an additional analysis based upon linear regression
to identify genes that show expression patterns in inbred parents
correlated with heterosis shown by the hybrids. For each
individual gene represented on the array, transcript abundance in
paternal parent lines was regressed on the magnitude of heterosis
exhibited by the corresponding hybrids with accession Landsberg
er msl in the training dataset.
The expression of one gene, At3gll220, showed an exceptionally
high correlation (r2 = 0.649; P = 2.7 x 10'8) . The correlation
was negative, i.e. expression is lower in parental lines that
produce more strongly heterotic hybrids. We assessed the utility
of using the expression of this gene in parental lines to predict
the heterosis that would be shown by the corresponding hybrids
with accession Landsberg er msl. This was conducted for both
training and test datasets, as for the predictions based on the
expression of Atlg67500 and At5g45500 in hybrids. The heterosis
predicted was well correlated with the measured heterosis (r2 =
0.719) and the predicted values for two of the three hybrids in
the test dataset were very accurate. However, heterosis was
substantially overestimated for the hybrid Landsberg er msl x
Kas-1, despite there being no correlation between the expression
of At3gll220 in parental accessions and the weight of those
accessions (r2 < 0.001).
Gene At3gll220 is annotated as encoding "unknown protein", so its
function in the process of heterosis cannot be deduced based upon
homology.
Example 4: Transcriptome analysis for prediction of other traits
We used the methodology as described for the prediction of
heterosis using parental transcriptome data to develop models for
the prediction of additional traits in accessions. The
transcriptome data set used for the construction of the models
was that obtained for 11 accessions: Br-0, Kondara, Mz-0, Ag-0,
Ct-1, Gy-0, Columbia, Wt-l, Cvi-0, Ts-5 and Nok3, as previously
described. Trait data had previously been obtained from these,
and accessions Ga-0 and Sorbo. Transcriptome data from
accessions Ga-0 and Sorbo were used for trait prediction in these
accessions. The lists of genes incorporated into the models
relating to the 15 measured traits are listed in Tables 3 to 17.
The predicted trait values for Ga-0 and Sorbo were compared with
measured trait values for these accessions, to assess the
performance of the models.
As the models developed for the prediction of additional traits
were developed using only 11 accessions, we expected them to
contain some false components. These would tend to shift trait
predictions towards the average value of the trait across the set
of accessions used for the construction of the models.
Therefore, our criterion for success of each model was whether or
not it ranked the accessions Ga-0 and Sorbo correctly. The
results, as summarised in Table 18, show that the models were
able to successfully predict flowering time, seed oil content and
seed fatty acid ratios. As expected, the values produced by the
models were between the measured value for the trait in the
respective accessions and the average value of the trait across
all accessions. Only the models to predict the absolute seed
content of a subset of specific fatty acids were unsuccessful.
This lack of success in the experiment we conducted may have been
due to the relative lack of precision of the data for these
traits and/or insufficient numbers of genes with transcript
abundance correlated with the trait to overcome the effects of
false components in the models developed using the data sets
available at the time. We believe that models based on more
extensive data sets would be able to successfully predict these
traits.
The ability to use transcriptome data from an early stage of
plant growth under specific environmental conditions (i.e. aerial
parts of vegetative-phase plants after 3 weeks growth in a
controlled environment room under 8 hour photoperiod) to predict
characteristics that appear later in the development of plants
grown in different environmental conditions (flowering time,
details of seed composition and vernalisation responses of plants
grown in a glasshouse under 16 hour photoperiod) is remarkable.
We interpret this as evidence of extensive interconnection and
multiplicity of gene function, regulated, as for heterosis,
largely at the level of transcript abundance. The results
presented here indicate that our methodology will allow the use
of specific characteristics of the transcriptomes of organisms,
including both plants and animals, early in their life cycle as
"markers" to predict many complex traits later in their life
cycle, and to increase our understanding of the underlying
biological processes.
Example 5: METHODS AND MATERIALS
Accessions used
The accessions used for the studies underlying this disclosure
were obtained from the Nottingham Arabidopsis Stock Centre
(NASC): Kondara, Cvi-0, Sorbo, Ag-0, Br-0, Col-0, Ct-1, Ga-0, Gy-
0, Mz-0, Nok-3, Ts-5, Wt-5 (catalogue numbers N916, N902, N931,
N936, N994, N1092, N1094, N1180, N1216, N1382, N1404, N1558 and
N1612, respectively). A male sterile mutant of Landsberg erecta
(Ler msl) was also obtained from NASC (catalogue number N75).
Growth conditions
Seeds of parental accessions and hybrids were sown into pots
containing A. thaliana soil mix (as described in O'Neill et al
[71]) and Intercept (Intercept 5GR). The pot was then watered,
and sealed to retain moisture, before being placed at 4°C for 6
weeks to partially normalize flowering time. At the end of this
time period the pot was placed in a controlled environment room
(heated at 22°C and lit for 8 hours per day). Gradually the seal
was removed in order to acclimatise the plants to the reduced air
moisture. When the first true leaves appeared the plants were
transplanted to individual pots, which were again sealed and
returned to the controlled environment rooms. Again the seal was
gradually removed over the next few days. The positions of A.
thaliana plants in controlled environment rooms was determined
using a complete randomised block design, with the trays of
plants being regularly rotated and moved in order to reduce
environmental effects.
The production of hybrid seeds
Hybrids were produced by crossing accessions Kondara and Br-0 by
selecting a raceme of the maternal plant, removing all branches
and siliques, leaving only the inflorescence. All immature and
open buds were removed, along with the apical meristem, leaving
5-6 mature closed buds. From these buds the sepals, petals, and
stamens were removed leaving only a complete pistil. For crosses
involving Lex ms1 as the maternal parent, only enough tissue was
removed, from unopened buds, to allow access to the stigma. Buds
of all plants were then pollinated by removing a stamen from the
pollen donor plant, and rubbing the anther against the stigma.
This was repeated until the stigma was well coated with pollen
when viewed under the microscope. The pollinated buds were then
protected from additional pollination by being enclosed in a
'bubble' of Clingfilm, which was removed after 2-3 days.
Trait measurements
The total aerial fresh weight of the plants was determined by
cutting off all above soil plant material, quickly removing any
soil attached, and weighing on electronic scales (Ohaus Corp. New
Jersey. USA). The plant material was then frozen in liquid
nitrogen. All plant harvesting and weight measurements were taken
as close as practicable to the middle of the photoperiod. Where
trait data were combined for replicate sets of plants grown at
different time, the data were weighted to correct for differences
in absolute growth rates between the replicates caused by
environmental effects. The mean weight for each of the 14 parent
accessions and 13 hybrids was calculated for each of the three
growth replicates. These were then normalised to the first
replicate mean, to take account of any between-occasion variation
in the growth conditions. This was done by dividing each
replicate mean by the first replicate mean and then multiplying
by itself (for example [a/b]*b) in order to obtain the adjusted
mean.-
RNA extraction and hybridisation
200mg of plant tissue were ground to a fine powder using liquid
nitrogen in a baked pre-cooled mortar, and using a chilled
spatula, transferred to labelled chilled 1.5ml tube. To these
tubes 1ml of TRI Reagent (Sigma-Aldrich, Saint Louis USA) was
added, then shaken to suspend the tissue. After a 5 minute
incubation at room temperature 0.2ml of chloroform was added, and
thoroughly mixed with the TRI Reagent by inverting the tubes for
around 15 seconds, followed by 2-3 minutes incubation at room
temperature. The tubes were centrifuged at 12000rpm for 15
minutes and the upper aqueous phase transferred to a clean,
labelled tube. 0.5ml of isopropanol was then added to the tubes,
which were inverted repeatedly for 30 seconds to precipitate the
RNA, followed by a10 minutes incubation at room temperature. The
tubes were then were centrifuged at 12000rpm for 10 minutes at
4°C, revealing a white pellet on the side of the tube. The
supernatant was poured off of the pellet, and the lip of the tube
gently blotted with tissue paper. 1ml 75% ethanol was added and
the tubes shaken to detach the pellet from the side of the tube,
followed by centrifugation at 7500rpm for 5 minutes. Again the
supernatant was poured off of the pellet, which was quickly spun
down again and any remaining liquid removed using a pipette. The
pellet was then dried in a laminar flow hood, before 50μ1 DEPC
treated water (Severn Biotech Ltd. Kidderminster, UK) was added
to dissolve the pellet.
Sample concentrations were determined using an Eppendorf
BioPhotometer (Eppendorf UK Limited. Cambridge. UK), and RNA
quality was determined by running out 1μ1 on a 1% agarose gel for
1 hour. RNA from replicated plants were then pooled according
concentration in order to ensure an equal contribution of each
replicate.
The pooled samples were then cleaned using Qiagen Rneasy columns
(Qiagen Sciences. Maryland. USA) following the protocol on page
79 of the Rneasy Mini Handbook (06/2001), before again
determining the concentrations using an Eppendorf BioPhotometer,
and running out 1μ1 on a 1% agarose gel.
Affymetrix GeneChip array hybridisation was carried out at the
John Innes Genome Lab (http://www.jicgenomelab.co.uk). All
protocols described can be found in the Affymetrix Expression
Analysis Technical Manual II (Affymetrix Manual II
http://www.affymetrix.com/support/technical/manuals.affx.)
Following clean up, RNA samples, with a minimum concentration of
lμg, μ1-1, were assessed by running lμ1 of each RNA sample on
Agilent RNA6000nano LabChips® (Agilent Technology 2100
Bioanalyzer Version A.01.20 SI211). First strand cDNA synthesis
was performed according to the Affymetrix Manual II, using 10 μg
of total RNA. Second strand cDNA synthesis was performed
according to the Affymetrix Manual II with the following minor
modifications: cDNA termini were not blunt ended and the
reaction was not terminated using EDTA. Instead Double-stranded
cDNA products were immediately purified following the "Cleanup of
Double-Stranded cDNA" protocol (Affymetrix Manual II). cDNA was
resuspended in 22μ1 of RNase free water.
cRNA production was performed according to the Affymetrix Manual
II with the following modifications: llμ1 of cDNA was used as a
template to produce biotinylated cRNA using half the recommended
volumes of the E'NZO BioArray High Yield RNA Transcript Labelling
Kit. Labelled cRNAs were purified following the "Cleanup and
Quantification of Biotin-Labelled cRNA" protocol (Affymetrix
Manual II). cRNA quality was assessed by on Agilent RNA6000nano
LabChips® (Agilent Technology 2100 Bioanalyzer Version A.01.20
SI211). 20μg of cRNA was fragmented according to the Affymetrix
Manual II.
High-density oligonucleotide arrays (either Arabidopsis ATH1
arrays, or AT Genomel arrays, Affymetrix, Santa Clara, CA) were
used for gene expression detection. Hybridisation overnight at
45oC and 60RPM (Hybridisation Oven 640), washing and staining
(GeneChip® Fluidics Station 450, using the EukGEws2_450 Antibody
amplification protocol) and scanning (GeneArray® 2500) was
carried out according to the Affymetrix Manual II.
Microarray suite 5.0 (Affymetrix) was used for image analysis and
to determine probe signal levels. The average intensity of all
probe sets was used for normalization and scaled to 100 in the
absolute analysis for each probe array. Data from MAS 5.0 was
analysed in GeneSpring® software version 5.1 (Silicon Genetics,
Redwood City, CA).
Identification of genes with non-additive transcript abundance in
hybrids
Analysis of the normalised transcript abundance data was
performed using GenStat [70]. This was undertaken using a script
of directives programmed in the GenStat command language (see
below), and used to identify the set of defined patterns of
transcript abundance. Briefly, each hybrid transcript abundance
data set was compared to its appropriate parental data sets, for
each gene, for each of the particular expression patterns of
interest. Those genes showing a particular pattern in each data
set were given a test value. Once completed all of these values
were added together and only those data sets with a combined test
value equal to a given a critical value (equivalent to the value
if all data sets displayed that pattern) were counted. Once this
had been completed for the experimental data, the results were
checked by hand against the source data,
Program 1 below is an example of the pattern recognition
programme. This example identifies patterns in the KoBr hybrid
and its parents, for three replicates of each at the two-fold
threshold criteria.
Permutation analysis to calculate expected values for non-
additive transcript abundance in hybrids
Due to the relatively limited replication within the experiment
and the large number of genes assayed on the GeneChips it is
expected that a proportion of the genes displaying defined
patterns will have occurred by chance. It is therefore essential
to use appropriate statistical analysis of the data to determine
the significance of the results. In order to determine this,
random permutation analysis (bootstrapping) was used to generate
expected values for random occurrences of defined abundance
patterns of the data. Pseudoreplicate data sets were generated by
randomly sampling the original data within individual arrays, and
using a rotating 'seed number' in order to create random data
sets of the same size, and variance, as the original. The same
pattern recognition directives were then used for this random
data set as were used on the original data and the resulting
numbers of probes were recorded.
In order to get a statistically significant number of randomized
replicates, this randomization and analysis of the data was
repeated 250 times. The average numbers of probes identified for
each pattern were then used as the value that would be expected
to arise by random chance for that pattern. It was determined
that 250 cycles was a sufficiently large random data set, for
this experiment by comparing the expected random averages of the
defined patterns at 1.5 fold, at 50 cycles and at 250 cycles.
Comparisons between higher numbers of cycles (500-1000 cycles)
exhibited very little difference between the means except that
the longer runs served to reduce the standard errors. A Wilcoxon
matched-pairs two-tailed t-test on the means of the two
repetition levels (50 cycles and 250 cycles) gave a P-value of
0.674, suggesting very strongly that the means are not
statistically different from each other. Based on this it was
assumed that the average random values will not change
significantly with increased replication, and that 250 cycles is
a significantly large number of replicates to generate this mean
random value in this case.
Program 2 below is an example of the bootstrapping programme.
This example bootstraps the KoBr hybrid at the two-fold threshold
criteria, for 250 repetitions.
Chi2 tests for significance of transcriptome remodelling
Fold changes in themselves are not statistical tests, and cannot
be used alone to designate a confidence level of the reported
differences in expression. The average numbers of probes
identified for each pattern after permutation analysis represent
the number expected to arise by random chance for that pattern.
Once this expected value has been determined it can be used in a
maximum likelihood Chi square test, under the null hypothesis of
no difference between observed and expected, in order to
determine whether the observed patterns differ significantly from
random chance. This was undertaken using the "Chi-Square goodness
of fit" option of GenStat, and testing the difference between the
mean number of genes observed fitting a given expression pattern,
and the mean number of genes expected to fit that same pattern
(as calculated above), with a single degree of freedom.
Significant relationships, fitting the alternative hypotheses of
significant differences between the two mean values, were
considered to be those exhibiting P values of 0.05 or less.
Normalisation of transcriptome remodelling
Transcriptome remodelling was calculated, normalised for the
divergence of the transcriptomes of the parental accessions,
using the equation:
NT= RT/ (Rp/Rpra)
Where NT = normalised level of transcriptome remodelling of a
cross
RT = total number of genes summed across all 6 classes indicative
of remodelling for the specific hybrid, at the appropriate fold-
level
Rp = total number of genes with transcript abundance differing
between the parental accessions of the specific hybrid, at the
appropriate fold-level.
Rpm= Mean number of genes with transcript abundance differing
between the parental accessions across all combinations analysed,
at the appropriate fold-level.
Estimation of Relative Genetic Distance
In order to develop a measure of the Relative Genetic Distance
(RGD) between accession Ler and the 13 accessions crossed with it
to produce hybrids the following method was used. A set of 216
loci were selected that were polymorphic for the 14 main
accessions studied in this thesis. These were downloaded from the
web site of the NSF 2010 project DEB-0115062
(http://walnut.usc.edu/2010/). Loci were selected to cover the
genome by defining 500 kb intervals throughout the genome,
starting at base pair 1 on each chromosome, and selecting the
polymorphic locus with the lowest base pair coordinate that has a
complete set of sequence data for all 14 accessions, if any, in
each interval. The number of polymorphisms across these 216 loci
between each accession and Ler were determined and normalised
relative to the polymorphism rate observed between Ler and
Columbia (with 45 polymorphisms, the most similar to Ler) to give
the RGD.
Regression analysis to identify genes with transcript abundance
in hybrid lines correlated with the strength of heterosis
In order to identify genes showing a significant linear
relationship between strength of heterosis and transcript
abundance in hybrid lines, regression analysis was undertaken
using a script of directives programmed in the GenStat command
language. This programme conducted a linear regression, for the
transcript abundance of each probe, against the phenotypic value
for 32 GeneChips. There were three replicate GeneChips for each
of the hybrids LaAg, LaCt, LaCv, LaGy, LaKo, and LaMz, and two
replicates each for LaBr, LaCo, LaGa, LaNo, LaSo, LaTs, and LaWt,
each representing the pooled RNA of thtee individual hybrid
plants. The results of these regressions were presented as F-
values. Once this had been completed for the experimental data,
significant results were checked by hand against the source data.
Program 3 below is an example of the linear regression programme.
This example identifies linear regressions between the hybrid
transcriptome and MPH.
Once this had been completed for the transcription data,
permutation analysis was used to determine how often particular
regression line would arise by random chance. The data was
randomised within individual arrays, using a rotating 'seed
number' and the regression analyses were repeated for this random
data, using the same directives used for the original data. In
order to get a statistically significant number of random
replicates, this randomisation and analysis of the data was
repeated 1000 times. Following this, the 1000 regression values
for each gene were ranked according to the probability of a
relationship between the phenotypic values and random expression
values, and the F values of the first, tenth and fiftieth values
(corresponding to the 0.1%, 1% and 5% significance values) were
recorded. The probabilities of the actual and randomised samples
were then compared and only those genes where the probability of
occurring randomly is less than in the actual data at one of the
three significance values were counted' as showing a significant
relationship.
Program 4 below is an example of the linear regression
bootstrapping programme. This example randomises linear
regressions between the hybrid transcriptome and MPH. Due to the
size of the outputs, the files are saved into intermediary files
that can be read by the computer but not opened visually.
Program 5 below is an example of the programme written to extract
the significant values out of the bootstrapping intermediary data
files, into a file that can be manipulated in excel. Again this
example handles linear regression data between the hybrid
transcriptome and MPH.
Regression analysis to identify genes with transcript abundance
in parental lines correlated with the strength of heterosis
In order to identify genes showing a significant linear
relationship between strength of heterosis and transcript
abundance in parental lines, regression analysis was undertaken
as described for the identification of genes with transcript
abundance in hybrids correlated with the strength of heterosis.
Example 6: A transcriptomic approach to modelling and prediction
of hybrid vigour and other complex traits in maize
Modelling and prediction of heterosis in maize
The experimental design uses a series of 15 different hybrid
maize lines, all with line B73 as the maternal parent. The
hybrids and parental lines were grown in replicated trials at
three locations (two in North Carolina and one in Missouri) in
2005, and data were collected for heterosis and a range of other
traits, as listed below. All 31 lines (15 hybrids and 16
parents) were grown for 3 weeks and aerial tissues cut, weighed
and frozen in liquid nitrogen. RNA was prepared and Affymetrix
maize GeneChips were used to analyse the transcriptome in 2
replicates of each. The methods successfully developed in
Arabidopsis, as described above, were used to (i) identify genes
with transcript abundance correlated with the magnitude of
heterosis, (ii) develop predictive models using the transcriptome
data from 12 or 13 hybrids and the corresponding parents and
(iii) test the ability of the models to "predict" the performance
of additional hybrids, based only upon their transcriptome
characteristics.
Genes whose transcript abundance was shown to correlate with
heterosis in maize are shown in Table 19. Heterosis was
calculated for plant height, for plants at CLY location (Clayton,
North Carolina) only (model from 13 hybrids).
These data were used to develop a model for prediction of
heterosis in two further hybrids. All of the genes used in
producing the calibration line were have been used in the
prediction, both for the model development and the further "test"
plants.
Prediction of heterosis for plant height, CLY location only
(model from 13 hybrids to predict 2) :
The same procedures can be used, to develop predictive models for
each of the additional traits for which complete data sets are
available. For maize, the data from 14 inbred lines (used as
parents of the hybrids described above) can be used to develop
models for prediction of traits in further inbred lines.
The following traits may be measured in maize: yield; grain
moisture; plant height; flowering time; ear height; ear length;
ear diameter; cob diameter; seed length; seed width; 50 kernel
weight; 50 kernel volume.
Genes with transcript abundance correlating with yield, measured
as harvestable product, are shown in Table 20. Average yield was
calculated for 12 plants across 2 sites, MO and L.
These genes were used to develop a model for prediction of yield
in three further hybrids. All of the genes used in producing the
calibration line were have been used in the prediction, both for
the model development and the further "test" plants.
Rank order of yield was successfully predicted in these hybrids,
and the magnitude was accurate for 2 out of the 3 hybrids, shown
below. With improved trait data, accurate predictions would be
expected for all hybrids.
Prediction of average yield across 2 sites, MO and L (model from
12 hybrids to predict 3)
Example 6a: Prediction of plot yield in maize hybrids using
parental transcriptome data
We used linear regression to identify genes for which expression
levels in a training dataset of 20 genetically diverse inbred
lines (B97, CML52, CML69, CML228, CML247, CML277, CML322, CML333,
IL14H, Kill, Ky21, M37W, Mol7, M018W, NC350, NC358, Oh43, P39,
Tx303, Tzi8) was correlated with the plot yield of the
corresponding hybrids with line B73. Pedigrees and phylogenetic
grouping.72 of the maize lines used in our studies are summarised
in Table 21.
Using a stringent cut-off for significance (P < 0.00001),
r
correlations (0.288 < r2 < 0.648) were identified for 186 genes.
These are listed in Table 22. In the majority of cases (129),
gene expression in the inbred lines was negatively correlated
with yield of the hybrids. We were able to discount the
possibility that these correlations were artefacts of differing
proportions of cell types in different sizes of plants, which may
have arisen if the sizes of the inbred seedlings were indicative
of the performance of the corresponding hybrids, as we found no
correlation between plot yield and either the weight (r2 = 0.039)
or the height (r2 = 0.001) of the sampled seedlings of the
Corresponding parental lines.
To assess whether gene expression characteristics may be used
successfully for the prediction of yield, each hybrid in turn was
removed from the training dataset and models developed based upon
a regression conducted with the remaining lines. This was
conducted as for A. thaliana, except that the mean of the
predictions for all of the genes with highly significant
correlation (P < 0.00001) was used as the overall prediction of
heterosis for the excluded line. The numbers of genes exceeding
this significance threshold varied from 84 (with P39 excluded) to
262 (with NC350 excluded). Gene expression data for a test
dataset of four additional inbred lines (CML103, Hp301, Ki3,
OH7B) was then used to predict the heterosis that would be shown
by the corresponding hybrids with B73, by averaging the
predictions from each of the 186 genes identified by regression
analysis using the complete training dataset. The results showed
that the predicted plot yield is strongly correlated with the
measured plot yield (r2 = 0.707), demonstrating that gene
expression characteristics can, indeed, be used for the
prediction of heterosis, as quantified by yield. Although the
relationship was non-linear, with reduced ability to
quantitatively predict yields at the higher end of the range
studied, the method was able to correctly resolve the two highest
yielding hybrids in the test dataset from the two lowest yielding
hybrids. The poor yield performance of hybrids including the
popcorn (HP301) and the two sweet corns (IL14H and P39) were
correctly predicted, but the exceptionally high yield of the
hybrid NC350 x B73 was not predicted. We conclude that maternal
effects are minor, as the analysis was based on a mixture of
crosses with B73 as the maternal parent (15 hybrids) and as the
paternal parent (9 hybrids).
Growth and trait analysis of maize plants
Plants used for transcriptome analysis were grown from seeds for
2 weeks. Maize seeds were first imbibed in distilled water for 2
days in glasshouse conditions to break dormancy, before transfer
to peat and sand P7 pots. They were grown in long day glass house
conditions (16 hours photoperiod) at 22°C. Aerial parts above
the coleoptiles were excised, weighed and frozen in liquid
nitrogen. All plant harvesting and weight measurements were
taken as close as practicable to the middle of the photoperiod.
Plants for yield trials were grown in field conditions in
Clayton, NC in 2005. Forty plants of each hybrid were grown in
duplicate 0.0007 hectare plots. Yield was calculated as pounds
of grain harvested per plot, corrected to 15% moisture, as shown
in Table 23.
Example 7: A transcriptomic approach to modelling and prediction
of hybrid vigour and other complex.traits in oilseed rape
Modelling and prediction of heterosis in oilseed rape
The experimental design uses a series of 14 different hybrid
oilseed rape restorer lines, all with line MSL 007 C (which is a
male sterile winter line and has been used for commercial hybrid
production) as the maternal parent. The hybrids and parental
lines were grown in Hohenlieth and Hovedissen in Germany and
Wuhan in China in 2004/5, and data for heterosis and a range of
other traits, as listed below, were collected. All 29 lines (14
hybrids and 15 parents) are grown for 3 weeks and aerial tissues
cut, weighed and frozen in liquid nitrogen. RNA is prepared and
Affymetrix Brassica GeneChips are used to analyse the
transcriptome in 3 replicates of each. The methods successfully
developed in Arabidopsis are used to (i) identify genes with
transcript abundance correlated with the magnitude of heterosis,
(ii) predictive models are developed using the transcriptome data
from 12 hybrids and the corresponding parents and (iii) the
ability of the models to "predict" the performance of the 2
additional hybrids, based only upon their transcriptome
characteristics, is demonstrated.
Traits measured in oilseed rape: Seed yield, seed weight, seed
oil content, seed protein content; seed glucosinolates;
establishment; Winter hardiness; Spring development; flowering
time; plant height; standing ability.
Modelling and prediction of additional traits
Upon completion of heterosis modelling, the same procedures are
used to develop predictive models for each of the additional
traits for which complete data sets are available. For oilseed
rape, the data from 12 inbred lines (used as parents of the
hybrids described above) is used to develop models, which is used
to "predict" the traits in 2 further inbred lines. The
performance of the models is validated.
Example 8: Further data modelling techniques
Improvement of the models
The models developed in Arabidopsis utilize linear regression
approaches. However, non-linear approaches may enable the
identification of more comprehensive gene sets and, hence, more
precise models. Non-linear approaches are therefore incorporated
into the model development protocols. Additional opportunities
for refinement include weighting of the contribution of
individual genes and data transformations.
Development of reduced representation models
Although approaches based on the use of GeneChips or microarrays
may continue to be the preferred analytical platform for
commercialization, there are other methods available for the
quantitative determination of transcript abundance. Quantitative
PCR methods can be reliable and are amenable to some automation.
However, when such approaches are to be used, it is desirable to
identify a subset of genes (ideally under 10) that retain most of
the predictive power of the sets of genes used to date in the
models (70 for prediction of heterosis based on hybrid
transcriptomes, typically >150 for prediction of heterosis or
other traits based on inbred transcriptomes). Therefore, a
limited set of genes is identified by iterative testing of the
precision of predictions by progressively reducing the numbers of
genes in the models, preferentially retaining those with the best
correlation of transcript abundance with the trait.
Example 9: Standard Operating Instruction for the Analysis of
Gene Expression Data
This section provides detailed guidance for development and use
of predictive models using the program GenStat [70].
List of programmes
The following GenStat programmes may be used in accordance with
the invention and are suitable for analysing any Affymetrix based
expression data.
GenStat Programme l~Basic Regression Programme ~ Method 4
GenStat Programme 2- Basic Prediction Regression Programme ~
Method 5
GenStat Programme 3~ Prediction Extraction Programme ~ Method 5
GenStat Programme 4~ Basic Best Predictor Programme ~ Method 7
GenStat Programme 5~ Basic Linear Regression Bootstrapping
Programme ~ Method 9
GenStat Programme 6~ Basic Linear Regression Bootstrapping Data
Extraction Programme ~ Method 9
GenStat Programme 7 ~ Basic Transcriptome Remodelling Programme
~Method 10
GenStat Programme 8 ~ Dominance Pattern Programme ~Method 11
GenStat Programme 9 ~ Dominance Permutation Programme ~Method 11
GenStat Programme 10~ Transcriptome Remodelling Bootstrap
Programme ~Method 12
Introduction
These standard operating procedures are designed to enable the
undertaking of gene expression analysis studies, from RNA
extraction through to advanced prediction.
The procedures are divided into 4 workflows, depending on the
type of analyses you wish to undertake. See Figure 1.
Workflow a) follows the basic first steps, common to all analyses
(methods 1-3), to the stage of predicting traits based upon
transcription profiles.
Workflow b) follows the recommended analysis procedure (based on
the latest analysis developments). It culminates in the
prediction of traits based on a subset of best predictor genes.
Workflow c) follows an alternative analysis procedure, used to
generate the prediction reported in my thesis, and includes a
bootstrapping step.
Workflow d) describes to methods for analysing the degree of
transcriptome remodelling between hybrids and their parent lines.
All of these workflows are designed to be 'worked through' and
contain step-by-step instruction on how to complete the analysis.
a) Standard Protocols
Method 1, Extract RNA
This stage results in the production of good quality total RNA at
a concentration of between 0.2 - lμg μl-1 for hybridisation to
Affymetrix GeneChips. These methods are the same for both
Arabidopsis and Maize chips, for other species, contact
Affymetrix for their recommended methods.
1.1 Trizol RNA extraction
200mg of plant tissue were ground to a fine powder using liquid
nitrogen in a baked pre-cooled mortar, and using a chilled
spatula, transferred to labelled chilled capped tube. To these
tubes 1ml of TRI REAGENT (Sigma-Aldrich, Saint-Louis USA) was
added and shaken to suspend the tissue. After a 5 minute
incubation at room temperature 0.2ml of chloroform was added, and
thoroughly mixed with the TRI REAGENT by inverting the tubes for
around 15 seconds, followed by 2-3 minutes incubation at room
temperature. The tubes were centrifuged at 12000rpm for 15
minutes and the upper aqueous phase transferred to a clean,
labelled tube.
0.5ml of isopropanol was then added to the tubes, which were
inverted repeatedly for 30 seconds to precipitate the RNA,
followed by 10 minutes incubation at room temperature. The tubes
were then centrifuged at 12000rpm for 10 minutes at 4°C,
revealing a white pellet on the side of the tube. The supernatant
was poured off the pellet, and the lip of the tube gently blotted
with tissue paper. 1ml 75% ethanol was added and the tubes shaken
to detach the pellet from the side of the tube, followed by
centrifugation at 7500rpm for 5 minutes. Again the supernatant
was poured off the pellet, which was quickly spun down again and
any remaining liquid removed using a pipette. The pellet was then
dried in a laminar flow-hood; before 50μl DEPC treated water
(Severn Biotech Ltd. Kidderminster, UK) was added to dissolve the
pellet.
1.2 RNA Clean-up
RNA samples were cleaned up using RNeasy® mini columns (Qiagen
Ltd, Crawly, UK), according to the protocol given in the RNeasy®
Mini Handbook (3rd edition 06/2001 pages 79-81). Due to the
maximum binding capacity, no more than lOOμg of RNA could be
loaded on to each column. In order to obtain as high a
concentration as possible during the elution step, 40μl was used
and the elute run through the column twice. This was followed by
a second 40μ1 volume of DEPC treated water in order to remove any
remaining RNA, which could be used to increase the amount of
clean RNA available, should further concentration be required.
If the concentration of the clean RNA was less than μg μl-1 a
further precipitation and dissolution can be performed using an
Affymetrix recommended method which can be found in the
Affymetrix Expression Analysis Technical Manual II
(http://www.affymetrix.com/support/technical/manuals.affx).
5μ1 3 M NaOAc, pH 5.2 (or one tenth of the volume of the RNA
sample) was added to the RNA sample requiring concentrating,
together with 250μ1 of 100% ethanol (or two and a half volumes of
the RNA sample). These were mixed and incubated at -20°C for at
least 1 hour. The samples were centrifuged at 12000 rpm in a
micro-centrifuge (MSE, Montana, USA) for 20 minutes at 4°C, and
the supernatant poured off leaving a white pellet. This pellet
was washed twice with 80% ethanol (made up with DEPC treated
water), and air-dried in a laminar flow hood. Finally the pellet
was re-suspended in DEPC treated water,- to a volume appropriate
to the required concentration.
Method 2, RNA Hybridisation
2.1 Hybridisation to GeneChips
Affymetrix GeneChip array hybridisation was carried out at the
John Innes Genome Lab (http://www.jicgenomelab.co.uk). All
protocols described can be found in the Affymetrix Expression
Analysis Technical Manual II (Affymetrix Manual II
http://www.affymetrix.com/support/technical/manuals.affx.)
Following clean up, RNA samples, with a concentration of between
0.2-lng, μl-1, were assessed by running lμ1 of each RNA sample on
Agilent RNA6000nano LabChips® (Agilent Technology 2100
Bioanalyzer Version A.01.20 SI211) . First strand cDNA synthesis
was performed according to the Affymetrix Manual II, using 10 μg
of total RNA. Second strand cDNA synthesis was performed
according to the Affymetrix Manual II with the following minor
modifications:
cDNA termini were not blunt ended and the reaction was not
terminated using EDTA. Instead Double-stranded cDNA products were
immediately purified following the "Cleanup of Double-Stranded
cDNA" protocol (Affymetrix Manual II). cDNA was re-suspended in
22μl of RNase free water.
cRNA production was performed according to the Affymetrix Manual
II with the following modifications:
llμl of cDNA was used as a template to produce biotinylated cRNA
using half the recommended volumes of the ENZO BioArray High
Yield RNA Transcript Labelling Kit. Labelled cRNAs were purified
following the "Cleanup and Quantification of Biotin-Labelled
cRNA" protocol (Affymetrix Manual II). cRNA quality was assessed
by on Agilent RNA6000nano LabChips® (Agilent Technology 2100
Bioanalyzer Version A.01.20 SI211). 20ug of cRNA was fragmented
according to the Affymetrix Manual II.
High-density oligonucleotide arrays were used for gene expression
detection. Hybridisation overnight at 45°C and 60RPM
(Hybridisation Oven 640), washing and staining (GeneChip®
Fluidics Station 450, using the EukGEws2_450 Antibody
amplification protocol) and scanning (GeneArray® 2500) was
carried out according to the Affymetrix Manual II.
Microarray suite 5.0 (Affymetrix) was used for image analysis and
to determine probe signal levels. The average intensity of all
probe sets was used for normalization and scaled to 100 in the
absolute analysis for each probe array. Data from MAS 5.0 was
analysed in GeneSpring® software version 5.1 (Silicon Genetics,
Redwood City, CA).
Method 3, Data Loading
This section describes the methods used to load the expression
data into GeneSpring, how to normalise the data, and how to save
it in excel for further analysis. These instructions are best
followed while carrying out the analysis. A GeneSpring course is
recommended if further analysis is required using this programme.
3.1 Loading Data into GeneSpring
Open GeneSpring, > File > Import data > select the first of the
data files you wish to load > click Open
Choose file format - Affy pivot table
(Create new genome - if you don't want to go into an existing
one)
Select genome - Arabidopsis, Maize, etc, or create a new genome
following instructions on screen
Import data: selected files - select any remaining files you want
to analyse
Import data: sample attributes - this is where you can enter the
MIAME info
Import data: create experiment - yes. Save new experiment - give
it a name, it will appear in the experiment folder in the
navigator toolbar.
3.2 New experiment checklist
These 4 factors should be completed in turn, to ensure that the
data is properly normalised. This will impact upon all of the
subsequent analyses. Generally the defaults or recommended orders
should be used.
Define Normalisations
Click on 'use recommended order' and check that the following is
included:
Data transformation: measurements less than 0.01 to 0.01
Per chip: 50th %
Per gene: normalise to median, cut off = 10 in raw signal
Define Parameters
Here we define the names of the expression data. Depending upon
the labelling of the expression files, changes may not be
required here. If changes are required:
Click on 'New custom' Type the name of each sample.
Delete other parameters to avoid confusion.
Save
Define Default interpretation
No changes needed for this experiment
Define Error model
No changes needed for this experiment
3.3 Transfer Data in to Excel
Once the data is normalised it can be transferred into an excel
spreadsheet.
To do this, click on the relevant data in the experiment tree (on
the far left of the main GeneSpring screen)
Click View > view as spreadsheet
select all > copy all > paste into Excel spreadsheet.
Save.
This forms the master Excel chart.
Method 4, Regression Analysis
These instructions describe the basic regression method. This
regression forms the basis of the subsequent prediction methods.
To create a data file for use in GenStat. Open the master Excel
file (with normalised expression data from GeneSpring) > Copy the
relevant data columns (the data for those accessions that will
form the 'training data set' from which significant predictive
genes will be selected) into a new chart> add a column of ":" at
the far end > save chart as .txt file>close file
Open the text file in GenStat> Enclose any title names in speech
marks (""), this should have the effect of turning the titles
green> Find and replace (ctrl R) * with blanks> Replace all> Save
file again
4.2 Regression Programme
Open 'basic regression programme' (GenStat Programme l~Basic
Regression Programme) in GenStat
Check that the input data filename is correct, and is opening to
channel 2
Check that the output data file is going to the correct
destination and is opening to channel 3. These input and output
file names should be RED
Check that the phenotypic trait data are correct for the trait-
under investigation. Use "\" to go on to new lines, these
backslashes will turn GREEN.
Check that the number of genes to be investigated is set to the
correct value (usually 22810 for Arabidopsis, or 17734 for
Maize).
If the R2, Slope, and Intercept are required remove the "" from
the appropriate analysis section, and from the print command,
both will turn BLACK from green.
4.3 Running the Programme
To run the programme, ensure that both the programme window and
output windows are open (to tile horizontally Alt+Shift+F4).
Select the programme window and press Ctrl+W. This will set the
programme running, check that the GenStat server icon (histogram
symbol, in taskbar at bottom right-hand corner of the screen) has
changed colour to red.
To cancel the programme right click on the server icon and choose
interrupt
Once complete the GenStat icon will change colour back to green
4.4 Analysing the Output
To analyse the data, first open it in Excel, select "delimited">
next> tick the "Tab" and "Space"> Finish
Add a new row at the far left-hand side of the sheet, and label
the appropriate columns "P value" "Df" and "R square" "Slope" and
"Intercept" if these were included in the analysis
Add a new column to the beginning and label it "ID"
Fill the remaining cells of the ID column with a series 1-22810
for Arabidopsis or 1-17734 for Maize (edit>fill>series>OK)
Delete the column "Df"
Select all of the data columns> Data> Sort> P value ascending
Select all of the rows where the P value are less than or equal
to 0.05. Colour these cells using the "paint" option, and record
the number in this list. These are the genes significant at the
5% level
Select all of the rows where the P value are less than or equal
to 0.01. Colour these cells an alternative colour using the
"paint" option, and record the number in this list. These are the
genes significant at the 1% level
Select all of the rows where the P value are less than or equal
to 0.001. Colour these cells a third colour using the "paint"
option, and record the number in this list. These are the genes
significant at the 0.1% level
These three values are the number of OBSERVED significant probes
in the data set
These observed significant probes, can be used as 'prediction
probes' for the prediction of traits in other accessions, or
hybrid combinations.
Method 5, Prediction
These instructions describe the basic prediction method. All
subsequent prediction methods are a variation on this.
5.1 Producing the Prediction Calibration Lines
Using the list of identified prediction probes; create a specific
prediction sub-set gene list. This can be done by copying your ID
and P-value columns (sorted by ID to return the data to its
original order) in to a new excel sheet along with the expression
data of your training line accessions. You can then sort by P-
value and delete those genes that do not appear in the relevant
significance (usually 0.1%) list. Remember to sort by ID again to
return the file to its correct order, then delete the ID and
Sig0.1% columns you added. Save this file under a new file name
as a .txt file (for example trainingsetdata.txt).
Open the 'Basic Prediction Regression Programme' (GenStat
Programme 2)
Check that the input file is the one that you have just created
Check that the output file is named correctly (calibration output
file)
Check that the number of genes is correct (for example the 0.1%
significant genes)
Check that the bin values are appropriate for the trait data.
These values should cover the range of the data and a little way
either side.
Save the file and run the programme (Ctrl+W)
5.2 Making the Test Expression File
To make the predictions use the identified prediction probes, and
the expression data of the 'unknown lines' for which we are
making the prediction of heterosis Using the list of identified
prediction probes, create a specific prediction sub-set gene
list, as was done when generating the file for the calibration
curves (section 5.1). This can be done by copying your ID and P-
value columns (sorted by ID to return the data to its original
order) in to a new excel sheet along with the expression data of
your training line accessions. You can then sort by P-value and
delete those genes that do not appear in the relevant
significance (usually 0.1%) list. Remember to sort by ID again to
return the file to its correct order, then delete the ID and
Sig0.1% columns you added. Save this file under a new file name
as an Excel spread sheet.
In this file add two blank columns between each of the data
columns. In the first column, next to the first unknown line's
expression measurement, insert a number series from 1 to however
long the list on gene measurements is. In the next column, list
the identifier for those measurements (the best identifier would
be the parent name, for instance Kas, B73 etc.).
In the first column next to the second data list type the command
"=B2+0.01" Then copy this down the column. This will have the
effect of giving a number series that is 0.01 greater than its
equivalent for the first parent. In the next column, list the
identifier for those measurements again
Repeat this process for any remaining parent data sets. Each
number series should always be 0.01 greater than its equivalent
in the previous series.
Starting with the second set of data columns, cut all of the
genes, number series and identifies, and add them to the bottom
of first set of data columns. Be sure to use Edit> Paste Special>
Values so as not to upset your commands. Repeat this for the
remaining columns. You should now have three long columns with
all of the data in.
Select all of the data. Click Data>Sort>Column B (or whichever is
the column with the number sequence in). After sorting, you
should have all of your parental data mixed together, with all of
the same genes next to each other (for example, with three
parents your number sequence should read 1, 1.1, 1.2,2,2.1,2.2
etc. and the identifier column should read Kas, Sha, Ll-0,Kas,
Sha, Ll-0 etc. or equivalent) save the file. This is your
identifier file.
Copy only the column with the expression data into a new work
book. Delete all headings and add a column of colons ":". Save
the file as a . txt file. This is your 'Tester' data file. Ensure
that you close this file, as GenStat will not recognise the file
if open in Excel.
Open this file in GenStat press Ctrl+R and in the 'Find What' box
type * leave the 'Replace With' box blank. Click 'Replace All'
then save this file. This is your test expression file.
5.3 Running the Prediction File
Open the 'Prediction Extraction Programme' (GenStat Programme 3
Check the variate "mpadv" these are the X-axis values for the
calibration lines. Ensure that these are the same as the bin
values entered earlier (section 5.1).
Check the first input file. This should be the expression data of
your Tester lines (section 5.2).
Check the second input file. This should be the output file from
your calibration line (calibration output file- section 5.1).
Check that the "ntimes" command is the number of test genes
multiplied by the number of parents, therefore the total number
of genes in your test expression file.
Check that the "calc Z=Z+3" command is correct for your number of
Tester lines,, for example, for four Tester lines this should read
"calc Z=Z+4".
Check that your "if (estimate)" commands are appropriate for the
range of your trait data. This is for the 'capped' prediction.
These should be set at 2 'bin sizes' beyond and below the bin
range, if appropriate.
Run the programme (Ctrl+W). This programme prints to the output
window, which should be saved as an output (out) file.
Note it is normal for there to be error messages, if all of the
previous steps have been followed ignore these.
5.4 Analysing the Output
Open your saved output file in Excel. Choose Delimited > Next and
tick the Tab and Space buttons.
Delete the writing found in the file until you reach the first
data point. Usually the first 60 lines.
Name the columns "No." "Cap" "Raw"
Scroll to the bottom and delete all of the messages you see
there.
Select all and sort by "No" ascending.
Check that you have the correct number of rows remaining. This
should equal the ntimes value from the Prediction Extraction
Programme (the number of prediction genes you have generated,
multiplied by the number of Tester lines you are predicting for).
Scroll to the bottom and delete all of the non-relevant
information you see there (for example "regvr=regms/resms" "code
CA" etc)
Delete any remaining warning messages, to the left and right of
the 'useful data.'
Open the identifier xls file you generated earlier. Copy the
Number series and Identifier columns in to your output file.
Select all (Ctrl+A) and sort by Identifier, this should separate
the data by parent name.
Cut and paste all of the parents into neighbouring columns (so
that they are next to each other).
Scroll to the bottom of the list under the cap column enter the
command "=AVERAGE(B2:B203)" (Note, this command is based on 202
predictive genes, you should adjust this command to cover the
number of predictions for your gene set).
Copy this command to the bottom of all of your lists. You should
now have two predictions for each of your Tester lines, the
CAPPED and RAW prediction values.
These predictions can be used individually, or they can be
averaged between replicates of the same accessions.
b) Recommended Prediction Protocol
Method 6, N-l Model
These instructions describe the first steps of the recommended
prediction protocol. The N-l model is a modification to the basic
regression method, and using the same GenStat programme, however
this regression is repeated for each accession in the training
set.
6.1 Running the N-l model
To undertake the N-l model, prepare an expression file containing
all of the accessions you wish to use in your training set.
Run a basic regression (GenStat Programme l~Basic Regression
Programme) using all but one of these accessions. If you have
multiple replicates of the same accession, ensure that all are
removed.
Using the genes identified from this experiment, undertake a
prediction as described in Method 5, using the removed accession
as the tester line. Record the ID list of the predictive genes
(section 4.4), and the results of the RAW prediction for each
gene (as listed in section 5.4) for each replicate.
Repeat this process for all of the accession in the training set,
until you have predicted each accession against a training set
containing all of the other accessions. These data can be used to
asses the overall accuracy of these predictions by plotting the
ACTUAL trait values against the predicted, or they can be used
for the later 'Best Predictor' prediction method.
Method 7, Best Predictor
This programme calculates which genes consistently predict well
over a wide range of accessions and phenotypes. You can also use
the output to investigate the frequency of genes appearing in the
predictive lists, and thereby identify many noise genes.
7.1 Creating the data file
To create the data file first open a new Excel spreadsheet. In
the first column, paste the list of predictive gene IDs (the
numbers assigned at the regressions stage) from the first of the
N-l accessions (section 6.1). In the next column paste the list
of predictions for these genes for this accession, as generated
in the prediction stage for that accession in the N-l model. In
the third column at each stage paste the accession name, repeated
next to each gene in the list. In the fourth column type the
replicate number for that accession, if there is only one
replicate type 1. In the fifth type the actual trait value for
that accession.
Open the 'Basic Best Predictor Programme' (GenStat Programme 4)
Check that the names of the accessions are correctly listed.
Check that the number of replicates is correct (note these should
be written [values='chip 1','chip 2'] and so on for however many
replicates there are).
Check that the input file name is correct.
Run the programme (Ctrl+W). This programme prints to the output
window, which should be saved as an output (out) file.
7.3 Generating a Best Predictor File
Open your saved output file in Excel. Choose Delimited > Next and
tick the Tab and Space buttons.
Delete the copy of the programme in the output (first 31 lines or
so) at the top of the file, and the programme information at the
bottom of the file (last 8 lines).
Only the first 4 columns (gene, number, Delta, and se_delta) are
at the top of the file. Scroll half way down the sheet; there are
3 further columns (a repeat of gene, Ratio, and se_ratio) copy
these columns next to the 4 columns at the top of the sheet.
Ensure that the column names are gene, number, Delta, and
se_delta, gene, Ratio, se_ratio; respectively.
Delete the second 'gene' column.
Save the file. This file is your Best Predictor file
The information in the Best Predictor file is:
Gene Gene is the gene ID list of the predictive genes (section
4.4) .
Number The number of occasions that each gene occurs in the
predictive gene lists of the N-l model. Using this we can quickly
understand the distribution of this gene between gene lists from
the N-l model (section 6.1). This information can be used to
quickly identify 'noise genes' by their low frequency in gene
lists.
Delta The Absolute Difference (AD) is the mean of the differences
between actual trait values and the values predicted for each
line in the model. The closer the AD to 0 the closer the
predictions are, on average, to the actual value. This value
gives a good 'feel' for how close a prediction is to the actual,
in relation to the trait of interest. For example, an AD of 4
might seem good if the trait was height in cm, and seem a fair
tolerance for a prediction, however if the trait was plot yield
in Kg, this value might be rather large.
se_delta The standard error of the Absolute Difference (seAD).
This value gives a measure of the variability of the prediction,
the smaller this value is the smaller the variability of the AD.
An ideal predictive gene will have a small AD and seAD.
Ratio Ratio of the Difference (RD). This is the mean of the Ratio
between actual trait values and the values predicted for each
line in the model. This value is a more universal measure of AD,
as all values are normalised to 1 (1 being a perfect match
between prediction and actual), and the closer to 1 a gene is the
better the gene appears to be for prediction. In theory this
should allow the predictive ability of a gene can be assigned,
independently of the trait value. For example, a particular gene
might have an AD of -0.12 for yield weight, but an RD of 0.98.
Saying that the gene is on average a 98% accurate predictor is
perhaps an easier concept to understand.
se_ratio The standard error of the Ratio of the Difference
(seRD). This value gives a measure of the variability of the
ratio of the prediction, the smaller this value is the smaller
the variability of the RD. An ideal predictive gene will have an
RD close to 1 and a small seRD.
Using these parameters it is possible to generate more accurate
gene list for the prediction of heterosis. This is a trial and
error process at present, experimenting with different
combinations of parameters will identify the best combination of
genes for that trait. At present the most consistent combination
of parameters for a good analysis has been a gene frequency of
ALL MODELS (the predictive gene must appear in all N-l models),
and a Ratio (or RD) of >0.98 and <1.02.
In order to the gene combination with the parameters of gene
frequency of all models, and an RD of >0.98 and <1.02, firstly
sort (data> sort) the Best Predictor file by 'number' with the
data descending. Before pressing 'OK' use the 'THEN BY' function
to sort the data by Ratio ascending. Press OK.
This will bring all of the most consistent genes to the top of
the worksheet. Select all of the genes that display an RD of
between 0.98 and 1.02.
To test whether this is a good predictor list, calculate the
average prediction for each accession and replicate for this best
predictor gene list, and plot these predictions against the
actual values for that trait.
An R2 value between 0.5 and 1 suggests that gene list contains
genes that are good markers for predictions of that trait.
Method 8, Best Predictor-Prediction
8.1 Best Predictor Prediction
This method is a variation on the standard predictive method
(method 5), and uses the same GenStat programmes.
The only variation of this programme is to use the best predictor
gene list in place of the 0.1% P-valve list, for generating the
training and tester files.
c) Alternative "Basic" Prediction Protocol
Method 9, Bootstrapping
These instructions describe the first steps of the alternative
prediction protocol. These methods are an addition to the basic
regression method, and using the same GenStat programmes for the
early stages. This Bootstrapping follows on directly from the
basic regression (method 4), but prior to the prediction, and
acts as an alternative method for identifying significant
barker' genes. It works by generating a 'customised T-table'
that is specific for the experiment in question.
9.1 Regression Bootstrapping
Open the 'Basic Linear Regression Bootstrapping Programme'
(GenStat Programme 5) in GenStat
Check that the input data filename is correct, and is opening to
channel 2. This input file will be the same expression data file
used for the initial regression (section 4.1)
Check that the output data files are going to the correct
destinations and are opening to channels 2,3,4, and 5
Check that the numbers of genes to be analysed are correct for
each output file (for Arabidopsis ATH-1 GeneChips this will be
three files with 6000 genes and one with 4810), and that the
print directives are pointing to the correct channels
To run the programme, ensure that both the programme window and
output windows are open. Select the programme window and press
Ctrl+W. This will set the programme running, check that the
GenStat server icon (bottom right-hand corner of the screen) has
changed colour to red.
To cancel the programme right click on the server icon and choose
interrupt.
Once complete the GenStat icon will change colour back to green.
This programme can take many days to run due to the large number
calculations, and produces output files totalling up to 430Mb, so
plenty of disk space would be required. Once generated, the data
for this programme needs to be extracted.
9.2 Data Extraction Programme
Open the ,Basic Linear Regression Bootstrapping Data Extraction
Programme' (GenStat Programme 6) in GenStat
Check that the input files are correct (the output files from the
bootstrapping programme)
Run the programme (Ctrl-W)
This programme prints to the Output window. Save this window as
an .out file.
9.3 Analysing the Output
To analyse the data, first open it in Excel, select "delimited">
next> tick the "Tab" and "Space"> Finish
Delete the first 32 rows, all of the gaps (after 6000, 12000, and
18000 probes), and all the text at the end of the data file. The
data should be the same length as the regression file (for
Arabidopsis 22810 lines long).
Add a new row, and label the columns "boot@5%" "boot@l%" and
"boot@0.1%"
Add a new column to the beginning and label it "ID"
Fill the remaining cells of the ID column with a series 1-22810
(edit>fill>series>OK)
Copy all of these columns into the same sheet as the Observed
significant probes data set, generated from the initial
regression (section 4.4) with a one column gap
Leaving another single column gap label three further columns
"sig@5%" "sig@l%" and "sig@0.1%". In the first cell in the column
"sig@5%" type "=E2-$B2". Copy this to all of the cells in the
three new columns.
9.4 Calculating Significance
Select all of the data columns> Data> Sort> Sig@5% descending
Select all of the cells in this row where the value is positive.
Colour these cells using the "paint" option, and record the
number in this list. These are the genes significant at the 5%
level
Select all of the data columns> Data> Sort> Sig§l% descending
Select all of the cells in this row where the value is positive.
Colour these cells using the "paint" option, and record the
number in this list. These are the genes significant at the 1%
level
Select all of the data columns> Data> Sort> Sig@0.1% descending
Select all of the cells in this row where the value is positive.
Colour these cells using the "paint" option, and record the
number in this list. These are the genes significant at the 0.1%
level
These results indicate whether or not the OBSERVED values differ
significantly from random chance. These lists of significant
genes can be used as markers, for the prediction of this trait as
described in Method 5.
d) Transcription Remodelling Protocol
These analyses are designed to investigate the degree of
difference in the transcriptome profiles between the hybrid and
parental lines. There are two methods, investigating the
transcriptome remodelling, and investigating the degree of
dominance.
Method 10, Transcriptome Remodelling Fold-Change Experiments
This analysis is designed to investigating the transcriptome
remodelling between hybrid and parental transcriptomes.
10.1 Create Data File
To create a data file for use in GenStat. Open master normalised
expression Excel file > Copy the relevant data columns (in the
order 3 hybrid files, 3 paternal files, 3 maternal files) into a
new chart> add a colon ":" at the very end of the last row > save
chart as . txt file>close file
Open the text file in GenStat> Enclose any title names in speech
marks (""), this should have the effect of turning the titles
green> Find and replace (Ctrl+R) * with blanks> Save file again
10.2 Fold Change Analysis Programme
Open the 'Basic Transcriptome Remodelling Programme' (GenStat
Programme 7) in GenStat
Check that the input data filename is correct, and is opening to
channel 2
Check that the output data file is going to the correct
destination and is opening to channel 3
Check that the ratios are set correctly for the ratio comparison
under investigation.
For example, for
"if ((elem(i;k).gt.0.5).and.(elem(i;k).It.2))"
This is set for a 2-fold ratio
For 3 fold the values would be 0.33 and 3
For 1.5 fold the values would be 0.66 and 1.5
The values are entered 3 times in the programme
Check that the ratios are set correctly for the fold change
comparison under investigation. This is undertaken for all of the
sections and should be set simply to the relevant fold level
To run the programme, ensure that both the programme window and
output windows are open. Select the programme window and press
ctrl>W. This will set the programme running, check that the
GenStat server icon (bottom right-hand corner of the screen) has
changed colour to red.
To cancel the programme right click on the server icon and choose
interrupt
Once complete the GenStat icon will change colour back to green
10.3 Analysing the Output
To analyse the data, first open it in Excel, select "delimited">
next> tick the "Tab" and "Space"> Finish
Delete the first 266 rows in Excel, until you reach the column
headers. Then delete bottom line beyond the data output
At the bottom of each column calculate the total number of
significant patterns in that list. This can be done by using the
directive "=SUM(C2:C22811)" in the first column and copying this
into the remaining columns, ensuring that the correct data is
selected.
The initial analysis is now complete. These values represent the
OBSERVED data in the further analysis, following bootstrapping to
generate the expected values.
Method 11, Transcriptome Remodelling Dominance Experiments
This analysis is designed to investigating dominance type
transcriptome remodelling between hybrid and parental
transcriptoraes. Significance is calculated by comparing observed
values to the expected generated from random data. Note, this
programme is in its early stages, and is not easy to modify.
11.1 Create Data File
This experiment compares the expression of the profile of the
hybrid against the mean of it parents. To do this we must first
calculate these mean values.
Open a new Excel worksheet. Paste in the parent expression data
(both maternal and paternal) for the first replicate of the first
accession.
Calculate the mean value for each gene. This can be done using
typing the equation =AVERAGE(A2:B2) into the next cell along.
Copy this equation all the way down this column.
Open another worksheet and paste in the expression data of the
first hybrid, copy the newly generated mean parental expression
value and Edit>Paste Special >Values in to the next column.
Repeat this for all of the replicates and accessions. Note that
this programme is designed to analyse 3 replicates of each
hybrid, a total of 6 columns per accession.
Once this is complete, save the file as .txt. Open the file in
GenStat> enclose the titles in "" which should change their
colour to green. Save the file again. This is the input file.
11.2 Running the Dominance Pattern Recognition Programme
Open the 'Dominance Pattern Programme' (GenStat Programme 8) in
GenStat
Check the accession names (first scalar command) are correct. If
you are investigating less than 8 accessions, you will need to
change the numbers of these identifiers throughout the programme.
Should you not wish to do this, running 'pseudo-data' in the
remaining columns will not affect the output and can be ignored
at the analysis stage.
Check the number of columns (second scalar command) is correct.
It should be a 6x the number of accessions used (default is 48).
Check that the out put file is correctly named and addressed.
Check that the input file is correct.
Check that the fold level is correct for the analysis you wish to
under take. These values a recorded for 2 fold as
if (ratio.ge.0.5).and.(ratio.le.2) "calculates flags"
calc heqmp=l
elsif (ratio.gt.2)
calc hgtmp=l
elsif (ratio.lt.0.5)
calc hltmp=l
For other fold levels change the 0.5 and 2 values to the
appropriate value for that fold level.
For 3 fold the values would be 0.33 and 3
For 1.5 fold the values would be 0.66 and 1.5
Run the file by pressing Ctrl+W.
11.3 Analysing the Pattern Recognition Output
To analyse the output file, first open it in Excel, select
"delimited"> next> tick the "Tab" and "Space"> Finish
You will see a file filled with '1s' and '0s.' Scroll to the
bottom of this file. Underneath the first filled column write the
equation "=SUM(B1:B22810)" (ensuring that all of the data in that
column is filled). Copy this equation to all of the columns.
Each set of three 'sum values' represent the data output for a
single accession (3 replicates), in the order that the data was
loaded into the programme. These values represent
Column 1= The number of genes who' s hybrid expression falls
within the fold level criterion of the mid-parent value, for ALL
3 replicates.
Column 2= The number of genes who's hybrid expression is greater
than that of the mid-parent value, by at least the fold level
criterion, for ALL 3 replicates.
Column 3= The number of genes who's hybrid expression is lower
than that of the mid-parent value, by at least the fold level
criterion, for ALL 3 replicates.
Record these values, as the OBSERVED for these data.
11.4 Generating the EXPECTED value.
The expected data set is generated using the 'Dominance
Permutation Programme' (GenStat Programme 9)
Check the number of columns (second scalar command) is correct.
It should be a 6x the number of accessions used (default is 48).
Check that the out put file is correctly named and addressed.
Check that the input file is correct. This is the same input file
as generated previously.
Check that the fold level is correct for the analysis you wish to
under take. These values a recorded for 2 fold as before (section
11.1)
Check the number in the permutation loop is correct for then
number of permutations you require. A minimum of 100 is
recommended (although 1000 is ideal).
Run the file by pressing Ctrl+W.
This programme may take a few days to run, depending upon how
many permutations are added.
11.5 Analysing the Pattern Recognition Permutation Output
To analyse the output file, first open it in Excel, select
"delimited"> next> tick the "Tab" and "Space"> Finish
You will see a file filled with numbers. Scroll to the bottom of
this file. Underneath the first filled column write the equation
"=SUM(B1:B123)" (ensuring that all of the data in that column is
filled). Copy this equation to all of the columns.
Each set of three 'sum values' represent the permuted data output
for a single accession (3 replicates), in the order that the data
was loaded into the programme. The three values represent the
'expected by random chance' versions of the values calculated in
section 11.3.
The calculated values at the bottom of the columns are the
EXPECTED values required for this analysis. As these data are
effectively random it is acceptable to combine these for
comparison, if time is limiting.
11.6 Analysing the Significance
The level of significance is calculated by chi square analysis,
using the observed and expected data generated previously, and 1
degree of freedom.
Method 12, Transcriptome Remodelling Fold-Change Bootstrapping
This analysis is designed to assess the significance of fold
change experiments described in Method 10 . Significance is
calculated by comparing observed values to expected generated
from random data
12.1 Fold Change Bootstrapping
Open 'Transcriptome Remodelling Bootstrap Programme' (GenStat
Programme 10) in GenStat
Check that the input data filename is correct, and is opening to
channel 2. This will be the same input file as created in section
10.1.
Check that the output data files is going to the correct
destinations and is opening to channels 3
Check that the number of randomisations is set to the desired
value. As few as 50 randomisations are sufficient to give valid
estimates of random chance, however 1000 would be ideal, but this
can take many days to obtain.
Check that the ratios are set correctly for the ratio comparison
under investigation.
For example:
"if ((elem(i;k).gt.0.5).and.(elem(i;k).It.2))"
This is set for a 2-fold ratio
For 3 fold the values would be 0.33 and 3
For 1.5 fold the values would be 0.66 and 1.
To run the programme, ensure that both the programme window and
output windows are open. Select the programme window and press
Ctrl>W. This will set the programme running, check that the
GenStat server icon (bottom right-hand corner of the screen) has
changed colour to red.
To cancel the programme right click on the server icon and choose
interrupt
Once complete the GenStat icon will change colour back to green
12.2 Analysing the Output
To analyse the data, first open it in Excel, select "delimited">
next> tick the "Tab" and "Space"> Finish
Delete the first 281 rows in Excel, until you reach the first row
of data. Then delete bottom line beyond the data output
Select the whole sheet and go to data>sort>sort by "Column B".
This will remove the empty rows from the data.
At the bottom of each column calculate the mean number of
significant patterns in that list. This can be done by using the
directive "=AVERAGE(B2:B22811)" in the first column and copying
this into the remaining columns, ensuring that the correct data
is selected.
This will give the EXPECTED mean value, expected by random chance
in the data
12.3 Calculating Significance
Calculating the significance of the observed patterns requires
the use of a maximum likelihood chi square test
Firstly open GenStat> Stats> Statistical Tests> Chi-Square
Goodness of Fit
Click on "Observed data create table"> Spreadsheet
Name the table OBS> Change rows and columns to 1> OK and ignore
the error message
In the new table cell type the number of the first OBSERVED
column sum value
Click on "expected frequencies create table"> Spreadsheet
Name the table EXP> leave rows and columns as 1> OK and ignore
the error message
In the new table cell type the number of the first Expected mean
column mean value
On the Chi-Square window put 1 into the degrees of freedom box
and click Run
Record the Chi-Square and P value that appears in the Output
window.
Type the next OBSERVED value into the OBS box and click onto the
output window
Type the next EXPECTED value into the EXP box and click onto the
output window
On the Chi-Square window click Run, and record the new Chi-Square
and P value that appears in the Output window
This should then be undertaken for all of the remaining OBSERVED
and EXPECTED values.
These results indicate whether or not the OBSERVED values differ
significantly from random chance.
Troubleshooting
This section describes some of the most common problems that can
occur while running these programmes. Many of these
problems/solutions apply to most of the programmes and as a
result this section has not been divided up along programme
lines. This list is not exhaustive, but should cover the majority
of problems encountered. It should be noted that the 'fault
codes' given.are only for illustration, often many fault codes
can result from the same root problem.
One common method of solving general problems is to ensure that
all of the input, files are closed prior to running the programme.
This is achieved by typing (to close channel 2) "close ch=2" and
then running this directive. By repeating this for channels 3-5,
you can ensure that all of the channels are closed before running
your programme, and thus avoiding conflicts.
Fault 16, code VA 11, statement 4 in for loop Command: fit
[print=*]mpadv invalid or incompatible type(s) Structure mpadv is
not of the required type.
Remove comma from the end of the variate list.
Fault 29, code VA 11, statement 4 in for loop Command: fit
[print=* J mpadv Invalid or incompatible type(s) Structure mpadv is
not of the required type
Problem with the trait-data identifier. Possibly a different or
missing identifier following the trait data variates (X-axis
data)
Failure to run problems
- Too many values
Fault # code VA 5, statement 2 in for loop Command: read
[ch=2;prlnt=*;serial=n]exp Too many values
1) Ensure that the width parameter is large enough, set to a
large enough value (400 is standard)
2) Ensure that if titles are included in the data file, that they
are 'greened out' and not being read as data
3) Ensure that the "Unit" number (at the beginning of the
programme) and the number of trait "variate"s are the same
Fault 13, code VA 6, statement 4 in for loop Command: fit
[print=*]mpadv Too few values (including null subset from
RESTRICT) Structure mpadv has 31 values, whereas it should have
38
Ensure that the "Unit" number (at the beginning of the programme)
and the number of trait "variate" are the same
Warning 6, code VA 6, statement 2 in for loop Command: read
[ch=2/print=*/serial=n]exp Too few values (including null subset
from RESTRICT)
Ensure that the "ntimes=" number and the number of probes in the
data file are the same
File Opening Failure
Fault #, code 10 25, statement 2 in for loop Command: read
[ch=2/print—*;serial=n]exp Channel for input or output has not
been opened, or has been terminated Input File on Channel 2
1) Input file name is incorrect
2) Input file address is incorrect
Fault 32, code 10 25, statement 12 in for loop Command: print
[ch-3/iprint=*/clprint=*/rlprint=*]bin Channel for input or
output has not. been opened, or has been terminated Output File on
Channel 3
Output file address is incorrect.
Very slow running of bootstrapping
Check that the programme is not having conflicts with anti-virus
software. This should be solved by the computing department, but
results from anti-virus software scanning the file each time it
makes a write-to-disk operation. This can often be easily changed
by modifying the scanning settings.
If All Else Fails
Check that the file C:\Temp\Genstat is not filled. This can
result from too many temp (.tmp) files being generated as a
result of bootstrapping programmes. Deleting these files may
improve the running of the programme.
Finally VSN (GenStat providers) can be contacted at 'support@vsn-
intl.com'
Data Analysis problems
Missing or very high F-problems
Ensure that the data has not 'shifted' at very low f-
probabilities. At the regression stage {section 4.4), before
creating the ID column, add an extra column to the beginning of
the file. Insert the ID column, and sort by DF, if the data has
shifted, this should become apparent here.
Table 22, continued.
stop
References
1 R. H. Moll, W. S. Salhuana, H. F. Robinson, Crop Sci 2, 197
(1962) .
2 J. H. Xiao, J. M. Li, L. P. Yuan, S. D. Tanksley, Genetics 140,
745 (1995)
3 M. A. Kosba, Beitr Trop Landwirtsch Veterinarmed 16, 187 (1978)
4 K. E. Gregory, L. V. Cundiff, R. M. Koch, J. Anim Sci. 70, 2366
(1992)
5 G. H. Shull, Am Breed Assoc 4, 296 (1908)
6 D. E. Comings, J. P. MacMurray, Molecular Genetics and
Metabolism 71, 19 (2000)
7 Meyer,R.C., et al. 2004 Plant Physiol. 134: 1813-1823
8 Piepho, Hans-Peter (2005) Genetics 171:359-364
9 Stuber,C.W., et al. (1992) Genetics 132:823-839
10 C. B. Davenport, Science 28, 454 (1908)
11 E. M. East, Reports of the Connecticut agricultural experiment
station for years 1907-1908 419 (1908).
12 J. B. Hollick, V. L. Chandler, Genetics 150, 891 (1998)
13 D. A. Fasoula, V. A. Fasoula, Plant Breeding Reviews 14, 89
(1997)
14 J. P. Hua et al., Proceedings of the National Academy of
Sciences of the United States of America 100, 2574 (2003)
15 S. W. Omholt, E. Plahte, L. Oyehaug, K. F. Xiang, Genetics
155, 969 (2000)
16 Duvick,D.N. (1999). Genetic diversity and heterosis. In:
Coors,C.G. and Pandey,S. (Eds.) Genetics and exploitation of
heterosis in crops. American Society of Agronomy, Madison 293-
304
17 Melchinger,A.E. 1999 Genetic diversity and heterosis. In:
Coors,C.G. and Pandey,S. (Eds.) Genetics and exploitation of
heterosis in crops. American Society of Agronomy, Madison 99-
1118.
18 Moll,R.H., et al. 1965. Genetics 52 139-144.
19 Stokes,D., et al. Euphytica in press. 2007
20 Melchinger,A.E., et al. (1990) TAG Theoretical and Applied
Genetics (Historical Archive) 80:488-496
21 Xiao,J., et al. (1996) TAG Theoretical and Applied Genetics
92: 637-643
22 Fabrizius,M.A., et al. (1998). Crop Science 38:1108-1112.
23 L. Z. Xiong, G. P. Yang, C. G. Xu, Q. F. Zhang, M. A. S.
Maroof, Molecular Breeding 4, 129 (1998)
24 Q. X. Sun, Z. F. Ni, Z. Y. Liu, Euphytica 106, 117 (1999)
25 Z. Ni, Q. Sun, Z. Liu, L. Wu, X. Wang, Molecular and General
Genetics 263, 934 (2000)
26 L. M. Wu, Z. F. Ni, F. R. Meng, Z. Lin, Q. X. Sun, Molecular
Genetics and Genomics 270, 281 (2003)
27 Auger et al. Genetics 169:389-397 2005
28 Sun,Q.X., et al. 2004 Plant Science 166, 651-657
29 M. Guo et al., Plant Cell 16, 1707 (2004)
30 Vuylsteke et al. Genetics 171:1267-1275 2005
31 Kliebenstein et al. Genetics 172:1179-1189 Feb 2006
32 Kirst et al. Genetics 169:2295-2303 2005
33 Paux et al. New Phytologist 167:89-100 2005
34 H. Kacser, J. A. Burns, Genetics 97, 639 (1981)
35 Langton, Smith & Edmondson 1990 Euphytica 49(l):15-23
36 L. M EJNARTOWICZ Silvae Genetica 48, 2 (1999) Pg 100-103
37 Cassady, J.P., Young, L.D., and Leymaster,K.A. (2002) J. Anim
Sci. 80, 2286-2302
38 Gama,L.T., et al. (1991). J. Anim Sci. 69, 2727-2743
39 Bradford GE, Burfening PJ, Cartwright TC. J. Anim Sci 1989
Nov;67(11):3058-67
40 Marks HL. Poult Sci 1995 Nov;74(11):1730-44
41 S. Einum and I. A. Fleming (1997) 50 (3) Journal of Fish
Biology 634 -651
42 Peyman and Ulman, Chemical Reviews, 90:543-584, (1990)
43 Crooke, Ann. Rev. Pharmacol. Toxicol., 32:329-376, (1992)
44 John et al, PLoS Biology, 11(2), 1862-1879, 2004
45 Myers (2003) Nature Biotechnology 21:324-328
46 Shinagawa et al., Genes and Dev., 17, 1340-5, 2003
47 Fire A, et al., 1998 Nature 391:806-811
48 Fire, A. Trends Genet. 15, 358-363 (1999)
49 Sharp, P. A. RNA interference 2001. Genes Dev. 15, 485-490
(2001)
50 Hammond, S. M., et al., Nature Rev. Genet. 2, 110-1119 (2001)
51 Tuschl, T. Chem. Biochem. 2, 239-245 (2001)
52 Hamilton, A. et al., Science 286, 950-952 (1999)
53 Hammond, S. M., et al., Nature 404, 293-296 (2000)
54 Zamore, P. D., et al., Cell 101, 25-33 (2000)
55 Bernstein, E., et al., Nature 409, 363-366 (2001)
56 Elbashir, S. M., et al., Genes Dev. 15, 188-200 (2001)
57 WO0129058
58 W09932619
59 Elbashir S M, et al., 2001 Nature 411:494-498
60 Marschall, et al. Cel1μ1ar and Molecular Neurobiology, 1994.
14(5): 523
61 Hasselhoff, Nature 334: 585 (1988) and Cech, J. Amer. Med.
Assn., 260: 3030 (1988)
62 AGI, Nature 408, 796 (2000).
63 T. Zhu, X. Wang, Plane Physiol. 124, 1472 (2000)
64 R. Meyer, 0. Torjek, C. Miissig, M. Luck, T. Altmann, paper
presented at the Signals, Sensing and Plant Primary Metabolism
2nd Symposium. Potsdam, Germany, 2003)
65 S. Barth, A. K. Busimi, H. F. Utz, A. E. Melchinger, Heredity
91, 36 (2003)
66 M. Guo, M. A. Rupe, 0. N. Danilevskaya, X. F. Yang, Z. H. Hut,
Plant Journal 36, 30 (2003)
67 Sakamoto,A., et al. 2003 Plant Cell 15 2042-2057.
68 Schmid,M., et al. Nature Genetics 37 501-506 2005.
69 Tian,D., et al. Nature 423 74-77 2003
70 GenStat for Windows. Seventh Edition(7.1.0.198). 2005. Oxford,
Lawes Agricultural Trust. Ref Type: Computer Program
71 C. M. O'Neill, 1. Bancroft, The Plant Journal 23, 233 (2000)
72 Liu,K., et al. (2003). Genetics 165 2117-2128.
WE CLAIM :
1. A method of predicting the magnitude of a trait in a plant
or animal; comprising
determining transcript abundances of a gene or a set of
genes in the plant or animal, wherein transcript abundances of
the gene or set of genes in the plant or animal transcriptome
correlate with the trait; and
thereby predicting the trait in the plant or animal.
2. A method according to claim 1, comprising earlier steps of
analysing the transcriptome of a population of plants or
animals;
measuring the trait in plants or animals in the population;
and
identifying a correlation between transcript abundances of a
gene or set of genes in the plant or animal transcriptomes and
the trait in the plants or animals.
3. A method according to claim 1 or claim 2, wherein the plant
or animal is a hybrid.
4. A method according to claim 3, wherein the trait is
heterosis.
5. A method according to claim 4, wherein the heterosis is
heterosis for yield.
6. A method according to claim 1 or claim 2, wherein the plant
or animal is inbred or recombinant.
7. A method according to claim 4 or claim 5, wherein the method
is for predicting the magnitude of heterosis and the gene or set
of genes comprises Atlg67500 or At5g45500 or orthologues thereof
and/or a gene or set of genes selected from the genes shown in
Table 1 or Table 19, or orthologues thereof.
8. A method according to any of the preceding claims,
comprising determining transcript abundance of a gene or set of
genes in the plant or animal wherein the trait is not yet
determinable from the phenotype of the plant or animal.
9. A method according to any of the preceding claims, wherein
the method is for predicting a trait in a plant and wherein the
plant a crop plant.
10. A method according to claim 9, wherein the cr.op plant is
maize.
11. A method comprising increasing the magnitude of heterosis in
a plant hybrid, by:
(i) upregulating expression in the hybrid of a gene or set of
genes whose transcript abundance in hybrids correlates positively
with the magnitude of heterosis, wherein the gene or set of genes
comprises a gene or set of genes selected from the positively
correlating genes shown in Table 1 and/or Table 19A, or
orthologues thereof; and/or
(ii) downregulating expression in the hybrid of a gene or set of
genes whose transcript abundance in hybrids correlates negatively
with the magnitude of heterosis, wherein the gene or set of genes
comprises a gene or set of genes selected from Atlg67500,
At5g45500 and/or the negatively correlating genes shown in Table
1 and/or Table 19B, or orthologues thereof.
12. A method of increasing a trait in a plant, by:
(i) upregulating expression in the plant of a gene or set of
genes whose transcript abundance in plants correlates positively
with the trait, wherein:
the trait is flowering time and wherein the gene or set of
genes comprises a gene or set of genes selected from the genes
listed in Table 3A or Table 4A, or ortholgues thereof;
the trait is seed oil content and wherein the gene or set of
genes comprises a gene or set of genes selected from the genes
listed in Table 6A, or orthologues thereof;
the trait is ratio of 18:2 / 18:1 fatty acids in seed oil,
wherein the gene or set of genes comprises a gene or set of genes
selected from the genes listed in Table 7A, or orthologues
thereof;
the trait is ratio of 18:3 / 18:1 fatty acids in seed oil,
wherein the gene or set of genes comprises a gene or set of genes
selected from the genes shown in Table 8A, or orthologues
thereof;
the trait is ratio of 18:3 / 18:2 fatty acids in seed oil,
wherein the gene or set of genes comprises a gene or set of genes
selected from the genes shown in Table 9A, or orthologues
thereof;
the trait is ratio of 20C + 22C / 16C + 18C fatty acids in
seed oil, wherein the gene or set of genes comprises a gene or
set of genes selected from the genes shown in Table 10A, or
orthologues thereof;
the trait is ratio of polyunsaturated / monounsaturated +
saturated 18C fatty acids in seed oil, wherein the gene or set of
genes comprises a gene or set of genes selected from the genes
shown in Table 12A, or orthologues thereof;
the trait is % 16:0 fatty acid in seed oil, wherein the gene
or set of genes comprises a gene or set of genes selected from
the genes shown in Table 14A, or orthologues thereof;
the trait is % 18:1 fatty acid in seed oil, wherein the gene
or set of genes comprises a gene or set of genes selected from
the genes shown in Table 15A, or orthologues thereof;
the trait is % 18:2 fatty acid in seed oil, wherein the gene
or set of genes comprises a gene or set of genes selected from
the genes shown in Table 16A, or orthologues thereof;
the trait is % 18:3 fatty acid in seed oil, wherein the gene
or set of genes comprises a gene or set of genes selected from
the genes shown in Table 17A, or orthologues thereof; or
the trait is yield, and wherein the gene or set of genes
comprises a gene or set of genes selected from the genes shown in
Table 20A, or orthologues thereof;
or
(ii) upregulating expression in the plant of a gene or set of
genes whose transcript abundance in plants correlates positively
with the trait, wherein:
the trait is flowering time and wherein the gene or set of
genes comprises a gene or set of genes selected from the genes
listed in Table 3B or Table 4B, or ortholgues thereof;
the trait is seed oil content and wherein the gene or set of
genes comprises a gene or set of genes selected from the genes
listed in Table 6B, or orthologues thereof;
the trait is ratio of 18:2 / 18:1 fatty acids in seed oil,
wherein the gene or set of genes comprises a gene or set of genes
selected from the genes listed in Table 7B, or orthologues
thereof;
the trait is ratio of 18:3 / 18:1 fatty acids in seed oil,
wherein the gene or set of genes comprises a gene or set of genes
selected from the shown in Table 8B, or orthologues thereof;
the trait is ratio of 18:3 / 18:2 fatty acids in seed oil,
wherein the gene or set of genes comprises a gene or set of genes
selected from the genes shown in Table 9B, or orthologues
thereof;
the trait is ratio of 20C + 22C / 16C + 18C fatty acids in
seed oil, wherein the gene or set of genes comprises a gene or
set of genes selected from the genes shown in Table 10B, or
orthologues thereof;
the trait is ratio of polyunsaturated / monounsaturated +
saturated 18C fatty acids in seed oil, wherein the gene or set of
genes comprises a gene or set of genes selected from the genes
shown in Table 12B, or orthologues thereof;
the trait is % 16:0 fatty acid in seed oil, wherein the gene
or set of genes comprises a gene or set of genes selected from
the genes shown in Table 14B, or orthologues thereof;
the trait is % 18:1 fatty acid in seed oil, wherein the gene
or set of genes comprises a gene or set of genes selected from
the genes shown in Table 15B, or orthologues thereof;
the trait is % 18:2 fatty acid in seed oil, wherein the gene
or set of genes comprises a gene or set of genes selected from
the genes shown in Table 16B, or orthologues thereof;
the trait is % 18:3 fatty acid in seed oil, wherein the gene
or set of genes comprises a gene or set of genes selected from
the genes shown in Table 17B, or orthologues thereof; or
the trait is yield, and wherein the gene or set of genes
comprises a gene or set of genes selected from the genes shown in
Table 20B, or orthologues thereof.
13. A method of predicting a trait in a hybrid, wherein the
hybrid is a cross between a first plant or animal and a second
plant or animal; comprising
determining the transcript abundance of a gene or set of
genes in the second plant or animal, wherein transcript abundance
of the gene or the genes in the set of genes correlates with the
trait in a population of hybrids produced by crossing the first
plant or animal with different plants or animals; and
thereby predicting the trait in the hybrid.
14. A method according to claim 13, comprising earlier steps of:
analysing transcriptomes of plants or animals in a
population of plants or animals;
determining a trait in a population of hybrids, wherein each
hybrid in the population is a cross between a first plant or
animal and a plant or animal selected from the population of
plants or animals;
and
identifying a correlation between transcript abundance of a
gene or set of genes in the population of plants or animals and
the trait in the population of hybrids.
15. A method according to claim 13 or claim 14, wherein the
hybrid is a maize hybrid cross between a first maize plant and a
second maize plant.
16. A method comprising:
determining the transcript abundance of a gene or set of
genes in plants or animals, wherein the transcript abundances of
the gene or the genes in the set of genes in plants or animals
correlate with a trait in hybrid crosses between a first plant or
animal and other plants or animals;
selecting one of the plants or animals on the basis of said
correlation; and
selecting a hybrid that has already been produced or
producing a hybrid cross between the selected plant or animal and
the said first plant or animal.
17. A method according to claim 16, wherein the plants are maize
and wherein a maize hybrid cross is produced.
18. A method comprising:
analysing the transcriptomes of hybrids in a population of
hybrids;
determining heterosis or other trait of hybrids in the
population; and
identifying a correlation between transcript abundance of a
gene or set of genes in the hybrid transcriptomes and heterosis
or other trait in the hybrids.
19. A method for determining hybrids to be grown or tested in
yield or performance trials which comprises determining
transcript abundance from vegetative phase plants or pre-
adolescent animals.
20. A method according to claim 19, wherein the hybrids are
maize hybrids.
21. A method which comprises analyzing the transcriptome of
hybrids or inbred or recombinant plants or animals, said method
comprising:
(i) identifying genes involved in the manifestation of heterosis
and other traits in hybrids; and, optionally,
(ii) predicting and producing hybrid plants or animals of
improved heterosis and other traits by selecting plants or
animals for breeding, wherein the plants or animals exhibit
enhanced transcriptome characteristics with respect to a selected
set of genes relevant to the transcriptional regulatory networks
present in potential parental breeding partners; and, optionally,
(iii) predicting a range of trait characteristics for plants and
animals based on transcriptome characteristics.
22. A method according to claim 21, wherein the hybrids or
inbred or recombinant plants are maize.
23. A subset of genes that retain most of the predictive
power of a large set of genes the transcript abundance of
which correlates well with a particular characteristic in a
hybrid.
24. The subset according to claim 23 which comprises
between 10 and 70 genes for prediction of heterosis based on
hybrid transcriptomes.
25. A method for identifying a limited set of genes which
comprises iterative testing of the precision of predictions by
progressively reducing the numbers of genes in a trait predictive
model, and preferentially retaining those with the best
correlation of transcript abundance with the trait.
Transcriptome-based prediction of heterosis or hybrid vigour and other
complex phenotypic traits. Analysis of transcript abundance in
predictive gene sets, for predicting magnitude of heterosis or other
complex traits in plants and animals. Transcriptome-based screening
and selection of individuals with desired traits and/or good hybrid
vigour.
| # | Name | Date |
|---|---|---|
| 1 | 4334-kolnp-2008-abstract.pdf | 2011-10-08 |
| 1 | abstract-4334-kolnp-2008.jpg | 2011-10-08 |
| 2 | 4334-KOLNP-2008-ASSIGNMENT 1.1.pdf | 2011-10-08 |
| 2 | 4334-kolnp-2008-translated copy of priority document.pdf | 2011-10-08 |
| 3 | 4334-kolnp-2008-assignment.pdf | 2011-10-08 |
| 4 | 4334-kolnp-2008-pct request form.pdf | 2011-10-08 |
| 4 | 4334-kolnp-2008-claims.pdf | 2011-10-08 |
| 5 | 4334-kolnp-2008-pct priority document notification.pdf | 2011-10-08 |
| 5 | 4334-KOLNP-2008-CORRESPONDENCE 1.1.pdf | 2011-10-08 |
| 6 | 4334-KOLNP-2008-PA.pdf | 2011-10-08 |
| 6 | 4334-KOLNP-2008-CORRESPONDENCE 1.2.pdf | 2011-10-08 |
| 7 | 4334-kolnp-2008-others.pdf | 2011-10-08 |
| 7 | 4334-KOLNP-2008-CORRESPONDENCE-1.3.pdf | 2011-10-08 |
| 8 | 4334-KOLNP-2008-OTHERS-1.2.pdf | 2011-10-08 |
| 8 | 4334-KOLNP-2008-CORRESPONDENCE-1.4.pdf | 2011-10-08 |
| 9 | 4334-kolnp-2008-correspondence.pdf | 2011-10-08 |
| 9 | 4334-KOLNP-2008-OTHERS-1.1.pdf | 2011-10-08 |
| 10 | 4334-kolnp-2008-international search report.pdf | 2011-10-08 |
| 11 | 4334-kolnp-2008-drawings.pdf | 2011-10-08 |
| 11 | 4334-kolnp-2008-international publication.pdf | 2011-10-08 |
| 12 | 4334-kolnp-2008-form 1.pdf | 2011-10-08 |
| 12 | 4334-kolnp-2008-international preliminary examination report.pdf | 2011-10-08 |
| 13 | 4334-kolnp-2008-form 2.pdf | 2011-10-08 |
| 13 | 4334-kolnp-2008-form 5.pdf | 2011-10-08 |
| 14 | 4334-kolnp-2008-form 3.pdf | 2011-10-08 |
| 15 | 4334-kolnp-2008-form 2.pdf | 2011-10-08 |
| 15 | 4334-kolnp-2008-form 5.pdf | 2011-10-08 |
| 16 | 4334-kolnp-2008-form 1.pdf | 2011-10-08 |
| 16 | 4334-kolnp-2008-international preliminary examination report.pdf | 2011-10-08 |
| 17 | 4334-kolnp-2008-international publication.pdf | 2011-10-08 |
| 17 | 4334-kolnp-2008-drawings.pdf | 2011-10-08 |
| 18 | 4334-kolnp-2008-international search report.pdf | 2011-10-08 |
| 19 | 4334-kolnp-2008-correspondence.pdf | 2011-10-08 |
| 19 | 4334-KOLNP-2008-OTHERS-1.1.pdf | 2011-10-08 |
| 20 | 4334-KOLNP-2008-CORRESPONDENCE-1.4.pdf | 2011-10-08 |
| 20 | 4334-KOLNP-2008-OTHERS-1.2.pdf | 2011-10-08 |
| 21 | 4334-KOLNP-2008-CORRESPONDENCE-1.3.pdf | 2011-10-08 |
| 21 | 4334-kolnp-2008-others.pdf | 2011-10-08 |
| 22 | 4334-KOLNP-2008-CORRESPONDENCE 1.2.pdf | 2011-10-08 |
| 22 | 4334-KOLNP-2008-PA.pdf | 2011-10-08 |
| 23 | 4334-KOLNP-2008-CORRESPONDENCE 1.1.pdf | 2011-10-08 |
| 23 | 4334-kolnp-2008-pct priority document notification.pdf | 2011-10-08 |
| 24 | 4334-kolnp-2008-claims.pdf | 2011-10-08 |
| 24 | 4334-kolnp-2008-pct request form.pdf | 2011-10-08 |
| 25 | 4334-kolnp-2008-assignment.pdf | 2011-10-08 |
| 26 | 4334-kolnp-2008-translated copy of priority document.pdf | 2011-10-08 |
| 26 | 4334-KOLNP-2008-ASSIGNMENT 1.1.pdf | 2011-10-08 |
| 27 | abstract-4334-kolnp-2008.jpg | 2011-10-08 |
| 27 | 4334-kolnp-2008-abstract.pdf | 2011-10-08 |