Abstract: Apparatuses (including devices and systems) and methods for determining if a patient will respond to a variety of cancer drugs.
WO 2016/139534 PCT7IB2016/000316
APPARATUSES AND METHODS FOR DETERMINING A PATIENT'S RESPONSE TO MULTIPLE
CANCER DRUGS
CROSS REFERENCE TO RELATED APPLICATIONS [0001] This patent application claims priority to Indian Patent applications numbers 1001/CHE/2015, filed 5 March 2, 2015, and 201641000542, filed January 7, 2016. Each of these patent applications are herein incorporated by reference in their entirety.
INCORPORATION BY REFERENCE
[0002] All publications and patent applications mentioned in this specification are herein incorporated by
10 reference in their entirety to the same extent as if each individual publication or patent application was specifically and individually indicated to be incorporated by reference.
FIELD
[0003] Described herein are apparatuses (including devices and systems) and methods for determining if a
15 patient will respond to a variety of cancer drugs.
BACKGROUND
[0004] Although there are a number of cancer therapies known and in development to treat various forms of
cancer, it is difficult or impossible to predict which cancer therapies (including cancer drugs) will be effective in
20 treating a particular cancer type of a particular patient. In the last decade there has been a rapid increase in evidence
showing the effects of one or a few markers for the treatment of various cancers. Unfortunately, this information has
been exceedingly difficult to generalize between patients, particularly to patients not possessing the specific value
for the markers examined.
[0005] Further, there is a confusing, and sometimes conflicting, variety of markers and marker values relevant
25 to patient drug response. Many markers are DNA based. Deoxyribonucleic acids (DNA) are the building blocks of the genome. Human genome has about 3 billion base pairs organized into 22 pairs of autosomes (1 through 22) and a pair of sex determining chromosomes X and Y. There are 4 DNA bases Adenine, Guanine, Thymine, Cytosine. A series of DNA organized in a specific fashion form a gene. Each gene is associated with one or more traits of the organism such as color of eye, height, etc. Only about 2% of the human genome encodes for around 23000 genes.
30 Rest of the 98% of the genome is not well understood and is currently considered as "junk DNA". The variation in the genome sequence between any two individuals is expected to be less than 0.01%. This small variation at various positions across the human genome is believed to account for all the visible differences among individuals as well as plays a role in health, disease and aging. Some genes are more critical to the normal functioning of the human than others. Currently about 4800 genes are known to have clinical relevance. Some DNA base positions are so critical
35 that even a single DNA base modification or substitution (called a single nucleotide variation or SNV) can cause disease with one or more manifestations. For e.g., CFTR gene is well established to be associated with Cystic Fibrosis with many known mutations. Many mutations are also well established in BRCA1 and BRCA2 genes to be associated with breast cancer, etc. Such SNVs are called Mutations. Variations can also span more than one base position such as multi-base insertions or deletions (indels) and translocations of large regions of the genome. Such
40 changes are called structural variations. These also include copy number variations or CNVs which occur when the
-1 -
WO 2016/139534 PCT7IB2016/000316
number of copies of a region of the human genome deviates from its normal number 2 (for diploid). Any of these
variations occurring in clinically sensitive regions of the genome can cause diseases of varying severity depending on the function of the genes involved.
[0006] Genetic markers, whose values may be identified by sequencing, such as next-generation sequencing,
5 have been proposed as helping identify therapy response in cancer patients. Next-Generation Sequencing (NGS) refers to the lccent advances in sequencing the deoxyribonucleic acids (DNA) in a massively parallel manner faster and cheaper without loss of accuracy compared to earlier methods such as Sanger sequencing. While the cost of sequencing the first complete human genome took as 13 years and cost about 3 billion USD at a rate of $1 per base in 2000, many laboratories around the world can now apply NGS to sequence human genomes routinely for about
10 $1000 (or $0.000003 per base sequenced) per genome. NGS technology is agnostic of the origin of the DNA (i.e.,
source organism). There are a number of vendors of NGS technology with Illumina leading the market with several
variant products including X10, HiSeq, Miseq and NextSeq. Life technologies Ion torrent platform is the second
largest player. Various smaller players are attempting to break into this market.
[0007] Since the genome of every individual in this world is unique, there is no such thing called "normal
15 genome" or "standard genome". However, in order to serve as a reference, free and open reference genome databases are made available to the scientific research community (academic and commercial) by National Institutes of Health in USA. The human genome assembly builds (the stable hgl9 build and the latest hg20 build) and annotations are constantly upgraded and also enhanced by external groups marking their own annotations to the genome builds released by NIH.
20 [0008] What is needed are rapid and accurate methods and apparatuses for determining which therapies may
be effective (and/or ineffective) in treating a particular patient.
[0009] Described herein are methods apparatuses that may provide a caregiver (e.g. physician, nurse, etc.) to
treat a cancer patient by predicting which drugs may be effective in treating that patient. As will be described in greater detail below, these methods and apparatuses may improve traditional clinical genomics, and may combine
25 them with other marker information and provide powerful and accurate predictions for patient therapy.
[00010] Clinical genomics is the application of NGS and other genomic technologies for clinical utility such as diagnosis of a disease at the molecular (DNA, RNA, or Protein) level. Most currently published clinical genomics methods involve one or more of the following steps: identifying genes or regions of the genome of interest to a particular disease or group of diseases. (Target regions selection); designing a method to capture only the
30 target DNA regions of interest. (Target capture); amplification of the target DNA capture by polymerase chain
reaction (PCR); prepare libraries for NGS; NGS and generation of millions of short reads of same length (around 75 to 150 bases each); aligning the short reads coming out of NGS to the human reference genome hgl9 allowing for a reasonable number of mismatches to account for SNVs and small indels. Many open source or commercial algorithms and tools are available to perform this step of the analysis. From the aligned regions of the genome,
35 calling all SNVs, indels and CNVs by comparing against the reference genome databases. Many algorithms and tools are available to perform this step of the analysis. Annotation and interpretation of the variations called in the previous step may also be performed. This integrates information from literature curated databases of variant-disease associations for previously published variants. For discovery of novel (previously unknown) clinically relevant variants, various tools are used to predict the possible functional / clinical impact of variants. From these
40 steps, it would be beneficial to generate clear, concise and precise clinical reports that are readily interpretable by physicians to make clinical decisions. An example of a workflow for clinical genomics based diagnostics are illustrated in FIGS. 1A-1B. Additional descriptions and examples are provided below.
-2-
WO 2016/139534
PCT7IB2016/000316
SUMMARY OF THE DISCLOSURE [00011] The methods and apparatuses described herein may determine the response of patient having cancer to one or more cancer drugs and/or therapies. In general, these methods and apparatuses may assist a physician in 5 treating a patient. For example, described herein are marker processing apparatuses that can receive a wide variety of patient-specific marker values, including marker values that do not match to know maker values (e.g., variants, polymorphisms, etc.), and can determine how equivalent such patient-specific marker values to known markers. The apparatus (e.g., device, system, etc.) may then apply this equivalence determination to estimate a patient's response to a variety of different drugs. Methods of operating the maker processing apparatuses are also described.
10 The marker p-ocessing apparatuses described herein are capable of receiving and transforming a number of different types of markers, including genetic sequence information (polynucleotide sequences), immunohistochemical information, fluorescent in-situ hybridization information, etc. Each of these markers may have multiple possible values, only some of which are known. Examples of values may include, for polynucleotide markers: polymorphisms, mutations, truncations, reading frame shift, deletions, repetitions, etc., or any other status of a gene
15 or portion of a gene. Examples of values for florescence markers may include: florescence patterns, florescence intensities, combinations of immunohistochemical markers, tissue-specificity, etc. The known values for the marker(s) may be linked to clinical or literature information regarding the efficacy of one or more drugs or therapies. [00012] In addition, described herein are apparatuses that may assist in interpreting marker information
20 received from a patient, which may be separate from or integrated into the maker processing apparatuses. In particular, graphical processing units for graphically processing marker information.
[00013] Also described herein are methods and kits for determining if a patient has one or more makers, including genetic screens/panels and methods of performing them. Also described are immunohistochemical methods and markers.
25 [00014] The methods and apparatuses described herein may generally be used to concurrently examine a large number of cancer drugs and therapies. In some variations the user (physician, technician, etc.) may input into the marker processing apparatus the type of cancer or type of tissue being examined by the markers. In general, the marker processing apparatus may operate to analyze a sub-set of cancer therapies (e.g., drugs, treatments, dosing regimes, clinical trials, etc.) quickly in an initial, streamlined phase, before analyzing a larger set of cancer therapies.
30 The methods or apparatus may be adapted to provide a rapid response on this sub-set of cancer therapies by first
receiving the patient-specific marker values (including receiving all patient-specific marker values for the larger set of markers), and determining a level of equivalence compared to a sub-set of markers predetermined to have an effect on the sub-set of cancer therapies (such as a set of "standard of care" drugs specific to a type of cancer). The level of equivalence may be quantitative (e.g., between 0 and 100%) or qualitative, e.g., high equivalence (which
35 may include perfect matching), medium equivalence, low equivalence, and no equivalence, etc.
[00015] For example, these apparatuses and methods may determine a response of a cancer patient with a specific type of cancer to a plurality of cancer drugs specific to the cancer type of the patient, referred to as the standard of care drugs (for that cancer type, a class of cancer types, or all cancers) and may be limited to this collection of drugs. In some variation this "standard of care" output, indicating a predicted response from the
40 standard of care drugs, may be rapidly determined using the marker processing apparatus, e.g., within 1-14 days of receiving a patient sample, and/or the patient-specific marker values, despite testing for all of the patient-specific
-3-
WO 2016/139534 PCT7IB2016/000316
marker values in the superset, before continuing to determine responses or predicted responses from other therapies
outside of the standard of care.
[00016] Any of the methods described herein may optionally include testing a sample from the patient for a plurality of markers to identify patient specific marker values. The marker values tested may include, but are not 5 limited to, genomic events such as single nucleotide variants (SNVs), Insertions and Deletions (InDels), and copy number variants (CNVs). They may also include specific structural variants (SVs) such as translocations, micro satellite instability (MSI) and protein expression levels. The markers may be measured using appropriate technologies including Next Generation Sequencing (NGS), Immunohistochemistry (IHC), Polymerase Chain Reaction (PCR) and fluorescent in-situ hybridizations (FISH). In one example, in reference to colon cancer, the
10 markers tested to determine the response to a set of standard of care drugs for colon cancer comprising makers for at least 14 genes including APC, BRAF, DPYD, EGFR, MET, KRAS, NRAS, PIK3CA, PTEN, SMAD4, UGT1A1, TS, TOPI and ERCC1 and micro satellite instability (MSI). In this exemplary embodiment, 11 genes (each gene may covered by one or more makers) may be examined to determine patient-specific marker values for markers spanning these genes using an NGS panel (APC, BRAF, DPYD, EGFR, MET, KRAS, NRAS, PIK3CA, PTEN,
15 SMAD4, UGT1A1), and markers for protein expression levels for TS, TOPI and ERCC1 may be measured via IHC, and MSI may be measured via PCR. Thus, in this example the markers may have known reference values, which may be part of a larger library of reference maker values that are linked (in the library) to a predetermined therapeutic effect. As described herein, the methods may be used to determine from the entire set of patient-specific marker values, a level of equivalence to these reference marker values patient-specific marker values for these
20 markers. A marker processing apparatus may be used to determine a level of equivalence for patient-specific
marker values for the standard of care markers and the level(s) of equivalence to reference maker values may be used to significantly more accurately weigh possible effects (e.g., predicted outcomes) of one or more of these standard of care drugs to that specific patient. In general, the methods and apparatuses described herein provide for much more than matching of patient-specific marker values (patient values) to known reference values for one or
25 more makers; instead, they provide a nuanced level of equivalence (e.g., high, medium, low, none) that improves their predictive ability well beyond what is currently available.
[00017] In another example, an NGS panel may be designed to include genomic regions from a set of 152 genes that modulate the response to chemotherapies and targeted therapies, or modulate the metabolism of some of these drugs, or impact prognosis and disease progression. Further, in this exemplary embodiment, the regions
30 included in the panel may be optimized to cover all genie regions for tumor suppressor genes, all regions that cover variants from one version of the Catalogue Of Somatic Mutations In Cancer (COSMIC) database for oncogenes, regions that cover known translocations, regions that optimize copy number detection and regions that contain germ line mutations known to impact disease metabolism or prognosis. For each marker tested using IHC, the antibody may be designed and the protocol standardized to optimize the detection of protein expression from the patient
35 sample.
[00018] Any appropriate patient sample may be used. For example, a sample drawn from the patient may be fresh tissue extracted from the area of the cancer via a biopsy, and/or stored in the form of Formalin-fixed, paraffin-embedded (FFPE) tissue. On one hand, FFPE tissues render themselves to long term storage of the sample. On the other hand, the quality of DNA in FFPE tissues is highly variable and poses significant challenges to various aspects
40 of the methods of the invention. In particular, the steps of the lab protocol of the NGS panel are standardized to
handle FFPE tissues with a wide range of quality. In one exemplary embodiment, the protocol includes pulldown in-solution using SureSelect XT2 RNA baits and sequencing using Illumina's sequencing instrument such as MiSeq,
-4-
WO 2016/139534 PCT7IB2016/000316
NextSeq or HiSeq. As described herein, the lab protocol may be optimized to handle samples that contain at least
20% of tumor content and at least 200ng of input tumor DNA.
[00019] The raw data generated from the lab for the NGS panel may be in the form of sequence reads, 2xl51bp in length in one exemplary embodiment, and are transformed to patient-specific marker values by performing 5 alignment, SNP detection, copy number calling, translocation detection, and Quality Checks (QC), which may be executed as part of a patient-specific marker value generation procedure. In one exemplary embodiment, these algorithms are executed by a computer processor. In another embodiment, these steps may be executed by a computer processor in tandem with a graphics processor resulting in overall speed improvements of up to lOx. [00020] The patient-specific marker value generation methods may include pre-alignment filtering steps and
10 steps such as trimming low quality bases and using contaminant databases to filter out noise. An alignment
algorithm may use a Burrows-Wheeler Transform (BWT) based index search followed by a Dynamic Programming (DP) method to find an optimal match for each read. It may include steps for seeding matches from the read, aggregating matches to identify candidate regions, performing a banded DP, mate rescue, split read alignment, and local realignment. The SNV detection may use an iterative binomial caller and include a framework to specify
15 complex Boolean filters based on properties of reads near the location of the SNP being called. The copy number calling method (steps) may involve creating GC-corrected normalized coverages for the normal profile, in one embodiment by computing iterative average for each region. Copy number calls made at the exon level may be summarized to the gene level. The translocation detection may use split read alignment to identify known translocations. The patient-specific marker value generation steps may also include QC checks to determine sample
20 contamination, sample degradation and sample mixup using the SNV calls. The patient-specific marker value
generation steps may also include checks on number of novel SNP calls, the number of copy number calls to QC the algorithm execution, etc.. The QC results may be used to pass or reject the patient-specific marker values generated by the pipeline for the patient sample. [00021] For markers measured by IHC, the scoring criteria and cut-offs used may be standardized separately
25 for each IHC marker based on published literature and vendor provided catalogues.
[00022] Any of the methods described herein may include entering the patient specific marker values into the marker processing apparatus. The marker processing apparatus typically includes a processor (e.g., computer processor), one or more inputs (control, e.g., keyboards, dials, buttons, touchscreens, etc.) and one or more outputs (e.g., screens, connections to screens, etc.). The marker processing apparatus may be software, hardware or
30 firmware. The marker processing apparatus typically includes or is able to read from a database of reference
markers, including reference marker values that are linked to functional therapeutic outcome. For example, the marker processing apparatus may include a manually curated database of reference markers and their functional significance. Each reference marker may be annotated, e.g., using peer reviewed journal literature, as having a Loss of Function, a Gain of Function or an Unknown Functional impact on the gene that it falls in.
3 5 [00023] A marker processing apparatus may also include standard of care logic or rules, called the SOC rules, that may relate collections of reference markers to the expected response of the SOC drugs for a specific tissue. The SOC rules may be manually curated using information from disparate peer reviewed literature. Each rule may be recorded as an expression that quantifies the response of a specific drug in the presence of a marker M in the patient's sample: Response (drug D | marker M causing functional significance F) = R based on literature evidence
40 L.
-5-
WO 2016/139534 PCT7IB2016/000316
[00024] The marker can be as specific as a single variant, for example, the V600E variant in BRAF gene, or as
general as any Loss of Function variant in a gene (e.g., any variant in SMAD4 gene that causes a LOF) or with some intermediate level of generality (e.g., any variant in exons 23, 24 or 25 of ALK gene that causes a GOF). In addition, it can also be the negative of a variant or variant class (e.g., any variant in EGFR gene that does NOT cause a GOF) 5 or a Boolean combination of a plurality of variant classes for the same gene or different genes with specific or generic exceptions (e.g., any variant in EGFR gene except T790M that causes a GOF or any variant in XXX gene that is NOT an in-frame insertion in exon 20 and causes a GOF). The drug reference in the SOC rule may include specific drugs (for e.g., Everolimus) or a class of drugs (e.g., mTOR inhibitors). The functional significance may include loss of function (LOF), gain of function (GOF) or unknown function. The response value for a drug could be
10 sensitive or resistance.
[00025] As mentioned, after entering the patient specific marker values into the marker processing apparatus, the marker processing apparatus may be used to determine an equivalence level for the subset of patient specific marker values relative to the reference SOC marker values based on equivalence rules. For each patient-specific marker measured with a specific value, the marker processing apparatus may evaluate its equivalence with any of
15 the reference SOC marker values and may output an equivalence level, such as with high, medium or low
confidence. The apparatus may determine how to apply the equivalence rules, dependent on the type of the patient-specific marker value and may annotate the functional significance of the patient specific marker value based on the functional significance of the reference marker value. In one exemplary embodiment of an equivalence rule, a patient specific marker value that indicates a premature truncation in a Tumor Suppressor gene is equivalent to a
20 reference marker with high confidence if it is an exact match or if the reference marker is a premature stop, or a frameshift variant or an Exonic Splicing Silencer (ESS) resulting in a premature stop downstream of the patient specific marker value, and is annotated with a LOF functional significance if such a reference marker is found. [00026] In this aspect of the invention, the marker processing apparatus may then execute the SOC rules based on the functional significance of the patient-specific marker values derived using equivalence to the reference
25 marker values. The marker processing apparatus may execute only the most specific rule for each drug, and may execute all possible rules based on the patient-specific marker values. Once all the rules have been fired, the apparatus may provide an overall recommendation for each drug. Where the response predicted by individual SOC rules for a drug are all consistent, the overall recommendation for the drug may be the same as the individual response prediction. Where the different SOC rules predict contradicting responses, the apparatus may resolve the
30 apparent conflict. In some variations, the apparatus prepares the conflict (e.g., the marker, the relevant patient-specific marker values and/or their levels of equivalence to reference marker values) and present the conflict for manual intervention to resolve the conflict based on the literature evidence associated with each individual SOC rule. Alternatively or additionally, the apparatus may resolve the conflict automatically by applying the weights from the levels of equivalence and/or by applying a SOC rule. The resolved outcome may be recorded in the system
35 as a new specific SOC rule. If the conflict cannot be resolved, the drug response may be called Inconclusive.
[00027] The marker processing apparatus may thus be used to predict the estimated drug response given the unique patient-specific marker values for the standard of care set of drugs for a patient sample. The estimated drug response for each drug may vary between Enhanced Response for drugs that are found sensitive and Poor Response for drugs that are found resistant. Conflicting SOC rules may be resolved to predict the response of the drug as
40 either Limited Response or Inconclusive. The SOC drugs for which none of the SOC rules were fired may have a Standard Response for the patient, and may be reported thus. The marker processing apparatus may further be used to generate the patient's report automatically with minimal manual intervention, needed only when a conflict arises.
-6-
WO 2016/139534 PCT7IB2016/000316
The report may include the estimated response of each of the standard of care drugs for the patient sample, along
with the details of the patient-specific marker values that were measured and used to estimate the responses. [00028] In some variations, the response of a cancer patient with a specific type of cancer to a plurality of cancer drugs not specific to the cancer type of the patient, referred to as off-label drugs, and to a plurality of clinical 5 trials for the specific type of cancer is determined using the marker processing apparatus with the object of determining additional therapies that may be applicable to the patient.
[00029] For example, the marker processing apparatus may further comprise a collection of cancer pathways that provide for each gene in the panel above, a list of targetable upstream or downstream genes from the pathways. The apparatus may also comprise a plurality of cancer drugs from different tissues and clinical trials annotated with
10 the target gene and additional conditions of applicability.
[00030] The marker processing apparatus may perform an equivalence of the patient-specific marker values with the reference marker values to predict the functional significance of the patient-specific marker values as causing Loss of Function or Gain of Function. The patient-specific marker values that are not found equivalent to the curated markers with known functional significance, may be further categorized using bioinformatics
15 predictions, and database lookup.
[00031] One step may involve using the marker processing apparatus to prioritize the rest of the variants and shortlist patient-specific marker values that are expected to have a damaging functional significance on targetable genes in cancer pathways. This step may involve using the tools provided by the marker processing apparatus including, but not limited to a tool that enables quick assessment of the cleanliness of the alignment in the
20 neighborhood of the variant location; a tool that displays the copynumber distribution to in superimposing the
copynumber calls of the sample against the distribution of calls from each of the panel's regions over the samples in the system; and a tool to present all the relevant information about the marker in a concise visualization. [00032] Another step may involve using an automatically created list of drugs approved in other tissues and clinical trials for the specific tissue to identify additional therapies that may be applicable to the patient, given the
25 patient-specific marker values. Literature evidence to support the applicability of the drug or trial to the patient may be recorded in the marker processing apparatus for reuse in subsequent patient samples.
[00033] The marker processing apparatus may further generate the patient's report automatically. In addition to the Standard of Care drugs, the report may include additional off-label drugs and clinical trials applicable to the patient sample, along with the details of the patient-specific marker values that were measured and used to estimate
30 the responses.
[00034] For example, described herein are methods of estimating a patient's response to a plurality of cancer drugs using a marker processing apparatus, the method comprising: entering a plurality of patient-specific marker values into the marker processing apparatus, wherein the plurality of patient-specific maker values were identified by testing a patient sample for a plurality of markers to identify the patient-specific marker values; using the marker
35 processing apparatus to determine an equivalence level for at least some of the patient-specific marker values
relative to reference marker values in a library of reference marker values for markers associated with the plurality of cancer drugs, wherein the equivalence level includes: high equivalence, low equivalence or no equivalence; and outputting, from the marker processing apparatus, an estimated drug response a plurality of cancer drugs based on the determined equivalence level for the at least some of the patient-specific marker values.
-7-
WO 2016/139534 PCT7IB2016/000316
[00035] In any of these methods, the method may further comprise using the marker processing apparatus to
compare at least some of the patient-specific marker values to the library of reference marker values for markers associated with the plurality of cancer drugs.
[00036] Any of these methods may include using the maker processing apparatus to determine the estimated 5 drug response to each of the cancer drugs in the plurality of cancer drugs by, for a marker associated with each drug in the marker processing apparatus, scaling an a reference drug response associated with a reference marker value by the equivalence level for the patient-specific marker values associated with that marker.
[00037] The marker processing apparatus may be used to determine an equivalence level for at least some of the patient-specific marker values relative to reference marker values comprises using the marker processing
10 apparatus to compare a subset of the patient-specific marker values to the library of reference marker values for markers associated with the plurality of cancer drugs.
[00038] For example, using the marker processing apparatus to determine an equivalence level for at least some of the patient-specific marker values relative to reference marker values may comprise using the marker processing apparatus to compare a subset of the patient-specific marker values for markers correlated to a type of cancer.
15 [00039] Any of these methods may also include testing a sample from the patient for the plurality of markers to identify the patient-specific marker values. As described above, any appropriate testing (NGS, immunohistochemistry, FISH, etc.) may be used. For example, the plurality of markers may include makers for one or more of: next generation sequencing (NGS) markers, immunohistochemical (IHC) markers, or florescent in-situ hybridization (FISH) markers. The methods and apparatuses may also be configured to include selecting, in the
20 marker processing apparatus, a standard of care set of drugs based on a cancer type, and wherein using the marker processing apparatus to compare the patient-specific marker values to the library of reference marker values comprises comparing the patient-specific marker values to the library of reference marker values associated with the selected standard of care set of drugs in the marker processing apparatus. [00040] Any of these methods may also include selecting, in the marker processing apparatus, a standard of
25 care set of drugs based on a cancer type, and wherein using the marker processing apparatus to compare the patient-specific marker values to the library of reference marker values comprises comparing the patient-specific marker values to the library of reference marker values associated with the selected standard of care set of drugs in the marker processing apparatus, and further wherein using the marker processing apparatus to determine the equivalence level for the patient-specific markers relative to reference markers in the library of reference marker
30 values comprises determining the equivalence level for the patient-specific markers relative to the reference markers based on a set of equivalence rules for the selected standard of care set of drugs in the marker processing apparatus. [00041] In general, using the marker processing apparatus to determine an equivalence level may comprise applying a set of equivalence rules when comparing the the patient-specific markers against the reference markers in the library of reference markers values to set the equivalence level to a level selected from the set of: high
35 equivalence, medium equivalence, low equivalence, or no equivalence.
[00042] Outputting may include presenting the output of the marker processing apparatus to a user. For example, outputting the estimated drug responses comprises indicating, for each cancer drug in the plurality of cancer drugs, one or more of; enhanced response, standard response, and poor response. [00043] As mentioned above, any of the method described herein may include manually curating the reference
40 marker values and the rules from published literature sources.
-8-
WO 2016/139534 PCT7IB2016/000316
[00044] A method of predicting a patient's response to a plurality of cancer drugs using a marker processing
apparatus may include: testing a sample from the patient for a plurality of markers to identify patient-specific
marker values; entering the patient-specific marker values into the marker processing apparatus; selecting, in the
marker processing apparatus, a standard of care set of drugs based on a cancer type; using the marker processing
5 apparatus to compare a subset of the patient-specific marker values to a library of reference marker values for
markers associated with the selected standard of care set of drugs, and determine an equivalence level for the subset
of patient-specific marker values relative to reference marker values based on a set of equivalence rules; and
outputting from the marker processing apparatus, an estimated drug response for the standard of care set of drugs
based on the determined equivalence level.
10 [00045] Any of these methods may generally include, after quickly performing a standard-of-care check for a subset of standard-of-care drugs as described, a full screen of all of the markers (in some variations excluding the standard of care markers already examined) may be performed against the rest of the reference library marker values to determine which, if any, therapies may be used to treat the patient based on the patient-specific marker values. [00046] For example, any of these methods may include, after outputting the estimated drug response: using
15 the marker processing apparatus to compare all of the rest of the patient-specific marker values that were not part of the subset of patient-specific marker values to the library of reference marker values and determine an equivalence level for all of the rest of the patient-specific marker values relative to reference marker values in the library of reference marker values; and outputting an estimated drug response for a second plurality of drugs based on the determined equivalence level for all of the rest of the patient-specific marker values.
20 [00047] The full analysis may be performed after the standard of care analysis. For example, the standard of care screen may be performed within 1-14 days (e.g., within less than 14 days, less than 13 days, less than 12 days, less than 11 days, less than 10 days, less than 9 days, less than 8 days, less than 7 days, less than 6 days, less than 5 days, less than 4 days, less than 3 days, etc.) from either receiving the patient sample (and therefore determining the patient-specific maker values) or from receiving the patient-specific marker values. Following this, the method may
25 include looking at the remaining markers as described above, and providing output within another three weeks (e.g., within less than 3 weeks, less than 2.5 weeks, less than two weeks, less than 13 days, less than 12 days, less than 11 days, less than 10 days, less than 9 days, less than 8 days, less than 7 days, less than 6 days, less than 5 days, less than 4 days, less than 3 days, etc.) following the SOC results/output. For example, the methods may include the estimated drug response based on the determined equivalence level for the standard of care set of drugs is performed
30 within a first time period; and wherein outputting the estimated drug response for the second plurality of drugs is performed within a second time period following the first time period by at least 4 days.
[00048] In general, the methods described herein may include selecting the standard of care set of drugs based on the cancer type comprises selecting a standard of care set of drugs for one of: breast cancer, lung cancer, colon cancer, and melanoma. Outputting the estimated drug response may comprise indicating one or more of; enhanced
35 response, standard response, and poor response.
[00049] As mentioned, any of these methods may include resolving conflicts between two or more patient-specific marker values associated one or more markers for a same drug in the standard of care set of drugs. The step of using the marker processing apparatus to determine an equivalence level may include applying the set of equivalence rules to determine if the patient-specific marker value is identical to the reference marker value,
40 functionally similar to the reference marker value, or structurally similar to the reference marker value.
-9-
WO 2016/139534 PCT7IB2016/000316
[00050] In general, any of these methods may include modifying the equivalence rules (e.g., learning). For example, any of these methods may include modifying the set of equivalence rules based on a determined equivalence level.
[00051] The methods may include using the maker processing apparatus to determine the estimated drug 5 response to each of the standard of care cancer drugs in the standard of care set of cancer drugs by scaling a
reference drug response associated with a reference marker value by the equivalence level determined for a patient-specific marker value related to that reference marker value.
[00052] The method of claim 14, wherein the plurality of markers include one or more of: next generation sequencing (NGS) markers, immunohistochemical (IHC) markers, or florescent in-situ hybridization (FISH)
10 markers.
[00053] As mentioned, in any of the maker processing apparatuses described herein may include a database of reference markers, SOC rules (e.g., encoded based on literature values), definitions of equivalence rules. In general, these apparatuses are configured to estimate equivalence (determine equivalence levels) to predict the function of patient-specific marker values.
15 [00054] As mentioned, the plurality of markers may include next generation sequencing (NGS) markers, immunohistochemical (IHC) markers, or florescent in-situ hybridization (FISH) markers, or any other marker. [00055] In any of these methods, using the marker processing apparatus to determine an equivalence level may include applying a set of equivalence rules when comparing the the patient-specific markers against the reference markers in the library of reference markers values to set the equivalence level to a level selected from the set of:
20 high equivalence, medium equivalence, low equivalence, or no equivalence.
[00056] Any of the methods described herein may be configured to examine all and/or a subset of reference marker values against patient-specific markers. For example, a method of estimating a patient's response to a plurality of cancer drugs or therapies using a marker processing apparatus may include: entering a plurality of patient-specitic marker values into the marker processing apparatus, wherein the plurality of patient-specific maker
25 values were identified by testing a patient sample for a plurality of markers to identify the patient-specific marker values; using the marker processing apparatus to determine an equivalence level for at least some of the patient-specific marker values relative to reference marker values in a library of reference marker values, wherein the equivalence level includes: high equivalence, low equivalence or no equivalence; using the marker processing apparatus to determine patient-specific marker values that are likely benign from those having no equivalence to a
30 reference marker value; using the marker processing apparatus to automatically prioritize patient-specific marker values; identifying drugs or therapies related to markers corresponding to patient-specific marker values having equivalence to a reference marker value; and outputting an estimated response to the identified drug or therapy based on the determined equivalence level patient-specific marker values. [00057] Using the marker processing apparatus to determine an equivalence level may comprise comparing the
35 at least some of the patient-specific marker values to the reference marker values in the library and applying equivalence rules to determine the equivalence levels.
[00058] Any of these methods may include associating patient-specific marker values that are equivalent to a reference marker value with high equivalence or low equivalence with a marker characteristic from the reference marker value.
-10-
WO 2016/139534 PCT7IB2016/000316
[00059] Any of these methods may include associating patient-specific marker values that are equivalent to a
reference marker value with high equivalence or low equivalence with a marker characteristic from the reference
marker value and comprising one of: gain of function or loss of function.
[00060] These methods may include associating patient-specific marker values that are equivalent to a
5 reference marker value with high equivalence or low equivalence with a drug or therapy effect that is already
associated with the reference marker value.
[00061] These methods may include, after using the marker processing apparatus to determine an equivalence
level for at least some of the patient-specific marker values relative to reference marker values in a library of
reference marker values, using the marker processing apparatus to find a match from a Catalogue Of Somatic
10 Mutations In Cancer (COSMIC) database of variants for patient-specific marker values having no equivalence to a reference marker value. The methods may further include associating patient-specific marker values having no equivalence to a reference marker value but with a match to the COSMIC database to the matched COSMIC variant. After using the marker processing apparatus to determine an equivalence level for at least some of the patient-specific marker values relative to reference marker values in a library of reference marker values and after using the
15 marker processing apparatus to find matches from the COSMIC database of variants, the method may use the
marker processing apparatus to apply a structural rule to patient-specific marker values having no equivalence to a reference marker value and no match to the COSMIC database of variants, to identify a structural defect in a marker referenced by the patient specific marker value. [00062] Any of these methods may also include, after using the marker processing apparatus to determine an
20 equivalence level for at least some of the patient-specific marker values relative to reference marker values in a
library of reference marker values, using the marker processing apparatus apply a structural rule to patient-specific marker values having no equivalence to a reference marker value to identify a structural defect in a marker referenced by the patient specific marker value. [00063] For example, any of these methods may also include, after using the marker processing apparatus to
25 determine an equivalence level for at least some of the patient-specific marker values relative to reference marker values in a library of reference marker values, using the marker processing apparatus apply structural rules to patient-specific marker values having no equivalence to a reference marker value to identify a structural defect in a marker referenced by the patient specific marker value, wherein the structural defect comprises one of: low minimum allele frequency, protein shortening, and modification of highly conserved region. The methods may also
30 include associating patient-specific marker values having no equivalence to a reference marker value but with a structural defect, with the structural defect.
[00064] Also described herein are methods including using the marker processing apparatus to determine patient-specific marker values that are likely benign from those having no equivalence to a reference marker value comprises identifying patient-specific marker values as likely benign where the patient-specific marker values have
35 no equivalence to a reference marker value, and do not have a match with a Catalogue Of Somatic Mutations In
Cancer (COSMIC) database of variants, and for which a set of structural rules does not indicate a structural defect. [00065] Any of these methods may include using the marker processing apparatus to automatically prioritize patient-specific marker values comprises prioritizing based on prioritizing patient-specific marker values having equivalence to a reference marker value before patient-specific marker values that have a match from a Catalogue
40 Of Somatic Mutations In Cancer (COSMIC) database of variants, and before patient-specific markers having a structural defect.
-11-
WO 2016/139534 PCT7IB2016/000316
[00066] The step of identifying drugs or therapies related to markers corresponding to patient-specific marker
values having equivalence to the reference marker values may include identifying drugs or therapies when a marker
corresponding to one of the patient-specific marker values is a gene, and identifying a drug or therapy associated
with that gene or a gene downstream of the gene.
5 [00067] Any of these methods may include identifying drugs or therapies including identifying drugs or
therapies related to markers corresponding to the patient-specific marker values that match to the COSMIC database
or have a structural defect in the marker.
[00068] In general, any of these methods may perform a pathway analysis to identify targetable genes based on
markers corresponding to patient-specific marker values having equivalence to the reference marker values.
10 [00069] For example, described herein are methods of estimating a patient's response to a plurality of cancer drugs or therapies using a marker processing apparatus, the method comprising: entering a plurality of patient-specific marker values into the marker processing apparatus, wherein the plurality of patient-specific maker values were identified by testing a patient sample for a plurality of markers to identify the patient-specific marker values; using the marker processing apparatus to determine an equivalence level for at least some of the patient-specific
15 marker values relative to reference marker values in a library of reference marker values, wherein the equivalence level includes: high equivalence, low equivalence or no equivalence by comparing the patient-specific marker values to the 'ibrary of reference marker values and applying equivalence rules to determine equivalence level; using the maker processing apparatus to find a match from a Catalogue Of Somatic Mutations In Cancer (COSMIC) database of variants for patient-specific marker values having no equivalence to a reference marker value; using the
20 marker processing apparatus to apply structural rules to patient-specific marker values having no equivalence to a reference marker value and no match with the COSMIC database of variants, to identify a structural defect in the marker; using the marker processing apparatus to determine patient-specific marker values that are likely benign because they have no equivalence to the reference marker, do not match with variants in the COSMIC database of makers, and do not have a structural defect in the marker; using the marker processing apparatus to automatically
25 prioritize patient-specific marker values based first on patient-specific marker values having equivalence to a
reference marker value, then on patient-specific marker values that match with the COSMIC database of variants, then based on any structural defect in the maker; and identify drugs or therapies related to the markers corresponding to patient-specific marker values; and outputting an estimated response to the identified drug or therapy.
30 [00070] Also described herein are methods for estimating a patient's response to a plurality of cancer drugs
using a marker processing apparatus, the method comprising: (optionally) testing a patient sample for a plurality of markers to determine a plurality of patient-specific marker values; entering the patient-specific marker values into the marker processing apparatus; using the marker processing apparatus to determine an equivalence level for at least some of the patient-specific marker values relative to reference marker values in a library of reference marker
35 values; outputting, from the marker processing apparatus, an estimated drug response for the plurality of cancer drugs based on the determined equivalence level for the at least some of the patient specific marker values (which may optionally be in the form of a report); and modifying the marker processing apparatus based on the equivalence level of the patient-specific marker values. [00071] As mentioned above, any of the method described herein may include using the marker processing
40 apparatus to determine an equivalence level by using the marker processing apparatus to perform one or more (or all) of: applying a set of equivalence rules when comparing the patient-specific markers against the reference
-12-
WO 2016/139534 PCT7IB2016/000316
markers in the library of reference markers values; setting the equivalence level to a level selected from the set of:
high equivalence, medium equivalence, low equivalence, or no equivalence; and estimating the functional significance of the patient-specific marker value based on the functional significance of its equivalent reference marker value. 5 [00072] Any of these methods may include selecting, in the marker processing apparatus, a first plurality of drugs based on cancer type, and using the marker processing apparatus to estimate the patient's response to the selected first plurality of cancer drugs based on the equivalence level of the at least some of the patient specific marker values to the set of reference marker values with a known association to the response of the selected first plurality of cancer drugs.
10 [00073] In general, the method may include using the marker processing apparatus to determine an equivalence level all of the patient-specific marker values relative to reference marker values in the library of reference marker values and determining, using the marker processing apparatus, a second plurality of cancer drugs estimated to have a favorable patient response based on the determined equivalence level for the patient specific marker values. [00074] The method may also include selecting the first plurality of drugs from amongst a set of approved
15 drugs for the cancer type of the patient and recommended drugs for early stage treatment of the cancer type of the patient.
[00075] A method of estimating a patient's response to a plurality of cancer drugs may also include selecting the first plurality of drugs based on the cancer type comprises selecting the first plurality of drugs for one of: breast cancer, non-small cell lung cancer, colon cancer, melanoma, ovarian cancer and brain cancer.
20 [00076] In general, the marker processing apparatus may comprise a library of reference marker values, the
library comprising an indicator of a functional significance of the reference marker values and an association of the reference marker values to a response of one or more cancer drugs of the selected first plurality of cancer drugs. For example, the library may include a matrix, listing, or any other appropriate data structure including reference marker values and each reference marker value (note that individual markers may have multiple reference marker
25 values in the database) may be linked to a cancer drug and cancer drug response based associated with the reference marker value. The associated cancer drug and response to that cancer drug (and multiple cancer drugs and responses may be linked to the same reference marker value) may be based on clinical data (including published results). The links between drugs, drug response and reference marker values may be two-way, meaning that a drugs and/or drug responses may connect to particular reference marker values.
30 [00077] Thus, any of these methods may include using the library of reference marker values in the marker processing apparatus to estimate a response for each of the selected first plurality of cancer drugs based on the equivalence values of patent-specific reference values that are equivalent to reference marker values which refer to a drug from the first plurality of cancer drugs. [00078] As mentioned, any of these methods and apparatuses may be configured to resolve conflicts when
35 patient-specific marker values have an equivalence level (non-zero or high, medium, low equivalence) for multiple reference marker values. Either the patient-specific reference marker may match with multiple reference marker values (less likely) or different patient-specific maker values may match to multiple reference maker values. These conflicts may be resolve automatically or may be manually resolved with automatic assistance. Conflicts may be resolved using the equivalence level (e.g., weighting potential drug outcomes more heavily where there are higher
40 equivalence values). The apparatus may 'learn' the outcome of the resolved conflicts (particularly when manually or partially manually resolved) and may therefore automatically resolve these conflicts based on the learned
-13-
WO 2016/139534 PCT7IB2016/000316
resolution. Learning (e.g., progressively enhancing the components of the marker processing apparatus by
modifying the marker processing apparatus) may include forming a new rule that the apparatus applies. [00079] For example, any of these methods may include using the library of reference marker values in the marker processing apparatus to estimate a response comprises using the library of reference marker values to 5 estimate a single response for each of the selected first plurality of cancer drugs, by resolving conflicts in estimated responses based on the patent-specific reference values that are equivalent to reference marker values which refer to the same drug from the first plurality of cancer drugs.
[00080] As mentioned, in general, the estimated response for each of the drugs (e.g., each of the first plurality of cancer drugs) may comprise one of: enhanced response, poor response or an intermediate response.
10 [00081] In general, the step of outputting from the marker processing apparatus the resport on the first plurality of cancer drugs (e.g., the first set or the standard-of-care set) may be done quickly, e.g., outputting the report within 4 hours or less. A more full report may follow within a few days or weeks, where the full report may predict a response from a larger set of potential therapies (e.g., drugs). For example, any of these methods may include determining the second plurality of cancer drugs estimated to have a favorable response from amongst a second set
15 of drugs. The second set of drugs may include off-label drugs approved or recommended for cancer types other than the cancer type of the patient, and drugs in clinical trials. Any of these methods may also include using the marker processing apparatus to prioritize the patient-specific marker values and identify those that may be targetable by drugs with a favorable response for the patient's cancer and to identify targetable genes from cancer pathways based on those identified as targetable. For example, the second plurality of cancer drugs may be chosen from a database
20 of clinical and pre-clinical studies published in peer reviewed literature.
[00082] Any of these methods may include prioritizing the patient specific marker values. The step of using the marker processing apparatus to prioritize the patient-specific marker values may generally include automatically prioritizing by giving a higher priority to patent-specific marker values having a functional significance, giving a moderate significance to patient-specific marker values that have a match from known public database of somatic
25 variants, and giving a lower priority to patient-specific marker values having a damaging prediction based on a
standard criteria selected from the group of: protein shortening criteria, conservation criteria, and allele frequency
criteria.
[00083] As mentioned, any appropriate marker (and marker values) may be used. The plurality of markers may
include markers for genomic events including nucleotide variants (SNVs), Insertions and Deletions (InDels), copy
30 number variants (CNVs), structural variants (SVs) including translocations, micro satellite instability (MSI) and protein expression levels.
[00084] In any of these methods a patient sample may be tested. Testing the patient sample may include testing a sample from fresh tissue or from formalin-fixed paraffin-embedded (FFPE) blocks, using: Next Generation Sequencing (NGS), Immunohistochemistry (IIIC), Polymerase Chain Reaction (PCR) and fluorescent in-situ
35 hybridizations (FISH) to determine the plurality of marker values. Testing a patient sample for a plurality of markers to determine a plurality of patient-specific marker values may include using an automated patient-specific marker value generation pipeline.
[00085] The patient-specific marker value generation pipeline may include: aligning, SNP detection, copy number calling, translocation detection, and Quality Checks (QC). For example, the method patient-specific marker
40 value generation pipeline may include: aligning, SNP detection, copy number calling, translocation detection, and
-14-
WO 2016/139534 PCT7IB2016/000316
Quality Checks (QC), using a Central Processing Unit (CPU) and a Graphical Processing Unit (GPU) to augment
the execution on the CPU.
[00086] As mentioned, any of these methods may include modifying the marker processing apparatus (e.g., learning) by, e.g., progressively enhancing the components of the marker processing apparatus. Modifying the 5 marker processing apparatus may include enriching the library of reference marker values by patient-specific marker values found equivalent to reference marker values. Modifying the marker processing apparatus may include enriching the association of the library of reference marker values to the response of drugs when conflicts are encountered between patient-specific marker values to a same marker. Modifying the marker processing apparatus may include recording the patient-specific marker values and drug responses reported from the marker processing
10 apparatus for future reference.
[00087] In general, a marker processing apparatus may be configured to perform any of the methods described herein. For example, also described herein are marker processing apparatuses to estimate a patient's response to a plurality of cancer drugs, the apparatus comprising: an input to receive patient-specific marker values; a control to allow a user to select a standard of care set of drugs based on a cancer type; one or more computer readable media
15 storing one or more sets of computer readable instructions; a processor in communication with the one or more
computer readable media and configured to execute the computer readable instructions to to: compare at least some of the patient-specific marker values to a library of reference marker values associated with a selected standard of care set of drugs and determine an equivalence level for the at least some of the patient-specific markers relative to reference markers in the library of reference marker values based on a set of equivalence rules for the selected
20 standard of care set of drugs in the marker processing apparatus; and output a predicted drug response based on the determined equivalence level.
[00088] The processor may be further configured to: compare a remainder of the patient-specific marker values, which were not part of the at least some of the patient-specific marker values already compared to the library of reference markers, to the library of reference marker values and determine an equivalence level for all of the rest
25 of the patient-specific marker values relative to reference marker values in the library of reference marker values. [00089] In any of these apparatuses, the processor may be configured to output an estimated drug response for a second plurality of drugs based on the determined equivalence level for all of the rest of the patient-specific marker values. [00090] The processor may be configured to output the estimated drug response based on the determined
3 0 equivalence level for the standard of care set of drugs within a first time period and to output the estimated drug response for the second plurality of drugs within a second time period following the first time period by at least 4 days.
[00091] In any of these apparatuses, the control may be configured to allow the user to select the standard of care set of drugs based on the cancer type by selecting from one of: breast cancer, lung cancer, colon cancer, and
35 melanoma.
[00092] The output may be configured to indicate one or more of; enhanced response, standard response, and poor response (e.g., for each therapy, e.g., drug, presented). The processor may be further configured to resolve conflicts between two or more patient-specific marker values associated with a same drug in the standard of care set of drugs.
40 [00093] In general, the processor may be configured to determine the equivalence level by applying the set of equivalence rules to determine if the patient-specific marker value is identical to the reference marker value,
-15-
WO 2016/139534 PCT7IB2016/000316
functionally similar to the reference marker value, or structurally similar to the reference marker value. The
processor may be configured to modify the set of equivalence rules based on a determined equivalence level. The processor may be configured to determine the estimated drug response to each of the standard of care cancer drugs in the standard of care set of cancer drugs by scaling a reference drug response associated with a reference marker 5 value by the equivalence level determined for a patient-specific marker value related to that reference marker value. [00094] In any of these apparatuses, the processor may be configured to apply the set of equivalence rules when comparing the the patient-specific markers against the reference markers in the library of reference markers values to set the equivalence level to a level selected from the set of: high equivalence, medium equivalence, low equivalence, or no equivalence.
10 [00095] Also described herein are non-transitory computer-readable storage medium storing a set of
instructions capable of being executed by a computer processor, that when executed by the computer processor causes the computer to: receive a plurality of patient-specific marker values; receive a user selection of a standard of care set of drugs specific to a cancer type; compare the patient-specific marker values to a library of reference marker values associated with the selected standard of care set of drugs; determine an equivalence level for the
15 patient-specific markers relative to reference markers in the library of reference marker values based on a set of equivalence rules for the selected standard of care set of drugs in the marker processing apparatus; and output an estimated drug response based on the determined equivalence level.
[00096] The set of instructions, when executed by the computer processor, may further cause the computer to: compare a remainder of the patient-specific marker values, which were not part of the at least some of the patient-
20 specific marker values already compared to the library of reference markers, to the library of reference marker values and determine an equivalence level for all of the rest of the patient-specific marker values relative to reference marker values in the library of reference marker values.
[00097] The set of instructions, when executed by the computer processor, may further cause the computer to: output an estimated drug response for a second plurality of drugs based on the determined equivalence level for all
25 of the rest of the patient-specific marker values. The set of instructions, when executed by the computer processor, may further cause the computer to: output the estimated drug response based on the determined equivalence level for the standard of care set of drugs within a first time period and to output the estimated drug response for the second plurality of drugs within a second time period following the first time period by at least 4 days. [00098] The set of instructions, when executed by the computer processor, may further cause the computer to:
30 allow the user to select the standard of care set of drugs based on the cancer type by selecting from one of: breast cancer, lung cancer, colon cancer, and melanoma.
[00099] The set of instructions, when executed by the computer processor, may further cause the computer to: output the estimated drug response based on the determined equivalence level by indicating one or more of; enhanced response to the drug, standard response to the drug, and poor response to the drug.
35 [000100] The set of instructions, when executed by the computer processor, may further cause the computer to: resolve conflicts between two or more patient-specific marker values associated with a same drug in the standard of care set of drugs.
[000101] The set of instructions, when executed by the computer processor, may further cause the computer to: determine the equivalence level by applying the set of equivalence rules to determine if the patient-specific marker
40 value is identical to the reference marker value, functionally similar to the reference marker value, or structurally
-16-
WO 2016/139534 PCT7IB2016/000316
similar to the reference marker value. The set of instructions, when executed by the computer processor, may further
cause the computer to: modify the set of equivalence rules based on a determined equivalence level. [000102] The set of instructions, when executed by the computer processor, may further cause the computer to: determine the estimated drug response to each of the standard of care cancer drugs in the standard of care set of 5 cancer drugs by scaling a reference drug response associated with a reference marker value by the equivalence level determined for a patient-specific marker value related to that reference marker value.
[000103] The set of instructions, when executed by the computer processor, may further cause the computer to: apply the set of equivalence rules when comparing the the patient-specific markers against the reference markers in the library of reference markers values to set the equivalence level to a level selected from the set of: high
10 equivalence, medium equivalence, low equivalence, or no equivalence.
[000104] In general, also described herein are methods and apparatuses (e.g., tools) for assisting in analyzing patient-specific markers, including analyzing patient-specific markers relative to a library of reference maker values. For example, described herein are methods of aligning a number of reads from a DNA sample to a reference genome. Such methods may include: (a) using a computing processing unit (CPU) to precompute a Burrows-
15 Wheeler search index from the reference genome; (b) using the CPU to use the index to find all locations in the sample containing a match where a read approximately or exactly matches the reference genome; (c) using a graphics processing unit (GPU) to refine each match by introducing a limited number of gaps on the read to generate an alignment; and (d) using the CPU to rank all alignments for the read to find a best alignment which is written to output.
20 [000105] A method of aligning a number of reads from a DNA sample to a reference genome may include repeating steps b, c, and d for each read.
[000106] In general, a Smith-Waterman dynamic programming algorithm may be used to return a best alignment between a pair of input strings. A single backtracking matrix may be stored in local memory. The GPU-based constant memory may be used to store static variables that do not change during execution of the algorithm. The
25 memory operations may be expressed as primitive mathematical operations. If-else statements may be re-expressed as primitive bitwise operations. The input sequences may be sorted in increasing order of length before executing the algorithm.
[000107] As described in detail herein, the algorithm may have a processing speed of about 22 GCUPs. The GPU and CPU may comprise independent sets of instructions. The throughput may be 0.0025 Gbps/pt or greater.
30 [000108] For example, a system for aligning a number of reads from a DNA sample to a reference genome may include: a computing processing unit (CPU) configured to execute instructions for precomputing a Burrows-Wheeler search index from the reference genome and using the Burrows-Wheeler search index to find all reads that approximately or exactly matches a portion of the reference genome; a graphics processing unit (GPU) configured to execute instructions for refining at least some reads by introducing a limited number of gaps on the at least some
35 reads to generate an alignment, wherein the CPU is further configured to execute instructions for ranking all alignments for the at least some reads to find a best alignment which is written to output.
[000109] The CPU and GPU may generally include instructions for repeating the using, refining, and ranking for each read. The GPU may be configured to execute instructions for using a Smith-Waterman dynamic programming algorithm to iSturn a best alignment between a pair of input strings. The GPU may be configured to execute
40 instructions for storing a single backtracking matrix in local memory.
-17-
WO 2016/139534 PCT7IB2016/000316
[000110] In some variations, the GPU-based constant memory may be used to store static variables that do not
change during execution of the algorithm. The GPU may be configured to execute instructions for expressing
memory operations as primitive mathematical operations. The GPU may be configured to execute instructions for
re-expressing if-else statements as primitive bitwise operations. As mentioned, the GPU and CPU may comprise
5 independent sets of instructions.
BRIEF DESCRIPTION OF THE DRAWINGS
[000111] The novel features of the invention are set forth with particularity in the claims that follow. A better
understanding of the features and advantages of the present invention will be obtained by reference to the following 10 detailed description that sets forth illustrative embodiments, in which the principles of the invention are utilized, and
the accompanying drawings of which:
[000112] FIG. 1A is an example of a workflow for clinical genomics.
[000113] FIG. IB is another example of a workflow ("Strandomics workflow").
[000114] FIG. 2 shows an example of a relationship between true positive and false positive rate for various 15 aligners.
[000115] FIGS. 3 A and 3B show an example of distributions of mapping qualities assigned to incorrectly
mapped reads.
[000116] FIGS. 4A and 4B also show an example of distributions of mapping qualities assigned to incorrectly
mapped reads. 20 [000117] FIGS. 5A and 5B show an example of a histogram of supporting reads for substitutions and InDels and
complex variants.
[000118] FIGS. 6A and 6B show examples of distributions of scores assigned to overlapping variants.
[000119] FIGS. 7A and 7B show examples distributions of strand bias.
[000120] FIGS. 8A and 8B show example distributions of supporting read percentage for variants. 25 [000121] FIGS. 9A and 9B show example distributions of variant scores for variants.
[000122] FIGS. 10A and 10B show example distributions of strand bias.
[000123] FIGS. 11A and 1 IB show example distributions of supporting read percentages
[000124] FIGS. 12A-12I illustrate an example of a SoC Report for a patient with a clinical indication of
colorectal carcinoma. 30 [000125] FIGS. 13A-13Q illustrate another example of a SoC Report for a patient with a clinical indication of
lung adenocarcinoma.
[000126] FIGS. 14A-14C show an example of a portion of a report for a patient with a clinical indication of
breast carcinoma.
[000127] FIGS. 15A-15C depict another example of a portion of a report for a patient with a clinical indication 35 of colorectal carcinoma.
[000128] FIGS. 16A-16B illustrate another example of a portion of a report for a patient with a clinical
indication of lung adenocarcinoma.
[000129] FIG. 17 show another example of a portion of a report for a patient with a clinical indication of
colorectal carcinoma.
-18-
WO 2016/139534 PCT7IB2016/000316
[000130] FIGS. 18A-18B show another example of a portion of a report for a patient with a clinical indication of
lung adenocarcinoma.
[000131
[000132
[000133
[000134
[000135
and GPU.
[000136
[000137
Seq.
[000138
[000139
[000140
[000141
[000142
[000143
database.
[000144
[000145
[000146
[000147
[000148
[000149
[000150
[000151
[000152
[000153
[000154
CCDC6.
[000155
[000156
GOF.
[000157
[000158
FIGS. 19A-19B illustrate an example of a test requisition form.
FIG. 20 shows an example of a GPUSeq Aligner workflow.
FIG. 21 illustrates an example of a matrix construction used in the GPUSeq Aligner.
FIG. 22 shows an example of a GPUSeq workflow.
FIG. 23 (shown as FIG. 20, above) shows a schematic representation of load distribution between CPU
FIG. 24 depicts an example of a VariantCaller workflow.
FIGS. 25A and 25B show an example of a performance comparison of GPUSeq and StrandNGS DNA-
FIG. 26 is an example of a workflow for a "SmartLab" as described herein.
FIG. 27 is a graph illustrating the break-up of the 152 genes on the exemplary NGS panel.
FIG. 28 illustrates intronic regions added to the panel design for better copy number detection.
FIG. 29 illustrates an entire intronic region added for sensitive detection of marker breakpoints.
FIG. 30 illustrates target regions overlapping individual metabolism related marker.
FIG. 31 shows target regions for oncogenes designed to overlap known markers from the COSMIC
FIG. 32 illustrates a protocol for the NGS panel using SureSelect XT2 reagent kit.
FIG. 33 illustrates steps in the patient-specific marker value generation pipeline.
FIG. 34 illustrates a read distribution at a genomic location used by SNP detection algorithm.
FIG. 35 illustrates copy number visualization focused on the genes of the panel.
FIG. 36 illustrates different types of split alignments and the possible underlying SVs.
FIG. 37 illustrates post alignment steps, including quality control steps as discussed herein.
FIG. 38 shows a normal distribution of supporting reads percentage for a sample.
FIG. 39 illustrates a distribution of supporting reads percentage for a contaminated sample.
FIG. 40 shows a normal plot of GC vs corrected normalized average coverage.
FIG. 41 is a plot of GC vs coverage for a sample with sequencing abnormalities.
FIG. 42 illustrates an Elastic Genome Browser view of a translocation involving the genes RET and
FIG. 43 is a schematic block diagram of the marker processing apparatus.
FIG. 44 illustrates equivalence rules depicting a patient specific marker value being annotated with
FIG. 45 illustrates one example of a Variant Support View.
FIG. 46 shows one example of a Variant Card that succinctly depicts all relevant information about the
variants.
[000159] FIG. 47 shows a therapy matrix that provides a shortlist of off-label drugs and clinical trials that may
be applicable to the patient sample.
-19-
WO 2016/139534
PCT7IB2016/000316
DETAILED DESCRIPTION [000160] Described herein in general are methods, systems and treatments including (or related to) patient risk assessment, diagnosis, therapy recommendation, and clinical trials. For example, described herein are workflows 5 including workflows for generic clinical genomics as described above, in reference to FIG. 1. There are minor variations to this workflow for each of the following 3 applications areas: (1) Germline Cancer; (2) Somatic Cancer; and (3) Inherited diseases.
[000161] Also described herein are systems and methods that may be used as part of a biopharma or government sponsored clinical trial. For example, described herein are clinical exome panels (including whole
10 exome sequencing approaches) followed by a pipeline (described herein) for NGS data analysis and interpretation that may help discover novel biomarkers to aide in new drug clinical trials process. Once biomarkers are established for a particular drug to stratify patients into groups such as responders, non-responders, responders with adverse side effects, and non-responders with adverse side effects, companion diagnostic panel(s) can then be designed using the process described herein ("Panel Design")The current invention provides methods and apparatus
15 with the object of determining the response of a cancer patient with a specific type of cancer to a plurality of cancer drugs using a marker processing apparatus. Part I: Clinical Applications
[000162] In one aspect of this invention, the response of a cancer patient with a specific type of cancer to a plurality of cancer drugs specific to the cancer type of the patient, referred to as the standard of care drugs and
20 limited to this collection of drugs, is determined using the marker processing apparatus.
[000163] The methods in this aspect of the invention comprises the step of testing a sample from the patient for a plurality of markers to identify patient specific marker values. The nature of marker values tested may include, but is not limited to, genomic events such as single nucleotide variants (SNVs), Insertions and Deletions (InDels), and copy number variants (CNVs). It may also include specific structural variants (SVs) such as translocations, micro
25 satellite instability (MSI) and protein expression levels. Each of the markers may be measured using technologies including Next Generation Sequencing (NGS), Immunohistochemistry (IHC), Polymerase Chain Reaction (PCR) and fluorescent in-situ hybridizations (FISH). In one example, limited to colon cancer, the markers tested to determine the response to the standard of care drugs comprise those from at least 14 genes including APC, BRAF, DP YD, EGFR, MET, KRAS, NRAS, PIK3CA, PTEN, SMAD4, UGT1A1, TS, TOPI and ERCC1 and micro
30 satellite instability (MSI). In this exemplary embodiment, 11 genes may be measured using an NGS panel (APC, BRAF, DPYD, EGFR, MET, KRAS, NRAS, PIK3CA, PTEN, SMAD4, UGT1A1), protein expression levels for TS, TOPI and ERCC1 may be measured via IHC, and MSI may be measured via PCR. [000164] In one exemplary embodiment of this aspect of the invention, the NGS panel may be designed to include genomic regions from a set of 442 genes (e.g., 152) that modulate the response to chemotherapies and
35 targeted therapies, or modulate the metabolism of some of these drugs, or impact prognosis and disease progression. Further, in this exemplary embodiment, the regions included in the panel may be optimized to cover all genie regions for tumor suppressor genes, all regions that cover variants from one version of the COSMIC database for oncogenes, regions that cover known translocations, regions that optimize copy number detection and regions that contain germline mutations known to impact disease metabolism or prognosis. For each marker tested using IHC,
40 the antibody may be designed and the protocol standardized to optimize the detection of protein expression from the patient sample.
-20-
WO 2016/139534 PCT7IB2016/000316
[000165] The sample drawn from the patient may include in one form fresh tissue extracted from the area of the
cancer via a biopsy, and in another form, sample that has been stored in the form of Formalin-fixed, paraffin-embedded (FFPE) tissue. On one hand, FFPE tissues render themselves to long term storage of the sample. On the other hand, the quality of DNA in FFPE tissues is highly variable and poses significant challenges to various aspects 5 of the methods of the invention. In particular, the steps of the lab protocol of the NGS panel are standardized to handle FFPE tissues with a wide range of quality. In one exemplary embodiment, the protocol includes pull-down in-solution using SureSelect XT2 RNA baits and sequencing using Illumina's sequencing instrument such as MiSeq, NextSeq or HiSeq. In this exemplary embodiment, the lab protocol is optimized to handle samples that contain at least 20% of tumor content and at least 200ng of input tumor DNA.
10 [000166] The raw data generated from the lab for the NGS panel is in the form of sequence reads, 2xl51bp in length in one exemplary embodiment, and are transformed to patient-specific marker values using algorithms including but not limited to alignment, SNP detection, copy number calling, translocation detection, and Quality Checks (QC) executed as part of a patient-specific marker value generation pipeline. In one exemplary embodiment, these algorithms are executed by a computer processor. In another embodiment, these algorithms are executed by a
15 computer processor in tandem with a graphics processor resulting in overall speed improvements of up to lOx.
[000167] The patient-specific marker value generation pipeline includes pre-alignment filtering steps and steps such as trimming low quality bases and using contaminant databases to filter out noise. The alignment algorithm may use a Burrows-Wheeler Transform (BWT) based index search followed by a Dynamic Programming (DP) algorithm to find an optimal match for each read. It may include steps for seeding matches from the read,
20 aggregating matches to identify candidate regions, performing a banded DP, mate rescue, split read alignment, and local realignment. The SNV detection algorithm may use an iterative binomial caller and include a framework to specify complex Boolean filters based on properties of reads near the location of the SNP being called. The copy number calling algorithm involves creating GC-corrected normalized coverages for the normal profile, in one embodiment by computing iterative average for each region. Copy number calls made at the exon level may be
25 summarized to the gene level. The translocation detection may use split read alignment to identify known
translocations. The patient-specific marker value generation pipeline may also include QC checks to determine sample contamination, sample degradation and sample mix-up using the SNV calls. It may also include checks on number of novel SNP calls, the number of copy number calls to QC the algorithm execution. The QC results may be used to pass or reject the patient-specific marker values generated by the pipeline for the patient sample.
30 [000168] For markers measured by IHC, the scoring criteria and cut-offs used may be standardized separately for each IHC marker based on published literature and vendor provided catalogues.
[000169] In this aspect of the invention, the method next comprises entering the patient specific marker values into the marker processing apparatus. The marker processing apparatus comprises a manually curated database of reference markers and their functional significance. Each reference marker may be annotated using peer reviewed
35 journal literature as having a Loss of Function, a Gain of Function or an Unknown Functional impact on the gene that it falls in.
[000170] The marker processing apparatus further comprises rules, called the SOC rules, that relate collections of reference markers to the expected response of the SOC drugs for a specific tissue. The SOC rules are manually curated using information from disparate peer reviewed literature. Each rule may be recorded as an expression that
40 quantifies the response of a specific drug in the presence of a marker M in the patient's sample: Response (drug D | marker M causing functional significance F) = R based on literature evidence L
-21 -
WO 2016/139534 PCT7IB2016/000316
[000171] The marker can be as specific as a single variant - say the V600E variant in BRAF gene for example, or more general, for example, any Loss of Function (LOF) variant in a gene (e.g., any variant in SMAD4 gene that causes a LOF) or with some intermediate level of generality (e.g. any variant in exons 23, 24 or 25 of ALK gene that causes a GOF). In addition, it can also be the negative of a variant or variant class (e.g., any variant in EGFR gene 5 that does NOT cause a GOF) or a Boolean combination of a plurality of variant classes for the same gene or different genes with specific or generic exceptions (e.g., any variant in EGFR gene except T790M that causes a GOF or any variant in XXX gene that is NOT an in-frame insertion in exon 20 and causes a GOF). The drug reference in the SOC rule may include specific drugs (for e.g., Everolimus) or a class of drugs (for e.g., mTOR inhibitors). The functional significance may include loss of function (LOF), gain of function (GOF) or unknown
10 function. The response value for a drug could be sensitive or resistance.
[000172] Upon entering the patient specific marker values into the marker processing apparatus, the marker processing apparatus may be used to determine an equivalence level for the subset of patient specific marker values relative to the reference SOC marker values based on a set of equivalence rules. For each patient specific marker measured with a specific value, the marker processing apparatus may evaluate its equivalence with any of the
15 reference SOC marker values and may output an equivalence level with high, medium or low confidence. The
equivalence rules may be dependent on the type of the patient specific marker value and may be used to annotate the functional significance of the patient specific marker value based on the functional significance of the reference marker value. In one exemplary embodiment of an equivalence rule, a patient specific marker value that indicates a premature truncation in a Tumor Suppressor gene is equivalent to a reference marker with high confidence if it is an
20 exact match or if the reference marker is a premature stop, or a frameshift variant or an Exonic Splicing Silencer (ESS) resulting in a premature stop downstream of the patient specific marker value, and is annotated with a LOF functional significance if such a reference marker is found.
[000173] In this aspect of the invention, the marker processing apparatus may then execute the SOC rules based on the functional significance of the patient-specific marker values derived using equivalence to the reference
25 marker values. The marker processing apparatus may execute only the most specific rule for each drug, and may execute all possible rules based on the patient-specific marker values. Once all the rules have been fired, the apparatus may provide an overall recommendation for each drug. Where the response predicted by individual SOC rules for a drug are all consistent, the overall recommendation for the drug may be the same as the individual response prediction. Where the different SOC rules predict contradicting responses, the apparatus may allow manual
30 intervention to resolve the conflict based on the literature evidence associated with each individual SOC rule. The resolved outcome may be recorded in the system as a new specific SOC rule. If the conflict cannot be resolved, the drug response may be called Inconclusive.
[000174] In this aspect of the invention, the marker processing apparatus may thus be used to predict the estimated drug response given the unique patient-specific marker values for the standard of care set of drugs for a
35 patient sample. The estimated drug response for each drug may vary between Enhanced Response for drugs that are found sensitive and Poor Response for drugs that are found resistant. Conflicting SOC rules may be resolved to predict the response of the drug as either Limited Response or Inconclusive. The SOC drugs for which none of the SOC rules were fired may have a Standard Response for the patient, and may be reported thus. The marker processing apparatus may further be used to generate the patient's report automatically with minimal manual
40 intervention, needed only when a conflict arises. The report may include the estimated response of each of the
standard of care drugs for the patient sample, along with the details of the patient-specific marker values that were measured and used to estimate the responses.
-22-
WO 2016/139534 PCT7IB2016/000316
[000175] In another aspect of this invention, the response of a cancer patient with a specific type of cancer to a
plurality of cancer drugs not specific to the cancer type of the patient, referred to as off-label drugs, and to a
plurality of clinical trials for the specific type of cancer is determined using the marker processing apparatus with
the object of determining additional therapies that may be applicable to the patient.
5 [000176] In this aspect of the invention, the marker processing apparatus may further comprise a collection of
cancer pathways that provide for each gene in the panel above, a list of targetable upstream or downstream genes
from the pathways. The apparatus may also comprise a plurality of cancer drugs from different tissues and clinical
trials annotated with the target gene and additional conditions of applicability.
[000177] In this aspect of the invention, the marker processing apparatus performs an equivalence of the patient-
10 specific marker values with the reference marker values to predict the functional significance of the patient-specific marker values as causing Loss of Function or Gain of Function. The patient-specific marker values that are not found equivalent to the curated markers with known functional significance may be further categorized using bioinformatics predictions, and database lookup. [000178] In this aspect of the invention, one step involves using the marker processing apparatus to prioritize the
15 rest of the variants and shortlist patient-specific marker values that are expected to have a damaging functional
significance on targetable genes in cancer pathways. This step may involve using the tools provided by the marker processing apparatus including, but not limited to a tool that enables quick assessment of the cleanliness of the alignment in the neighborhood of the variant location; a tool that displays the copy number distribution to in superimposing the copy number calls of the sample against the distribution of calls from each of the panel's regions
20 over the samples in the system; and a tool to present all the relevant information about the marker in a concise visualization.
[000179] In this aspect of the invention, another step may involve using an automatically created list of drugs approved in other tissues and clinical trials for the specific tissue to identify additional therapies that may be applicable to the patient, given the patient-specific marker values. Literature evidence to support the applicability of
25 the drug or trial to the patient may be recorded in the marker processing apparatus for reuse in subsequent patient samples.
[000180] The marker processing apparatus may further generate the patient's report automatically. In addition to the Standard of Care drugs, the report may include additional off-label drugs and clinical trials applicable to the patient sample, along with the details of the patient-specific marker values that were measured and used to estimate
30 the responses.
[000181] In some embodiments, the marker processing apparatus comprises a processor configured to run marker processing algorithms as described herein. For example, the marker processing apparatus can comprise a general purpose computer configure to run marker processing software. Panel Design
35 [000182] In general, described herein are multi-gene panels. One variation of a multi-gene panel (refeired to herein as "StrandAdvantage") covers (a) all oncogenes and tumor suppressor genes, (b) genes that can help guide the choice of therapy options, (c) predictors of response or resistance to targeted and chemotherapy drugs (d) markers for prognosis and (e) those involved in drug metabolism. [000183] Multi-gene testing can provide many advantages as compared to single gene testing approaches. For
40 example, multi-gene testing can allow short turnaround time to start therapy, including first-line options; testing of many genes concurrently (simultaneous identification of a driver mutation, as well as a secondary mutation in the
-23-
WO 2016/139534 PCT7IB2016/000316
gene); identification of resistance to targeted therapy for the driver gene, thus helping predict the contraindications
of a targeted drug; targeting tumors of different origins with the same drive mutations, using a single drug; testing a tumor for mutations and genes other than the ones typically associated with that tumor for more information; determining with very high sensitivity extremely subtle mutations that are unique to the tumor; and capturing tumor 5 heterogeneity with high sensitivity, often missed with other molecular testing approaches.
[000184] Sequencing the entire genie content of these (e.g., 161) genes would result in a very large amount of sequencing and increase the cost of the test. The StrandAdvantage test selectively sequences just 0.5MB of the target genes to keep sequencing costs and long term data storage costs to a minimum. It may be, however, difficult to determine which sub-regions to sequence, and how to guide analysis of these regions. The
10 regions have been broadly chosen as follows:
[000185] All important COSMIC mutations involved in solid tumors are covered by the target regions. Briefly, all Refseq, UCSC and Ensembl transcripts were combined for each gene. All transcripts were merged to construct a "super transcript" for each gene. The DNA coordinates for all known transcripts were merged to construct "super transcript DNA coordinate" for each gene. All known COSMIC mutations (SNV, INDELs up to
15 35 bp) that overlap with the super transcript coordinates were considered as targets.
[000186] In addition to COSMIC, coding germline mutations from ClinVar and HGMD were included for the genes involved in drug metabolism.
[000187] For genes whose copy number status is important, the design ensures that the target regions are spread across the gene to maximize detection accuracy of whole gene copy number alteration events. Briefly,
20 probes were equally interspaced along the entire length of the gene, covering all exons, wherever possible. If the distance between two adjacent exons was too large (>xkb), an intronic probe was introduced to ensure adequate representation.
[000188] For genes involved in translocations, the most common breakpoints (based on frequency of occurrence as per COSMIC) have been identified and these breakpoint regions have been densely covered with
25 probes to enable high-resolution breakpoint detection.
[000189] To optimize the panel size and number of probes. Overlapping and adjacent targets within 120 bp
were merged.
[000190] Probe design was set to balance specificity and coverage and set to IX tiling.
[000191] Thebioinformatics pipelines that support the test may ensure that the above objectives are
30 achievable,
[000192] Design Confirmations: (1) Hapmap and cell line DNA from 20 samples were used to analyze coverage of all intended target regions. No target base was found to be covered less than IX. This proves that all probes in the panel are functional and that all intended target regions are covered. (2) We confirmed that all off-target regions with significant coverage had high similarity (were pseudogenes or members of the gene
35 family originally targeted).
[000193] Analytical validation: The panel is being confirmed on Cell lines hosting various types of DNA alterations. For INDELS, copy number changes and translocations, cell lines have been identified and validation is ongoing. Limit of detection for each type of event is being evaluated. For copy number changes, higher fold changes in more pure tumor samples are easier to identify. So we have designed a series of
40 experiments testing a range of amplifications from 2-15 or more copies and over a range of tumor content.
-24-
WO 2016/139534
PCT7IB2016/000316
[000199] These panel tests described herein may generally be performed by: (1) a clinician provides patient tumor specimen in collection kit; (2) the tissue is tested by a lab (as described herein), which sequence the DNA from, e.g., a formalin fixed, paraffin embedded tumor sample, by sequencing numerous sub-regions of the genes of 5 interest, as described herein; (3) data is analyzed as described herein, including use of the an apparatus (e.g., "StrandOmics") including analytic software; (4) consultation with clinicians and Ph.Ds by telephone consult. [000200] These assays, such as the Strand-48 and Strand-161 gene panels described above, may not necessarily be used as diagnostic tests; instead, they may be performed after the clinical and pathology diagnosis is performed. The clinical utility of these panels may be to stratify patients (such as in the case for Strand-161, solid tumor cancer
10 patients) in a meaningful way, e.g., according to their tumor mutation profiles and, and inform their physician about relevant treatments to guide rational treatments, including informing them about possible drug interactions and counter indications. For example, the Strand-161 assay may be used to inform the patient's oncologist about clinically valid cancer therapies that are available in the market or in clinical trials. The methods and apparatuses described herein enable the practice of personalized medicine, e.g., by an oncologist, by providing them with
15 clinical decision supportA
[000201] In any of these methods and apparatuses, the process may advantageously be structured as a stratified assay, in which, a quick result looking at a sub-set of the genes (and/or specific polymorphism of these genes) is first provided to the physician, followed by a more detailed report a few days thereafter. Thus, although many (e.g., hundreds) of genes including thousands of polymorphisms (including duplications, breaks, deletions, etc.) may be
20 identified and analyzed, the assay may be configured to quickly (e.g., within a day, a few days, etc.) give initial
results on a sub-set of relevant genes, followed up within a short time by the complete (or another sub-set) of results for the panel.
[000202] For example, clinicians may receive (e.g., within 5 days, within 6 days, within 7 days, within 8 days, within 9 days, within 10 days) a case report on their patient's subset of NCCN "standard of care" genes tested. The
25 subset may consist of 8 standard of care genes (or 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, etc. standard of care genes). The number and identity of the genes forming the standard of care subset may be determined based on those genes identified by scientific/medical consensus, e.g., from the published literature. In 15 days or less (e.g., combining the StrandOmics described herein with our expert interpreters) the full case report may be provided to clinicians. In practice, the methods and apparatuses described
30 herein provide more than 40% faster than current competitors and at a much lower cost. The 8 standard-of-care genes may include: KRAS, BRAF, NRAS, PIK3CA, EGFR, AKT1, KIT, and PDGFRA.
[000203] The assays described herein may be next generation sequencing ("NGS") based tests, and may detect any SNV, indels that overlap with the panel region. Thus, the number on the number of SNVs and indels may vary, although there may be minimum number (e.g., "X" SNVs and indels); novel variants detected may be analyzed
35 bioinformatically.
[000204] All the oncogenes and tumor suppressors in the panels may be analyzed for CNVs. Oncogene amplifications and tumor suppressor losses may be measured, and reported (e.g., on a gene sheet providing an idea of the number of oncogenes and tumor suppressors). Translocations that cause fusion proteins may also be detected and analyzed. The most common fusion proteins involve ALK, ROS and RET. The panels described herein may be
40 designed such that a fusion involving these and any other gene will be detected with high confidence (e.g., by
placing probes in the entire introns of ALK/ROS/RET where the breakpoint could occur). For example, there are 10
-36-
WO 2016/139534 PCT7IB2016/000316
well-known translocations involving these 3 partners and the assay should be able to capture them with high
confidence. Any other translocation involving these 3 genes also has a good chance of being captured.
Lab Protocols
[000205] One example of a lab protocol is based upon Agilent technologies' SureSelect XT2 Enrichment 5 protocol. It has been optimized to handle hybridization based selection of targets from FFPE tumor samples.
Although the method and systems described herein for producing the library and sequences described herein may be
particularly useful for the downstream processes and apparatuses also described, any appropriate methods and
apparatuses, including currently known systems and methods, may be used for much of the downstream methods
and apparatuses (including alignment, analysis, treatment recommendations, and the like). For example, alternative 10 technologies being optimized for this include library preparation kits from New England Biotechnologies and Roche
Nimblegen, along with Lockdown DNA probes from IDT. In addition, library preparation kits from new vendors
may be useful if cost effective and of equivalent performance. The choice of platform and the protocol
standardization steps are designed towards automation compatibility.
[000206] Described herein is a method of sequencing that is optimized as described herein, which has numerous 15 advantages over prior art methods and apparatuses. Briefly the protocol is as follows:
[000207] 1. DNA is extracted using a standard FFPE DNA extraction kit.
[000208] 2. Since FFPE DNA is highly fragmented and of inferior quality compared to fresh tissue, it is
assessed using a real time PCR assay.
[000209] 3. 200 ng of qubit (Life technologies) quantified d.s. DNA is sheared using a Covaris for 260-280 20 seconds to ensure >70% of the original DNA lies between 150 and 200 bp peak.*
[000210] 4. All standard steps in the preparation of sample library are followed except that no size selection is
performed. PCR for 8 and 14 cycles are performed for fresh tissue/ cell line and FFPE DNA respectively to get
adequate DNA library from individual samples
[000211] 5. The pre-capture library for each sample is individually assessed using a tape station (Agilent 25 technologies). If greater than 10% of the total library is made up of 120 bp fragments which could arise from
adaptor dimers as well, a mild size selection using 1.4:1 ampure beads to DNA is used to remove these smaller
fragments*.
[000212] 6. 6 to 8 individual libraries* are quantified and pooled in equal concentration to make up a total
amount of 1500 ng DNA and hybridized to probes, for 24 hours. 30 [000213] 7. The hybridization protocol is followed as per SS-XT2 standard protocol, except for avoiding any
size selection after hybridization. Following cleanup, 12 (FFPE) PCR cycles* are performed to get the final pooled
library.
[000214] 8. The final library is cleaned using 1.5:1 ampure beads to DNA*.
[000215] 9. As previously described, the library is checked on Tape station and if the 120 bp fragment is 35 greater than 10% of the pooled library, it is cleaned up again using Ampure beads.
[000216] 10. 8 pM library is loaded on a v3 sequencing kit (2X150 PE sequencing)* to obtain 8-9 GB total
output, -1200 cluster density and >90% Q30 reads. This allows for -1-1.2GB (more than five million reads)
sequencing output per sample.
[000217] (*Steps indicate may be particularly important steps)
-37-
WO 2016/139534 PCT7IB2016/000316
[000218] Analysis consists of multiple steps: alignment, filtering, somatic SNP detection, copy number variation detection, and detection of translocations. All these steps may be carried out using Strand NGS and are described in greater detail below.
[000219] Described herein is a method of sequencing that is optimized for starting with 50-100ng of input DNA 5 as described herein, which has numerous advantages over prior art methods and apparatuses. Briefly the protocol is as follows: [000220] (1) DNA is extracted using a standard FFPE DNA extraction kit.
[000221] (2) Since FFPE DNA is highly fragmented and of inferior quality compared to fresh tissue, it is
assessed using a real time PCR assay. 10 [000222] (3)50- 100 ng of qubit (Life technologies) quantified d.s. DNA is sheared using a Covaris for
350-500 seconds. A mild size selection is performed to remove the smaller fragments that adversely
affect the size of library preparation.*
[000223] (4) All subsequent standard steps in the preparation of library are followed except that no
size selection is performed. Optimized adaptors concentration for library generation is 6 X10A4 pM (lower than 15 recommended protocol). Adaptor ligation efficiency is enhanced by the use of 3% PEG in ligation reaction for
FFPE samples.
[000224] (5) PCR for 10-11 cycles are performed for fresh tissue/ cell line and control FFPE DNA for
generating adequate pre-cap libraries. For clinical FFPE DNA 13 cycle PCR is performed to get adequate DNA
library from individual samples. 20 [000225] (6) The pre-capture library for each sample is individually assessed using a tape station
(Agilent technologies). If greater than 10% of the total library is made up of 120 bp fragments which
could arise from adaptor dimers as well, a mild size selection using 1:1.8 ampure beads to DNA is used to
remove these smaller fragments*. The library is eluted in 50ul of buffer.
[000226] (7) 1-4 individual libraries* are quantified and pooled in equal concentration to make up a total 25 amount of 1500 ng DNA and hybridized to probes, for 24 hours
[000227] (8) The hybridization protocol is followed as per SS-XT2 standard protocol. For individual capture
(single sample), the probes used are SS-XT. Following cleanup, 11-12 (FFPE) PCR cycles* are performed to get
the final pooled library. The final library is cleaned using 1:1.8 ampure beads to DNA*. Optimized elution volume
for library: 50ul post ligation and final. 30 [000228] (9) As previously described, the library is checked on Tape station and if the 120 bp fragment is
greater than 10% of the pooled library, it is cleaned up again using Ampure beads. Optimized sequencing
plexing: singh or pooled capture is 1-4 libraries.
[000229] (10) 1.3 pM library is loaded on a v3 sequencing kit (2X151 PE sequencing)* to obtain 8-9 GBtotal
output, -1200 cluster density and >90% Q30 reads. This allows for -1-1.2GB (more than five million reads) 35 sequencing output per sample.
"Steps indicate may be particularly important/critical steps
[000230] Alignment
[000231] Reads generated by the sequencer are mapped to the reference genome using a combination of a fast
exact-match lookup algorithm, BWT, to narrow the potential match locations and an accurate dynamic programming 40 for the final alignment. The following parameters may be used during alignment. Bases with poor quality ( 90% are ignored. Reads are screened against a contaminant database before alignment. The alignment is set up to allow incorporation of indels whose length is up to 40% of the read length. 5 [000232] Reads which differ substantially from the reference genome pose special problems and are handled as follows: reads with medium indels (10-50bp): the Strand NGS aligner uses a special score function that does not penalize the largest indel in a read. This allows reads with a single medium indel to be aligned accurately without any loss in specificity; reads spanning breakpoints caused by large indels, copy number changes or structural variants (SVs): the Strand NGS aligner has a split-read alignment module. Reads which do not align in the first
10 phase are passed through the split-read aligner - this helps align reads which have at least 30bp on either side of the breakpoint. The breakpoints discovered in this manner are used to improve the alignments of reads near the breakpoint, and also to rescue some unaligned reads. This comprehensive approach allows for high-resolution detection of large indels and SVs; reads spanning small near-tandem duplications: an innovative duplication detection algorithm, detects duplicates within reads, and checks whether the alignment for the read can be improved
15 after the duplicated stretch is removed. The duplicated stretch is replaced in the match record as a putative insertion. Example: Benchmark study of a DNA Read Aligner
[000233] Background: Next generation sequencing technology has led to the generation of millions of short reads at an affordable cost. Aligning these short reads to a reference genome is a crucial task for many downstream applications. However due to the large size of such data, the process of alignment is computationally challenging
20 and requires sophisticated algorithms which are both, fast and accurate. In this work, we will briefly discuss the
Strand NGS (formerly Avadis NGS) alignment algorithm but more importantly present the benchmarking results on
several simulated data sets and a real whole-genome data to compare it with other standard state-of-the-art
algorithms.
[000234] Results: Multiple aligners like Strand NGS, BWA, BWA-Mem, Bowtie2 and Novoalign3 are
25 compared for accuracy and computational efficiency using 4 simulated data sets from the GCAT website and a real Illumina HiSeq 2500 whole-genome paired-end data of 1000 genomes CEU female sample, NA12878. Strand NGS and Novoalign3 showed comparable accuracy in terms of both, % correctly mapped reads and receiver operating curves (ROC). They also seem to outperform other algorithms especially on data sets with longer InDels. For reads potentially originating from complex genomic locations like repeat regions (and therefore assigned low mapping
30 quality), Strand NGS aligner, with careful and intelligent filtering of false positives based on mapping qualities, produces a higher true positive rate compared to Novoalign3. As for the performance comparison based on computational efficiency, other than minor differences, practically all the included algorithms showed comparable performance.
[000235] Conclusions: Alignment of millions of short reads to a large reference genome with many complex 35 regions is still a hard problem and almost all current algorithms adopt some form of strategy to trade-off accuracy
and computational efficiency. The benchmarking results presented in this study suggest that Strand NGS is a
powerful approach for short read alignment and either compares well or even outperforms other state-of-the-art
algorithms.
[000236] Introduction: There has been an unprecedented growth in sequencing data due to the rapid 40 advancements in the next-generation sequencing (NGS) technology. For a whole-genome sample, the number of
reads can vary from few million long reads (>=400bp) generated by instruments like PacBio and 454, to 2-3 billion
short reads (>=75bp) generated by instruments like Illumina/Solexa and SOLID. Further, there has been a
-39-
WO 2016/139534 PCT7IB2016/000316
significant cost reduction in DNA sequencing, attracting more and more researchers to use NGS technology in their
own labs. However, unless we have proper bioinformatics tools to process and analyze this large amount of
sequencing data, data by itself is not of much use.
[000237] Problem Statement and Challenges: One of the first steps in the analysis of NGS data, other than of 5 course checking the quality and filtering out bad quality reads, is alignment or mapping of the generated sequencing
reads to the respective reference genome. An accurate and efficient alignment of reads to a reference genome is
crucial for many downstream applications, for example variant calling, structural variant detection including copy
number changes, detecting protein-DNA binding sites using ChIP Sequencing, comparing expression of
genes/transcripts across different biological conditions, understanding methylation patterns in DNA, determining 10 species composition using metagenomics workflow, etc. However, alignment is a challenging problem due to the
following reasons:
[000238] a) A reference genome is typically long (-billions) and has complex regions like repetitive elements
etc.
[000239] b) Reads are short in length (typically, 50 - 150bp), presenting issues with accuracy, and large in 15 number, presenting issues with efficiency.
[000240] c) Reads have sequencing errors and must be mapped to unique positions in the reference genome.
[000241] d) The subject genome (for example tumor sample) may inherently be different from the reference
genome because of acquired alterations over time, making the alignment difficult.
[000242] In order to address the above challenges, numerous approaches (MAQ, BWA, BWA-Mem, BWA-SW, 20 Bowtie, Bowtie2, SOAP2, Novoalign, Novoalign3, mrFAST, mrsFAST, SHRIMP, etc.) have been developed in the
past. There have also been some benchmarking studies discussing and comparing results from different subsets of
these approaches.
[000243] Evaluation Approach: Since the alignment task can very quickly become a bottleneck in the analysis
pipeline due to the ever-increasing volume of the sequencing data, most of the above approaches adopt a trade-off 25 between accuracy and speed. The choice of number of mismatches or gaps allowed, neglecting quality of data and
known SNP information during alignment are some of the ways to trade-off accuracy in favor of the speed or
efficiency. Due to this trade-off, it is important to assess the performance of different algorithms on both metrics,
i.e., accuracy of read alignment, and computational efficiency. However since accuracy of the alignment process
directly impacts the results of many downstream applications, accuracy is a more important metric than efficiency as 30 long as the algorithm runs in a reasonable amount of time given the computational resources at hand.
[000244] To measure performance in terms of accuracy, we use simulated data sets for benchmarking and use
the following four evaluation criteria:
[000245] Fraction (or %) of correctly, incorrectly and unmapped reads when alignment was done for: (a) All
reads; (b) Only reads with SNPs; (c) Only reads with InDels; (d) Only reads with both SNPs and InDels; (e) Trade-35 off between True Positive Rate (TPR) and False Positive Rate (FPR). In this case, TPR and FPR correspond to
correctly mapped and incorrectly mapped reads respectively; (f) Mapping quality distribution of incorrectly mapped
reads; (g) Fraction (or %) of correctly, incorrectly and unmapped reads for special case of reads that are assigned
low mapping qualities (representing ambiguous reads possibly originating from repeat regions in the genome).
[000246] To measure performance in terms of computational efficiency, we use a real data set (a whole-genome 40 sequencing run for a widely used sample NA12878) and the total run-time of the chosen algorithms is used as an
evaluation criterion. The total time taken for alignment includes the following times: a) Time taken for Burrows
-40-
WO 2016/139534 PCT7IB2016/000316
Wheeler Transform (BWT) search to find the initial exact seeds, b) Time taken for Dynamic Programming (DP)
around the seeds found, c) Time taken for post-processing to produce final alignment results. [000247] Also, it is important to note that the total time reported should not include the time taken in the following tasks: a) Time taken to build the BWT index on the reference genome (since this is a one-time task), b) 5 Time taken to import the sample and collect/calculate QC metrics.
[000248] Algorithms Compared for Benchmarking: To make the alignment process more efficient, most alignment algorithms start by building an index on the reference genome. This index is then used to find the genomic locations for each read. There are some algorithms that build index on the reads but most of the popular recent algorithms build index on the reference genome. This is because the same index once built on a reference
10 genome can be used repeatedly for aligning different read sets. There are two main techniques used for building the index on reference genome: hash tables and Burrows-Wheeler transform (BWT). In this study, we only consider the following algorithms that build an index (either BWT- or hash table-based) on the reference genome: [000249] Strand NGS: It is BWT-based aligner, which in the first step, uses the Ferrazina and Manzini algorithm to find the seeds of each read quickly in the reference. Then for every seed match found for the read, a
15 Smith-Waterman type dynamic programming (DP) algorithm is applied to determine if the read matches around the area anchored by the seed. This match must satisfy the specified mismatches and gaps threshold. [000250] BWA: It is a BWT-based aligner, which uses the Ferrazina and Manzini matching algorithm to find seeds. Then a backtracking algorithm is used that searches for matches between a substring of the reference genome and the query within a certain defined distance.
20 [000251] BWA-Mem: It is conceptually based on BWA but reported to handle long reads and considered to be more accurate and faster.
[000252] Bowtie2: It is a BWT-based aligner but just like BWA, uses the Ferrazina and Manzini matching algorithm to find seeds. Bowtie2 is designed to handle long reads and supports additional features that are not supported by Bowtie.
25 [000253] Novoalign3: It is a hash table based tool, which first builds the index on the reference using hash tables and then uses the Neddleman-Wunsch algorithm with affine gap penalties to find the optimal alignment. [000254] Other than Novoalign3, which uses the hash table based index, others used BWT based index. Despite the conceptual similarities among the BWT index-based algorithms, there are still differences in their seeding strategies and optimizations. For example, Bowtie2 search for the seeds of length 16 with a gap of 10 then prioritize
30 the seed locations and perform DP around them. BWA on the other hand, does the inexact matching in the BWT index itself. In Strand NGS, we keep extending the seed till the seed is present in the reference. Then we jump by one/third of the seed and start searching for the next seed. In the end, we take the 4 longest seeds and find the seed locations in the reference. After removing the duplicate seed locations, DP is performed at each seed location. [000255] Data Sets Used: The GCAT (Genome Comparison and Analytic Testing) platform which is developed
35 by Arpeggi Inc. and hosted on the bioplanet website, is heavily used in this study. GCAT provides a solid testing platform for comparing the performance of multiple genomic analysis tools across a standard set of metrics. For alignment tests, the GCAT website hosts different simulated data sets covering read lengths of 100, 150, 250, and 400; single-end and paired-end library protocols; and both short and long InDels. Assuming a typical use-case scenario where Illumina paired-end data is used, we selected four data sets from GCAT in this study for
40 benchmarking purposes. The details of these data sets are shown in Table 4. The base quality for each base in all the four simulated data sets is 17, producing an average read quality of 17.
-41-
WO 2016/139534 PCT/IB2016/000316
[000256] For computational performance benchmarking, we used the data from whole-genome sequencing of the sample, NA12878 on Illumina HiSeq 2500 using paired-end library. This data is -103GB and comprises of 1,165,216,818 (-1.16 billion) paired-end reads of length 150bp.
iaDie t: Description oi OCAI aaia sets 5 [000257] There are some limitations of using simulated data sets for accuracy benchmarking. First, it is difficult to mimic the error patterns in the generated reads that matches the true error model of the sequencer; and second it is not guaranteed that the true genomic location of the generated read with simulated sequencing errors is the one where the read is generated from. The read with simulated sequencing errors and/or SNPs/InDels can very well align at a different location with better accuracy. However barring few instances like these, still the ease of defining the
10 truth, makes this approach viable and useful for benchmarking studies. Therefore despite some shortcomings, most of the high-level comparisons that can be derived about different algorithms using simulated data sets still hold, thereby providing a useful resource to the users to understand the pros and cons of multiple alignment algorithms. [000258] Accuracy Benchmarking Results: In an effort to keep this article short, we only provide detailed results and discussion on one of the GCAT data sets for accuracy benchmarking. Since reads with large InDels pose greater
15 challenge in alignment, we'll only cover the benchmarking results on data set D4 in more detail. For comparison of alignment algorithms on other data sets (Dl, D2, and D3), we have reviewed our results. Most of the conclusions made in this study using the results on data set D4 hold good for other data sets, however the differences in the algorithms for data sets with short InDels (Dl and D3) are less pronounced. [000259] Alignment Accuracy on Data Set D4: As per the evaluation criteria mentioned above, Table 5 shows
20 the fraction of correctly, incorrectly and unmapped reads for four different algorithms in various scenarios. All the algorithms were run on default parameters.
tauie s: Alignment Accuracy ui siranu ruvoaiigiu on suuuiaieu uaia sei w* . [000260] Since the reads with SNPs or InDels, which typically makes the read alignment difficult, constitute a small percentage of the data, all the algorithms particularly Strand NGS, BWA, and Novoalign3 have comparable alignment accuracy when all reads are used. Bowtie2 has a slightly worse alignment accuracy compared to others on 5 both metrics i.e. % correctly mapped and % incorrectly mapped. However for all practical purposes, alignment accuracy for all the four algorithms can be considered in the same ballpark when results are summarized across all reads.
[000261] On the other hand, if we focus on the alignment accuracy considering only reads with SNPs, only reads with InDels, or only reads with both SNPs and InDels, there are more clear differences in the accuracy of different
10 algorithms. For example, comparing the accuracy of different algorithms on 45,782 reads that have both SNPs and InDels, Strand NGS and Novoalign3 seem to be performing equally well but relatively better than BWA and Bowtie2 in terms of both metrics, producing higher % of correctly mapped reads (high TPR) and lower % of incorrectly mapped reads (low FPR). [000262] Relationship between True Positive and False Positive Rate: Typically, there is a tradeoff between
15 TPR and FPR and the relationship between them is an important and commonly used metric to compare different algorithms. To assess the relationship between TPR and FPR, we restrict the analysis to the reads with both SNPs and InDels (45,782 reads in this case), since these reads are most difficult to align and therefore represent a good set to compare the pros and cons of different algorithms. As per the description on the GCAT website, to generate the plot showing relationship between true positive and false positive rate, first reads are sorted based on their
20 normalized mapping qualities from high to low, and then correct and incorrect percentages are calculated. Incorrect read % is plotted in log-scale on the y-axis and correct read % is plotted on the x-axis.
[000263] The further the line goes to the right before turning upwards, the better the accuracy. This is because typically the percentage of incorrectly mapped reads is low for reads with high mapping quality, with this percentage increasing with the decrease in the mapping quality.
25 [000264] FIG. 2 shows the relationship between true positive and false positive rate when only reads with both SNPs and InDels are considered. As can be seen from figure 2, BWA and Bowtie2 show a steep curve upwards compared to Strand NGS and Novoalign3, indicating larger percentage of incorrectly mapping reads even for reads that are assigned higher mapping quality by the respective algorithms. On the other hand, this curve for Strand NGS and Novoalign3 goes up more gradually indicating only a mild increase in FPR as TPR increases.
-43-
WO 2016/139534 PCT7IB2016/000316
[000265] Distribution of Mapping Quality of Incorrectly Mapped Reads: Regardless of the choice of the alignment algorithm, due to short length of the reads, sequencing errors and complex genomic locations including repeat regions, there will be some percentage of incorrectly mapped reads. These incorrectly mapped reads are likely to produce false positives in downstream analysis like variant calling. However since most variant calling algorithms 5 take into account the read/base quality as well as mapping quality assigned by the alignment algorithm, it is possible to avoid some false positives if low mapping qualities are assigned to the incorrectly mapped reads. Therefore, one of the important criterions to compare alignment algorithms is to assess the distribution of mapping qualities assigned by them to incorrectly mapped reads. As mentioned, if an algorithm incorrectly maps the read but assign a low mapping quality to it, the read can be filtered out in the variant calling step avoiding false positive calls
10 originating from it. On the other hand, if incorrectly mapped reads are assigned high mapping quality, there is no way to filter these out leading to more false positives.
[000266] To assess what mapping qualities are assigned to incorrectly mapped reads by different algorithms, GCAT provides a plot showing the mapping quality distribution for incorrectly mapped reads. Figure 3A shows this distribution for all reads, and figure 3B shows the same distribution for only reads with InDels. For better
15 visualization, we also show in figure 4A and 4B the similar distributions, however comparing only Strand NGS and Novoalign3 since both of these algorithms are found to have comparable but higher alignment accuracy as compared to BWA and Bowtie2 (refer to table 5).
[000267] There are two important points to observe from figures 3A and 3B, which apply to all the algorithms. First, every algorithm aligns a fraction of reads incorrectly with mapping quality as 100. There is most likely no
20 ambiguity in aligning these reads due to multiple locations being equally likely, but still the reads are mapped
incorrectly due to other issues. Second, a higher fraction of incorrectly mapped reads are towards the left side of the plot indicating an assignment of lower mapping quality to them. This is good because later these incorrectly mapped reads can be filtered out using an appropriate mapping quality threshold before proceeding with the variant calling step. If we ignore the incorrectly mapped reads with mapping quality 100, and focus on the rest which are likely
25 incorrectly aligned due to ambiguous locations, we observe that particularly Strand NGS assigns lower mapping quality to these reads compared to the other three algorithms. The maximum mapping quality assigned by Strand NGS to incorrectly mapped reads (ignoring reads with mapping quality 100) is around 50, whereas rest of the algorithms, including Novoalign3 assigns much higher mapping quality to these reads, thereby making it difficult to filter out these reads in the variant calling step. This is more clearly visualized in figure 4A and B, where only
30 Strand NGS and Novoalign3 are compared.
[000268] Alignment Accuracy for Reads with Low Mapping Quality: The reads that originate from complex genomic locations like repeat regions for example are difficult to align. Typically these reads are classified as ambiguous by most algorithms as multiple locations in the genome are found to be equally likely for alignment of these reads. Depending on the algorithm or the choice of parameter settings, either all the genomic locations are
35 reported or one random location is chosen and reported for these ambiguous reads. In either case, most of the
algorithms reduce the mapping quality of these reads to reflect the ambiguity in their alignment. In this section, to assess the performance of algorithms in terms of accuracy on these ambiguous reads, we'll only compare Strand NGS and Novoalign3 because of the following reasons: 1) In terms of % correctly and % incorrectly mapped reads (which is used as the primary metric of accuracy), Strand NGS and Novoalign3 have similar accuracy, and both are
40 found to be better than BWA and Bowtie2. Therefore, we believe that this additional analysis is valuable for
comparing Strand NGS and Novoalign3 only. 2) Each algorithm has its own way of assigning mapping qualities. In addition, the mapping qualities assigned by different algorithms are in various ranges. This is the reason GCAT
-44-
| # | Name | Date |
|---|---|---|
| 1 | 201747033113-Correspondence to notify the Controller [11-03-2022(online)].pdf | 2022-03-11 |
| 1 | 201747033113-STATEMENT OF UNDERTAKING (FORM 3) [19-09-2017(online)].pdf | 2017-09-19 |
| 2 | 201747033113-FORM 1 [19-09-2017(online)].pdf | 2017-09-19 |
| 2 | 201747033113-US(14)-HearingNotice-(HearingDate-14-03-2022).pdf | 2022-02-14 |
| 3 | 201747033113-FER.pdf | 2021-10-17 |
| 3 | 201747033113-DRAWINGS [19-09-2017(online)].pdf | 2017-09-19 |
| 4 | 201747033113-DECLARATION OF INVENTORSHIP (FORM 5) [19-09-2017(online)].pdf | 2017-09-19 |
| 4 | 201747033113-CLAIMS [22-07-2021(online)].pdf | 2021-07-22 |
| 5 | 201747033113-COMPLETE SPECIFICATION [22-07-2021(online)].pdf | 2021-07-22 |
| 5 | 201747033113-COMPLETE SPECIFICATION [19-09-2017(online)].pdf | 2017-09-19 |
| 6 | 201747033113.pdf | 2017-09-21 |
| 6 | 201747033113-DRAWING [22-07-2021(online)].pdf | 2021-07-22 |
| 7 | 201747033113-FORM-26 [06-12-2017(online)].pdf | 2017-12-06 |
| 7 | 201747033113-FER_SER_REPLY [22-07-2021(online)].pdf | 2021-07-22 |
| 8 | Correspondence by Agent_Power of Attorney_13-12-2017.pdf | 2017-12-13 |
| 8 | 201747033113-FORM 3 [01-07-2021(online)].pdf | 2021-07-01 |
| 9 | 201747033113-Information under section 8(2) [01-07-2021(online)].pdf | 2021-07-01 |
| 9 | 201747033113-Proof of Right (MANDATORY) [19-03-2018(online)].pdf | 2018-03-19 |
| 10 | 201747033113-FORM 3 [19-03-2018(online)].pdf | 2018-03-19 |
| 10 | 201747033113-FORM 4(ii) [19-04-2021(online)].pdf | 2021-04-19 |
| 11 | 201747033113-FORM 18 [01-03-2019(online)].pdf | 2019-03-01 |
| 12 | 201747033113-FORM 3 [19-03-2018(online)].pdf | 2018-03-19 |
| 12 | 201747033113-FORM 4(ii) [19-04-2021(online)].pdf | 2021-04-19 |
| 13 | 201747033113-Information under section 8(2) [01-07-2021(online)].pdf | 2021-07-01 |
| 13 | 201747033113-Proof of Right (MANDATORY) [19-03-2018(online)].pdf | 2018-03-19 |
| 14 | 201747033113-FORM 3 [01-07-2021(online)].pdf | 2021-07-01 |
| 14 | Correspondence by Agent_Power of Attorney_13-12-2017.pdf | 2017-12-13 |
| 15 | 201747033113-FER_SER_REPLY [22-07-2021(online)].pdf | 2021-07-22 |
| 15 | 201747033113-FORM-26 [06-12-2017(online)].pdf | 2017-12-06 |
| 16 | 201747033113-DRAWING [22-07-2021(online)].pdf | 2021-07-22 |
| 16 | 201747033113.pdf | 2017-09-21 |
| 17 | 201747033113-COMPLETE SPECIFICATION [19-09-2017(online)].pdf | 2017-09-19 |
| 17 | 201747033113-COMPLETE SPECIFICATION [22-07-2021(online)].pdf | 2021-07-22 |
| 18 | 201747033113-CLAIMS [22-07-2021(online)].pdf | 2021-07-22 |
| 18 | 201747033113-DECLARATION OF INVENTORSHIP (FORM 5) [19-09-2017(online)].pdf | 2017-09-19 |
| 19 | 201747033113-FER.pdf | 2021-10-17 |
| 19 | 201747033113-DRAWINGS [19-09-2017(online)].pdf | 2017-09-19 |
| 20 | 201747033113-US(14)-HearingNotice-(HearingDate-14-03-2022).pdf | 2022-02-14 |
| 20 | 201747033113-FORM 1 [19-09-2017(online)].pdf | 2017-09-19 |
| 21 | 201747033113-STATEMENT OF UNDERTAKING (FORM 3) [19-09-2017(online)].pdf | 2017-09-19 |
| 21 | 201747033113-Correspondence to notify the Controller [11-03-2022(online)].pdf | 2022-03-11 |
| 1 | searchE_19-10-2020.pdf |