Abstract: METHOD FOR DETECTING CANCER THROUGH CELL-FREE DNA METHYLATION PATTERNS IN SUBJECT AND DIAGNOSTIC DEVICE THEREFOR ABSTRACT The present disclosure provides at least one method for detecting at least one cancer through cell-free DNA (cfDNA) methylation patterns in a subject. Method comprises: obtaining a plurality of methylation sequencing (methSeq) reads from cfDNA of said subject; analysing said methylSeq reads using a bioinformatics pipeline to obtain a genomic alignment and a methylation status for each methylSeq read; filtering a first type of methSeq reads from a second type of methSeq reads depending upon their conversion extents; deriving at least one score using said first type of methSeq reads; and applying a predictive model to said at least one score to determine presence of said at least one cancer and a tissue of origin of said at least one cancer in said subject. Disclosed also is a diagnostic device for detecting at least one cancer through cell-free DNA (cfDNA) methylation patterns in a subject. FIG. 1 for the Abstract
Description:FIELD OF THE INVENTION
The present disclosure relates to methods for detecting cancers through cell-free DNA (cfDNA) methylation patterns in subjects. The present disclosure also relates to diagnostic devices for detecting cancers through cell-free DNA (cfDNA) methylation patterns in subjects.
BACKGROUND OF THE INVENTION
Globally, cancer is the leading cause of death, which is responsible for nearly 9.7 million deaths (out of 20 million cases reported; i.e., nearly 50%) in 2022. The most commonly reported fatal cancers, by numbers, are lung cancer (1.8 million reported deaths out of 2.5 million reported cases), breast cancer (0.67 million reported deaths out of 2.3 million reported cases), colorectal cancer (0.9 million reported deaths out of 1.9 million reported cases), stomach cancer (0.66 million reported deaths out of 1 million reported cases), prostate cancer (1.5 million reported cases), and liver cancer (0.76 million reported deaths) [900-world-fact-sheet.pdf (who.int)]. Among the above mentioned cancers, the five year (2019 till July 2024) (Localized, Regional and Distant) survival rates are as follows: Lung Cancer: 65%, 37%, 9%; Colorectal Cancer: 91%, 72%, 13%; Breast Cancer: 99%, 86%, 31%; Prostate Cancer: >99%, >99%, 34%; Liver Cancer: 37%, 14%, 4%; Stomach Cancer: 75%, 35%, 7%. So, it is evident that survival increases significantly across different cancer types if diagnosed in early stages [https://www.cancer.org/cancer/types.html].
Next-generation sequencing technologies have emerged as powerful tools in recent years, enabling the discovery of numerous disease-associated biomarkers at the onset of illness, such as cancer. Typically, DNA methylation profiling using methylation sequencing is typically used for detecting, diagnosing and monitoring of cancer in subjects. Mutation-based evaluation of cfDNA has been used in patients with cancer to recommend therapies and to monitor cancer relapse. However, a multi-cancer approach using the mutation-based evaluation can be expensive since mutations can occur in different genomic regions in different cancers; and sequencing large number of genomic regions with high sensitivity increases the cost of the mutation-based evaluation test. Methylation signatures on the other hand are very broad so ultra-deep sequencing is not necessary. However, methylation changes can occur due to a variety of diseases, lifestyle habits and natural ageing processes. Hence, determining differentially methylated regions in a disease group as compared to a control group while handling the confounding factors is not easy with conventional methods. Moreover, further research is warranted to discover cancer specific methylation patterns in the target population with the appropriate control groups to ensure high sensitivity for early-stage cancers while maintaining exceptional accuracy. Therefore, in the light of foregoing discussion, there exists a need to overcome the aforementioned drawbacks.
SUMMARY OF THE INVENTION
A primary objective of the present disclosure seeks to provide a method and a diagnostic device that provides efficient detection of cancer(s) through cell-free DNA (cfDNA) methylation patterns in a subject by accounting for only fully (or largely) converted methylation sequencing (methSeq) reads from cfDNA sample. Herein, distribution of said methSeq reads is not biased relative to the distribution of all fragments, thereby providing highly effective detection of cancer(s) in subjects. Another objective of the present disclosure seeks to provide a method and a diagnostic device for detecting at least one cancer through cell-free DNA (cfDNA) methylation patterns in a subject. An aim of the present disclosure is to provide a solution that overcomes at least partially the problems encountered in prior art.
One such problem of the existing prior art is that of a very stringent conversion efficiency criterion. After the cfDNA, obtained from a sample, is sequenced, a check is made to determine the conversion efficiency of unmethylated cytosines to uracils. Only if the conversion efficiency is high enough, say 99% or above, is the sample used for training a classifier algorithm or for making a prediction. The application of such a constraint leads to high failure rates and therefore costs. On the other hand, in the absence of such a constraint, a poorly methylated DNA fragment could appear as methylated, leading to sample misclassification. The present disclosure describes a novel approach to tackle said issue. The disclosed method and diagnostic device, described in detail in the following sections, allows to rescue reads from samples at conversion efficiency much lower than 99%. The disclosed method and diagnostic device thus save the time and expenditure of resequencing the samples that failed the stringent, say 99% or above, conversion efficiency requirements.
In a first aspect, an embodiment of the present disclosure provides a method for detecting at least one cancer through cell-free DNA (cfDNA) methylation patterns in a subject, said method comprising:
obtaining a plurality of methylation sequencing (methSeq) reads from cfDNA of said subject;
analysing said methylSeq reads using a bioinformatics pipeline to obtain a genomic alignment and a methylation status for each methylSeq read;
filtering a first type of methSeq reads from a second type of methSeq reads depending upon their conversion extents;
deriving at least one score using said first type of methSeq reads; and
applying a predictive model to said at least one score to determine presence of said at least one cancer and a tissue of origin of said at least one cancer in said subject.
The aforementioned method of the present disclosure leverages a novel blood-based assay for non-invasive detection of diverse cancers. The assay leverages cell-free DNA (cfDNA) methylation sequencing, a technique that explores methylation patterns within cfDNA to identify potential biomarkers. The test incorporates a methylation score derived from the sequencing data, followed by a machine learning based prediction model trained on a retrospective cohort encompassing diagnosed cancer cases and control subjects recruited across various centres in India. The methylation patterns between the two cohorts are contrasted to identify the distinctive methylation patterns. To ensure robustness, conservative constraints within the machine learning algorithms are employed, aligning them with established biological principles. Additionally, consistency with methylation signatures observed in other ethnicities (Caucasian and Han Chinese) is evaluated and a substantial control group with habitual tobacco and alcohol use, and factors known to influence DNA methylation are incorporated during the study. Finally, the test's resilience to random signal fluctuations is verified and cross-validation yields promising results demonstrate a sensitivity of, for example, 79.3% for Stage I cancer, 78.4% for Stages II, 78.4% for Stage III, and 86.8% for Stage IV at a specificity of approximately 96.9% in an independent validation set. In conclusion, the blood test offers a promising approach for cancer detection, particularly in advanced stages.
In a second aspect, an embodiment of the present disclosure provides a diagnostic device for detecting at least one cancer through cell-free DNA (cfDNA) methylation patterns in a subject, said diagnostic device comprising a reader unit for reading and analysing results from said cell-free DNA (cfDNA) methylation patterns.
The aforementioned diagnostic device achieves all the advantages and technical effects of the aforementioned method for detecting at least one cancer through cell-free DNA (cfDNA) methylation patterns in a subject, of the present disclosure. The aforementioned diagnostic device is novel and efficient on effectively detecting at least one cancer through cell-free DNA (cfDNA) methylation patterns in a subject by reading the converted methylation sequencing (methSeq) reads from cfDNA sample at different thresholds of 20%, 30%, 40%, 50%, 60%, 70%, 80%, 90%, 95%.
In a third aspect, an embodiment of the present disclosure provides a non-transitory computer-readable storage media having computer-readable instructions stored thereon, the computer-readable instructions being executable by a processor comprising processing hardware to execute a method of the aforementioned first aspect.
Embodiments of the present disclosure substantially eliminate or at least partially address the aforementioned problems in the prior art, and enable assignment of the new user into a cluster for which the personalised content is already available in the metaverse, thereby minimizing time and efforts for customising each object individually according to the new user.
Additional aspects, advantages, features and objects of the present disclosure would be made apparent from the drawings and the detailed description of the illustrative embodiments construed in conjunction with the appended claims that follow.
It will be appreciated that features of the present disclosure are susceptible to being combined in various combinations without departing from the scope of the present disclosure as defined by the appended claims.
BRIEF DESCRIPTION OF THE DRAWINGS
The summary above, as well as the following detailed description of embodiments, is better understood when read in conjunction with the appended drawings. For the purpose of illustrating the present disclosure, exemplary constructions of the disclosure are shown in the drawings. However, the present disclosure is not limited to specific instrumentalities disclosed herein. Moreover, those skilled in the art will understand that the drawings are not to scale.
Embodiments of the present disclosure will now be described, by way of example only, with reference to the following diagrams wherein:
FIG. 1 illustrates a flowchart depicting steps of a method for detecting at least one cancer through cell-free DNA (cfDNA) methylation patterns in a subject, in accordance with an embodiment of the present disclosure;
FIG. 2 illustrates a schematic illustration of a process flow for detecting at least one cancer through cell-free DNA (cfDNA) methylation patterns in a subject, in accordance with an embodiment of the present disclosure;
FIG. 3 illustrates a diagnostic device for detecting at least one cancer through cell-free DNA (cfDNA) methylation patterns in a subject, in accordance with an embodiment of the present disclosure;
FIG. 4 illustrates a graphical representation depicting cross-validation ROC, showing a very high true positive rate at low false positive rate for both training and testing dataset;
FIGs. 5A, 5B and 5C illustrate graphical representations depicting sensitivity analysis at targets of 91%, 95% and 98% cross-validity specificity, respectively, for a general cancer model and a female cancer model. The whiskers show 95% confidence interval of scores across multiple models, which is relatively tight across controls, benign samples, and different stages; and
FIG. 6 illustrates different conversion and non-conversion protocols for detecting at least one cancer through cell-free DNA (cfDNA) methylation patterns in a subject.
In the accompanying drawings, an underlined number is employed to represent an item over which the underlined number is positioned or an item to which the underlined number is adjacent. A non-underlined number relates to an item identified by a line linking the non-underlined number to the item. When a number is non-underlined and accompanied by an associated arrow, the non-underlined number is used to identify a general item at which the arrow is pointing.
DETAILED DESCRIPTION OF EMBODIMENTS
The following detailed description illustrates embodiments of the present disclosure and ways in which they can be implemented. Although some modes of carrying out the present disclosure have been disclosed, those skilled in the art would recognize that other embodiments for carrying out or practising the present disclosure are also possible.
Referring to FIG. 1, illustrated is a flowchart 100 depicting steps of a method for detecting at least one cancer through cell-free DNA (cfDNA) methylation patterns in a subject. At step 102, a plurality of methylation sequencing (methSeq) reads is obtained from cfDNA of said subject. At step 104, said methylSeq reads are analysed using a bioinformatics pipeline to obtain a genomic alignment and a methylation status for each methylSeq read. At step 106, a first type of methSeq reads is filtered from a second type of methSeq reads depending upon their conversion extents. At step 108, at least one score is derived using said first type of methSeq reads. At step 110, a predictive model is applied to said at least one score to determine presence of said at least one cancer and a tissue of origin of said at least one cancer in said subject.
Throughout the present disclosure, the term "cell-free DNA" (cfDNA) refers to DNA fragments (typically, 160-200 base pairs long) free floating in a blood sample of a subject. The cfDNA is not contained within blood cells, and their source could be various cells in the body. If there is an underlying cancer present in the body, then a few of the cfDNA fragments come from the cancer cells. Notably, such cfDNA fragments may carry methylation patterns distinct to fragments coming from normal cells. Herein, the subject is a human. Alternatively, the subject may be an animal.
Throughout the present disclosure, the term "methylation" refers to an epigenetic modification wherein an extra methyl or hydroxy methyl group is added to a nucleic acid base, such as a DNA base, particularly a cytosine nucleotide typically at CpG dinucleotides (or CpG sites). Typically, a CpG site has a C followed by a G in the genomic sequence. Notably, there are ~28m CpGs in the human genome; some of these are methylated and some are not. However, it may be appreciated that in some tissues or conditions, methylation is observed even at non-CpG sites, thus referred to as "non-CpG methylation". This pattern of methylation (or methylation pattern) is known to differ from tissue to tissue (e.g., heart cells may have a different pattern from lung cells) and in cancer (cancer cells may have different patterns from normal cells). Notably, methylation influences gene expression by regulating the accessibility of the DNA to transcriptional machinery, without altering the underlying DNA, namely cfDNA, sequence, therefore, cfDNA methylation patterns serve as potential diagnostic and/or therapeutic biomarkers for identification of a disease or condition in a subject from a healthy control, or status of said disease in a subject over a period of time.
In an embodiment, said cancer is selected from at least one of: colorectal cancer, lung cancer, breast cancer, prostate cancer, stomach cancer, liver cancer. Notably, colorectal cancer, lung cancer, breast cancer, prostate cancer, stomach cancer, liver cancer refers to malignancies arising in the colon or rectum, tissues of the lungs, breast tissues, prostate gland, stomach lining, and liver cells, respectively. Notably, the aforementioned cancers may vary in origin, pathophysiology, and progression, but all involve uncontrolled cellular growth within the respective organs or tissues. Beneficially, the method of the present disclosure may find application in detection, diagnosis, and/or treatment of the aforementioned cancer types. It may be appreciated that early and accurate detection of these cancers is critical due to their high prevalence and significant morbidity and mortality associated with late-stage diagnoses. Effective detection allows for timely intervention, which can improve patient outcomes by enabling earlier therapeutic strategies, reducing disease progression, and increasing the likelihood of remission or survival. Given the distinct molecular and genetic markers (referred to as "biomarkers", hereafter) associated with each cancer type, the method provides a means of identifying such biomarkers, which can assist in the stratification of patients, tailoring of personalized treatment regimens, and monitoring of treatment efficacy.
Referring to step 102, the method comprises obtaining a plurality of methylation sequencing (methSeq) reads from cfDNA of said subject. The term "methylation sequencing" (methSeq) as used herein refers to any sequencing technique designed to analyse DNA (namely, cfDNA) methylation patterns across a genome. Methylation sequencing (methSeq) enables the detection and quantification of these methylation marks, allowing for the identification of epigenetic changes that may be associated with diseases, such as cancer, neurological disorders, or other pathological conditions. Various methods of methSeq include, but are not limited to, enzymatic methyl sequencing (EM-seq), whole-genome bisulfite sequencing (WGBS), reduced representation bisulfite sequencing (RRBS), and other high-throughput sequencing approaches aimed at mapping methylated cytosines within the genome. Beneficially, the methSeq techniques provide insights into the regulation of gene activity, cellular differentiation, and the effects of environmental factors on the epigenome.
The term "methSeq read" as used herein refers to individual sequences of DNA, namely cfDNA, that are generated as a raw output of the methylation sequencing. Typically, the methSeq reads vary in length, such as 50 bp, 100 bp, or longer. The methSeq reads contain information about the methylation status of cytosine bases, particularly at CpG sites, and are used to determine the methylation patterns across the genome. Typically, unmethylated cytosine bases are converted to uracils (and read as thymines during sequencing), while methylated cytosine bases remain as cytosine bases. The methSeq reads show these conversions and allow for differentiation between methylated and unmethylated cytosine bases. Moreover, presence of cytosine bases or thymine bases at specific CpG sites indicates whether those sites were methylated or unmethylated in the original DNA. Beneficially, by analysing a large number of methSeq reads, the overall methylation pattern at those loci across a sample can be determined.
Notably, the methSeq library generation may be achieved by any one of: a conversion protocol or a non-conversion protocol. Primarily, the conversion protocol may include chemical conversion of DNA (namely, bisulfite treatment or BS-seq) and enzymatic conversion of DNA (namely, EM-Seq) as the gold standard for differentiating between methylated and unmethylated cytosines. Conversion protocol preserve methylated Cs as Cs and convert unmethylated Cs to uracils which show up as Ts in the sequencing data. This is because most sequencers can only handle A, C, G, T bases. Notably, the non-conversion protocols rely on alternative methods to assess DNA methylation. In this regard, the non-conversion protocols leverage some newer instruments which can sequence methylated Cs directly, almost as if a 5'th base is being sequenced, thus eliminating the need for conversion. In this regard, some conversion filter, coarse conversion estimator, scoring models, pre-trained models, and perturbation etc. may be relevant to assess DNA methylation. Examples of non-conversion protocols may include, but do not limit to, nanopore-based sequencing, affinity-based methods. Nanopore-based sequencing leverages platforms that can directly detect methylation patterns in native DNA, by measuring changes in electrical current as DNA passes through a nanopore, allowing detection of methylated bases as compared to unmethylated bases. The affinity-based methods, such as Methylated DNA Immunoprecipitation Sequencing (MeDIP-seq) enrich methylated DNA regions using antibodies that bind to methylated cytosines, and subject the enriched methylated DNA regions to sequencing. Other affinity-based methods, such as Methyl-CpG Binding Domain Sequencing (MBD-seq), uses methyl-CpG-binding domain proteins to capture methylated DNA fragments for sequencing. Beneficially, the non-conversion protocols preserve the integrity of the DNA, reducing degradation and loss during preparation. Moreover, such methods avoid biases introduced during conversion protocols and PCR. Additionally, such non-conversion protocols are generally less labour-intensive than conversion protocols.
In an embodiment, said methSeq reads are obtained by:
extracting plasma from the blood drawn from said subject;
extracting cell-free DNA (cfDNA) from said plasma;
pre-processing said cfDNA for end-repairing, a-tailing and adapter ligation;
subjecting said pre-processed cfDNA to at least one first enzyme selected from: Tet methylcytosine dioxygenase 2 (TET2) and T4 Phage β-glucosyltransferase (T4-BGT), for protecting methylcytosine bases (5mC) and hydroxymethyl cytosine bases (5hmC) from deamination to obtain a protected methSeq library;
subjecting said protected methSeq library to at least one second enzyme selected from: Apolipoprotein B mRNA editing enzyme subunit 3A (APOBEC3A), for converting unprotected cytosine bases to uracil bases to obtain converted single-stranded DNA;
performing PCR of the said converted strands to obtain a methSeq library; and
performing sequencing of said converted methSeq library to obtain methSeq reads.
In this regard, methSeq reads are obtained by using enzymes to differentiate between methylated and unmethylated cytosines. Herein, different subjects ranging from treatment-naïve patients (including: a) patients without any previous onco-treatment or surgery, b) patients with suspected or confirmed benign, precancerous or malignant neoplasms and/or lesions, and c) patients eligible for surgical resection) and healthy controls (including: a) individuals without any symptoms of benign, precancerous or malignant neoplasms and/or lesions, and without any history of cancer or precancerous or benign lesions, as well as b) individuals with the habit of chewing and/or smoking tobacco, moderate alcohol consumption, and c) subjects without any of these habits). From the aforementioned subjects, whole blood samples are collected, for example, via venipuncture using a sterile needle into tubes which prevent blood clotting and stabilize nucleic acids (including cfDNA) during storage and transport. The collected blood sample is subjected to centrifugation for plasma extraction from the whole blood sample. Notably, the centrifugation step separates the blood into three distinct layers, namely, plasma (top), buffy coat (middle) and red blood cells (bottom), based on the density of its components. The plasma layer contains cfDNA, proteins, hormones, and other cellular components. The plasma layer may be subjected to a further centrifugation step to remove any residual cellular components therein, ensuring the plasma is clean and suitable for cfDNA extraction. Subsequently, the plasma is processed to extract cfDNA therefrom. Optionally, cfDNA is extracted using DNA-binding kits comprising silica membrane or magnetic beads, chaotropic salt solution, carrier RNA, wash buffers, elution buffer or water, etc.
It may be appreciated that the pre-processing step of said cfDNA for end-repairing, a-tailing and adapter ligation is essential for constructing a sequencing-ready DNA library, such as for a next-generation sequencing (NGS). It may be appreciated that the cfDNA is naturally fragmented due to biological processes like apoptosis and necrosis, and such cfDNA fragments often have uneven or "blunt" ends (i.e., overhangs). End-repairing is a process that ensures the DNA fragments have blunt ends by making the 5' and 3' ends of the cfDNA uniform, enabling efficient ligation of sequencing adapters in later steps. Typically, a mixture of enzymes, such as T4 DNA polymerase, T4 polynucleotide kinase and exonuclease, to prepare cfDNA fragments for further modifications such as a-tailing and adapter ligation. A-tailing typically promotes ligation of the cfDNA fragments to sequencing adapters. The sequencing adapters are short, double-stranded DNA sequences that are necessary for sequencing. In this regard, a DNA polymerase enzyme is typically used to add (A) nucleotide to 3' blunt end of the cfDNA fragment which eventually covalently binds or ligates to a complementary thymine (T) overhang at a 3' end of a sequencing adapter in the presence of a T4 DNA ligase enzyme. Typically, some sequencing adapters (namely, indexed adapters) contain a unique sequence or barcode that allows multiple samples to be sequenced in a single run by distinguishing between reads from different samples. Beneficially, the pre-processed cfDNA is ready for sequencing, and optional amplification steps for increasing the quantity of cfDNA fragments before the sequencing step.
The pre-processed cfDNA is subsequently subjected to enzymatic methyl sequencing (EM-seq) to generate EM-seq libraries. Herein, the terms "first enzyme" and "second enzyme" refers to two different classes of enzymes adapted for protecting bases from deamination and conversion of a base into another, usually a more stable base, respectively. In this regard, enzymes TET2 and T4-BGT are used for protecting methylcytosine bases (5mC) and hydroxymethyl cytosine bases (5hmC) from deamination to obtain a protected methSeq library. In this regard, typically, TET2 converts 5mC and intermediate 5hmC into 5-carboxylcytosine (5caC), thereby allowing both methylated and hydroxymethylated cytosines to be distinguished from unmodified or unmethylated cytosines. T4-BGT catalyses the conversion of TET2-converted as well as naturally occurring 5hmC to 5-( -glucosyloxymethyl) cytosine. Additionally, treatment with APOBEC3A converts unmodified cytosines (C) into uracil (U) which is later read as thymine (T), while modified or methylated cytosines remain cytosines during sequencing, such as NGS. In other words, in the reads from EM-seq, namely EM-methSeq or methSeq reads, cytosines represent methylated cytosines, and the thymines represent unmethylated cytosines. Beneficially, EM-seq libraries represent the methylome with minimal DNA fragmentation or biases, higher quality DNA and better yields even with less input DNA (down to 10 ng) or degraded DNA samples. Moreover, EM-seq provides single-base resolution analysis. Optionally, the resulting methSeq library is sequenced on a standard high throughput short read (such as next generation sequencing (NGS) sequencer.
It may be appreciated that, alternatively, the EM-seq process to generate EM-seq libraries may employ other enzymes categorized as the first enzyme and the second enzyme, and is not just limited to TET2 and T4-BGT, and APOBEC3A, respectively.
In an embodiment, said methSeq reads are obtained by:
extracting plasma from the blood drawn from said subject;
extracting cell-free DNA (cfDNA) from said plasma;
pre-processing said cfDNA for end-repairing, a-tailing and adapter ligation;
subjecting said pre-processed cfDNA to sodium bisulfite for converting unmethylated cytosine bases to uracil bases to obtain a converted methSeq library; and
performing sequencing of said converted methSeq library to obtain methSeq reads.
In this regard, methSeq reads are obtained by using chemical treatment. As mentioned above in the previous embodiment, the cfDNA is extracted from a plasma of the whole blood sample and pre-processed. The pre-processed cfDNA is subjected to sodium bisulfite treatment to convert unmethylated cytosines into uracil (which is later read as thymine in sequencing, such as WGBS (of bisulfite sequencing (BS-seq)) and RRBS), while methylated cytosines remain unchanged. Typically, sodium bisulfite reagent deaminates cytosine in DNA. Notably BS-seq provides comprehensive genome-wide DNA methylation analysis.
It may be appreciated that the sequence of steps of the BS-Seq may vary in other variations of this embodiment. For example, a person skilled in the art may perform sodium bisulfite-based conversion of unmethylated cytosine bases to uracil bases of the cfDNA to obtain a converted methSeq library prior to performing end-repairing, a-tailing and adapter ligation on said converted cfDNA. It may be appreciated that by combining WGBS bisulfite treatment with NGS enzyme treatment, DNA methylation can be profiled or mapped at single-base resolution, namely, single cytosine bases, across the entire genome, including promoters, enhancers, and intergenic regions. Alternatively, RRBS focuses on CpG-rich regions, thus providing single-base resolution.
Optionally, the method comprises employing a metric derived from the cfDNA, namely lab metrics, during the upstream processing of the sample, i.e., the wet lab stages as mentioned above. Notably, the metrics is based on methSeq reads supplemented with other metrics derived from the cfDNA during the upstream processing of the sample. Some metrics regarding cfDNA may be quantity of cfDNA, average size of the cfDNA, percent of cfDNA molecules below a size threshold, quantity of cfDNA molecules below (or above) a size threshold etc. Beneficially, the metric derived from the cfDNA is directly used along with the scores by the machine learning model. Optionally, the metrics regarding cfDNA may have 20%, 40%, 60%, 80%, 90% thresholds that are used to derive different individual scores based on a biological nature of the cfDNA.
Optionally, the method comprises using the sample gender to decide which cancers to assess in a given sample. In this regard, it may be appreciated that gender-specific cancers, such as breast cancer, ovarian cancer, and cervical cancer assessments are done only in samples from female subjects, thus eliminating the need and/or bias for testing for these cancer types in male patients.
In one embodiment, the method comprises amplifying the converted methSeq library by subjecting it to a plurality of polymerase chain reaction (PCR) cycles prior to sequencing, wherein the number of PCR cycles is determined by the quantity of input cell-free DNA (cfDNA). In certain embodiments, no amplification or a plurality of PCR cycles may be performed, depending on the input cfDNA, prior to sequencing and subsequent analysis. It is understood that a higher input of cfDNA requires fewer PCR cycles, while a lower input necessitates additional cycles to yield a sufficient quantity of amplified cfDNA for downstream processes, such as sequencing. For instance, where the input cfDNA exceeds 100 ng, approximately 4–5 PCR cycles may be performed, whereas when the input cfDNA is less than 10 ng, approximately 11–18 PCR cycles may be necessary. In an additional embodiment, for input cfDNA exceeding 400 ng, approximately 2–4 PCR cycles may be sufficient.
In one embodiment, following the generation of the methSeq library, quantitative real-time PCR (qPCR) may be employed to assess the efficiency of conversion of unmethylated cytosine to thymine. In this context, the PCR method utilizes primers designed to amplify specific regions of the human genome. Fewer PCR cycles may be required if the conversion is incomplete, as many regions would retain sequence similarity to the original primer binding sites, allowing for amplification within the initial PCR cycles. Conversely, if conversion is 100% efficient, all unmethylated cytosine bases are fully converted to uracil and subsequently to thymine, including those within regions complementary to the primers. This results in compromised primer binding, necessitating a greater number of PCR cycles for amplification. Accordingly, methSeq libraries that require a plurality of PCR cycles, for example, at least 21 cycles, for sufficient amplification are indicative of complete conversion and are selected for further processing.
In an embodiment, the method further comprises:
incorporating a unique sample barcode or index during said adapter ligation step;
pooling said converted methSeq library with converted methSeq libraries from additional subjects to create a pooled methSeq library;
performing a target capture of said pooled library using enrichment probes;
sequencing said pooled methSeq library to generate pooled methSeq reads; and
demultiplexing said pooled methSeq reads to isolate said methSeq reads corresponding to said subject.
In this regard, the unique sample barcode or index is incorporated in the cfDNA fragment during PCR amplification. The unique sample barcode or index enables distinguishing between different samples during pooling. After incorporating unique sample barcodes, multiple converted methSeq libraries from different subjects, namely additional subjects, are pooled together (or combined) into a single sequence and subjected to sequencing. Notably, for each sequencing read, the sequencer records a unique sample barcode or index sequence. Optionally, there are two different barcodes. The first is the sample barcode or index which is different for each sample and used for demultiplexing, as used herein. The second is a unique molecular barcode (UMI) to deal with PCR duplicate sequences. This allows bioinformatic tools to identify and collapse PCR duplicate sequence, ensuring that each cfDNA fragment is only counted once, thus improving the quantitative accuracy of sequencing data. Moreover, incorporating the unique molecular barcodes reduces errors by comparing sequences with same unique molecular barcodes, allowing detection and filtering of erroneous sequences, and correcting amplification biases that can occur during PCR. Notably, the UMI is associated with increased accuracy.
Herein, the "cfDNA sample", refers to the cfDNA extracted from a subject for which a corresponding methSeq library was prepared by following the aforementioned method. It may be appreciated that for pooling, multiple subjects are selected, from whom multiple cfDNA samples are extracted to prepare a corresponding methSeq library, as mentioned above. Optionally, before pooling, methSeq libraries are typically quantified and normalized to ensure each sample contributes equally to the final sequencing output. After obtaining pooled methSeq reads in a single file containing methSeq reads from all pooled samples, by sequencing of the pooled methSeq library, the obtained pooled methSeq reads are demultiplexed or separated back to their corresponding original samples, based on the unique sample barcodes or indices. Notably, demultiplexing employs a mismatch tolerance to account for errors in sequencing (e.g., a misread unique molecular barcode) without compromising the integrity of the cfDNA samples. Beneficially, separate unique molecular barcodes, associated with separate converted methSeq libraries, allow sequencing reads from different samples to be distinguished from one another after sequencing. Moreover, pooling multiple converted methSeq libraries reduces the number of runs required for separate converted methSeq libraries, thereby reducing the overall cost of the high-throughput sequencing. Furthermore, demultiplexing ensures accurate data attribution for further analysis of the cfDNA samples such as in cancer diagnostics or genetic testing.
In an embodiment, the method further comprises:
performing hybridization capture of target regions of said converted methSeq library to obtain a captured methSeq library; and
performing sequencing of said capture methSeq library to obtain methSeq reads.
In this regard, before performing sequencing of the converted methSeq library, specific target regions, such as those containing CpG sites, promoters, or other methylation-sensitive regions, are specifically enriched using the hybridization capture technique. Herein, the term "enrich" refers to selectively increasing representation of specific regions (such as CpG sites, promoter regions, enhancer regions, or other loci of interest) of the cfDNA fragments that contain methylated and unmethylated cytosines in the original methSeq library. Typically, probes (also called capture baits), that are short, single-stranded oligonucleotides complementary to the target regions, are designed and used for hybridization capture of target regions. Optionally, the probes are biotin-labeled for easy recovery of hybridized fragments of the captured methSeq library, using streptavidin-coated beads (which bind to biotin). The captured methSeq library is subjected to sequencing to obtain the methSeq reads, as mentioned above, to generate methylation-specific data for the target regions. Optionally, the captured methSeq library is subjected to amplification prior to sequencing thereof. Beneficially, hybridization capture of target regions is more cost-effective and manageable than sequencing the entire genome, by reducing the amount of sequencing data required. Moreover, hybridization capture of target regions increases accuracy of methylation patterns measurement in the selected (enriched) target regions.
In an embodiment, said method comprises estimating, using Alu sequences, said conversion extent of said unmethylated cytosine bases to uracil bases in said converted methSeq library, wherein if said unmethylated cytosine bases undergo an incomplete conversion to uracil bases, binding of said Alu sequences to a primer thereof is retained yielding a smaller PCR Ct value indicative of a coarse conversion efficiency. Typically, Alu sequences are a family of short interspersed nuclear elements (SINEs) that are highly repetitive and abundant in the human genome, constituting about 10-15% of the genome. These sequences are about 300 base pairs long and are found in many copies throughout the genome. Since Alu sequences represent a significant fraction of the genome, said Alu sequences play an important role in indicating the cytosine-to-uracil conversion efficiency of bisulfite treatment. Typically, Alu sequences are often hypomethylated in the genome, meaning most cytosines in these regions are unmethylated, therefore, if almost all cytosines in Alu sequences are converted to uracil, the BS-seq treatment is considered to be efficient on the cfDNA samples.
Notably, if conversion efficiency is 100%, then all the unmethylated cytosine bases are converted to uracil, including the regions complementary to the Alu primers. Primer binding is then compromised, and such regions would need many PCR cycles to be amplified. However, if the conversion is not complete (i.e., conversion efficiency is low), Alu sequence will have many regions that still resemble the original primer binding regions, and they would be amplified within the first few cycles of PCR. Thus, methSeq libraries that take at least 21 cycles for amplification may be considered to have undergone complete conversion, and may be taken ahead for hybridisation with target probes.
In an embodiment, wherein said method comprises estimating, using Alu sequences, said conversion extent of said unprotected cytosine bases to uracil bases in said converted methSeq library, wherein if said unprotected cytosine bases undergo an incomplete conversion to uracil bases, binding of said Alu sequences to a primer thereof is retained yielding a smaller PCR Ct value indicative of a coarse conversion efficiency. Herein, if almost all cytosines in Alu sequences are converted to uracil, the Alu sequences indicate the cytosine-to-uracil conversion efficiency of enzyme treatment, i.e., that the EM-seq treatment is efficient on the cfDNA samples.
Optionally, said methSeq reads are obtained by:
extracting plasma from the blood drawn from said subject;
extracting cell-free DNA (cfDNA) from said plasma; and
applying non-conversion-based protocol for obtaining methSeq reads.
In this regard, as mentioned above in the previous embodiments, the cfDNA is extracted from a plasma of the whole blood sample. The non-conversion-based protocol or conversion-free protocol eliminates the use of any chemicals or enzymatic treatment of the cfDNA. Typically, the non-conversion-based protocol or conversion-free protocol does not require the conversion of cytosines to uracils (or any other modifications) to differentiate between methylated and unmethylated cytosines. Optionally, the resulting cfDNA is sequenced on a standard high throughput short read (such as next generation sequencing (NGS) sequencer. It may be appreciated that the non-conversion-based protocol or conversion-free protocol comprises the steps of NGS library preparation as discussed above, along with optional steps of PCR amplification, pooling, enrichment, pooling thereon. The NGS library is subjected to a 5-base sequencing and optionally demultiplexing if the NGS library was pooled as mentioned above. The 5-base sequencing reads are subjected to alignment (such as methylation-aware alignment) and filtering steps to obtain methSeq reads, as mentioned above.
Referring to step 104, the method comprises analysing said methylSeq reads using a bioinformatics pipeline to obtain a genomic alignment and a methylation status for each methylSeq read. Herein, the term "bioinformatics pipeline" as used herein refers to key steps that ensure processing the methSeq reads, aligning the methSeq reads to the reference genome, and accurately determining the methylation status of each methSeq reads. Optionally, sequential key steps of the bioinformatics pipeline includes: identifying quality control parameters; adapter removal and trimming; alignment to a human reference genome; filtering reads based on various read and alignment parameters, determination of a methylation status for each CpG base in said methSeq read, and determination of a methylation status for each non-CpG base in said methSeq read.
Notably, the step of identifying quality control parameters involves checking quality of the methSeq reads (e.g., using tools like FastQC) to assess sequencing quality scores, read length distribution, GC content, and presence of any adapters or low-quality regions. Herein, the sequencing quality scores, also known as Phred scores, represent the probability that a base is incorrectly called during sequencing, thereby indicative of reliable bases in the methSeq reads for accurate methylation status analysis. A higher Phred score indicates higher confidence in the base call. For example, a Phred score of 30 means 1/1000 chance of incorrect base calling; and a Phred score of 20 means 1/100 chance of incorrect base calling. Moreover, the read length distribution refers to the variation in the length of methSeq reads. Ideally, all methSeq reads should have a consistent length, as this indicates uniformity in sequencing. Furthermore, the GC content refers to the percentage of guanine (G) and cytosine (C) bases in the sequencing reads. Different genomic regions have varying GC content, and a balanced GC content is important for good sequencing performance. Furthermore, the adapters refer to short artificial sequences that are ligated to DNA fragments during library preparation for sequencing. Notably, adapters can interfere with proper alignment to the reference genome, as they are not part of the actual genomic sequence, therefore, they must be removed before alignment. The low-quality regions, similar to the adapters, have poor Phred scores, therefore, are often trimmed to improve overall analysis.
As mentioned above, the step of adapter removal and low-quality bases trimming from the methSeq reads (e.g., with tools like Trimmomatic or Cutadapt) is essential to ensure proper alignment of the cfDNA sample methSeq reads to the reference genome. Moreover, the step of alignment to a human reference genome using aligners (e.g., Bismark or Bowtie2) is essential to account for the converted cytosines (read as thymines in the methSeq reads). Furthermore, after alignment, the step of filtering reads based on various read and alignment parameters, such as alignment quality, base quality, or duplicate reads, ensures that only high-quality data is used for further analysis. Furthermore, the steps of determination of a methylation status for each CpG base and each non-CpG base in said MethSeq read is essential to assess disease condition and monitoring a progression thereof in the subject from whom the cfDNA sample was extracted.
In an embodiment, said bioinformatics pipeline employs a plurality of functions comprising: identifying quality control parameters; adapter removal and trimming; alignment to a human reference genome; filtering reads based on various read and alignment parameters, determination of a methylation status for each CpG base in said methSeq read and determination of a methylation status for each non-CpG base in said methSeq read.
In an embodiment, said bioinformatics pipeline employs a plurality of functions comprising: FASTQC or Fastp for identifying quality control parameters; BbDuk, Trimmomatic or Fastp for adapter removal and trimming; BWAmeth, Biscuit or Bismark for alignment to a human reference genome; Samblaster or Picardtools markduplicates and Samtools or Sambamba for filtering reads based on various read and alignment parameters, WGBStools for determination of a methylation status for each CpG base in said methSeq read and a custom algorithm for determination of a methylation status for each non-CpG base in said methSeq read.
In this regard, FASTQC is a quality control tool that assesses the quality of high-throughput sequencing data. BBDUK is a tool for filtering and trimming reads by quality, adapters, and contaminants. TRIMMOMATIC is a tool for trimming low-quality bases and adapter sequences from FASTQ files. FASTP is an all-in-one preprocessing tool for FASTQ files that performs quality control, filtering, and adapter trimming. SAMTOOLS SORT INDEX is a tool for sorting and indexing BAM files to optimize storage and access. PICARD TOOLS MARKDUPLICATES is a tool that identifies and marks duplicate reads in BAM files to improve variant calling accuracy. SAMBAMBA MARKDUP is a tool for marking duplicate reads in alignment files, similar to Picard's MarkDuplicates. BWA_METH is an aligner optimized for mapping bisulfite-treated sequencing reads or enzyme-treated sequencing reads to reference genomes. BISMARK is a tool for aligning bisulfite-treated sequencing reads or enzyme-treated sequencing reads and determining cytosine methylation status. BISCUIT is a bisulfite sequencing pipeline or enzyme sequencing pipeline for alignment, quality control, and methylation calling. Methyldackel is a tool for extracting methylation metrics from bisulfite-treated sequencing data or enzyme-treated sequencing data. WGBStools is a suite of tools for analyzing whole-genome bisulfite sequencing data or whole-genome enzyme-treated sequencing data, including quality control and methylation calling. Notably, the bisulfite-treatment and enzymatic conversion have similar effects on the sequencing reads. Therefore, the aforementioned tools, BWA_METH, BISMARK, BISCUIT, Methyldackel and WGBStools, are adapted to work with any conversion protocol generated data. BEDtools is a toolkit for intersecting, merging, and manipulating genomic interval data in BED format. BCFtools are a set of utilities for variant calling, filtering, and processing VCF and BCF files.
It may be appreciated that a person skilled in the art may be aware of other tools which perform similar functions as mentioned above. Therefore, the aforementioned tools for performing the aforementioned functions should not limit the scope of the invention.
Referring to step 106, the method comprises filtering a first type of methSeq reads from a second type of methSeq reads depending upon their conversion extents. Herein, the term "conversion extent" refers to a degree or proportion of unmethylated cytosines in that read that have been successfully converted to uracil (which is read as thymine during sequencing) during the EM-Seq or BS-Seq treatment process, which is critical for accurately distinguishing between methylated and unmethylated cytosines in methylation sequencing (methSeq). Notably, a high conversion extent is essential for accurate methylation analysis, and low conversion efficiency can lead to erroneous interpretations of methylation patterns. Herein, the terms "first type of methSeq reads" and "second type of methSeq reads" refer to the well-converted, namely in which all or almost all unmethylated cytosines are successfully converted to uracils (and subsequently to thymine during sequencing), and poorly converted methSeq reads, namely in which unmethylated cytosines are not converted to uracils (and thus thymine during sequencing), respectively. It may be appreciated that well-converted methSeq reads ensure accurate detection of methylation patterns, and high conversion efficiency of the EM-seq or BS-seq treatment. Poorly converted methSeq reads reduce the reliability of the methylation data. As mentioned above, optionally, Alu sequences are used to evaluate the conversion efficiency of the EM-seq or BS-seq treatment. Besides the Alu sequences, Cytosine to thymine ratio (C:T), methylation calling tools, and/or global conversion rates may be used to distinguish between the first and second types of methSeq reads. For example, if more than 95% of non-CpG cytosines are converted, then the methSeq read is considered to be of the first type (well-converted).
In an embodiment, said method comprises:
analysing said methylation status for each base in said methSeq read to estimate said conversion extent of said methSeq read; and
identifying said first type of methSeq reads, based on said conversion extent being above a threshold.
In this regard, the methylation status, i.e., measure that reflects whether a cytosine base in the genome is methylated or unmethylated, for each base in said methSeq read is compared to the reference genome and conversion extent of said methSeq read is estimated. Notably, a high rate of cytosine-to-thymine (C:T) conversion in a read is indicative of a high methylation status or percentage and thus high conversion extent, and vice versa. It may be appreciated that the threshold may be a plurality of thresholds based on the different conversion efficiency associated with each base existing in different target regions such as CpG sites, promoters, enhancers, etc, as compared to the reference genome. For example, the plurality of thresholds (or threshold values) may be 90%, 92%, 95%, and 98%. Optionally, the threshold (or threshold value) may be above 90%, such as 95%.
Referring to step 108, the method comprises deriving at least one score using said first type of methSeq reads. The term "at least one score" as used herein refers to a conversion or methylation score or a metric that indicates the proportion of unmethylated cytosines in the methSeq read that have been successfully converted to uracils and subsequently to thymines after EM-seq or BS-seq treatment. The score is an indicator of how well the conversion process has worked and how reliable the sequencing data is for determining methylation status. Typically, the score is calculated as the ratio of cytosines converted to thymine versus the total number of cytosines that were expected to be converted (i.e., the unmethylated cytosines).
Score=(Number of cytosines converted to thymine)/(Total number of expected unmethylated cytosines)×100
For example, if 95 out of 100 unmethylated cytosines were successfully converted to thymine in a read, the score would be 95%, indicative of efficient conversion process by EM-Seq or BS-Seq treatment.
In this regard, the resulting methSeq reads are aligned (or matched) to the human genome reference sequence allowing for the fact that some of Cs in the various CpGs may have been converted to T, using standard methods. The methSeq reads that align to the above-mentioned target regions, such as CpGs, are analyzed further and a comparison of the methSeq reads with that of the aligned reference sequence is then used to identify which CpGs in the methSeq reads are unmethylated and which are methylated. For each of target regions, a plurality of scores is derived by aggregating the above information in various ways.
Optionally, one such score, namely CpG-wise methylation, is calculated for each CpG in the target region. In this regard, a fraction of overlapping fragments that are methylated to that of the unmethylated ones is considered, and any summary statistic of this fraction across CpGs in the region (e.g., average) is calculated.
Optionally, another score is calculated for each methSeq read. In this regard, a fraction of CpGs in the methSeq read that are methylated and the distribution of this quantity across all methSeq reads aligned to the target region is considered, and any summary statistic for this distribution (e.g., the median) is calculated.
Optionally, yet another score is calculated for each methSeq read. In this regard, a fraction of length of each methSeq read to methSeq read that has lengths below a certain threshold for each target region is calculated.
Moreover, for each score identified above, a score trend identification is made using knowledge of biology whether an increasing or a decreasing trend in the score is associated with cancer. In an example, in a collection of blood samples from cancer patients and from normal individuals, if a score value is higher in cancer patients than in the normal individuals, then the trend direction is considered positive.
In an embodiment, the method comprises using said first type of methSeq reads to compute various scores selected from any of:
at least one region-wise methylation score corresponding to each of at least one genomic region, wherein said at least one region-wise methylation score is one of: a CpG-wise methylation score, CHG-wise methylation score, CHH-wise methylation score, read-wise methylation score, and fragmentomic methylation score; and
a global methylation score selected from one of: global fragmentomic methylation score, global mitochondrial to nuclear ratio methylation score.
Herein, the term "region-wise methylation score" refers to a quantitative measure that reflects the level of DNA methylation within a specific genomic region. The region-wise methylation score is calculated based on the proportion of methylated cytosines (either CpG, CHG, or CHH sites) within that region, allowing for the assessment of methylation patterns over defined regions of interest, such as promoters, gene bodies, or CpG sites.
Typically, CpG-wise methylation score represents the methylation status specifically at CpG sites within the genomic region.
CpG-wise methylation Score=(Methylated CpGs)/(Total CpGs)×100
CHG-wise Methylation Score reflects the methylation level at cytosines followed by H (A, T, or C) and G (CHG sites) in the region. CHH-wise Methylation Score measures the methylation status at CHH sites (where H is A, T, or C) at non-CpG methylation sites. Read-wise Methylation Score reflects the methylation status across an entire sequencing read, considering all the methylated cytosines (CpG, CHG, and CHH) in that read, for understanding methylation at the read level rather than individual sites. Fragmentomic Methylation Score measures the methylation level within specific DNA fragments, such as cell-free DNA (cfDNA), to provide insight into fragment-based methylation patterns, such as those found in circulating cfDNA in liquid biopsies.
Moreover, the global methylation score provides a broader measure of the methylation status across the entire genome or across particular global contexts, such as cfDNA fragments or mitochondrial DNA. The global methylation scores are often used for assessing overall methylation levels, which can reflect genome-wide changes in methylation associated with conditions like cancer or aging. Typically, a global fragmentomic methylation score measures the overall methylation level across all DNA fragments, typically in cell-free DNA (cfDNA) samples to detect diseases such as cancer. A global mitochondrial to nuclear ratio (MNR) methylation score represents the ratio of methylation levels between mitochondrial DNA (mtDNA) and nuclear DNA (nDNA). Changes in the mtDNA to nDNA methylation ratio can indicate shifts in cellular processes, such as metabolic changes or oxidative stress, and have potential implications for understanding cancer biology and aging.
Optionally, the method comprises using an assay QC parameter score to score the conversion extent. It may be appreciated that the assay QC parameter is derived from assays performed over the cfDNA samples. Notably, size of methSeq reads decrease with increasing cancer stages (namely, I, II, III, and IV), consistent with the hypothesis that widespread hypomethylation in cancers leads to reduced cfDNA fragment size. It may be appreciated that the assay QC parameter is used as input to the model, but not to score the conversion extent.
Referring to step 110, the method comprises applying a predictive model to said at least one score to determine presence of such cancer and a tissue of origin of such cancer in said subject. The term "predictive model" as used herein refers to a computational model or an algorithm or a statistical model designed to predict biological or clinical outcomes (e.g., disease status, cancer type, or progression) based on methylation patterns observed across specific genomic regions or globally across the genome. In other words, the predictive model uses methylation scores as features to classify, predict, or assess outcomes with the aim of leveraging epigenetic data for diagnostic, prognostic, or therapeutic purposes. Herein, the predictive model is trained on methylation data (e.g., CpG-wise, CHG-wise, fragmentomic methylation scores, etc.) to predict a target outcome, such as disease diagnosis (e.g., distinguishing between cancerous and non-cancerous samples based on methylation patterns); prognosis (e.g., predicting the likelihood of disease progression or survival based on the global or region-wise methylation scores); and/or classification (e.g., identifying specific cancer types based on unique methylation patterns across certain regions or genome-wide). Optionally, the training data may include cell-free DNA (cfDNA) from liquid biopsies, tissue-specific DNA, or whole-genome methylation data. Optionally, the predictive model is a supervised learning-based models (like logistic regression, random forests, support vector machines (SVMs), or neural networks trained to classify samples or predict outcomes based on methylation scores) or unsupervised learning-based models (like clustering algorithms or dimensionality reduction (e.g., PCA) used to explore patterns in the methylation data without specific labelled outcomes). Optionally, the predictive model is subjected to cross-validation to ensure that the predictive model generalizes well to new, unseen methylation data. Beneficially, the predictive model is a non-invasive technique to detect cancers early through blood tests (liquid biopsy), thereby enabling guiding early interventions, personalized treatments and better clinical outcomes.
Herein, the term "tissue of origin" refers to a specific type of tissue or organ in which cancer cells originate and proliferate before potentially spreading to other parts of the body. Beneficially, identifying the tissue of origin helps in diagnosing the type of cancer and treatment plan, including surgery, chemotherapy, radiation therapy, and targeted therapies. For example, lung cancer, breast cancer, and colon cancer all originate from different tissues and have distinct characteristics, and thus different treatment plans. Optionally, the tissue of origin may be epithelial tissue, connective tissue, blood-forming tissues, neural tissue, and so forth.
In an embodiment, the method further comprises generating said predictive model using machine learning methods applied to at least one score generated from a leave-in training set of subject samples to:
determine presence or absence of said cancer and the tissue of origin of said cancer, and
assess for accuracy of said determination of presence or absence of said cancer and the tissue of origin of said cancer, on a leave-out set of subject samples.
It may be appreciated that the predictive model may either be generated or used from existing predictive models, to determine presence of such cancer and a tissue of origin of such cancer in said subject. In this regard, optionally, generating the predictive model may comprises the following steps:
receiving a training dataset comprising blood samples collected from cancer patients and normal individuals (controls) to be subjected to the above process;
performing a feature selection for identifying a subset of target regions and associated scores (among the at least one score computed above) that have significant differences between cancer patients and controls while respecting the score trends;
applying monotonicity constraints, by using gradient-boosted trees, to train the predictive model with the at least one score computed above and the added constraint respecting the score trends, to generate predictive model that is more explainable and consistent with biology;
generating multiple predictive models from different subsets of the training dataset and assessing prediction performance (cancer stage-wise median and 95% confidence intervals) on the rest of the training dataset for each of these multiple predictive models; and
employing a test dataset comprising a completely independent collection of blood samples from a different set of cancer patients and normal individuals to assess performance of each of the above multiple predictive models (cancer stage-wise median and 95% confidence intervals).
Herein, the terms "leave-in training set" and "leave-out set" refer to different sets of subjects selected from cancer patients and controls. Notably, from amongst the total number of participants or subjects, the leave-in training set comprises 80% and the leave-out set comprises 20% of the total number of participants or subjects. Typically, the leave-in training set is employed as training datasets for generating the predictive model, while the leave-out set is the test dataset employed for validation of the generated predictive model. It may be appreciated that subjects in both the leave-in training set and the leave-out set undergo the above mentioned process to have corresponding scores associated therewith.
In an embodiment, said leave-in training set and leave-out set comprise a group of subjects selected from: cancer patients diagnosed at a stage of cancer, patients with benign tumor, and healthy controls including both smokers and non-smokers, as well as tobacco chewers and non-chewers, wherein said stage of cancer is selected from a Stage I cancer, a Stage II cancer, a Stage III cancer, and a Stage IV cancer. It may be appreciated that including such a diverse set of subjects as the training dataset and the test dataset improves the predictive model and determination of methylation patterns to detect presence or absence of cancer in a subject and a tissue of origin with improved robustness, clinical relevance and generalizability. Notably, early-stage cancers (stage I) may exhibit different methylation signatures compared to more advanced stages (stages II-IV), thus enabling early detection, progression monitoring and guiding personalized treatment. Similarly, malignant cancer and non-malignant (benign) tumor conditions also exhibit different methylation patterns and therefore help in improving diagnostic specificity by reducing false positives in identifying methylation patterns unique to cancer. Moreover, subjects with both cancer and benign tumor/s help in gaining insight into understanding tumor biology. Controls typically provide a reference for normal methylation patterns, however, individuals habitual of smoking, alcohol, tobacco use, or other lifestyle factors provide assessment of risks associated with impact on methylation patterns.
In an embodiment, said leave-in training set is subjected to a plurality of cross-validation steps, wherein each cross-validation step comprises a feature selection module that selects regions of interest, wherein said regions of interest are selected based on the distribution of one or more of said scores being significantly different between:
cancer patients, and
patients with benign tumor/s, and healthy controls.
In this regard, as mentioned above, based on the methylation patterns in the cancer patients, and the set of patients with benign tumor and healthy controls, the scores of the two groups are different. The leave-in training set is subjected to a 4-fold cross-validation, which may be performed 20 times, to generate for example, 20x4=80 different models. Results from the 4-fold cross-validation may be combined to yield one set of predicted scores for each cfDNA sample in the leave-in training set. A threshold is determined for the set of predicted scores based on the desired specificity, and accordingly each cfDNA sample is labeled with a predicted label. This is repeated for all 20 sets of scores, and the median sensitivity and 95% confidence intervals over these 20 sets of scores are calculated for the leave-in training set. Subsequently, the 80 models may be applied on the leave-out set and the model-specific thresholds calculated on the leave-in training set as above are applied to obtain 80 sets of predictions for each cfDNA sample in the leave-out set. The median sensitivity and specificity as well as 95% confidence intervals over these 80 sets of predictions are then calculated.
Moreover, the 4-fold cross-validation involves feature selection on the leave-in training set followed by the predictive model building. Optionally, feature selection is performed using Mann Whitney U test or KS test. Notably, a predefined number of most significant features are used for model building.
In an embodiment, said leave-in training set is subjected to at least one perturbation step for synthetically expanding the leave-in training set. Notably, the leave-in training set is synthetically expanded by modifying sequences thereof. In this regard, small perturbations to the cfDNA sample sequences are introduced randomly, resulting in different combinations of methSeq reads in the leave-in training set.
In an embodiment, performance for detection of cancer on the leave-out set is a sensitivity of 89.6% for Stage I cancer, 89.2% for Stage II cancer, 94.6 % for Stage III cancer, and 92.1% for Stage IV cancer, at a specificity of approximately 86.2%. It may be appreciated that the high sensitivity and specificity of detecting presence or absence of cancer on the leave-out set enables guidance in the effective personalized treatment.
In an embodiment, the model has a sensitivity of 79.3% for Stage I cancer, 78.4% for Stage II cancer, 78.4% for Stage III cancer, and 86.8% for Stage IV cancer, at a specificity of approximately 96.9%.
In an embodiment, performance for detection of cancer at 95% confidence interval of sensitivity for
a general cancer model is in a range of 71%-77% for Stage I, 74%-76% for Stage II, 70%-73% for Stage III, and 78%-80% for Stage IV; and
a female cancer model is in a range of 75%-82% for Stage I, 75%-76% for Stage II, 79%-81% for Stage III and 84%-87% for Stage IV.
In this regard, the performance for detection of cancer at 95% confidence interval of sensitivity for the general cancer model may be in a range of 71, 72, 73, 74, 75 or 76% up to 72, 73, 74, 75, 76 or 77% for Stage I, 74, 74.5, 75, 05 75.5% up to 74.5, 75, 75.5 or 76% for Stage II, 70, 71 or 72% up to 71, 72 or 73% for Stage III, and 78, 78.5, 79 or 79.5% up to 78.5, 79, 79.5 or 80% for Stage IV. Moreover, the performance for detection of cancer at 95% confidence interval of sensitivity for the female cancer model may be in a range of 75, 76, 77, 78, 79, 80 or 81% up to 76, 77, 78, 79, 80, 81 or 82% for Stage I, 75, 75.2, 75.4, 75.6 or 75.8% up to 75.2, 75.4, 75.6, 75.8 or 76% for Stage II, 79, 79.5, 80 or 80.5 % up to 79.5, 80, 80.5 or 81% for Stage III, and 84, 85 or 86% up to 85, 86 or 87% for Stage IV.
In an embodiment, performance for detection of the tissue of origin on the leave-out set is 70-85%. Optionally, performance for detection of the tissue of origin on the leave-out set is 70, 72, 74, 76, 78, 80, 82 or 84% up to 72, 74, 76, 78, 80, 82, 84 or 85%. In an example, performance for detection of the tissue of origin on the leave-out set is 77.2%. Beneficially, high performance provides accurate determination of tissue of origin of cancer.
Referring to FIG. 2, illustrated is a schematic illustration of a process flow 200 for detecting a cancer through cell-free DNA (cfDNA) methylation patterns in a subject, in accordance with an embodiment of the present disclosure. As shown, the blood sample is collected and processed to extract cfDNA therefrom. Optionally, the extracted cfDNA may be directly subjected to assessment for determining cancer in the subject (not shown). However, in general and as a part of steps of the method disclosed in the present disclosure, the extracted cfDNA is subjected to methyl sequencing protocols, such as conversion-based sequencing protocol (selected from at least one of: the enzymatic methyl sequencing (EM-Seq), the bisulfite sequencing protocol) or non-conversion-based sequencing protocol. The methyl sequencing (methSeq) generated library, namely methSeq library, are used to generate an NGS library. The NGS library is generally directly subjected to hybridization capture protocol for determining capture molecules of interest. However, as shown darker filled boxes are used to depict the novel steps of the disclosed method, specifically, the method comprises subjecting the NGS library for analysed for methylation conversion (referred to as Coarse Conversion Estimation in the FIG. 2) and then to hybridization capture protocol. It may be appreciated that only the sequences that pass the conversion threshold are subjected to hybridization capture protocol, while the sequences that fail to pass the conversion threshold are reverted to methyl sequencing protocols. Subsequently, the captured molecules of interest are subjected to standard next generation sequencing (NGS) and further standard analysis (namely, primary and secondary analysis) thereof. As shown, the darker filled boxes depict the novel steps of the disclosed method, the method comprises applying a conversion filter, Meth score computation and pre-trained models on the standard analysed sequences to detect cancer and the tissue of origin of such cancer in the subject.
Optionally, the method comprises determining top two tissues or origin of cancer in the subject, once the presence of cancer is detected therein.
The present disclosure also relates to the diagnostic device as described above. Various embodiments and variants disclosed above, with respect to the aforementioned method, apply mutatis mutandis to the diagnostic device.
Referring to FIG. 3, illustrated is a diagnostic device 300 for detecting a cancer through cell-free DNA (cfDNA) methylation patterns in a subject, in accordance with an embodiment of the present disclosure. As shown, the diagnostic device 300 comprises a reader unit 302 for reading and analysing results from said cell-free DNA (cfDNA) methylation patterns. Moreover, the diagnostic device 300 further comprises a memory module 304 for storing data from said analysed results from said cell-free DNA (cfDNA) methylation patterns.
Herein, the term "diagnostic device" 300 refers to a specialized tool designed to identify cancer by analyzing epigenetic changes (specifically, methylation patterns) found in cfDNA, which circulates freely in the bloodstream. The diagnostic device 300 integrates multiple components to enable the collection, analysis, and storage of data from a subject's blood sample, offering a potentially non-invasive and rapid method for cancer detection.
Herein, the term "reader unit" 302 refers to a processor configured for reading and analyzing the methylation patterns of cfDNA extracted from the subject’s blood sample, to distinguish between normal and abnormal (cancer-related) methylation patterns. Optionally, the reader unit employs various bioinformatics algorithms to align the cfDNA sequence and determine the methylation status of CpG sites across the genome. Herein, the reader unit is a collection of large instruments for doing cfDNA extraction, library preparation and sequencing, coupled with all the bioinformatics programs.
The term "processor" refers to a computational arrangement that is operable to execute instructions, such as instructions related to processing of the cfDNA sample, analyses of methylation patterns associated with cancer, and so on. Examples of the processor include, but are not limited to, a microprocessor, a microcontroller, a complex instruction set computing (CISC) microprocessor, a reduced instruction set (RISC) microprocessor, a very long instruction word (VLIW) microprocessor, a Field Programmable Gate Array (FPGA), or any other type of processing circuit. Furthermore, the processor may refer to one or more individual servers, processing devices and various elements associated with a processing device that may be shared by other processing devices. Additionally, one or more individual processors, processing devices and elements are arranged in various architectures for responding to and processing the instructions that execute the aforementioned steps of the method of the present disclosure.
Herein, the term "memory module" reader unit 304 refers to a storage medium integrated into the diagnostic device or externally associated with the diagnostic device, to store data from the analyzed cfDNA methylation patterns. The memory module 304 is configured to save raw data, comprising the cfDNA sequence reads, methSeq reads, a plurality of algorithms, patient history, and so on, and processed results from the methylation analysis, comprising a plurality of scores, the methylation patterns, genomic alignment data, and so on. Moreover, the memory module 304 allows for retrieval of previous test results for longitudinal monitoring or comparison over time. Optionally, the reader unit 302 is communicable coupled with the memory module 304 through various means such as network connections, application programming interface (APIs), direct data access methods and the like.
Optionally, the diagnostic device 300 further comprises a point-of-care device 306, wherein the point-of-care device 306 comprises:
a device for collecting blood sample from the subject;
reagents required for extraction of plasma and cfDNA from the collected blood sample;
reagents required for methSeq reads preparation, PCR amplification of the extracted cfDNA; and
at least one cfDNA methylation pattern panel for validating the expression profiles of the cfDNA samples against the expression profiles in the reference genome.
In this regard, the point-of-care device 306 further comprises: a syringe for obtaining the blood sample from the subject, and a storage unit for storing the obtained blood sample.
In another embodiment, the point-of-care device 306 further comprises instructions for performing necessary steps for determining levels and/or differential levels of the methylation patterns, for example, in the blood sample obtained from the subject.
Optionally, the reader unit 302 is communicably coupled to the panel 306, the reader unit 302 may be responsible for performing various operations (such as analysis, and the like) on the panel data. Moreover, the panel 306 is communicably coupled to the memory module 304, thus the panel data can be stored on the memory module 304.
The present disclosure also relates to the non-transitory computer-readable storage media as described above. Various embodiments and variants disclosed above, with respect to the aforementioned method and the aforementioned diagnostic device, apply mutatis mutandis to the non-transitory computer-readable storage media.
A non-transitory computer-readable storage media having computer-readable instructions stored thereon, the computer-readable instructions being executable by a processor comprising processing hardware to execute the aforementioned method.
Referring to FIG. 4, illustrated is a graphical representation depicting cross-validation ROC, showing a very high true positive rate at low false positive rate for both training and testing dataset.
Referring to FIGs. 5A, 5B and 5C illustrate graphical representations depicting sensitivity analysis at targets of 91%, 95% and 98% cross-validity specificity, respectively, for a general cancer model and a female cancer model. The whiskers show 95% confidence interval of scores across multiple models, which is relatively tight across controls, benign samples, and different stages; and
Referring to FIG. 6, illustrated are different conversion and non-conversion protocols for detecting at least one cancer through cell-free DNA (cfDNA) methylation patterns in a subject. As shown, Panel A depicts the process flow of the steps of the method according to an embodiment of the present disclosure. Panel B depicts the creation of an NGS Library. Panel C depicts a conversion-based protocol for obtaining methSeq reads. As shown, the conversion-based protocol may be selected from bisulfite conversion (BS-Seq) and enzymatic conversion (EM-Seq). As shown in red outlines, according to an embodiment of the present disclosure, the disclosed method in a preferred implementation employs EM-Seq for generating methSeq reads. Panel D depicts a non-conversion-based protocol for obtaining methSeq reads. Herein, the green shaded elements (in the Panels A and C) are the novel steps or components of the present disclosure.
Modifications to embodiments of the invention described in the foregoing are possible without departing from the scope of the invention as defined by the accompanying claims. Expressions such as “including”, “comprising”, “incorporating”, “consisting of”, “have”, “is” used to describe and claim the present invention are intended to be construed in a non-exclusive manner, namely allowing for items, components or elements not explicitly described also to be present. Reference to the singular is also to be construed to relate to the plural. Numerals included within parentheses in the accompanying claims are intended to assist understanding of the claims and should not be construed in any way to limit subject matter claimed by these claims.
EXPERIMENTAL PART
Study Design: Four different cohorts were recruited in this multi-centric study involving various hospital centres in India (CTRI No: CTRI/2022/05/042936), after appropriate Ethics Committee approvals at each site. The inclusion criteria for cohorts 1, 2 and 3 included treatment-naïve patients of age 18 or above a) without any previous onco-treatment or surgery, b) with suspected or confirmed benign, precancerous or malignant neoplasms and/or lesions, and c) eligible for surgical resection, as identified by the site Principal Investigator or Co-Investigators through routine clinical care processes/screening processes. Recurrent or relapse cases were of course excluded. For cohort 4, subjects of age 50 or above without any symptoms of benign, precancerous or malignant neoplasms and/or lesions, and without any history of cancer or precancerous or benign lesions were recruited by the site Principal Investigator or Co-Investigators through routine clinical care processes/screening processes. Subjects with the habit of chewing and/or smoking tobacco, subjects with habitual moderate alcohol consumption, and subjects without any of these habits were recruited separately.
A total of 3406 subjects with benign, precancerous or malignant neoplasms and/or lesions with primaries in the Head and Neck, Breast, Stomach, Esophagus, Ovary, Cervix, Lung, Colorectal, Pancreas, Liver, and Gallbladder were recruited from 34 sites. Total cases for each cohort are - cohort 1: 12 cases, cohort 2: 1 case and for cohort 3: 3394 cases were recruited , as of the date of writing. A total of 1177 subjects were recruited from 3 sites for cohort 4, as of the date of writing. In all cohorts, cases of uncontrolled infection were excluded, as were pregnant or lactating women. Subjects who were coagulopathic and taking blood thinning products were included based on the site PI’s assessment of safety for a blood draw.
Sample Collection: Whole blood was collected in 2x10 mL plasma preparation tubes with anticoagulants (Streck Tubes) for cohorts 1, 2 and 3 (cancer cases) and for cohort 4 (control cases) and gently mixed. Matched FFPE blocks (treatment naive) from malignant and benign lesions were collected wherever available. The collected samples were shipped at room temperature to reach the Strand reference laboratory in Bangalore within 72 hours.
Plasma extraction: Streck tubes with the collected blood were centrifuged at 1900xg for 22 minutes at 4°C. The supernatant was transferred carefully to a 15ml tube and centrifuged a second time at 3200xg for 20 minutes at 4°C. The resultant plasma was extracted manually and stored at -80°C in 4 ml aliquots. The separated buffy coat was stored in RNALater and stored at -80°C.
cfDNA extraction: cfDNA was extracted from 4 ml of plasma using the Apostle MiniMax High Efficiency cfDNA isolation kit (Cat#: A17622-384) following manufacturer’s instructions. The extracted cfDNA was quantified using a Qubit High Sensitivity dsDNA kit and the size distribution was assessed on a TapeStation 4200 using a High Sensitivity DNA1000 tape. A typical cfDNA profile on the TapeStation shows a peak in the 170 bp range, and another significant peak in the 500-700 bp range.
Library Preparation: MethylSeq (methSeq) libraries were prepared from the extracted cfDNA using the NEB Enzymatic Methyl-Seq kit following manufacturer’s instructions. Briefly, 10-20 ng of sample cfDNA was end repaired, A-tailed, and ligated to adapters. TET2 and T4-BGT were used to protect 5mC and 5hmC from deamination. This was followed by a clean-up step with the purification beads provided by the manufacturer. This was followed by a safe stop point at -20°C that extended for at least 48 hours (this step seems critical in ensuring high conversion rates). The oxidised adapter-ligated DNA was denatured into single-stranded fragments using 0.1N Sodium hydroxide. APOBEC3A was used to convert unprotected cytosines to uracils. This was followed by PCR amplification, with the amount of input DNA determining the number of PCR cycles (10-15 ng 11 cycles, 15-20 ng input 10 cycles). Each library was also barcoded with unique indices in this process. The libraries were quantified using Qubit High Sensitivity dsDNA kit - the acceptable range for library yield with 10-11 cycles of PCR is 400 ng or more. Size evaluation was carried out by running them on the TapeStation 4200 using the DNA1000 tape. A typical cfDNA library showed a significant peak around 350 bp, and another significant peak at around 500 bp.
A quantitative PCR method was used to estimate the efficiency of conversion of unmethylated cytosine to uracil. This method used primers that amplify selected regions of the highly repetitive Alu sequences in the human genome. The rationale behind this is that if conversion was 100% efficient, then all the unmethylated cytosine bases will be converted to thymine, including the regions complementary to the Alu primers. Primer binding is then compromised, and such regions would need many PCR cycles to be amplified. On the contrary, if conversion is not complete, many regions would still resemble the original primer binding regions, and they would be amplified within the first few cycles of PCR. Libraries that took at least 21 cycles for amplification were considered to have undergone complete conversion, and were taken ahead for hybridisation with target probes.
Target Capture: Hybridisation capture of the target regions was performed overnight with the Twist Human Methylome panel probes comprising 550,000 target regions with 123m target bases covering 3.98m CpGs, by pooling 187.5 ng of eight MethylSeq libraries with distinct indices. The captured regions were amplified using 7 cycles of PCR, generating a post-capture library. This library was quantified using the Qubit High Sensitivity dsDNA kit, and size estimation was carried out by running it on the TapeStation 4200 using the DNA1000 tape.
Sequencing: Libraries were sequenced at 2x150 bp reads either on the Illumina NovaSeq 6000 (S4 flowcells) or the Illumina NovaSeq X Plus (10B and 25B flowcells).
Demultiplexing: Demultiplexing was performed in Illumina BaseSpace platform to separate reads with distinct barcodes and create FASTQ files for individual samples. This was done either on the sequencing instrument itself, or in instances where that failed for technical reasons, off the instrument using bcl-convert. Once done, the amount of data generated and the number of reads generated were compared with what was expected according to the sample sheet and samples were characterized into 7 groups accordingly: PASS, PASS-MORE, PASS-EXCESS, SLIGHTLY-LESS, LESS, BORDERLINE.
Read Alignment and QC: The FASTQ files were then synchronized to s3 buckets in directories named as per the sequencer run name. The FrAnaTk pipeline comprising the various steps below was launched on these files in multiple ec2 instances, one per sample, with c6a.12xlarge as configuration. The same Amazon Machine Image (AMI) was used for all the ec2 instances. Instances were automatically stopped when the respective run completed, and the output files were archived to the Amazon s3 Bucket. Subsequently, necessary files were copied to the Hetzner cloud and used for downstream processing.
BAM and bed of QC pass fragments (pair of reads aligned, uniquely mapped to the genome with mapping quality >=30 form a QC pass fragment);
Fragment size distribution, end motifs as TSVs;
Methylation at per CpG across the target footprint in bedGraph format;
TSV files with QC metrics on reads sequenced, aligned, uniquely mapped, coverage or target regions etc.; and
Log files documenting time for alignment, fragment and methylation feature extraction as text files.
FrAnaTk used Snakemake, version 7.17.1 (https://snakemake.readthedocs.io/en/stable/) for workflow management, BbDuk, version 38.96 (https://github.com/BioInfoTools/BBMap) with specific parameters (k=19 mink=5 hdist=1 hdist2=0 ktrim=r qtrim=r minlength=36 trimq=14) to remove contaminating adapter and trim low quality regions, and then FastQC to generate distributions for per base sequence quality, per read mean quality, per base nucleotide content, per read GC content, per base N content, per read length, read duplication levels, per base adapter content, and per base k-mer content. This was followed by BWAMeth, version 0.2.5, (https://github.com/brentp/bwa-meth) to align reads resulting from APOBEC deamination of unmethylated cytosines (this was done using two versions of the reference genome, one with all the Cs intact and one obtained by converting all Cs in the reference to Ts). Samblaster, version 0.1.26, (https://github.com/GregoryFaust/samblaster) was used to mark and remove duplicate reads and Samtools, version 1.14, (https://github.com/samtools/samtools) with parameters (-f 3 -F 3852 -G 48 --incl-flags 48 that retains only reads that are paired and align in a proper pair, and excludes reads that are unaligned, or whose mates are unaligned, or their alignment is part of secondary dataset, or if they don’t pass qc criteria, or if they are PCR or optical duplicates, or if they are part of a supplementary alignment) was then used to convert the SAM files generated to compressed BAM files. These were processed by the methylation extractor MethylDackel, version 0.6.1, (https://github.com/dpryan79/MethylDackel) to generate bedGraph files describing the number of reads with Cs and the number of reads with Ts for every CpG, CHG and CHH. MethylDackel was run in two parts, first, the basic bedGraphs were generated using MethylDackel extract and then MethylDackel mergecontext was used to add the proper start and end positions for each CpG, CHG, or CHH. The BAM files were also used to generate fragment bed files using Bedtools, version 2.30.0, (https://github.com/arq5x/bedtools2) command bamToBed and bedpe followed by post processing with gnu awk where the method was sourced from a published fragmentomics database called finaledb (https://academic.oup.com/bioinformatics/article/37/16/2502/6015114). The data from all the above outputs were then used to generate various QC stats with a combination of commands from Samtools, R (version 4.1.12), Bedtools, Picard, (version 2.18.29, https://github.com/broadinstitute/picard) and Python. In addition, a customized version of Wgbs-tools (https://github.com/nloyfer/wgbs_tools) was used to generate a PATR file from each BAM file (WGBS tools generates CpG clustered PAT files by default), where a PATR file merges both mates of a read pair into a single fragment, drops fragments that don’t overlap any CpG sites, and lists out the remaining fragments line by line; each line represents a fragment and indicates for each overlapping CpG whether the fragment has a C or a T. While generating PATR files, filters were placed to select reads as follows: sam flag filters (-f 3 -F 3852 i.e, include properly mapped reads, exclude unmapped reads, secondary alignments, PCR duplicates), mapping quality >= 30 are included.
The conversion efficiency of each of the reads were also calculated using a novel approach that is agnostic to read orientation. The reads with conversion efficiency > 0.95 were selected for all of the calculations. The Bedgraph and PATR files were then used to generate a number of scores for each sample. From the Bedgraph file, the percentage of conversion (pConvNonCpG) was calculated by dividing the total number of reads with Ts added up across all CHG and CHH sites, divided by the total number of reads added up across all CHG and CHH. In addition, another method for estimating the percentage of conversion (pConvCpGLambda) was used. The average coverage (cov.avg) was calculated by using the hsmetrics output file that has the different coverages and the number of good quality bases reads corresponding to that coverage.
Score generation: A number of different scores were generated from the bedgraph and pat/patr files. These scores were broadly classified into three categories: CpG-wise, readwise, and global. For the first two, the 550,000 regions in the panel manifest were considered and a score was computed for each such region, for each sample. For the last, only a single score was generated for the entire sample.
CpG-wise Methylation Scores: For each CpG, a methylation fraction was computed from the bedGraph files as the ratio of reads showing C at that specific CpG location and the total number of reads overlapping that CpG. Then various summary statistics were used to summarize the methylation fractions of the various CpGs in each region into region level scores. These included the mean (MCM) and the 10th percentile, 25th percentile, 50th percentile, 75th percentile and 90th percentile (q10CM, q25CM, q50CM, q75CM, q90CM, respectively).
Fragment-wise Methylation Scores: Two distinct categories of scores were computed for each region from the PATR files. Herein, in the first score, numerator (N) referred to reads with > 3 CpGs and at least 1 CpG in ROI, where >80% CpGs are methylated; and denominator (D) referred to reads with at least 1 CpG present in the ROI. First, the fraction of fragments overlapping that region, with at least one CpG within the region, at least 3 CpGs on the whole, and with extreme methylation fractions (>=80% of all CpGs methylated, or less than 20% of all CpGs methylated, respectively) was computed. The high methylated fragment fraction was referred to as "HMF" and the poorly methylated fragment fraction was referred to as "LMF", respectively. The denominator above comprises only reads with at least one CpG within the region, at least 3 CpGs on the whole, and referred to as "useful fragments". Second, for every fragment overlapping that region, a block score was computed, and illustrated by the following example: if the CpG pattern of the fragment is CCTTTCCCTC, then block score is (2^2 + 3^2 + 1^2) / (10^2) (methylated fragment block score of MFB) or (3^2 + 1^2) / (10^2) (unmethylated fragment block score of UFB).
Fragmentomic Scores: From the BAM file, reads were filtered using samtools (-f 3 -F 3852 i.e, include properly mapped reads, exclude unmapped reads, secondary alignments, PCR duplicates). Both mates of a read pair were merged into a single fragment and only fragments with mapping quality >= 30 were considered. For every region, the proportion of fragments overlapping that region (at least 50% of the fragment should overlap the region) and with length <= 120 bp was determined as the short fragmentomic fraction (SFF). Next, for every region and for every CpG in that region, the ratio of fragments starting at the C in the CpG to the total number of fragments overlapping that CpG was determined as the cleavage ratio and the average cleavage ratio across all CpGs in the region was determined as the cleavage fragmentomic fraction (ClFF). Finally, the fraction of fragments overlapping the region and aligned to a C at the 5’ end was determined as the C fragmentomic fraction (CFF).
Global Fragmentomic Scores: From the BAM files, the same filtering as for the above fragmentomic scores was applied. Then, for every region, two sets of 256 scores were computed, one for each of the 256 possible 4-mers (AAAA, AAAC, …, TTTT). The first set of scores, called endpoint scores (E-AAAA etc), were computed by taking the fraction of fragments overlapping the region and with the 5' end containing the given 4-mer in the reference sequence. The second set of scores, called breakpoint scores (B-AAAA etc), were computed by taking the fraction of fragments overlapping the region and with 5’ end aligning exactly in the middle of the given 4-mer in the reference sequence. Finally, a mitochondrial to nuclear ratio score (MNR) was computed as the ratio of fragments aligning to chrM vs chr1-22.
The various scores outlined above are summarized again in Table 1 below.
Score Category Score Region-wise or Global
CpG-Wise MCM, q10CM, q25CM etc Region-wise
Fragment-wise HMF, LMF, MFB, UFB Region-wise
Fragmentomic SFF, ClFF, CFF Region-wise
Global Fragmentomic 5’EP, 3’EP and 5’BP, 3’BP Global
Mitochondrial to Nuclear Ratio MNR Global
Table 1
Public Data and Literature Analysis for Differentially Methylated Regions:
TCGA/EWAS/GEO: Data generated using the Illumina HumanMethylation 450K BeadChip that covers approximately 485,547 CpG sites (C followed by G) across the entire human genome was compiled from TCGA, EWAS and GEO. TCGA has cancer tissue and adjacent normal tissue data for a variety of cancers, all generated using a standardized process. Datasets in GEO are generated using a variety of treatments and normalization methods, making this data harder to use. EWAS compiles a subset of data from TCGA and GEO and processes them using a standardized pipeline; Further, it also provides additional data on healthy tissues and on blood. Data on 3966 tumor tissues and 333 adjacent normals from TCGA, 1414 normal tissues from EWAS, 600 blood samples from EWAS and 600 blood samples from GEO was compiled. The breakup of these samples are as follows in Table 2 below.
Cancer Types TCGA Cancer EWAS Cancer TCGA Healthy EWAS Healthy
BRCA 793 939 97 301
CESC 307 225 3 0
COAD 312 229 38 239
ESCA 185 296 16 126
HNSC 528 471 50 0
LUAD 473 339 32 301
LUSC 370 323 42 301
OV 10 0 0 0
READ 98 0 7 0
STAD 395 482 2 146
UCEC 438 263 46 0
UCS 57 117 0 0
Table 2
Regions that were differentially methylated in cancer tissues relative to both adjacent normal and blood were identified and mapped to the Twist Human Methylome panel manifest. The ancestry of these datasets was mixed but expected to be primarily Caucasian along with Hispanic and African contributions.
Singlera Analysis: 595 differentially methylated regions identified by (Chen et al. 2020) from an analysis of 200 primary tumors and 200 normal tissues in China were obtained from the supplementary and mapped to the Twist Human Methylome panel manifest.
EpiPanGI Analysis: Based on analysis of gastric cancer types from TCGA, a broad pool of 67k regions was targeted to hybrid capture data from 46 controls and 256 cancer samples (40 CRC) which led of selection of 10k priority marker regions where 5.6k represented CRC markers (ref).
Literature Analysis: 400 publications were curated for probes, genes, and specific CpGs reported as hyper- or hypo-methylated between a variety of cancers and normals. The initial strategy searched Pubmed using specific keywords (methylation, cancer, cfDNA, early diagnosis), followed by screening of the search results, which led to curation of about 100 PMIDs. A few cancers like colorectal cancer, lung cancer and breast cancer were overrepresented in these and other cancers highly prevalent in the Indian population like HNSCC, cervical cancer, oesophageal cancer, gastric cancer, etc. were underrepresented. A revised search strategy with less stringent keyword searches (methylation + cancer-specific search for the 9 cancer types of interest) gave about 600 PMIDs across 9 cancers of interest leading to curation of an additional 290 PMIDs, with a total of 391 methylation PMIDs curated. 3 approaches were common in these publications: sequencing/array based techniques (WGBS/microarrays), analysis of public datasets like TCGA and GEO, and predetermined targets selected through review of literature (PCR based). In addition, curation of HPV methylation markers for premalignant & early HPV associated cancers was undertaken and it was observed that L1 methylation levels were associated with advanced lesions and cancer. A similar search strategy was followed for curation of fragmentomic markers. The main objectives for curation of fragmentomic markers was to understand the specific cleavage patterns and cleavage sites of the 3 main enzymes (DNAES1L3, DFFB, DNASE1). This yielded 23 PMIDs.
Panel Region Annotations: Methylated cytosines can be in CpG islands, shores (0-2kb from islands), in shelves (2-4kb from islands), in the open sea, and in sites surrounding transcription sites (−200 to −1500 bp from 5′ untranslated region (UTR) for coding genes). These annotations were derived from Ensembl genome v105, UCSC, RefSeq, and Encode.
Model Building Features: Model building was performed using the various scores mentioned above. In addition, the Qubit yields and the Tapestation profiles were also considered.
Overall Scheme: A random selection of 20% of the samples in each category were kept aside as the leave-out set. The remaining 80% (the leave-in training set) was used for 4-fold cross-validation. Cross validation was performed 20 times on the leave-in to generate 20x4=80 different models. This procedure did not have any access to the leave out set. Results from all 4 folds were combined to yield one set of predicted scores for each sample in the leave-in training set. A threshold was determined for this set of predicted scores based on the desired specificity, and accordingly each sample was labeled with a predicted label. This was repeated for all 20 sets of scores, and the median sensitivity and 95% confidence intervals over these 20 sets of scores were calculated for the leave-in training set. Then the 80 models were applied on the leave-out set and the model-specific thresholds calculated on the leave-in training set as above were applied to obtain 80 sets of predictions for each sample in the leave-out set. The median sensitivity and specificity as well as 95% confidence intervals over these 80 sets of predictions were then calculated.
Feature Selection: Cross-validation involved feature selection on the training subset of the leave-in training set followed by model building. Feature selection was performed using the Mann Whitney U test or the KS test. A certain number of the most significant features were used for model building.
Model Building: XGBoost, a non-parametric method that generates an ensemble of trees, was used for model building. XGBoost handles missing values and also allows for monotonic constraints that may be key for generalization and explainability. Application of these constraints assumes that increasing (or decreasing) methylation in a region contributes to increasing association with cancer presence, and disallows models that might for instance infer low and high methylation values as cancer and intermediate values as normal, or even worse, split the range of methylation values into several intervals alternately inferred as cancer and normal. Model building was conducted with default parameters and no hyperparameter tuning was performed.
Conversion Analysis: The correlation between the pConvNonCpG and pConvCpGLambda measures was analyzed as was their respective impact on the scores used for machine learning. To this end, the following background on pConvNonCpG calculation is useful.
Conversion Efficiency: Non CpGs Cs and Gs locations from the reference genomes were used for determining the conversion efficiency at individual read level. Only the reads with verified and high conversion efficiency were selected for the downstream analysis. Calculating the efficiency is nontrivial. Next subsection explains how conversion efficiency is calculated at the individual read level.
In order to assess conversion of unmethylated cytosines, the following steps were followed:
Take a double stranded fragment f that is sequenced.
f yields two reads, say r1 mapping to the reference and r2 mapping to the rev comp of the reference.
Trace f’s PCR lineage back to an original double stranded molecule X.
Let Xs be the strand of X that yields f.
Then f carries conversion info for cytosines in Xs but not in Xs'. (Xs’ is the other strand)
Two cases now, assuming Xs is either fully converted or not
If Xs is in the same direction as the reference
If the nonCpG Cs in the reference align with Cs in r1 and likewise non CpG Cs in the reference align with Gs in r2, then f is not converted
If the nonCpG Cs in the reference align with Ts in r1 and likewise non CpG Cs in the reference align with As in r2, then f is converted
if Xs is in the opp direction to the reference
If the nonCpG Gs in the reference align with Gs in r1 and likewise non CpG Gs in the reference align with Cs in r2, then f is not converted
If the nonCpG Gs in the reference align with As in r1 and likewise non CpG Gs in the reference align with Ts in r2, then f is converted
How to decide which of the above two cases applies for Xs?
Try each of the two cases
If Xs is fully converted, then f will satisfy either 6aii or 6bii but not both
If Xs is not converted at all, f will satisfy both 6ai and 6bi
For 6a, the reference is required to overlap enough non CpG Cs in r1 U r2 (say >=3)
For 6b, the reference is required to overlap enough non CpG Gs in r1 U r2 (say >=3)
Drop fragments that do not satisfy both the above, and then those that do not satisfy one of 6aii or 6bii (up to 1 exceptions)
Note,
The fraction of fragments f that remain is ~ conversion efficiency calculated locus wise from the CHH/CHG bedgraph file
The distribution of such fragments along the genome is not biased relative to the distribution of all fragments.
Coverage Analysis: The number of fragments in each sample was downsampled and the cross-validation procedure above was repeated to assess comparative cross-validation performance at the original and downsampled coverage levels.
Confounder Analysis: Model building was also performed using just covariates like age, gender, tobacco usage, and the presence of comorbidities, to gauge if they were sufficiently predictive in themselves, and therefore the model wasn’t really learning a true cancer vs control difference.
Missing Value Analysis: Each non-zero score value in the leave-out set was set to 0 independently with a probability of 10% to obtain a new, and the 100 models from the cross-validation were applied unchanged on this new leave-out dataset to determine sensitivity and specificity.
Results:
Conversion of Unmethylated Cytosines: The distribution of pConvNonCpG vs pConvCpG.Lambda were used for finding the samples with a reasonable conversion efficiency. Samples that satisfy pConvNonCpG>=99% were very likely to have pConvCpG.Lambda>98%, but those with pConvCpG.Lambda>99% could have pConvNonCpG as low as 94%. To address the concern of incomplete conversion yielding fragments that can give the false impression of being hypermethylated, the conversion status of all the fragments that align to regions in the panel that have low (<30%) methylation levels in the blood was computed by randomly sampling 4000 such regions over 200 samples. 1.1% of the useful fragments aligning to these regions had >80% of their CpGs unconverted, and only 19.87% of this 1.1% had at least 1 nonCpG C unconverted, and only 16.02% had at least 2 nonCpG Cs unconverted. This indicates that a large majority of such fragments are truly methylated, and therefore the fragment-wise HMF scores have only a small contamination due to lack of conversion.
Participant Demographic and Metadata summaries: An arbitrarily chosen subset of the samples from 1016 cancer patients and 376 controls was chosen for sequencing and analysis. After filtering for pConvNonCpG>=99% and cov.avg>=40, 327 controls and 818 cancers remained. Such samples varied in terms of age, gender, BMI, tobacco chewing/smoking status, alcohol consumption status, and presence of comorbidities like hypertension and diabetes and anatomical site distribution of the tumors.
Assay Measures: 17 of the 818 cancer samples and 13 out of 327 control samples had a cfDNA yield of less than 10ng as measured by Qubit, the minimum being 4.7ng. The highest was as high as 5896ng. The cfDNA yield distribution for fragments in the range 100-220bp as measured on Tapestation shows the stagewise means for both quantities; both seemed to increase broadly with cancer stage, as would be expected by the hypothesis that later stage cancers shed more cfDNA. In contrast, and as expected, the mean of cfDNA Avg Size for fragments in the 100-220 bp range decreased with cancer stage, consistent with the hypothesis that widespread hypomethylation in cancers leads to reduced cfDNA fragment size.
Read QC: The mean number of reads per sample was ~217M. The stagewise mean number of reads varied from ~209M to ~221M. On average, 76.7% of the reads aligned with the target regions, allowing at least one base overlap. Further, on average, 56.8% of the reads passed the SAM flag and mapping quality filters, and had at least one CpG overlap with the target regions. CpG-Wise, Fragmentomic, Global Fragmentomic, and Mitochondrial to Nuclear Ratio scores were derived from these reads. And, on average, 29.7% of the reads were useful, with at least one CpG overlap with the target regions and had at least 3CpGs on the whole; fragment-based scores were derived from these "Useful fragments". On average, 97.85% of the target bases were covered by 10 or more reads, and 92% were covered by 25 or more reads, considering only those reads that passed the SAM flag and mapping quality filters. The average coverage of target bases (cov.avg), considering only those reads that passed the SAM flag and mapping quality filters, varied from ~40 to ~169 for the various stages, the overall average being ~69 reads.
Score Distributions: The region-wise mean MCM score distribution on control samples indicated a bipolar distribution concentrated at the two extreme ends, with a few regions showing intermediate methylation fractions. The sample-wise mean MCM score distribution indicated that a coarse methylation remains the same across different stages. The distribution of the HMF scores in control samples in regions with mean MCM score < 0.3 (40.28% of the HMF score values are less than 0.02) and the distribution of the LMF scores in control samples in regions with mean MCM score > 0.7 (48.15% of the LMF score values are less than 0.02) both had a mode at 0, but the latter had another mode at ~2.5%, indicating that highly methylated regions also carried poorly methylated reads derived from a distribution with mode ~2.5% but poorly methylated regions only carried well methylated reads derived from a distribution with mode 0%.
Leave-in & Leave-out Set Compositions: The leave-in training set comprised 261 controls + 532 cancers (Benign: 43, I: 102, II 125, III: 133, IV: 129). The leave-out set comprised 65 controls + 156 cancers, (Benign: 15, I: 29, II: 37, III: 37, IV: 38).
Model Performance on Leave-Out Set: The performance on the leave-out set of the 80 models built on the leave-in dataset with 95% / 98% specificity as target. Different models were built to serve different purposes, namely control vs individual cancer, control vs grouped cancer and tissue of origin model. The performance for the HMF feature is shown in the following tables, Tables 3-5.
Control vs Individual Cancer | LeaveInPerturbation | 98% LeaveIn specificity
Leave-out sensitivity = 129 /141 = 91.5% | Leave-out specificity 56 / 65 = 86.2%
Type Control Benign I II III IV Total Total
(No Benign)
control 56 / 65 0 / 0 0 / 0 0 / 0 0 / 0 0 / 0 56 / 65 56 / 65
Colorectal 0 / 0 1 / 2 3 / 5 5 / 6 8 / 9 5 / 6 22 / 28 21 / 26
Esophagus 0 / 0 2 / 2 2 / 2 5 / 5 3 / 3 4 / 4 16 / 16 14 / 14
Gall Bladder 0 / 0 1 / 1 1 / 1 2 / 2 3 / 3 3 / 3 10 / 10 9 / 9
Liver 0 / 0 1 / 1 1 / 1 1 / 1 2 / 2 2 / 2 7 / 7 6 / 6
Lung 0 / 0 1 / 1 3 / 3 3 / 3 3 / 3 4 / 4 14 / 14 13 / 13
Pancreas 0 / 0 1 / 1 2 / 2 3 / 4 2 / 2 3 / 3 11 / 12 10 / 11
Stomach 0 / 0 0 / 0 4 / 4 3 / 4 4 / 4 3 / 4 14 / 16 14 / 16
Breast 0 / 0 2 / 3 3 / 3 3 / 4 3 / 3 4 / 4 15 / 17 13 / 14
Cervix 0 / 0 1 / 2 4 / 4 4 / 4 3 / 4 3 / 3 15 / 17 14 / 15
Ovary 0 / 0 2 / 2 3 / 4 4 / 4 4 / 4 4 / 5 17 / 19 15 / 17
Total 56 / 65 12 / 15 26 / 29 33 / 37 35 / 37 35 / 38 141 / 156 129 / 141
Percent 86.15% 80.00% 89.65% 89.19% 94.59% 92.10% 90.38% 91.50%
Table 3
Control vs Group Cancer | LeaveInPerturbation | 95% LeaveIn specificity
Leave-out sensitivity = 114 /141 = 80.8% | Leave-out specificity 63 / 65 = 96.9%
Type Control Benign I II III IV Total Total
(No Benign)
control 63 / 65 0 / 0 0 / 0 0 / 0 0 / 0 0 / 0 63 / 65 63 / 65
Colorectal 0 / 0 1 / 2 2 / 5 4 / 6 4 / 9 4 / 6 15 / 28 14 / 26
Esophagus 0 / 0 1 / 2 1 / 2 4 / 5 3 / 3 4 / 4 13 / 16 12 / 14
Gall Bladder 0 / 0 1 / 1 1 / 1 2 / 2 3 / 3 3 / 3 10 / 10 9 / 9
Liver 0 / 0 1 / 1 1 / 1 1 / 1 2 / 2 2 / 2 7 / 7 6 / 6
Lung 0 / 0 1 / 1 3 / 3 3 / 3 3 / 3 4 / 4 14 / 14 13 / 13
Pancreas 0 / 0 0 / 1 2 / 2 3 / 4 2 / 2 3 / 3 10 / 12 10 / 11
Stomach 0 / 0 0 / 0 4 / 4 2 / 4 4 / 4 2 / 4 12 / 16 12 / 16
Breast 0 / 0 2 / 3 3 / 3 4 / 4 2 / 3 4 / 4 15 / 17 13 / 14
Cervix 0 / 0 0 / 2 3 / 4 3 / 4 2 / 4 3 / 3 11 / 17 11 / 15
Ovary 0 / 0 1 / 2 3 / 4 3 / 4 4 / 4 4 / 5 15 / 19 14 / 17
Total 63 / 65 8 / 15 23 / 29 29 / 37 29 / 37 33 / 38 122 / 156 114 / 141
Percent 96.92% 53.33% 79.31% 78.38% 78.38% 86.84% 78.20% 80.85%
Table 4
Tissue of Origin (TOO) | LeaveInPerturbed
Top1 Hits = 66 / 114 = 57.9% | Top2 Hits = 88 / 114 = 77.2% | Top3Hits = 92 / 114 = 80.7% (excluded benigns from TOO evaluation)
Top 1 Type Cancer Positive TOO_I TOO_II TOO_III TOO_IV TOO_Total
Colorectal 14 / 26 (53.8 %) 1 / 2
(50.0 %) 2 / 4
(50.0 %) 4 / 4 (100.0 %) 3 / 4
(75.0 %) 10 / 14 (71.4 %)
Esophagus 12 / 14 (85.7 %) 1 / 1 (100.0 %) 1 / 4
(25.0 %) 1 / 3
(33.3 %) 1 / 4
(25.0 %) 4 / 12 (33.3 %)
GallBladder 9 / 9 (100.0 %) 0 / 1
(0.0 %) 1 / 2
(50.0 %) 3 / 3 (100.0 %) 3 / 3 (100.0 %) 7 / 9 (77.8 %)
Liver 6 / 6 (100.0 %) 1 / 1 (100.0 %) 0 / 1
(0.0 %) 1 / 2
(50.0 %) 1 / 2
(50.0 %) 3 / 6 (50.0 %)
Lung 13 / 13 (100.0 %) 1 / 3
(33.3 %) 1 / 3
(33.3 %) 2 / 3
(66.7 %) 1 / 4
(25.0 %) 5 / 13 (38.5 %)
Pancreas 10 / 11 (90.9 %) 1 / 2
(50.0 %) 3 / 3 (100.0 %) 1 / 2
(50.0 %) 0 / 3
(0.0 %) 5 / 10 (50.0 %)
Stomach 12 / 16 (75.0 %) 2 / 4
(50.0 %) 2 / 2 (100.0 %) 2 / 4
(50.0 %) 2 / 2 (100.0 %) 8 / 12 (66.7 %)
Breast 13 / 14 (92.9 %) 0 / 3
(0.0 %) 2 / 4 (50.0 %) 1 / 2 (50.0 %) 1 / 4 (25.0 %) 4 / 13 (30.8 %)
Cervix 11 / 15 (73.3 %) 3 / 3 (100.0 %) 2 / 3
(66.7 %) 2 / 2 (100.0 %) 2 / 3
(66.7 %) 9 / 11 (81.8 %)
Ovary 14 / 17 (82.4 %) 2 / 3
(66.7 %) 2 / 3 (66.7 %) 4 / 4 (100.0 %) 3 / 4 (75.0 %) 11 / 14 (78.6 %)
Total 114 / 141 (80.9 %) 12 / 23 (52.2 %) 16 / 29 (55.2 %) 21 / 29 (72.4 %) 17 / 33 (51.5 %) 66 / 114 (57.9 %)
Top 2 type Cancer Positive TOO_I TOO_II TOO_III TOO_IV TOO_Total
Colorectal 14 / 26 (53.8 %) 1 / 2
(50.0 %) 3 / 4
(75.0 %) 4 / 4 (100.0 %) 4 / 4 (100.0 %) 12 / 14 (85.7 %)
Esophagus 12 / 14 (85.7 %) 1 / 1 (100.0 %) 3 / 4
(75.0 %) 2 / 3
(66.7 %) 2 / 4
(50.0 %) 8 / 12 (66.7 %)
GallBladder 9 / 9 (100.0 %) 0 / 1
(0.0 %) 1 / 2
(50.0 %) 3 / 3 (100.0 %) 3 / 3 (100.0 %) 7 / 9 (77.8 %)
Liver 6 / 6 (100.0 %) 1 / 1 (100.0 %) 1 / 1 (100.0 %) 1 / 2
(50.0 %) 1 / 2
(50.0 %) 4 / 6 (66.7 %)
Lung 13 / 13 (100.0 %) 3 / 3 (100.0 %) 3 / 3 (100.0 %) 3 / 3 (100.0 %) 3 / 4
(75.0 %) 12 / 13 (92.3 %)
Pancreas 10 / 11 (90.9 %) 2 / 2 (100.0 %) 3 / 3 (100.0 %) 2 / 2 (100.0 %) 1 / 3
(33.3 %) 8 / 10 (80.0 %)
Stomach 12 / 16 (75.0 %) 3 / 4
(75.0 %) 2 / 2 (100.0 %) 2 / 4
(50.0 %) 2 / 2 (100.0 %) 9 / 12 (75.0 %)
Breast 13 / 14 (92.9 %) 1 / 3
(33.3 %) 3 / 4
(75.0 %) 1 / 2
(50.0 %) 2 / 4
(50.0 %) 7 / 13 (53.8 %)
Cervix 11 / 15 (73.3 %) 3 / 3 (100.0 %) 2 / 3
(66.7 %) 2 / 2 (100.0 %) 2 / 3
(66.7 %) 9 / 11 (81.8 %)
Ovary 14 / 17 (82.4 %) 3 / 3 (100.0 %) 2 / 3
(66.7 %) 4 / 4 (100.0 %) 3 / 4
(75.0 %) 12 / 14 (85.7 %)
Total 114 / 141 (80.9 %) 18 / 23 (78.3 %) 23 / 29 (79.3 %) 24 / 29 (82.8 %) 23 / 33 (69.7 %) 88 / 114 (77.2 %)
Top 3 type Cancer Positive TOO_I TOO_II TOO_III TOO_IV TOO_Total
Colorectal 14 / 26 (53.8 %) 1 / 2
(50.0 %) 3 / 4
(75.0 %) 4 / 4 (100.0 %) 4 / 4 (100.0 %) 12 / 14 (85.7 %)
Esophagus 12 / 14 (85.7 %) 1 / 1 (100.0 %) 3 / 4
(75.0 %) 3 / 3 (100.0 %) 3 / 4
(75.0 %) 10 / 12 (83.3 %)
GallBladder 9 / 9 (100.0 %) 0 / 1
(0.0 %) 1 / 2
(50.0 %) 3 / 3 (100.0 %) 3 / 3 (100.0 %) 7 / 9 (77.8 %)
Liver 6 / 6 (100.0 %) 1 / 1 (100.0 %) 1 / 1 (100.0 %) 1 / 2
(50.0 %) 2 / 2 (100.0 %) 5 / 6 (83.3 %)
Lung 13 / 13 (100.0 %) 3 / 3 (100.0 %) 3 / 3 (100.0 %) 3 / 3 (100.0 %) 3 / 4
(75.0 %) 12 / 13 (92.3 %)
Pancreas 10 / 11 (90.9 %) 2 / 2 (100.0 %) 3 / 3 (100.0 %) 2 / 2 (100.0 %) 1 / 3
(33.3 %) 8 / 10 (80.0 %)
Stomach 12 / 16 (75.0 %) 3 / 4
(75.0 %) 2 / 2 (100.0 %) 2 / 4
(50.0 %) 2 / 2 (100.0 %) 9 / 12 (75.0 %)
Breast 13 / 14 (92.9 %) 1 / 3
(33.3 %) 4 / 4 (100.0 %) 1 / 2
(50.0 %) 2 / 4
(50.0 %) 8 / 13 (61.5 %)
Cervix 11 / 15 (73.3 %) 3 / 3 (100.0 %) 2 / 3
(66.7 %) 2 / 2 (100.0 %) 2 / 3
(66.7 %) 9 / 11 (81.8 %)
Ovary 14 / 17 (82.4 %) 3 / 3 (100.0 %) 2 / 3
(66.7 %) 4 / 4 (100.0 %) 3 / 4
(75.0 %) 12 / 14 (85.7 %)
Total 114 / 141 (80.9 %) 18 / 23 (78.3 %) 24 / 29 (82.8 %) 25 / 29 (86.2 %) 25 / 33 (75.8 %) 92 / 114 (80.7 %)
Table 5
The control vs individual cancer model has a sensitivity of 89.6% for Stage I cancer, 89.2% for Stage II, 94.6% for Stage III, and 92.1% for Stage IV at a specificity of approximately 86.2% in the leave-out set.
The control vs grouped cancer model has a sensitivity of 79.3% for Stage I cancer, 78.4% for Stage II, 78.4% for Stage III, and 86.8% for Stage IV at a specificity of approximately 96.9% in the leave-out set.
The tissue of origin (TOO) model has a sensitivity of 57.9% for the top 1 cancer type, 77.2% for the top II cancer type and 80.7% for the top III cancer type. Stagewise sensitivity of the top I TOO model is 52.2% for Stage I, 55.2% for Stage II, 72.4% for Stage III and 51.5% for Stage IV. The Stagewise sensitivity of the top II TOO model is 78.3% for Stage I, 79.3% for Stage II, 82.8% for Stage III and 69.7% for Stage IV. The Stagewise sensitivity of the top III TOO model is 78.3% for Stage I, 82.8% for Stage II, 86.2% for Stage III and 75.8% for Stage IV.
The median sensitivity in the same ballpark on the leave-out set as for the leave-in training set across all the different models, indicating the models have generalized well. , Claims:CLAIMS
I/We claim:
1. A method for detecting a cancer through cell-free DNA (cfDNA) methylation patterns in a subject, said method comprising:
obtaining a plurality of methylation sequencing (methSeq) reads from cfDNA of said subject;
analysing said methylSeq reads using a bioinformatics pipeline to obtain a genomic alignment and a methylation status for each methylSeq read;
filtering a first type of methSeq reads from a second type of methSeq reads depending upon their conversion extents;
deriving at least one score using said first type of methSeq reads; and
applying a predictive model to said at least one score to determine presence of such cancer and a tissue of origin of such cancer in said subject.
2. The method as claimed in claim 1, wherein said methSeq reads are obtained by:
extracting plasma from the blood drawn from said subject;
extracting cell-free DNA (cfDNA) from said plasma;
pre-processing said cfDNA for end-repairing, a-tailing and adapter ligation;
subjecting said pre-processed cfDNA to at least one first enzyme selected from: Tet methylcytosine dioxygenase 2 (TET2) and T4 Phage β-glucosyltransferase (T4-BGT), for protecting methylcytosine bases (5mC) and hydroxymethyl cytosine bases (5hmC) from deamination to obtain a protected methSeq library;
subjecting said protected methSeq library to at least one second enzyme selected from: Apolipoprotein B mRNA editing enzyme subunit 3A (APOBEC3A), for converting unprotected cytosine bases to uracil bases to obtain a converted methSeq library; and
performing sequencing of said converted methSeq library to obtain methSeq reads.
3. The method as claimed in claim 1, wherein said methSeq reads are obtained by:
extracting plasma from the blood drawn from said subject;
extracting cell-free DNA (cfDNA) from said plasma;
pre-processing said cfDNA for end-repairing, a-tailing and adapter ligation;
subjecting said pre-processed cfDNA to sodium bisulfite for converting unmethylated cytosine bases to uracil bases to obtain a converted methSeq library; and
performing sequencing of said converted methSeq library to obtain methSeq reads.
4. The method as claimed in claim 1, wherein said methSeq reads are obtained by:
extracting plasma from the blood drawn from said subject;
extracting cell-free DNA (cfDNA) from said plasma; and
applying non-conversion-based protocol for obtaining methSeq reads.
5. The method as claimed in claims 2 and 3, further comprising, prior to step of performing sequencing, amplifying said converted methSeq library using a plurality of polymerase chain reaction (PCR) cycles determined by an amount of the extracted cfDNA.
6. The method as claimed in claim 5, further comprising:
incorporating a unique sample barcode or index during said adapter ligation step;
pooling said converted methSeq library with converted methSeq libraries from additional subjects to create a pooled methSeq library;
performing a target capture of said pooled library using enrichment probes;
sequencing said pooled methSeq library to generate pooled methSeq reads;
and
demultiplexing pooled methSeq reads to isolate methSeq reads corresponding to said subject.
7. The method as claimed in claims 6, further comprising:
performing hybridization capture of target regions of said converted methSeq library to obtain a captured methSeq library; and
performing sequencing of said captured methSeq library to obtain methSeq reads.
8. The method as claimed in any of the preceding claims 3 and 5-7, wherein said method comprises estimating, using Alu sequences, said conversion extent of said unmethylated cytosine bases to uracil bases in said converted methSeq library, wherein if said unmethylated cytosine bases undergo an incomplete conversion to uracil bases, binding of said Alu sequences to a primer thereof is retained yielding a smaller PCR Ct valueindicative of a coarse conversion efficiency.
9. The method as claimed in any of the preceding claims 2 and 5-7, wherein said method comprises estimating, using Alu sequences, said conversion extent of said unprotected cytosine bases to uracil bases in said converted methSeq library, wherein if said unprotected cytosine bases undergo an incomplete conversion to uracil bases, binding of said Alu sequences to a primer thereof is retained yielding a smaller PCR Ct value indicative of a coarse conversion efficiency.
10. The method as claimed in any of the preceding claims 1-9, wherein said bioinformatics pipeline employs a plurality of functions comprising: identifying quality control parameters; adapter removal and trimming; alignment to a human reference genome; filtering reads based on various read and alignment parameters, determination of a methylation status for each CpG base in said methSeq read and determination of a methylation status for each non-CpG base in said methSeq read.
11. The method as claimed in claim 1-10, wherein said method comprises:
analysing said methylation status for each base in said methSeq read to estimate said conversion extent of said methSeq read; and
identifying said first type of methSeq reads, based on said conversion extent being above a threshold.
12. The method as claimed in any of the preceding claims 1-11, wherein the method comprises using said first type of methSeq reads to compute various scores selected from any of:
at least one region-wise methylation score corresponding to each of at least one genomic region, wherein said at least one region-wise methylation score is one of: a CpG-wise methylation score, CHG-wise methylation score, CHH-wise methylation score, read-wise methylation score, and fragmentomic methylation score; and
a global methylation score selected from one of: global fragmentomic methylation score, global mitochondrial to nuclear ratio methylation score.
13. The method as claimed in any of the preceding claims 1-12, further comprising generating said predictive model using machine learning methods applied to at least one score generated from a leave-in training set of subject samples to:
determine presence or absence of said cancer and the tissue of origin of said cancer, and
assess for accuracy of said determination of presence or absence of said cancer and the tissue of origin of said cancer, on a leave-out set of subject samples.
14. The method as claimed in claim 13, wherein said leave-in training set and leave-out set comprise a group of subjects selected from: cancer patients diagnosed at a stage of cancer, patients with benign tumor, and healthy controls including both smokers and non-smokers, as well as tobacco chewers and non-chewers, wherein said stage of cancer is selected from a Stage I cancer, a Stage II cancer, a Stage III cancer, and a Stage IV cancer.
15. The method as claimed in any of the preceding claims 13 and 14, wherein said leave-in training set is subjected to a plurality of cross-validation steps, wherein each cross-validation step comprises a feature selection module that selects regions of interest, wherein said regions of interest are selected based on the distribution of one or more of said scores being significantly different between:
cancer patients, and
patients with benign tumors, and healthy controls.
16. The method as claimed in any of the preceding claims 13 to 15, wherein said leave-in training set is subjected to at least one perturbation step for synthetically expanding the leave-in training set.
17. The method as claimed in any of the preceding claims 14 to 16, wherein performance for detection of cancer on the leave-out set is a sensitivity of 89.6% for Stage I cancer, 89.2% for Stage II cancer, 94.6% for Stage III cancer, and 92.1% for Stage IV cancer, at a specificity of approximately 86.2%.
18. The method as claimed in any of the preceding claims 14 to 16, wherein performance for detection of cancer at 95% confidence interval of sensitivity for
a general cancer model is in a range of 71%-77% for Stage I, 74%-76% for Stage II, 70%-73% for Stage III, and 78%-80% for Stage IV; and
A female cancer model is in a range of 75%-82% for Stage I, 75%-76% for Stage II, 79%-81% for Stage III, and 84%-87% for Stage IV.
19. The method as claimed in any of the preceding claims 1-18, wherein said cancer is selected from at least one of: colorectal cancer, esophagus cancer, gallbladder cancer, liver cancer, lung cancer, pancreatic cancer, stomach cancer, breast cancer, cervical cancer, ovarian cancer.
20. The method as claimed in any of the preceding claims 14-19, wherein performance for detection of the tissue of origin on the leave-out set is 70-85%.
21. A diagnostic device for detecting a cancer through cell-free DNA (cfDNA) methylation patterns in a subject, said diagnostic device comprising a reader unit for reading and analysing results from said cell-free DNA (cfDNA) methylation patterns.
22. The diagnostic device as claimed in claim 21, further comprising a memory module for storing data from said analysed results from said cell-free DNA (cfDNA) methylation patterns.
23. A non-transitory computer-readable storage media having computer-readable instructions stored thereon, the computer-readable instructions being executable by a processor comprising processing hardware to execute a method of claim 1-20.
| # | Name | Date |
|---|---|---|
| 1 | 202441081204-STATEMENT OF UNDERTAKING (FORM 3) [24-10-2024(online)].pdf | 2024-10-24 |
| 2 | 202441081204-POWER OF AUTHORITY [24-10-2024(online)].pdf | 2024-10-24 |
| 3 | 202441081204-FORM FOR SMALL ENTITY(FORM-28) [24-10-2024(online)].pdf | 2024-10-24 |
| 4 | 202441081204-FORM FOR SMALL ENTITY [24-10-2024(online)].pdf | 2024-10-24 |
| 5 | 202441081204-FORM 1 [24-10-2024(online)].pdf | 2024-10-24 |
| 6 | 202441081204-FIGURE OF ABSTRACT [24-10-2024(online)].pdf | 2024-10-24 |
| 7 | 202441081204-EVIDENCE FOR REGISTRATION UNDER SSI(FORM-28) [24-10-2024(online)].pdf | 2024-10-24 |
| 8 | 202441081204-EVIDENCE FOR REGISTRATION UNDER SSI [24-10-2024(online)].pdf | 2024-10-24 |
| 9 | 202441081204-DRAWINGS [24-10-2024(online)].pdf | 2024-10-24 |
| 10 | 202441081204-DECLARATION OF INVENTORSHIP (FORM 5) [24-10-2024(online)].pdf | 2024-10-24 |
| 11 | 202441081204-COMPLETE SPECIFICATION [24-10-2024(online)].pdf | 2024-10-24 |
| 12 | 202441081204-MSME CERTIFICATE [25-10-2024(online)].pdf | 2024-10-25 |
| 13 | 202441081204-FORM28 [25-10-2024(online)].pdf | 2024-10-25 |
| 14 | 202441081204-FORM-9 [25-10-2024(online)].pdf | 2024-10-25 |
| 15 | 202441081204-FORM 18A [25-10-2024(online)].pdf | 2024-10-25 |
| 16 | 202441081204-Request Letter-Correspondence [06-11-2024(online)].pdf | 2024-11-06 |
| 17 | 202441081204-Power of Attorney [06-11-2024(online)].pdf | 2024-11-06 |
| 18 | 202441081204-FORM28 [06-11-2024(online)].pdf | 2024-11-06 |
| 19 | 202441081204-Form 1 (Submitted on date of filing) [06-11-2024(online)].pdf | 2024-11-06 |
| 20 | 202441081204-Covering Letter [06-11-2024(online)].pdf | 2024-11-06 |
| 21 | 202441081204-FER.pdf | 2025-07-28 |
| 22 | 202441081204-FORM 3 [22-10-2025(online)].pdf | 2025-10-22 |
| 1 | 202441081204_SearchStrategyNew_E_202441081204E_25-07-2025.pdf |