Systems And Methods For Analyzing Sequence Data

< Back

Systems And Methods For Analyzing Sequence Data

Abstract: SYSTEMS AND METHODS FOR ANALYZING SEQUENCE DATA ABSTRACT The present invention relates to systems, methods, software and computer-usable media comprising a clustering tool for biomolecule-related sequence. Biomolecule-related sequences can relate to proteins, peptides, nucleic acids, and the like, and can include functional and structural information such as secondary or tertiary structures, amino acid or nucleotide sequences, binding properties, genetic mutations and variants, sequence motifs, and so on. The clustering tool allows the user to find unique and biologically meaningful subtypes in biomolecule-related sequence datasets. In addition, the invention uses a recursive method to find finer structures in the datasets resulting in identification of novel subtypes.

Get Free WhatsApp Updates!
Notices, Deadlines & Correspondence

Patent Information

Application #

Filing Date

17 April 2019

Publication Number

43/2020

Publication Type

INA

Invention Field

BIOTECHNOLOGY

Status

lipika@lifeintelect.com

Parent Application

Applicants

Mazumdar Shaw Medical Foundation

Mazumdar Shaw Medical Foundation, A-Block, 8th Floor, Mazumdar Shaw Medical Centre, #258/A Narayana Health City, Bommasandra Bangalore Karnataka India 560099

Inventors

1. Dr. Nameeta Shah

Villa No. 166, NAMBIAR BELLEZEA, Muthanallur circle, Narayanaghatta village, Chandapura Dommasandra Road, Near Bangalore Karnataka India 560099

2. Pranali Sonpatki

Flat no.15, 3A/1 Nirmal Park HSG soceity, Padmavati Pune Maharashtra India 411043

Specification

DESC:F O R M 2
THE PATENTS ACT, 1970 (39 of 1970)
&
THE PATENTS RULES, 2003
COMPLETE SPECIFICATION
[See section 10 and rule 13]

1. TITLE OF THE INVENTION: SYSTEMS AND METHODS FOR ANALYZING SEQUENCE DATA

2. APPLICANT (A) NAME: MAZUMDAR SHAW MEDICAL FOUNDATION

(B) ADDRESS: MAZUMDAR SHAW MEDICAL FOUNDATION, A-BLOCK, 8TH FLOOR, MAZUMDAR SHAW MEDICAL CENTRE, #258/A, NARAYANA HEALTH CITY, BOMMASANDRA, BANGALORE, KARNATAKA, INDIA, 560099

3. NATIONALITY (C) INDIA

THE FOLLOWING SPECIFICATION PARTICULARLY DESCRIBES THE NATURE OF THIS INVENTION AND THE MANNER IN WHICH IT IS TO BE PERFORMED

[001] PRIORITY PARAGRAPH
[002] This application claims priority to the Indian provisional patent application No. 201941015490, filed on April 17, 2019, titled " SYSTEMS AND METHODS FOR ANALYZING SEQUENCE DATA" and is incorporated herein by reference.
[003] TECHNICAL FIELD OF THE INVENTION
[004] The present invention is in the technical field of analysis of genetic and transcriptomic sequences. More particularly, the invention relates to systems, methods, software and computer-usable media comprising a clustering tool for biomolecule-related sequence, wherein the clustering algorithm is a Recursive Consensus Clustering (RCC) tool for analysis and visualization of sequence data.
[005] BACKGROUND OF THE INVENTION
[006] Human genome has much information about a person's health. Next-generation sequencing (NGS) technologies rapidly convert genome information from its natural, biological format into sequence files. These sequence data can be examined for disease-associated mutations and other features. However, as the genome sequencing technologies become faster, cheaper, and accurate, the output produced by the available sequencing technologies can become difficult to analyze and interpret.
[007] As the genetic sequences possess heterozygosity, sequencing errors, somatic mutations, structural variants, repeated genetic elements, sequence reads can be structured in many ways. At times, it might lead to negligible, or even misleading informatics content. So, it is important to make correct interpretation of data.
[008] Recent advances in the field of RNA and DNA sequencing and data analysis has given us a wealth of data which aims to classify and study the transcriptomic subtypes/cell types in different biological systems.
[009] Transcriptome profiling is a popularly used technique to obtain information on the abundance of mRNA transcripts within a biological sample. The transcriptomic data can be generated using either GeneChips microarray or RNA sequencing.
[010] Clustering of transcriptomic data reduces the dimensionality of the data and allows a researcher to better analyse, visualize and interpret the data for biological insight. The motivation behind clustering of transcriptomic data is to reduce the dimensionality of the data for better analysis and viewing. Reducing the dimensions of the data makes it easier for interpretation.
[011] However, conventional sequence clustering tools do not perform satisfactorily in biological scenarios due to various reasons. Some of which are,
[012] Large number of samples: Due to recent advances in next generation sequencing, transcriptomic data can be found in abundance. It becomes increasingly difficult to analyze high number of samples using standard clustering algorithms as they fail to process such large datasets (Eg: PAM, mclust).
[013] High dimensionality: In transcriptomic data, often number of features/genes is very high and more often than not exceeds the number of samples. Due to this nature of biological data, it is difficult to view or analyze the data in lower dimensions. For example, in FIG 1 (Abrams Z. et al, Thirty biologically interpretable clusters of transcription factors distinguish cancer type. BMC Genomics 2018 19:738) shows principal component plots based on the expression patterns of 486 transcription factors (features) in the TCGA cancer dataset (https://portal.gdc.cancer.gov/). As demonstrated in the figure, any low dimensions are not enough to observe any segregation between the subgroups of the data.
[014] Number of clusters: If the data is not well annotated, it is difficult to find the optimal number of clusters which represent the dataset in true form. Even when the data is annotated, a lot of heterogeneity can be seen in the datasets which leads to the question if the data can be further divided.
[015] Sparsity of data: Sparsity of data comes into picture with single cell transcriptomic datasets. The standard clustering algorithms are not equipped to handle the sparsity of data as only a fraction of features are non-zero in each sample resulting in a sparse data matrix. This in turn affects the complexity and measurements of similarity.
[016] Researchers face the following challenges in performing transcriptomic data analysis:
[017] Clustering of datasets with an unknown number of clusters: Algorithmically identifying the optimal number of clusters in a dataset is a difficult mathematical problem especially for big datasets with a large number of clusters1.
[018] Finding novel subtypes with user-friendly tools: Subtypes that are not known beforehand are likely to be missed when applying popular clustering methods like k-means, hierarchical clustering, pam, mclust, etc2. As an example, consider large-scale TCGA pan-cancer dataset which includes samples from multiple cancer types (breast, prostate, brain, etc.) with each cancer type having distinct molecular subtypes. Two major publications that analyzed this transcriptome data were able to largely find clusters with tissue-specific cancers and not the subtypes within each cancer type with clinical relevance3,4. The novel subtypes were discovered only after using a sophisticated integrated analysis of multi-omic data.
[019] So, there is a need to develop a user-friendly clustering algorithm which allows the user to find unique and biologically meaningful subtypes in genomic datasets.
[020] In addition, there is a need for a clustering tool that can work well on datasets with known as well as unknown number of clusters as estimation of the number of clusters does not depend on the labelling of the data.
[021] Furthermore, there is need in the art to develop a clustering tool that can explore the dataset in a hierarchical fashion revealing finer structures and hence unique subtypes are not missed.
[022] SUMMARY OF THE INVENTION
[023] According to an exemplary aspect, the present invention relates to systems, methods, software and computer-usable media comprising a clustering tool for biomolecule-related sequence. Biomolecule-related sequences can relate to proteins, peptides, nucleic acids, and the like, and can include functional and structural information such as secondary or tertiary structures, amino acid or nucleotide sequences, binding properties, genetic mutations and variants, sequence motifs, and the like. The clustering tool allows the user to find unique and biologically meaningful subtypes in biomolecule-related sequence datasets. In addition, the invention uses a recursive method to find finer structures in the datasets resulting in identification of new unique subtypes.
[024] According to an exemplary aspect, the present invention discloses a Recursive Consensus Clustering (RCC), a user-friendly algorithm which allows the user to find new, unique and biologically meaningful subtypes in transcriptomic datasets. RCC uses a recursive method to find finer structures in the datasets resulting in identification of novel subtypes.
[025] In an embodiment, Recursive Consensus Clustering (RCC), an unsupervised clustering algorithm for novel subtype discovery from both bulk and single-cell datasets. RCC is available as an R package (https://github.com/MSCTR/RecursiveConsensusClustering) and facilitates the generation of new biological insights through intuitive visualization of clustering results.
[026] RCC was developed to address the following problems in transcriptomic data analysis:
[027] Clustering of datasets with unknown number of clusters: RCC works well on datasets with known as well as unknown number of clusters as estimation of the number of clusters k does not depend on the labelling of the data.
[028] Finding novel subtypes: Due to the recursive nature RCC explores the dataset in a hierarchical fashion revealing finer structures and hence novel subtypes which are generally missed.
[029] Biologist intuitive: Most of the computational parameters required for clustering are automatically calculated in RCC. Unlike all other algorithms it works well for both bulk and single cell transcriptome data.
[030] RCC can be divided in five basic steps namely:
[031] Data input: RCC takes quantitative transcriptomics data as input along with sample/cell information file.
[032] Feature selection and data scaling: For every recursive run RCC selects the top n% of variant genes in that particular subset of the original dataset which is used for further clustering. This feature of RCC helps in finding the finer hierarchical structure in a given dataset.
[033] Parallel processing of ConsensusClusterPlus(CCP): RCC uses ConsensusClusterPlus package in R which performs k-means clustering on the dataset which is parallelly processed eight times with 100 repeats each.
[034] Optimal k selection: Based on the Cumulative Distribution Function (CDF) plots produced, RCC finds the optimal k for each process in turn giving eight optimal k values for every run. If none of the clusters meet the already established criteria RCC returns zero as the optimal k.
[035] Subdivision of data and recursive clustering: Once the optimal k for a given dataset is selected, each subset is further recursively clustered till optimal k selection is possible, there are enough number of samples and significant number of genes are differentially expressed.
[036] Output: RCC outputs cluster information at all levels in a csv format and also allows the user to view the clustered data in the form of tracking plot and cluster annotation plot.
[037] In various embodiments, nucleic acid sequence read data can be generated using different techniques, platforms or technologies, including, but not limited to: capillary electrophoresis, microarrays, ligation-based systems, hybridization-based systems, direct or indirect nucleotide identification systems, pyrosequencing, polymerase-based systems, ion- or pH-based detection systems, electronic signature-based systems, and so on.
[038] In another aspect, a system for identifying potential clustering for sequencing reads is disclosed. The system processor takes nucleic acid sequence data from sequencer and is configured to perform cluster analysis of the read sequences from the sequence output to a reference sample.
[039] According to the yet another aspect, a computer-implemented method for determining possible cluster for sequencing reads is disclosed. A sample can be interrogated to produce a plurality of read sequences from the sample. Clusters are performed for the read sequences from the sequencer. A quality value for each cluster is determined. Each cluster with its associated quality value is outputted.
[040] According to the exemplary aspects, important uses of the invention,
[041] Clustering of datasets with unknown number of clusters: Identifying algorithmically the optimal number of clusters in a dataset is a difficult problem. RCC works well on datasets with known as well as unknown number of clusters as estimation of the number of clusters k does not depend on the labelling of the data.
[042] Finding novel subtypes: Due to the recursive nature RCC explores the dataset in a hierarchical fashion revealing finer structures and hence novel subtypes which are generally missed when applying standard clustering methods including kmeans, hierarchical clustering, pam, mclust, etc. For example consider large-scale cancer dataset which includes samples from multiple cancer types (breast, prostate, brain, etc) with each cancer type having distinct molecular subtypes. Using any out of the box clustering algorithm, one is only able to find tissue specific cancers and not the subtypes within each cancer type with clinical relevance.
[043] Biologist intuitive: Most of the computational parameters required for clustering are automatically calculated in RCC. Unlike all other algorithms it works well for both bulk and single cell transcriptome data.
[044] In summary, the present disclosure relates to systems, methods, software and computer-usable media comprising a clustering tool for biomolecule-related sequence, a Recursive Consensus Clustering (RCC), a user-friendly clustering tool that allows the user to find new, unique and biologically meaningful subtypes in genetic and genomic datasets. RCC uses a recursive method to find finer structures in the datasets resulting in identification of novel subtypes.
[045] BRIEF DESCRIPTION OF THE DRAWINGS
[046] Example embodiments of the present invention will be described with reference to the accompanying drawings briefly described below.
[047] FIG 1 illustrates the basic framework of RCC A) Overview of RCC workflow B) Datasets used to test RCC algorithm where N represents the number of cells/samples in a dataset, k represents the number of clusters identified by the authors in the original publications, fourth column states if RCC found biologically relevant novel subtypes in the respective dataset, RCC, mclust and SC3 columns state the number of clusters found by the respective algorithms C) Adjusted Random Index (ARI) of multiple runs (1000 runs for Ivy GAP5, Biase6 and Pollen7 datasets, 100 runs for Darmanis8,9 dataset and 10 runs for Human tissue10, TCGA pan-cancer11,12 and Neftel13 datasets) where ARI is calculated for each run with that of one randomly selected run D) Consensus matrix of 1000 RCC runs for Biase and Ivy GAP datasets showing the robustness of the algorithm, according to the aspects of present invention.
[048] FIG 2 illustrates the RCC results for TCGA pan-cancer and Neftel datasets. A) Tracking plot for TCGA pan-cancer dataset; the tracking plot is divided into three panels: the marker gene count panel i, the cluster panel j, and the annotation panel k. Panel i shows the number of marker genes found per cluster per level. Grey color indicates that the particular cluster is not further divided and hence no new markers are found which are distinct from the previous level. Panel j shows the number of clusters found per level. As shown in the figure, the data is divided into ten clusters at the first level. Then each of these ten clusters is further divided. Panel k shows the distribution of can- cer types in different clusters. The Melanoma (SKCM) samples are highlighted in the red box. The SKCM samples are divided into four clusters. B) Survival profiles for SKCM samples based on their cluster assignment. All the four SKCM clusters show significantly distinct survival profiles with cluster three having the best prognosis and cluster four having the worst. C) Cluster specific markers of SKCM clusters showed that each of the clusters have distinct marker genes. D) The pie charts show the distribution of six meta modules namely: MES2like, MES1like, AClike, OPClike, NPC1like, NPC2like, G1S and G2M across all 28 tumors categorized into proneural, mesenchymal, classical and mixed TCGA subtypes as in the original publication. Single-cell level distribution of meta module signatures for malignant cell types across four patients each representative of one TCGA subtype. E) Cluster annotation plots for RCC, tSNE + k-means and SC3 with cellular states on the x-axis and cluster label representing the dominant cellular state on the y-axis. F) Markers for macrophage clusters and their enrichment across nine clusters. The annotation panel shows the SSGSEA score for tumor periphery (LE), hypoxia (PAN), microglia and macrophage gene sets, according to the aspects of present invention.
[049] FIG 3 illustrates the Feature selection in RCC. A) RCC clustering of Brain tissue data using the top 2% variable genes across 53 tissue types. B) RCC clustering of Brain tissue data using the top 2% variable genes in the brain tissue samples. The selected features changing based on the current clustering hierarchy instead of being constant based on the entire dataset results in discovery of the finer structures of the data, according to the aspects of present invention.
[050] FIG 4 illustrates the gene marker plots for Ivy GAP data. In RCC clustering, CT and PAN samples subdivide into two clusters each. The rows represent genes and columns represent samples. The annotation bar shows the demarcation between two clusters (Gene names available in Supplemental Data), according to the aspects of present invention.
[051] Fig 5 illustrates the 5A. Consensus matrices for k = 2:6. And 5B. CDF plot for optimal k selection. The dot-dash line shows the maximum possible CDF value for each k, according to the aspects of present invention.
[052] FIG 6 illustrates the Cluster annotation plots of RCC cluster assignment for all the datasets. The X-axis represents annotations (histology, tissue type, cancer type, and cell type) and the Y-axis represents RCC clusters. The frequency of each attribute in each cluster is calculated in percent, where red indicates 100% and white indicates 0%, according to the aspects of present invention.
[053] FIG 7 illustrates the Cluster visualization of TCGA pan-cancer data. A) Tracking plot showing clustering information for TCGA pan-cancer data using the RCC cluster assignment along with the cluster distribution. Breast cancer samples (BRCA) are highlighted with black box and it can be seen they divide in four major clusters. B) tSNE + k-means plot using tSNE + k-means cluster assignment and data points colored by cancer type. BRCA samples are highlighted with black box. C) tSNE + k-means plot using tSNE + k-means cluster assignment and data points colored by tSNE + k-means cluster assignments. BRCA samples are highlighted with black box and it can be seen they divide into multiple clusters. The major clusters are highlighted in the legend. Using tSNE + k-means plots B and C, it is difficult to visualize or interpret the clustered data whereas RCC provides an easy and visually convenient alternative of tracking plot A for the same, according to the aspects of present invention.
[054] FIG 8 illustrates the SSGSEA of glioma samples in the TCGA pan-cancer dataset. There are 13 RCC clusters for gliomas out of which three clusters are GBM samples and the rest are lower grade glioma samples. The rows represent the cancer hallmark gene sets and the columns represent all the glioma samples, according to the aspects of present invention.
[055] FIG 9 illustrates the Best k selection for simulated datasets. A) Based on the selection criteria RCC selects k = 2,4,5 as best ks for all the runs. As k = 5 has the highest frequency across all the runs and is the highest k compared to other best ks, k = 5 is selected as the optimal k which is the true k. B) Based on the selection criteria RCC selects k = 2:5 as the best ks for all the runs. As k = 5 has the highest frequency across all the runs and is the highest k compared to other best ks, k = 5 is selected as the optimal k which is the true k, according to the aspects of present invention.
[056] FIG 10 illustrates the Tracking plots of TCGA pan-cancer data A) Tracking plot before the level cutoff function was applied. There are 138 RCC clusters B) Tracking plot after the level cutoff function with 118 clusters, according to the aspects of present invention.
[057] FIG 11 illustrates the Cluster annotation plot of brain samples from human tissue dataset. Except for mclust, all the algorithms used the entire human tissue data for clustering where optK of 53 was provided for tSNE + k-means and hclust algorithms. Mclust doesn’t work for the full dataset, hence we took the subset of brain samples and ran mclust separately on them. The brain samples were divided into 16 clusters by RCC, seven clusters by tSNE + k-means, two clusters by hclust and nine clusters by mclust. Based on the cluster annotation plots and ARI, RCC does a better job at dividing the brain samples at sub tissue levels, according to the aspects of present invention.
[058] FIG 12 illustrates the Gene marker plot for Lung samples from human tissue dataset. RCC divides the lung samples in eight clusters where cluster one shows enrichment of myeloid leukocyte activation and cell activation gene sets and patients with decreased pulmonary function, cluster two has gene sets with extracellular matrix component and subjects who suffered fast deaths, cluster three shows enrichment of myeloid leukocyte migration and cell migration gene sets, cluster four shows genes involved in immunoglobulin receptor binding and protein activation cascade, cluster five shows genes involved in positive regulation of protein secretion and plasminogen activation, cluster six shows up regulation of genes involved in cytokine activation and regulation of vasculature development, cluster seven shows up regulation of genes involved in Pulmonary embolism and complement activation and classical pathway and cluster eight has up regulation of genes which have cilium cellular component, according to the aspects of present invention.
[059] FIG 13 illustrates the Survival plots for KIRC, LGG, SARC, SKCM, and UCEC cancer types. RCC divided these cancers with significantly distinct survival profiles with p < 0.01 consistently across 10 runs, according to the aspects of present invention.
[060] FIG 14 illustrates the SSGSEA analysis of SARC clusters. Cluster one shows enrichment in glycolysis and DNA repair gene sets, cluster two shows enrichment in myogenesis geneset, cluster three shows enrichment of genes downregulated in KRAS pathway and genes involved in Hedgehog signaling whereas cluster four shows enrichment in complement pathway and apoptosis gene sets. The gene sets were downloaded from Molecular Signatures Database (MSigDB) 27, according to the aspects of present invention.
[061] FIG 15 illustrates the RCC clustering for the Pollen dataset. A) Consensus Cluster Matrix for 1000 runs of RCC showing consistent subgroups of BJ cell line highlighted by blue box. B) Gene marker plot of BJ cell-types where cluster two consisted of cells with expression of genes involved in RNA binding, structural constituent of ribosome and establishment of protein localization to endoplasmic reticulum, cluster three showed enrichment of genes involved in post-embryonic development and posttranscriptional regulation of gene expression, whereas cluster one had cells which did not show enrichment of genes upregulated in either of the clusters, according to the aspects of present invention.
[062] FIG 16 illustrates the gene marker plot for oligodendrocytes, endothelial and fetal quiescent cells in the Darmanis dataset. The clusters are formed irrespective of their origin (i.e. from the normal brain or tumor samples). A) The oligodendrocyte clusters show enrichment of kinase binding and mRNA splicing genes in cluster one, regulation of protein polymerization, myelin sheath, and ferric ion binding genes in cluster two and ganglioside GT1b binding, membrane raft polarization and myelination genes enrichment in cluster three. B) In Endothelial cells, cluster one shows genes involved in blood vessel and vasculature development processes, cluster two shows cell adhesion specific gene markers involved in blood vessel morphogenesis process and, cluster three markers showed to be playing a part in extracellular matrix formation. C) In Fetal Quiescent cell types cluster two shows enrichment of genes involved in axon guidance pathway and semaphorin receptor activity, cluster three shows enrichment of genes involved in kinase activity and neurotrophin TRKC receptor binding whereas cluster one had cells which did not show enrichment of genes upregulated in either of the two clusters, according to the aspects of present invention.
[063] FIG 17 illustrates the Pie Charts showing cellular states across 28 tumors for Neftel dataset. RCC clusters cells based on their cellular states (MES1like, MES2like, AClike, OPClike, NPC1like, NPC2like, G1S, G2M) across all 28 tumors which are not replicated well by tSNE + k-means and SC3, according to the aspects of present invention.
[064] FIG. 18 illustrates the UI model for RCC, according to the aspects of present invention.
[065] FIG. 19 is a block diagram illustrating the details of a digital processing system in which various aspects of the present invention are operative by execution of appropriate execution modules.
[066] In the drawings, like reference numbers generally indicate identical, functionally similar, and/or structurally similar elements. The drawing in which an element first appears is indicated by the leftmost digit(s) in the corresponding reference number.
[067] DETAILED DESCRIPTION OF THE INVENTION
[068] Embodiments of systems and methods for Recursive Consensus Clustering tool for analysis and visualization of sequence data are described herein.
[069] The section headings used herein are for organizational purposes only and are not to be construed as limiting the described subject matter in any way.
[070] In this detailed description of the various embodiments, for purposes of explanation, various details are described to provide a thorough understanding of the embodiments disclosed. One skilled in the art will appreciate, however, that these various embodiments may be practiced with or without these specific details. In other embodiments, structures and devices are shown in block diagram form. Furthermore, one skilled in the art can appreciate that the specific sequences in which methods are presented and performed are illustrative in nature and it is contemplated that the sequences can be different and still remain within the spirit and scope of the various embodiments disclosed herein.
[071] Definitions:
[072] The following terms are used as defined below throughout this application, unless otherwise indicated.
[073] The terms “tumour” or cancer tissue” refer to an abnormal mass of tissue which results from uncontrolled cell division. A tumour or tumour tissue comprises “tumour cells” which are neoplastic cells with anomalous growth properties and no functional bodily function. Tumours, tumour cells and tumour tissue can be benign or malignant.
[074] The phrase "differentially present" refers to differences in the quantity of the marker present in a sample taken from patients as compared to a control subject. A biomarker can be differentially present in terms of frequency, quantity or both.
[075] "Diagnostic" means identifying a pathologic condition.
[076] The terms "detection", "detecting" and the like, may be used in the context of detecting markers or biomarkers.
[077] A "test amount" of a marker refers to an amount of a marker present in a sample being tested. A test amount can be either in absolute amount (e.g., µg/ml) or a relative amount (e.g., relative intensity of signals).
[078] The terms "polypeptide," "peptide" and "protein" are used interchangeably herein to refer to a polymer of amino acid residues. "Polypeptide," "peptide" and "protein” can be modified, e.g., by the addition of carbohydrate residues to form glycoproteins.
[079] The terms "subject", "patient" or "individual" generally refer to a human or mammals. "Sample" refers to a polynucleotide, antibodies fragments, polypeptides, peptides, genomic DNA, RNA, or cDNA, polypeptides, a cell, a tissue, and derivatives thereof may comprise a bodily fluid or a soluble cell preparation, or culture media, a chromosome, an organelle, or membrane isolated or extracted from a cell.
[080] A “system” refers to a set of components, real or abstract, comprising a whole where each component interacts with or is related to at least one other component within the whole.
[081] A “biomolecule” is any molecule that is produced by biological organisms, including proteins, polysaccharides, lipids, nucleic acids and small molecules such as primary metabolites, secondary metabolites, and other natural products.
[082] EMBODIMENTS OF THE INVENTION:
[083] 1. WORKFLOW FOR THE RCC ALGORITHM
[084] FIG 1 explains the work flow of RCC algorithm
[085] Inventors have developed Recursive Consensus Clustering (RCC), a user-friendly R package that allows a researcher to find novel and biologically meaningful subtypes in transcriptomic datasets without requiring computational expertise. The recursive clustering of the dataset reveals finer structures in the data leading to the identification of novel subtypes. The RCC algorithm is described in five steps as shown in FIG 1A.
[086] Data input: RCC takes a quantitative transcriptomic data matrix as input. Annotation matrix file for samples/cells can be provided optionally.
[087] Feature selection and data scaling: For every recursive run RCC selects the top n% of variant genes in that particular subset of the original dataset which is used for further clustering. This feature of RCC helps in finding the finer hierarchical structure in a given dataset (FIG 3).
[088] Parallel processing of ConsensusClusterPlus: RCC uses the ConsensusClusterPlus package14 (CCP) in R for k-means clustering of the dataset. This is repeated eight times in parallel with 100 repeats each.
[089] Optimal k selection: Based on the cumulative distribution function (CDF) plots indicative of cluster stability, produced by CCP, RCC finds the best ks15 for each process, in turn, giving multiple k values that satisfy the best k selection criteria (described in detail in methods section). The k with maximum frequency is selected as the optimal k for the clustering.
[090] Subdivision of data and recursive clustering: Once the k for a given dataset is selected, all k subsets are clustered recursively till termination criteria is reached i.e. optimal k = 0.
[091] The sample size of the subset data is lower than the minimum number of samples required for clustering.
[092] Output: RCC outputs final cluster information as well as cluster information at all levels in a csv format. It also outputs the clustered data in the form of tracking plot and cluster annotation plot for visualization of results facilitating biological interpretation.
[093] Inventors have used k-means as the underlying clustering algorithm for RCC as inventors found it to work well for bulk transcriptomic datasets of different sizes. In principle, any clustering algorithm can be substituted for k-means. As the result obtained from the k-means algorithm is highly dependent on initial seed value, inventors run CCP eight times using a different random seed and different parameters for each run to get the most stable clusters (methods).
[094] To minimize user dependency and find appropriate values for all the parameters inventors tested this algorithm using various bulk and single-cell datasets (FIG 1B, Biological Inferences in methods). All the datasets selected were annotated and used for further clustering. RCC works well on bulk as well as single-cell datasets. For most of the datasets, RCC was able to find biologically significant novel subtypes (FIG 1B, Biological Inferences in methods). To check the stability of RCC inventors ran the algorithm 1000 times on Ivy GAP, Biase and Pollen datasets, 100 times on the Darmanis dataset and 10 times on Human tissue, TCGA pan-cancer, and Neftel datasets. Inventors calculated the Adjusted Random Index (ARI) using the mclust16 package in R of each run with that of one randomly selected run which showed highly consistent results with ARI ranging from 0.5 to 1.0 (FIG 1C). The consistency of the results for Biase and Ivy GAP datasets can be seen in the consensus matrices (FIG 1D). The consensus matrices for the rest of the datasets are provided in Supplementary data. The major feature of RCC is the automatic selection of the number of clusters for a given dataset. So to benchmark RCC, inventors chose mclust16 (for bulk dataset) and SC317 (for single-cell dataset) packages which also have the feature of selectingthe optimal k (FIG 1C).Inventors have also compared our algorithm against tSNE + k-means18 and hierarchical clustering (hclust, base R package) by giving it the k based on known annotation (Tables S1, S2). Inventors were not able to run mclust on larger datasets (i.e. datasets having samples > 1000) on our system (system specifications in methods). Overall, inventors found good concordance between the clusters found by RCC and labels suggested by the original authors (Tables S2, S3). In all but one case, RCC found novel subtypes, i.e. further subdivision of known attributes at the time of data generation as provided in the original study (FIG 1B, Biological Inferences in methods). For example; in the Ivy GAP dataset5, the CT and PAN samples consistently further divide into two clusters (FIG 1D). Further review of these sub-clusters showed that they were dividing into classical/mesenchymal and proneural subtypes of glioblastoma (FIG 4). None of the other algorithms tested were able to identify these subtypes.

[095] TABLE S1: Algorithms used for benchmarking RCC

[096] TABLE S2: Adjusted random index (AR) and optimal k provided to or
determined by RCC, hclust, mclustt, tSNE+ k-means and SC3

[097] TABLE S3: Cluster specificity for all the datasets across all the algorithms
[098] To illustrate the results of RCC inventors show examples of 1. TCGA pan-cancer bulk tissue dataset11,12 and 2. Neftel single-cell transcriptomic dataset13. The TCGA pan-cancer dataset was divided up to four levels and the initial four rows of panel i indicate the number of markers found at each level for each cluster (FIG 2A). The second-panel j is the RCC clustering level panel. It shows the cluster information for samples at each level starting with level one. The third panel k is the annotation panel. The annotation panel shows the distribution of samples across all the clusters. In the TCGA dataset, the data is shown to be divided into ten clusters at level one. These clusters at level one are then further divided recursively. RCC found a total of 124 clusters in this dataset with various subtypes in each cancer type. RCC was able to find significantly distinct survival profiles in 12 different cancer types across 10 runs. As an example, inventors show subtypes of melanoma (SKCM), highlighted in a red box (FIG 2A). RCC was able to find four subtypes in the SKCM cancer type with significantly distinct marker genes and survival profiles (FIG 2B and C). Similar molecular profiles were previously found and dis-cussed19.
[099] For single-cell, inventors show the results of applying RCC on Neftel dataset which includes single-cell RNA-sequencing of 28 glioblastoma tumors (pediatric and adult). The dataset has 7930 cells and 31 different known (three non-malignant and 28 patient wise malignant) cell types. RCC was able to divide the non-malignant cells based on their individual cell types irrespective of their source tissue whereas the malignant cells divided not only by the patient but also by malignant cell type. Neftel et al. showed that the malignant GBM cells are majorly found in four cellular states which are determined by six meta modules based on the following gene signatures; Hypoxia independent (MES1like) and hypoxia dependent (MES2like) mesenchymal related gene sets, astrocytic (AClike) marker gene set, oligodendroglial (OPClike) lineage marker gene set, and stem and progenitor cell signatures (NPC1like and NPC2like) as well as two cell cycling modules namely G1S and G2M. They were able to find these subtypes of malignant cells based on extensive data analysis using domain knowledge. Using RCC with minimal user input, inventors find similar subtypes in malignant cells(FIG 2D). FIG 2E shows the concordance between the single-cell module assignment vs. the cluster module assignment (detailed description in methods). RCC clusters are more homogeneous with respect to the cellular states of cells as compared to the other algorithms (FIG 2E). Inventors are also able to find subtypes in non-malignant cells not described in the original study (Biological inferences in methods). The macrophage clusters reflect different cell populations which are either microglia like (brain resident) or macrophage like (blood derived)20 and are in different tumor micro environment (hypoxic vs. non-hypoxic) (FIG 2F).
[0100] In conclusion, RCC is a user-friendly data clustering algorithm that can be used on both bulk and single-cell transcriptomic datasets. RCC analysis facilitates novel subtype discovery in the transcriptomic data allowing the user to tease out unknown finer structure in large datasets through intuitive visualization of clustering results.
[0101] RCC algorithm: RCC algorithm has six major steps: 1) Data input 2) Feature selection and data scaling 3) Parallel processing of ConsensusClusterPlus 4) Optimal k selection 5) Recursive clustering and 6) Output. Each step is described in detail below.
[0102] Data input: RCC takes an expression matrix (X) and configuration file as input. The expression matrix is a csv file where the columns represent samples/cells and the rows represent expression values for genes/transcripts. The input matrix should be normalized and log-transformed after adding a pseudo count of 1. Along with the expression matrix, the user also has the option of uploading a sample information file which contains clinical information or attributes for the samples/cells to be clustered. The configurable parameters with default values are discussed at the end of this section.
[0103] Feature selection and data scaling: The current clustering algorithms either use all the available genes or top variant genes or top PCA/ScHPF21 components for clustering the datasets, which gives us only a global view/subgroups of a dataset. In RCC, initially all samples and top n% variant genes are selected, then normalized and clustered as described in steps 3-4. Based on the previous level of clustering, the feature set changes for subsequent recursive runs. Inventors tested multiple datasets with different top n% variant genes and found that the top 1% of variant genes work well for the single-cell transcriptomic data, whereas the top 3 - 5% genes work well for bulk tissue data. This is a user-defined parameter in RCC where the user can select top n% of variant genes/features to be selected for further clustering. For a bulk-transcriptome dataset, a minimum of 500 genes/features are required for further clustering. If the top n% variant genes/features are < 500, RCC by default takes the top 500 variant genes for further analysis. For single-cell transcriptomic data the number of genes detected per cell can be as low as 200 hence no minimum criteria is applied. After subsetting the top n% variant genes/features, data scaling is done using the following formula:
[0104] z = (X – µ) / s, where X is the expression measure of a gene in a cell/sample, µ is the mean expression measure of a gene across all the cells/samples and s is the standard deviation of expression measure of a gene across all the cells/samples.
a. Parallel processing of ConsensusClusterPlus: The base algorithm used for RCC is ConsensusClus-terPlus14 (CCP) which performs k-means clustering on the dataset. CCP is run in parallel eight times with a different seed each time for increased robustness. The output of CCP includes stability evidence for a given number of clusters (k) and their assignments. CCP is run each time with the following para-meters:
b. Maximum number of clusters: The maximum number of clusters (maxK) for each run is calcu-lated by taking the minimum between 10 and 1/10th of the number of samples. maxK changes for each recursive run of RCC. Inventors limit the maxK to 10 because based on our testing, the best k selection is not reliable for larger k.
c. Number of repeats for each run: Each of the eight parallel processes has a repeat count of 100.
d. Proportion of items and features to sample (pItem and pFeature): As discussed earlier, RCC runs CCP in parallel eight times to get stable clusters. To enable selection of the most robust clusters, the pItem and pFeature values are changed for every run. The selected pItem values range from 0.6-0.9 whereas the pFeature values used are 0.8 and 1. Different combinations of the mentioned values.
e. Clustering algorithm: RCC uses k-means algorithm with Euclidean distance. K-means is applied on the z-scored data matrix with the Hartigan and Wong algorithm. By default, the number of centers is set to 10 and the maximum number of iterations is set to 10^9 in the k-means function.Seed: The k-means clustering algorithm requires an initial seed number to generate clusters. The initial seed value is very crucial for the clustering of data as it effects the repeatability and reproducibility of the results. To find robust clusters RCC generates random seed values for all of the eight runs of CCP.
[0105] Optimal k selection: Based on the cumulative distribution function (CDF) plots produced by CCP, RCC finds the best k15(Simulated data in methods) for each process as described below in turn giving eight k values. The k with maximum frequency is selected as the optimal k for the clustering. An example CDF plot of consensus matrices for k = 2:6 is shown in FIG 5. The consensus matrix of N samples for 100 runs of clustering is an N x N matrix with each cell containing the number of times the row and col-umn samples cluster together. For a given consensus matrix M, CDF is defined over the range of 0 to 1.
[0106]

[0107] Consensus index c varies from 0 to 100. The perfect CDF plot for a given (number of clusters), where for every run inventors get the same result, will have a CDF line with slope 0° starting at c = 0 and ending at c 99. The perfect CDF value will be

[0108] FIG 5 shows an example of CDF plots for k = 2,3,4,5,6. Top panel shows the consensus matrices for each k and the bottom panel shows the CDF plot with slope and line length calculations. The dot-dash line shows the maximum CDF value for each k.
[0109] Line length and slope calculation
[0110] The clustering of real data typically does not result in perfect CDF value so inventors create an allowance of 0.5. Inventors take all CDF values in that range and fit a line through it. If the slope of the line is > minimum slope threshold, inventors trim the line till inventors get the slope below threshold or the line length is smaller than the minimum line length.
[0111] Intra cluster stability calculation

[0112] Inter cluster overlap calculation

[0113] Differentially expressed genes
[0114] K-means clustering algorithms tend to find stable clusters even when the sample distribution is random but asymmetric resulting in sub-classification without biological relevance22. This parameter helps in se-lecting only biologically meaningful clusters for a dataset. To find the clusters with biological significance, RCC calculates differentially expressed genes across all the clusters. The clustering is selected only if it has n% of genes up regulated with FDR < 0.01 in at least one of the clusters.
[0115] Best k selection
[0116] Inventors selected the best k based on the following criteria:
[0117] Line length is > minimum line length (default: 30)
[0118] Line slope is < minimum slope threshold (default: 10°)
[0119] Intra cluster stability > 0.8
[0120] Inter cluster overlap < 0.2
[0121] Differentially expressed genes > minimum % of genes in at least one of the clusters (default: 20% for large dataset, 10% for small dataset).
[0122] If multiple k values are selected then inventors break the tie using the below criteria:
[0123] Inventors assign weights to each k if the slope of k is = 5° and if line length = 40. The k with maxi-mum weight is selected as the best k (if multiple ks have equal weight all of them are selected as the best k).
[0124] In the example FIG 5, it can be seen that k = 2,3,4,5 satisfy the criteria a:d. All the ks also satisfied criteria e. Based on f, k = 3,4,5 is selected as the best k. Inventors get similar values from the remaining eight runs which in turn gives us an array of best ks. The optimal k is selected out of these best ks based on frequency distribution. The best k with the highest frequency is selected as the optimal k.
[0125] Recursive clustering: Once the optimal k for a particular level is selected, each of the subdivided data-sets go on for further clustering. The samples/cells get recursively clustered until one of the following cri-teria is met:
i. Number of samples/cells in the dataset is lesser than the minimum number of samples required for clustering
ii. Optimal k is zero
[0126] Output: Upon submission of an expression matrix and their respective annotations, RCC gives the following output:
[0127] Cluster information file (ClusterInfo.csv): The basic output of RCC is the cluster information file which has the cluster assignment of all the samples in the submitted datasets at every level along with the final cluster assignment.
[0128] Cluster annotation plot (atrribute_vs_algorithm.pdf): The cluster annotation plot allows the user to view the cluster assignment of each sample based on a particular attribute using the clusterAnnotation function. The Cluster annotation plot shows the distribution of all the attributes (e.g. cell types or tissue types) across all the clusters. The columns represent the attributes present in the dataset and the rows represent the clusters (FIG 6).
[0129] Tracking plot (trackingPlot.pdf): The trackingPlot function is a visualization tool that allows the user to view the clustered data in an easy and interpretable manner (FIG 7). Columns in the tracking plot correspond to the samples/cells. The first panel i is the marker panel which shows the number of marker genes identified for each cluster at a given level. The first row is markers for clusters found at level one, the second one for markers found at level two, and so on. The color-scale is from white to red with white indicating zero markers found and bright red indicating the highest number of markers found. Grey color indicates that the particular cluster is not further divided and hence no new markers are found which are distinct from the previous level. The second-panel j is the RCC clustering level panel. It shows the cluster information for samples at each level starting with level one. The third-panel k is the annotation panel. The annotation panel shows the distribution of samples across all the clusters. Gene markers plot (Level_markers.pdf): The geneMarkers function in RCC calculates the specific genes which are significantly up regulated in the clusters at each level with FDR < 0.01 and log2 fold change values > 1. These genes are the specific markers for their representative clusters. RCC cal-culates and plots the markers across the levels making it easier for the user to visualize the data clustering. In the marker plots, the columns represent the samples/cells and the rows represent the genes.
[0130] SSGSEA analysis (ssgsea.pdf): To find the biological significance of the clusters, RCC allows the user to perform a single sample gene set enrichment analysis (SSGSEA)23 using the function ssgsea. This function allows the user to find the enrichment of particular gene sets in all the samples of the clusters. The user can input the gene sets of their importance in CSV format to perform the SSGSEA analysis. The SSGSEA heatmap shows the enrichment of all the gene sets across all the samples. In the plot, the columns represent the samples/cells and rows represent the gene sets. For example, FIG 8 shows the SSGSEA plot of glioma samples from the TCGA dataset which allows us to look into the enrichment of cancer hallmarks in each cluster.
[0131] Cluster attribute enrichment analysis (atrribute_vs_algorithmFE.csv): When a sample informa-tion file that contains clinical information or attributes is provided the user can perform cluster attribute enrichment analysis using the function clusterAttr. This function implements Fisher's exact test to find a significant correlation between the attributes (categorical variable) and clusters. This helps in understanding the biological/clinical significance of the cluster.
[0132] Kaplan–Meier analysis (survival.pdf): RCC allows the user to perform survival analysis for all the clusters using the function clusterSurvival.
[0133] Output of RCC analysis described in this manuscript are available as Supplementary Data.
[0134] RCC configurable parameters: Along with the input matrices, RCC also requires an input configuration file. This file allows the user to adjust the clustering parameters or use the default ones based on their requirements. The configuration file is in the .csv format where the first column indicates the parameter name and second col-umn indicates the parameter value. The configurable parameters are:
[0135] Input expression matrix.
[0136] Input annotation file: This is an optional parameter. If one does not have any annotation for the sample dataset this field can be left as NA.
[0137] Minimum slope threshold: The default value is 10°. Lower the threshold tighter the clustering. Recom-mended values are between 5-15°.
[0138] Minimum number of samples required for clustering: This option allows the user to decide the minimum sample size that is sufficient for further sub grouping. The default value is 20. A smaller number will likely result in a larger number of clusters.
[0139] Minimum line length: The default value is 30. Perfect clustering will have a line length of 99. Higher the value tighter the clusters. Recommended values are between 30-60.
[0140] Percentage of genes/features to be used for clustering: Inventors use the top 3 - 5% of variant genes in bulk data and the top 1% of variant protein-coding genes in single-cell data for clustering. For bulk data, the recommended values are between 2-5% and for single-cell data, it is 1-2%.
[0141] Minimum percentage of genes that are differentially expressed: The default value for large datasets (i.e. datasets having more than 1000 cells/samples) is 20% and for small datasets, it is 10%.
[0142] Type of dataset: RCC accepts both bulk as well as single-cell transcriptomic datasets. This parameter is used by RCC for determining feature selection criteria. If the data is bulk tissue, then a minimum of 500 or top n% genes/features (whichever is higher) are taken for clustering.
[0143] Output directory: Absolute path to the folder where the user wants to output the results. Benchmarking: Inventors used the datasets mentioned in FIG 1B for testing the algorithm. All the bulk datasets were normalized using the DESeq24 package in R and single-cell data using Counts Per Million (CPM) and then log2 transformed before running RCC. To benchmark RCC inventors used the algorithms given in (Table S1) on the bulk as well as single-cell transcriptomic datasets.
[0144] Inventors used the default parameters for all algorithms and provided the number of clusters where needed. For the algorithms which do not have the optimal k selection feature, inventors provided the same number of clusters pre-sented in the original publications. The optimal k and Adjusted Random Index (ARI) for each dataset across all the algorithms is shown in Table S2. ARI is calculated using the mclust16 package in R. Identification of novel subtypes results in lower ARI values. Since RCC results have significant number of subtypes for larger datasets inventors see lower ARI values. To show that the novel subtypes are subsets of known attributes, inventors calculate cluster specificity score for each clustering result as the percentage of clusters that are composed of largely samples having the same attribute, i.e. dominant attribute type proportion >= 90%. RCC did a better job of finding novel subtypes (described in the Biological Inferences section) as well as divide the datasets accurately based on the given annotations in the original publications (FIG 1B and FIG 6, Table S3).
[0145] Best k selection using simulated datasets: Inventors generated simulated datasets using the CIDR25 package in R to check if RCC was able to select the best k. Two simulated matrices were generated where mat1 had five sub-groups and each subgroup had an exact number of samples in them, whereas mat2 had five subgroups with varying proportions of samples in them. The best k found by RCC is the same as true k in the simulated dataset (FIG 9).
[0146] Cluster stability: To check for the stability of our algorithm inventors ran RCC 1000 times on Ivy GAP, Biase and Pol-len datasets, 100 times on the Darmanis dataset and 10 times on Human tissue, TCGA pan-cancer, and Neftel datasets. Inventors calculated ARI (FIG 1C) and plotted the heatmap of the Consensus Clustering Matrix to see the consistency of clustering in dataset (FIG 1D).
[0147] Level cutoff: The novel feature of RCC is its recursive nature. RCC keeps dividing the data until there is a sig-nificant variance in it. Once RCC is run, the user can use the tracking plot generated by RCC to visualize the clustering of the dataset for sample attributes. The tracking plot provides an intuitive and convenient way to in-terpret the results and derive biological insights. In case the user requires clustering up to a particular level as any further clustering might not be relevant for the user, they can use the cutoff function to allow RCC to cluster the data points only up to a particular level. This feature allows the user to control the subdivision of their data where needed. FIG 10 shows the before and after cutoff clustering for the TCGA pan-cancer data. Initially, RCC finds a total of 138 clusters with division up to four levels. After applying the cutoff function at level three, RCC finds a total of 118 clusters.
[0148] EXAMPLE EMBODIMENTS OF THE INVENTION
[0149] The present invention is further elaborated with the help of the following examples. However, these examples should not be construed to limit the scope of the present invention.
[0150] The main objective of RCC is to find novel subtypes automatically with minimal user in-put that can help in generating novel biological insights. RCC was able to find novel subtypes in the majority of the datasets which show biological significance through biological inferences.
[0151] EXAMPLE 1: IVY GAP DATASET
[0152] The Ivy GAP5 dataset contains the RNA-Seq profiles of anatomic structures in gliob-lastoma, grade IV brain cancer. These anatomic structures include Leading Edge (LE), Cellular Tumor (CT), Microvascular Proliferation (MVP) and Pseudopalisading Cells around Necrosis (PAN) which are described in detail in the original paper. As shown in FIG 1D RCC finds a subset of CT and PAN samples that have a distinct enrichment of proneural signature26. The gene markers for each cluster are shown in FIG 4.
[0153] EXAMPLE 2: HUMAN TISSUE DATA- Human tissue data contains RNA-Seq data across 53 different tissues, lymphoblastoid cell lines and transformed fibroblast cell lines. RCC is more sensitive in comparison to other algorithms in clustering the samples (FIG 6 and FIG 11, Table S3). Dey et al.10 talked about dividing the dataset into 20 clusters which doesn’t cover the 53 tissue types of the dataset. Inventors performed fisher’s exact test using the clusterAttr function to correlate the significance of clinical parameters like age, gender and cause of death with the clusters formed to find if RCC can segregate the data based on these attributes. Out of 53 tissue subtypes, inventors found seven tissue subtype clusters showed significant correlation with gender, nine showed significant correlation with age and 26 showed significant correlation with cause of death across 10 runs of RCC Supplementary file 2. RCC could find multiple biologically relevant subtypes across 13 tissues which other algorithms were not able to find. As an example, lung tissue for which RCC found eight clusters, tSNE + k-means and hclust could find only two and one clusters respectively. On further analysis inventors found that cluster one shows enrichment of myeloid leukocyte activation and cell activation gene sets and patients with de-creased pulmonary function, cluster two has gene sets with extracellular matrix component and subjects who suffered fast deaths, cluster three shows enrichment of myeloid leukocyte migration and cell migration gene sets, cluster four shows genes involved in immunoglobulin receptor binding, protein activation cascade, cluster five shows up regulation of genes involved in positive regulation of protein secretion, plasminogen activation, cluster six shows up regulation of genes involved in cytokine activation, regulation of vasculature development, cluster seven shows up regulation of genes involved in Pulmonary embolism and complement activation and classical pathway and cluster eight has up regulation of genes which have cilium cellular component (FIG 12).
[0154] EXAMPLE 3: TCGA PAN-CANCER DATA
[0155] TCGA pan-cancer11,12 dataset includes RNA-Seq data of 32 different cancer types. RCC was able to find 124 clusters in this dataset resulting in novel cancer subtypes. Inventors per-formed survival analysis on subtypes for each cancer type to find if there was any significant survival differences among the subtypes. Chen et al.3. and Hoadley et al.4 have found ten and 28 clusters respectively to divide the TCGA pan cancer data based on the multi-omic data analysis using the cluster of clusters approach. Inventors performed Kaplan-Meier analysis for RCC, tSNE + k-means, hclust and cluster assignments by Chen et al. and Hoadley et al. to evaluate which of the clustering assignments are clinically relevant. Inventors used patient survival data as a measure of clinical relevance. In our comparison, inventors found that using RCC clusters, there are 10 cancers with subtypes which show significant survival pro-files with p-value < 0.01 across 10 runs, for tSNE + k-means clusters there are three, five for hclust, three for Hoadley et al. and none for Chen et al. As shown in Table S6, RCC did a better job of finding cancer subtypes with significant survival profiles. The survival profiles of top five cancers (KIRC, LGG, SARC, SKCM and UCEC) are shown in FIG 13. An elaborate example of the same is SARC, which further divides into four clusters. All the four SARC clusters show significantly distinct sur-vival profiles with cluster four having the best prognosis and cluster three having the worst. SSGSEA analysis with the cancer hallmarks gene set27 of SARC clusters showed that each of the clusters have gene sets with distinct enrichment. Cluster one shows enrichment in glycolysis and DNA repair gene sets, cluster two shows enrichment in myogenesis gene set, cluster three shows enrichment of genes down regulated in KRAS pathway and genes involved in Hedgehog signaling whereas cluster four shows enrichment in complement pathway and apoptosis gene sets (FIG 14).
[0156] EXAMPLE 4: POLLEN DATASET
[0157] Pollen7 dataset includes single-cell data of 11 cell types which can be broadly divided into four subtypes namely: blood cells, dermal/ epidermal cells, neural cells and pluripotent cells. All the cell types are grouped individually at level one itself. Further at level two it was observed that the BJ cell line which are pluripotent stem cells (hiPSCs) originally derived from neonatal male human foreskin fibroblasts are consistently getting subdivided into three clusters across multiple runs (FIG 15). On further analysis inventors found that cluster two consisted of cells with expression of genes involved in RNA binding, structural constituent of ribosome and establishment of protein localization to endoplasmic reticulum, cluster three showed enrichment of genes involved in post-embryonic development and posttranscriptional regulation of gene expression, whereas cluster one had cells which did not show enrichment of genes up regulated in either of the clusters (FIG 15).
[0158] EXAMPLE 5: DARMANIS DATASET
[0159] Darmanis et al.8,9 have generated two single-cell RNA-Seq datasets: One in 2015 using human adult cortical samples and another one in 2017 using four GBM patient samples. Inventors combined both the datasets which include 4055 cells and 15 known cell types. RCC found 43 clusters across these cell types. RCC divided the non-malignant cells based on their individual cell types irrespective of their source tissue whereas the malignant cells showed to be divided by the patient types as one would expect. RCC also divided the Oligodendrocytes (2015 and 2017 combined) into three subtypes. The oligodendrocyte clusters showed enrichment of kinase binding and mRNA splicing genes in cluster one, regulation of protein polymerization, myelin sheath and ferric ion binding genes in cluster two and ganglioside GT1b binding, membrane raft polarization and myelination genes enrichment in cluster three (FIG 16). Similarly RCC also divides Endothelial and Fetal Quiescent cells into three clusters each (FIG 16). In endothelial cells, cluster one shows genes involved in blood vessel and vasculature development processes, cluster two shows cell adhesion specific gene markers involved in blood vessel morphogenesis process and, cluster three markers showed to be playing a part in extracellular matrix formation. In Fetal Quiescent cell types cluster two shows enrichment of genes involved in axon guidance pathway and semaphorin receptor activity, cluster three shows enrichment of genes involved in kinase activity and neurotrophin TRKC receptor binding whereas cluster one had cells which did not show enrichment of genes up regulated in either of the two clusters.
[0160] EXAMPLE 6: NEFTEL DATASET
[0161] The Neftel13 dataset consists of single-cell RNA-Seq profiles from 28 adult and pediatric glioblastoma patients. After extensive analysis, the authors from the original paper identified six meta modules; mesenchymal-like hypoxia- independent (MES1like) and -dependent (MES2like) modules, as-trocyte-like (AClike), oligodendrocytic precursor-like cells (OPClike), and neural progenitor-like cells (NPC1like and NPC2like). They also assigned cells a score based on their G1S and G2M cell cycle state. They concluded that there are largely four cellular states of glioblastoma cells (MES, AC, OPC, NPC) and each patient has varying proportions of these states, the proportions which are influenced by genetics and microenvironment. At gene expression level, the cells are more similar to each other based on their cellular state rather than based on their parent tumor. Results of RCC recapitulates these find-ings as can be seen in the pie charts demonstrating cells from each tumor clustering based on their me-ta module score (FIG 2D and FIG 17). Based on the findings of the original paper if inventors take as attribute the cellular state which is defined by the highest meta module score for each cell and label each cluster based on the highest average meta-module score for cells in that cluster, then inventors can see that RCC clusters cells with similar cellular state really well (FIG 2E). tSNE + k-means as well as SC3 are not able to identify clusters of highly cell cycling genes (highlighted in FIG 2E). RCC is also able to identify subtypes of non-malignant cells like macrophages (FIG 2F).
[0162] All the gene enrichment analysis was done using Toppfun28.
[0163] System configurations: All the algorithms were run on a computer with an Intel Intel® Xeon(R) CPU E5-2630 v4 processor running at 2.20GHz × 40 using 251.8 GiB of RAM, running Linux version 16.04.
[0164] Data and code availability: Our methods are implemented as an R package RecursiveConsensusClustering, available on GitHub. All the input and output files for all the mentioned datasets are also provided for download on https://www.msctr.org/2019/05/30/recursive-consensus-clustering/
2. USER INTERFACE:
[0165] Figure 18 shows the user interface model for RCC
[0166] Sample User Interfaces
[0167] FIG. 18 depicts sample user interfaces that enable users of RCC to cluster and analyse the sequence in one embodiment. RCC tool is developed using R statistical package. The tool can be either used as a standalone application or accessed through a web interface as seen in FIG 18.
[0168] 3. HARDWARE
[0169] Digital Processing System
[0170] FIG 19 is a block diagram illustrating the details of digital processing system 500 in which various aspects of the present invention are operative by execution of appropriate execution modules. Digital processing system 500 may correspond to one or more systems of FIG. 1 and FIG 2.
[0171] Digital processing system 500 may contain one or more processors (such as a central processing unit (CPU) 501), random access memory (RAM) 502, secondary memory 503, graphics controller 506, display unit 507, network interface 508, and input interface 509. All the components except display unit 507 may communicate with each other over communication path 505 which may contain several buses as is well known in the relevant arts. The components of FIG. 5 are described below in further detail.
[0172] CPU 501 may execute instructions stored in RAM 502 to provide several features of the present invention. CPU 501 may contain multiple processing units, with each processing unit potentially being designed for a specific task. Alternatively, CPU 501 may contain only a single general-purpose processing unit. RAM 502 may receive instructions from secondary memory 503 using communication path 505.
[0173] Graphics controller 506 generates display signals (e.g., in RGB format) to display unit 507 based on data/instructions received from CPU 501. Display unit 507 contains a display screen to display the images defined by the display signals. Input interface 509 may correspond to a keyboard and a pointing device (e.g., touch-pad, mouse), which enable the various inputs to be provided.
[0174] Network interface 508 provides connectivity to a network (e.g., using Internet Protocol), and may be used to communicate with other connected systems. Network interface 508 may provide such connectivity over a wire (in the case of TCP/IP based communication) or wirelessly (in the case of WIFI, Bluetooth based communication).
[0175] Secondary memory 503 may contain hard drive 503a, flash memory 503b, and removable storage drive 503c. Secondary memory 503 may store the data and software instructions, which enable digital processing system 500 to provide several features in accordance with the present invention.
[0176] Some or all of the data and instructions may be provided on removable storage unit 504, and the data and instructions may be read and provided by removable storage drive 503c to CPU 501. Floppy drive, magnetic tape drive, CD-ROM drive, DVD Drive, Flash memory, removable memory chip (PCMCIA Card, EPROM) are examples of such removable storage drive 503c.
[0177] Removable storage unit 64 may be implemented using storage format compatible with removable storage drive 503c such that removable storage drive 63c can read the data and instructions. Thus, removable storage unit 504 includes a computer readable storage medium having stored therein computer software (in the form of execution modules) and/or data.
[0178] However, the computer (or machine, in general) readable storage medium can be in other forms (e.g., non-removable, random access, etc.). These “computer program products” are means for providing execution modules to digital processing system 500. CPU 501 may retrieve the software instructions (forming the execution modules) and execute the instructions to provide various features of the present invention described above.
[0179] Digital processing system may correspond to each of user system: local system or remote and server noted above. Digital processing system may contain one or more processors (such as a central processing unit (CPU)), random access memory (RAM), secondary memory, graphics controller (GPU), primary display unit, network interfaces like (WLAN), and input interfaces (not shown).
[0180] CPU executes instructions stored in RAM to provide several features of the present invention. CPU may contain multiple processing units, with each processing unit potentially being designed for a specific task. Alternatively, CPU may contain only a single general purpose processing unit. RAM may receive instructions from secondary/system memory.
[0181] Graphics controller (GPU) generates display signals (e.g., in RGB format) to primary display unit based on data/instructions received from CPU. Primary display unit contains a display screen (e.g. monitor, touchscreen) to display the images defined by the display signals. Input interfaces may correspond to a keyboard, a pointing device (e.g., touch-pad, mouse), a touchscreen, etc. which enable the various inputs to be provided. Network interface provides connectivity to a network (e.g., using Internet Protocol), and may be used to communicate with other connected systems. Network interface may provide such connectivity over a wire (in the case of TCP/IP based communication) or wirelessly (in the case of WIFI, Bluetooth based communication).
[0182] Secondary memory may contain hard drive (mass storage), flash memory, and removable storage drive. Secondary memory may store the data (e.g., the specific requests sent, the responses received, etc.) and executable modules, which enable the digital processing system to provide several features in accordance with the present invention.
[0183] Removable storage unit may be implemented using storage format compatible with removable storage drive such that removable storage drive can read the data and instructions. Thus, removable storage unit includes a computer readable storage medium having stored therein executable modules and/or data. However, the computer (or machine, in general) readable storage medium can be in other forms (e.g., non-removable, random access, etc.). CPU may retrieve the executable modules, and execute them to provide various features of the present invention described above.
[0184] As would be appreciated by person skilled in the art, invention provides following advantages,
[0185] Optimal K selection: Finding the optimal number of clusters for a dataset is the most crucial problem in clustering and dimensionality reduction. The current algorithms expect the user to provide this information prior to clustering which is difficult for unannotated data. Currently there are very few techniques for automated selection of optimal clusters which perform to the users’ satisfaction. Most of these techniques depend upon user input while some like pam or mclust determine the optimal number of clusters but are very computationally intensive and fail on larger datasets. RCC uses an automated method for optimal K selection using the CDF plots.
[0186] Identification of hierarchical classification: The salient point of RCC is the recursive clustering. The standard clustering algorithms allow us to get a global hierarchy of the dataset, whereas in real scenario most of the times the data is hierarchical and can be further divided. RCC explores the local cluster hierarchy of the datasets allowing it to find novel subgroups. (e.g. RCC is not only able to identify cancer specific clusters but also subgroups within each cancer type)
[0187] Biologically significant results: The standard clustering algorithms currently cluster the data based on a particular set of features. The crucial problem with this is that sometimes the algorithms keep on grouping the data even when there is no clear variance among the data points. RCC takes care of this limitation by applying the following filters:
f. Minimum inter and intra cluster stability: The minimum intra cluster variability should be greater than 0.8 and the inter cluster stability should be less than 0.2
g. Minimum number of genes found differential: RCC checks if at least 10% (user tunable) of the features selected for clustering should be differentially expressed i.e. should have FDR values less than 0.1 and log2 fold change values greater than one.
[0188] Works on bulk as well as single cell transcriptomic datasets: Except for tSNE+km all the algorithms either work well for single cell data or for bulk tissue transcriptomic data. RCC has shown good results for clustering both single cell and bulk tissue transcriptomic datasets.
[0189] According to aspects of the invention, present invention has following uses,
[0190] Clustering of datasets with unknown number of clusters: Identifying algorithmically the optimal number of clusters in a dataset is a difficult problem. RCC works well on datasets with known as well as unknown number of clusters as estimation of the number of clusters k does not depend on the labelling of the data.
[0191] Finding novel subtypes: Due to the recursive nature RCC explores the dataset in a hierarchical fashion revealing finer structures and hence novel subtypes which are generally missed when applying standard clustering methods including kmeans, hierarchical clustering, pam, mclust, etc. For example consider large-scale cancer dataset which includes samples from multiple cancer types (breast, prostate, brain, etc) with each cancer type having distinct molecular subtypes. Using any out of the box clustering algorithm, one is only able to find tissue specific cancers and not the subtypes within each cancer type with clinical relevance.
[0192] Biologist intuitive: Most of the computational parameters required for clustering are automatically calculated in RCC. Unlike all other algorithms it works well for both bulk and single cell transcriptome data.
[0193] The best way to practice the present invention would be to use it as a stand-alone package or part of the software package like Oncomine Informatics (ThermoFischer Scientific), GeneSpring GX (Agilent Technologies) for translational bioinformatics. This tool can provide immense value for diagnostic companies as well as bioinformatics service providers and can help them take utmost advantage of large-scale public transcriptome datasets.
[0194] According to the aspects of embodiments, the specification may have presented a method and/or process as a particular sequence of steps. However, that the method or process does not depend on the particular order of steps set forth herein, the method or process should not be limited to the particular sequence of steps as described. As person with ordinary skill in the art would appreciate, other sequences of steps may be possible. It is therefore, the particular order of the steps set forth in the specification should not be construed as limitations on the claims. In addition, the claims directed to the method and/or process should not be limited to the performance of their steps in the order written, and person skilled in the art can readily appreciate that the sequences may be varied and still remain within the spirit and scope of the various embodiments.
[0195] The embodiments described herein, can be practiced with different computer system configurations including hand-held devices, microprocessor-based, microprocessor systems, or programmable consumer electronics, minicomputers, mainframe computers and the like. The embodiments can also be practiced in distributing computing environments where steps are performed by remote processing devices that are linked over a network.
[0196] According to a non-limiting exemplary aspect of the present invention, it should be understood that the embodiments described herein can employ various computer-implemented operations involving data stored in computer systems. These operations can be those requiring physical manipulation of computer systems. These quantities can be in form of electrical or magnetic signals capable of being stored, compared, transferred, combined, and otherwise manipulated. Further, the manipulations performed are often referred to in terms, such as producing, identifying, determining, or comparing.
[0197] Any of the operations that form part of the embodiments described herein are useful machine operations. The embodiments, described herein, also relate to a device or an apparatus for performing these operations. The systems and methods described herein can be specially constructed for the required purposes or it may be a general purpose computer selectively activated or configured by a computer program stored in the computer. In particular, various general purpose machines may be used with computer programs written in accordance with the teachings herein, or it may be more convenient to construct a more specialized apparatus to perform the required operations.
[0198] While specific embodiments can also be embodied as computer readable code on a computer readable medium. The computer readable medium is any data storage device that can store data, which can thereafter be read by a computer system. Examples of the computer readable medium include hard drives, network attached storage, CD-ROMs, CD-Rs, CD-RWs, magnetic tapes, read-only memory, random-access memory, and other optical and non-optical data storage devices. The computer readable medium can also be distributed through a network coupled computer systems so that the computer readable code is stored and executed in a distributed fashion.
[0199] Merely for illustration, only representative number/type of graph, chart, block and sub- block diagrams were shown. Many environments often contain many more block and sub- block diagrams or systems and sub-systems, both in number and type, depending on the purpose for which the environment is designed.
[0200] According to a non-limiting exemplary aspect of the present invention, the method(s) can be used for the development of technologies that enable pathogen detection in point-of-care settings. These diagnostic technologies developed can then be utilized by hospitals/private clinics/dental doctors or the public as such to screen/diagnose different pathogens.
[0201] While specific embodiments of the invention have been shown and described in detail to illustrate the inventive principles, it will be understood that the invention may be embodied otherwise without departing from such principles.
[0202] Reference throughout this specification to “one embodiment”, “an embodiment”, or similar language means that a particular feature, structure, or characteristic described in connection with the embodiment is included in at least one embodiment of the present invention. Thus, appearances of the phrases “in one embodiment”, “in an embodiment” and similar language throughout this specification may, but do not necessarily, all refer to the same embodiment.
[0203] It should be understood that the figures and/or screen shots illustrated in the attachments highlighting the functionality and advantages of the present invention are presented for example purposes only. The present invention is sufficiently flexible and configurable, such that it may be utilized in ways other than that shown in the accompanying figures.
[000] REFERENCES:
1) Jain, A. K. Data clustering: 50 years beyond K-means. Pattern Recognit. Lett.31, 651–666 (2010).
2) Ahmad, A. & Khan, S. S. Survey of State-of-the-Art Mixed Data Clustering Algorithms. IEEE Access7, 31883–31902 (2019).
3) Chen, F. et al. Pan-Cancer Molecular Classes Transcending Tumor Lineage Across 32 Cancer Types, Multiple Data Platforms, and over 10,000 Cases. Clin. Cancer Res.24, 2182–2193 (2018).
4) Hoadley, K. A. et al. Cell-of-Origin Patterns Dominate the Molecular Classification of 10,000 Tumors from 33 Types of Cancer. Cell173, 291–304.e6 (2018).
5) Puchalski, R. B. et al. An anatomic transcriptional atlas of human glioblastoma. Science360, 660–663 (2018).
6) Biase, F. H., Cao, X. & Zhong, S. Cell fate inclination within 2-cell and 4-cell mouse embryos revealed by single-cell RNA sequencing. Genome Res.24, 1787–96 (2014).
7) Pollen, A. A. et al. Low-coverage single-cell mRNA sequencing reveals cellular heterogeneity and acti-vated signaling pathways in developing cerebral cortex. Nat. Biotechnol.32, 1053–8 (2014).
8) Darmanis, S. et al. A survey of human brain transcriptome diversity at the single cell level. Proc. Natl. Acad. Sci. U. S. A.112, 7285–90 (2015).
9) Darmanis, S. et al. Single-Cell RNA-Seq Analysis of Infiltrating Neoplastic Cells at the Migrating Front of Human Glioblastoma. Cell Rep.21, 1399–1410 (2017).
10) Dey, K. K., Hsiao, C. J. & Stephens, M. Visualizing the structure of RNA-seq expression data using grade of membership models. PLOS Genet.13, e1006599 (2017).
11) Cerami, E. et al. The cBio Cancer Genomics Portal: An Open Platform for Exploring Multidimensional Cancer Genomics Data. Cancer Discov.2, 401–404 (2012).
12) Gao, J. et al. Integrative Analysis of Complex Cancer Genomics and Clinical Profiles Using the cBioPor-tal. Sci. Signal.6, pl1-pl1 (2013).
13) Neftel, C. et al. An Integrative Model of Cellular States, Plasticity, and Genetics for Glioblastoma. Cell178, 835–849.e21 (2019).
14) Wilkerson, M. D. & Hayes, D. N. ConsensusClusterPlus: a class discovery tool with confidence assess-ments and item tracking. Bioinformatics26, 1572–1573 (2010).
15) ?enbabaoglu, Y., Michailidis, G. & Li, J. Z. Critical limitations of consensus clustering in class discovery. Sci. Rep.4, 6207 (2015).
16) Scrucca, L., Fop, M., Murphy, T. B. & Raftery, A. E. mclust 5: Clustering, Classification and Density Esti-mation Using Gaussian Finite Mixture Models. R J.8, 289–317 (2016).
17) Kiselev, V. Y. et al. SC3: consensus clustering of single-cell RNA-seq data. Nat. Methods14, 483–486 (2017).
18) Krijthe Jesse H. Rtsne: T-Distributed Stochastic Neighbor Embedding using Barnes-Hut Implementation,URL: https://github.com/jkrijthe/Rtsne. (2015).
19) Cancer Genome Atlas Network, T. C. G. A. Genomic Classification of Cutaneous Melanoma. Cell161, 1681–96 (2015).
20) Müller, S. et al. Single-cell profiling of human gliomas reveals macrophage ontogeny as a basis for re-gional differences in macrophage activation in the tumor microenvironment. Genome Biol.18, 234 (2017).
21) Levitin, H. M. et al.De novo gene signature identification from single-cell RNA -seq with hierarchical Pois-son factorization. Mol. Syst. Biol.15, (2019).
22) Röttger, R. Clustering of Biological Datasets in the Era of Big Data. J. Integr. Bioinform.13, 52–81 (2016).
23) Foroutan, M. et al. Single sample scoring of molecular phenotypes. BMC Bioinformatics19, 404 (2018).
24) Anders, S. & Huber, W. Differential expression analysis for sequence count data. Genome Biol.11, R106 (2010).
25) Lin, P., Troup, M. & Ho, J. W. K. CIDR: Ultrafast and accurate clustering through imputation for single-cell RNA-seq data. Genome Biol.18, 59 (2017).
26) Verhaak, R. G. W. et al. Integrated genomic analysis identifies clinically relevant subtypes of glioblastoma characterized by abnormalities in PDGFRA, IDH1, EGFR, and NF1. Cancer Cell17, 98–110 (2010).
27) Liberzon, A. et al. The Molecular Signatures Database Hallmark Gene Set Collection. Cell Syst.1, 417– 425 (2015).
28) Chen, J., Bardes, E. E., Aronow, B. J. & Jegga, A. G. ToppGene Suite for gene list enrichment analysis and candidate gene prioritization. Nucleic Acids Res.37, (2009).

,CLAIMS:CLAIMS

I/ We Claim,

1) A system and method for genomic or nucleic acid sequence data analysis, the system comprising:
a) a processor coupled to a memory operable to cause the system to, present to a user a plurality of genomic tools;
b) obtaining a plurality of nucleic acid sequence reads, wherein at least one nucleic acid read comprises a single cell transcriptome data;
c) comparing said reads to a reference sequence construct, wherein said reference sequence construct is stored in computer memory, comprising at least two alternative sequences at a position in the reference sequence construct, one of which is a variation;
d) scoring sequence overlaps for each nucleic acid read against the reference sequence construct;
e) aligning each read to a location on the construct such that the score for each read is maximized;
f) a clustering tool for biomolecule-related sequence, wherein clustering tool allows the user to find unique and biologically meaningful subtypes in biomolecule-related sequence datasets;
g) a recursive method tool, wherein the recursive method tool allows the user to find finer structures in the datasets resulting in identification of novel subtypes; and
h) identifying novel subtypes in dataset in a hierarchical manner revealing finer structures.
2) A method for genomic or nucleic acid sequence data analysis, the method comprising acts of:
a) obtaining data, wherein a Recursive Consensus Clustering (RCC) takes quantitative transcriptomics data as input along with sample/cell information file;
b) selecting feature and data scaling, wherein for every recursive run the RCC selects the top n% of variant genes in that particular subset of the original dataset which is used for further clustering;
c) parallel processing of ConsensusClusterPlus, wherein the RCC uses ConsensusClusterPlus package in R which performs k-means clustering on the dataset which is parallelly processed eight times with 100 repeats each;
d) selecting optimal k, wherein cumulative distribution function (CDF) based plots are produced, wherein the RCC finds the optimal k for each process in turn giving eight optimal k values for every run, wherein if none of the clusters meet the already established criteria RCC returns zero as the optimal k;
e) sub-divisioning of data, wherein an optimal k for a given dataset is selected;
f) recursive clustering, wherein each subset is further recursively clustered till optimal k selection is possible; and
g) providing RCC outputs cluster information, wherein output is provided in a csv format, allowing the user to view the clustered data in the form of tracking plot and cluster annotation plot.
3) The system and method as claimed in claim 1 and claim 2, wherein the genomic or nucleic acid sequence read data can be generated using different techniques, platforms or technologies, including, capillary electrophoresis, microarrays, ligation-based systems, hybridization-based systems, direct or indirect nucleotide identification systems, pyrosequencing, polymerase-based systems, ion- or pH-based detection systems, electronic signature-based systems, or any combination thereof.
4) The system and method as claimed in claim 1 and claim 2, wherein the system processor takes genomic or nucleic acid sequence data from sequencer, configuring a cluster analysis of the read sequences from the sequence output to a reference sample.
5) The system and method as claimed in claim 1 and claim 2, wherein a method determines possible cluster for sequencing reads, wherein the sample is interrogated to produce a plurality of read sequences from the sample, wherein the quality value for each cluster is determined.
6) The system and method as claimed in claim 1 and claim 2, wherein the genomic or nucleic acid sequences are related to proteins, peptides, nucleic acids, functional information sequence, structural information sequence, secondary or tertiary structures, amino acid or nucleotide sequences, biomolecule binding sequences, genetic mutations and variants, sequence motifs, or any combination thereof.
7) The system and method as claimed in claim 1, further comprising assembling the genomic or nucleic acid reads to each other based upon the alignment of the genomic or nucleic acid reads with respect to the reference sequence construct.
8) The method as claimed in claim 2, wherein the structural variation is at least 100 bp long, wherein the structural variation is selected from the group consisting of deletions, duplications, copy-number variations, insertions, inversions, translocations or any combination thereof.
9) The method as claimed in claim 2, wherein the mutation is selected from the group consisting of a deletion, a duplication, an inversion, an insertion, and a single nucleotide polymorphism, wherein the reference sequence construct comprises a genome of an organism, wherein the reference sequence comprises a chromosome of an organism.
10) The system and method as claimed in claim 1 and claim 2, wherein the user can request for additional resources comprises one selected from the list consisting of:
a) retrieving a data file not provided by the user and not included in the sequence data;
b) retrieving data from an URL;
c) retrieving a matrix of probabilities;
d) requesting additional computing power;
e) requesting additional computer processors;
f) requesting one or more virtual machines; and
g) requesting additional storage space.
11) The system and method as claimed in claim 2 and claim 10, wherein the method/RCC tool detects an inconsistency between the instructions and the sequence data, wherein the RCC tool detects an inconsistency between the sequence data and the executable.
12) The method as claimed in claim 11, wherein the RCC tool adds a flag to the instructions that sends a parameter to the executable, wherein the parameter controls how the executable analyzes the sequence data.
13) The system and method as claimed in claim 2 and claim 10, wherein the method recommends the modification to the user and allows the user to accept the recommendation, wherein the method/RCC tool causes the system to prompt the user for additional data.
14) The method as claimed in claim 13, wherein the RCC tool causes the system to prompt the user to accept the selected change, wherein the RCC tool causes the system to inform the user of the selected change.
15) The method as claimed in claim 2 and claim 14, wherein the RCC tool analyzes the sequence data and selects the change based on a feature of the sequence data.
16) The method as claimed in claim 2 and claim 10, wherein the method/RCC tool includes a series of statements that assign input data to specific sequence alignment programs based on qualities of the input data.
17) The method as claimed in claim 16, wherein the qualities are at least one of the following: file size, extension, file format, number of input files, and metadata.

Documents

Application Documents

#	Name	Date
1	201941015490-FORM 18 [07-03-2021(online)].pdf	2021-03-07
1	201941015490-PROVISIONAL SPECIFICATION [17-04-2019(online)].pdf	2019-04-17
2	201941015490-POWER OF AUTHORITY [17-04-2019(online)].pdf	2019-04-17
2	201941015490-FORM 3 [23-09-2020(online)].pdf	2020-09-23
3	201941015490-FORM 1 [17-04-2019(online)].pdf	2019-04-17
3	201941015490-FORM 3 [31-08-2020(online)].pdf	2020-08-31
4	201941015490-ENDORSEMENT BY INVENTORS [16-05-2020(online)].pdf	2020-05-16
4	201941015490-DRAWINGS [17-04-2019(online)].pdf	2019-04-17
5	201941015490-Proof of Right (MANDATORY) [04-07-2019(online)].pdf	2019-07-04
5	201941015490-COMPLETE SPECIFICATION [16-04-2020(online)].pdf	2020-04-16
6	201941015490-FORM-26 [04-07-2019(online)].pdf	2019-07-04
6	201941015490-CORRESPONDENCE-OTHERS [16-04-2020(online)].pdf	2020-04-16
7	Correspondence by Agent _Form 1_GPA_08-07-2019.pdf	2019-07-08
7	201941015490-DRAWING [16-04-2020(online)].pdf	2020-04-16
8	Correspondence by Agent _Form 1_GPA_08-07-2019.pdf	2019-07-08
8	201941015490-DRAWING [16-04-2020(online)].pdf	2020-04-16
9	201941015490-FORM-26 [04-07-2019(online)].pdf	2019-07-04
9	201941015490-CORRESPONDENCE-OTHERS [16-04-2020(online)].pdf	2020-04-16
10	201941015490-COMPLETE SPECIFICATION [16-04-2020(online)].pdf	2020-04-16
10	201941015490-Proof of Right (MANDATORY) [04-07-2019(online)].pdf	2019-07-04
11	201941015490-ENDORSEMENT BY INVENTORS [16-05-2020(online)].pdf	2020-05-16
11	201941015490-DRAWINGS [17-04-2019(online)].pdf	2019-04-17
12	201941015490-FORM 3 [31-08-2020(online)].pdf	2020-08-31
12	201941015490-FORM 1 [17-04-2019(online)].pdf	2019-04-17
13	201941015490-POWER OF AUTHORITY [17-04-2019(online)].pdf	2019-04-17
13	201941015490-FORM 3 [23-09-2020(online)].pdf	2020-09-23
14	201941015490-PROVISIONAL SPECIFICATION [17-04-2019(online)].pdf	2019-04-17
14	201941015490-FORM 18 [07-03-2021(online)].pdf	2021-03-07
15	201941015490-FER.pdf	2025-06-30
16	201941015490-FORM 3 [03-07-2025(online)].pdf	2025-07-03

Search Strategy

1	201941015490_SearchStrategyNew_E_patentsearchstrategyE_25-06-2025.pdf