Abstract: ABSTRACT AI-POWERED FRAMEWORK FOR DIABETES ONSET PREDICTION BASED ON MULTI-OMIC GENOMIC DATA FUSION The present invention relates to precision medicine is getting closer to multi-modal biological data. It takes a lot of work to effectively exploit these multi-modal biological data because of their volume and complexity. Using a range of biomedical data modalities, researchers and clinicians are actively creating artificial intelligence (AI) techniques for data-driven knowledge discovery and causal inference. These AI-based methods have shown encouraging outcomes in a range of healthcare and biomedical applications. A meaningful implementation of individualized care for obesity can only be achieved through the use of sophisticated analytical tools and multi-omics research methodologies. This multi-Omics integration has revealed biomarkers such as SACS and TXNIP DNA methylation, OPRD1 and RHOT1 expression, and an SNP linked to ANO1, which offer new insights into the interactions between many biological pathways causing type 2 diabetes. This machine learning method using multi-omics cross-sectional data from human pancreatic islets produced a promising T2D prediction accuracy that could have wide-ranging clinical diagnostic applications. FIG.1
Description:AI-POWERED FRAMEWORK FOR DIABETES ONSET PREDICTION BASED ON MULTI-OMIC GENOMIC DATA FUSION
Technical Field
[0001] The embodiments herein generally relate to a method for AI-powered framework for diabetes onset prediction based on multi-omic genomic data fusion.
Description of the Related Art
[0002] Information regarding genetic variants at the patient level can be obtained from the multi-omics data made possible by methods like NGS. For instance, genome sequencing was used during the COVID-19 pandemic to determine the subtypes of COVID-19 variations and forecast the prognosis of patients. Pharmacogenomics, a new paradigm in medication development, makes it possible for clinicians to use genetic profiling to find specific mutations in cancer diagnosis and treatment, allowing them to select the most effective course of action. In a similar vein, smart wearable technology is an effective way to monitor individual health data for tailored diagnosis and optimal care. The most common kind of diabetes, type 2 diabetes mellitus (T2DM), is becoming more common worldwide and is placing a heavy strain on healthcare systems.
[0003] Preventive measures are urgently needed because of the negative effects of type 2 diabetes on the economy and individuals, which include reduced productivity, increased healthcare expenses, serious complications, and a shorter lifetime. Since blood glucose levels are primarily regulated by the secretion of insulin and glucagon by pancreatic beta and alpha cells, respectively, the pancreas is in fact the most important organ for comprehending the pathophysiology of type 2 diabetes. Therefore, a lot of work has been done to identify the molecular processes that lead to T2D patients' decreased ability to secrete insulin and glucagon from their pancreatic islets.
[0004] Precision medicine's shortcomings in individualized illness understanding could be filled by the multi-omics data. Researchers can find biomarkers for illness diagnosis and treatment outcomes as well as new relationships between biological entities by examining these intricate biological Big Data. However, manual analysis is not feasible due to the size and complexity of high-throughput multi-omics data.
[0005] A predictive model is considered effective if it can reliably distinguish between people who are at high and low risk of developing the condition, accurately estimate an individual's risk when predictions closely match observed outcomes, and work well across a variety of populations. Either internal or external validation can be used to evaluate calibration and discrimination; external validation is frequently used since it offers a more thorough evaluation of the model's generalizability. We developed a predictive Partial Least Square (PLS) Regression model for Omics integration after investigating current machine learning techniques and carefully choosing a model based on complexity and suitability for our data.
[0006] AI-driven models have emerged as a promising tool for developing predictive models for T2DM by analyzing complex and multidimensional datasets to identify high-risk individuals, uncover risk factors and biomarkers associated with T2DM development, and guide personalized interventions for disease prevention. At the same time, the EHR data can serve as quantitative phenotype targets to evaluate treatment outcomes and contribute to clinical biomarker discoveries. The patient's multi-omics data, which contains molecular-level features, can help clinicians understand the disease and find new genetic/cellular biomarkers.
SUMMARY
[0001] In view of the foregoing, an embodiment herein provides a method for AI-powered framework for diabetes onset prediction based on multi-omic genomic data fusion. In some embodiments, wherein the reverse-phase protein arrays, RNA or single nucleotide polymorphism (SNP) microarrays, DNA or RNA sequencing, and other techniques are used to gather these data. Managing missing data is a crucial component of quality control. These studies differed in their target populations, prediction horizons, and data types. The second key area focused on risk stratification, aiming to identify individuals at high risk of T2DM, an essential component for effective public health intervention. Feature pre-selection can be performed in a supervised, i.e. providing the phenotype of interest, or unsupervised/unbiased fashion. The unsupervised technique is often focused on selecting the most variable features independent of the sources of variation.
[0002] In some embodiments, whereas the most popular method was turning each omics dataset's initial large-scale features into gene features. DNA methylation levels at certain locations inside genes or RNA transcripts produced from genes are examples of unique characteristics. The subsequent processing stages would make use of these mapped gene-level feature matrices. For instance, CpG island methylation values, which can vary greatly for a particular gene, are aspects of TCGA DNA methylation data. Eight studies also employed self-reported type 2 diabetes as a diagnostic criterion. Some research also used codes from the International Classification of Diseases, notably ICD-9 and ICD-10. In one investigation, measured blood glucose levels were used to validate clinical procedures. In order to optimize the sum of pairwise correlations between the latent components and a phenotype of interest, DIABLO converts each individual Omic dataset into latent components. Finding features that are associated both inside and between the Omics datasets is the outcome of DIABLO.
[0003] In some embodiments, wherein the principal components analysis (PCA) and autoencoders are the most frequently used feature extraction techniques in the reviewed studies. Filter-based feature selection techniques include the use of statistical tools to estimate the significance or relevance of the feature, eliminating the least significant features from the dataset; examples from some of the reviewed studies include the Chi-squared test, the Wilcoxon rank-sum test, and maximum relevance minimum redundancy (mRMR). All 40 studies used EHRs, including sociodemographic characteristics, lifestyle risk factors, anthropometric measures, glycemic traits, blood lipids, blood pressure factors, etc. Other sources included multi-omics data, such as single nucleotide polymorphisms (SNPs), metabolomic data (metabolite levels in the blood), and microbiome data.
[0004] These and other aspects of the embodiments herein will be better appreciated and understood when considered in conjunction with the following description and the accompanying drawings. It should be understood, however, that the following descriptions, while indicating preferred embodiments and numerous specific details thereof, are given by way of illustration and not of limitation. Many changes and modifications may be made within the scope of the embodiments herein without departing from the spirit thereof, and the embodiments herein include all such modifications.
BRIEF DESCRIPTION OF THE DRAWINGS
[0001] The embodiments herein will be better understood from the following detailed description with reference to the drawings, in which:
[0002] FIG. 1 illustrates a method for AI-powered framework for diabetes onset prediction based on multi-omic genomic data fusion according to an embodiment herein.
DETAILED DESCRIPTION OF PREFERRED EMBODIMENTS
[0001] The embodiments herein and the various features and advantageous details thereof are explained more fully with reference to the non-limiting embodiments that are illustrated in the accompanying drawings and detailed in the following description. Descriptions of well-known components and processing techniques are omitted so as to not unnecessarily obscure the embodiments herein. The examples used herein are intended merely to facilitate an understanding of ways in which the embodiments herein may be practiced and to further enable those of skill in the art to practice the embodiments herein. Accordingly, the examples should not be construed as limiting the scope of the embodiments herein.
[0002] FIG. 1 illustrates a method for AI-powered framework for diabetes onset prediction based on multi-omic genomic data fusion according to an embodiment herein. In some embodiments, the initial features were frequently scaled or normalized during the early integration. The dataset comprising all omics modalities was then subjected to dimensionality reduction as a whole, as opposed to one at a time. For instance, the pre-processed RNA sequence features and gene-level CNV features were concatenated into a single matrix. To find the most highly informative features for cancer survival, they employed feature extraction techniques like PCA and autoencoder to create a 100-feature embedding from that matrix. Then, they used Cox regression to pick features based on the embedding. Five experiments used Naïve Bays (NB) classifiers, and four studies each used KNN and extreme gradient boosting (XGBoost). Four studies using various voting methods, including weighted voting and soft voting, used ensemble learning, which combines the predictions of several models to increase the prediction's total accuracy. Three research employed hidden Markov models (HMM), and two studies each used linear and quadratic discriminant analysis classifiers and gradient boosting machines (GBM). Neither of the splits produced a T2D prediction accuracy below 70% for PLS2, and both components predict T2D much better than the naïve baseline model for the DIABLO integrated multi-Omics study. By using unsupervised hierarchical clustering on the highest-ranking features chosen from the Omics based on data integration, the high accuracy of T2D versus control classification was verified.
[0003] In some embodiments, Mixed- and hierarchical-integration techniques were employed in the survival analysis learning with multi-omics neural networks (SALMON) project. An external knowledge base, such as established regulatory links between characteristics in various omics modalities, is used for integration in hierarchical techniques. Before merging them for further processing, SALMON used outside expertise to extract features from each omics modality. Sociodemographic and FHD factors, lifestyle, anthropometric measures of body size and composition, glycemic traits, which include measures of glucose control, blood lipid and blood pressure factors, which include measures of cholesterol, triglycerides, and blood pressure; inflammatory biomarkers, which include measures of inflammation, such as C-reactive protein-to-albumin ratio (CAR); other biomarkers, which include measures of liver function, such as liver enzyme levels, or measures of adiposity, such as circulating adiponectin levels; medications, and disease history are the broad categories of the EHR risk factors and biomarkers used as inputs by the included studies. Actually, the main predictive indicators selected by the DIABLO model applied to individuals Omics were simply the expression of 11 genes, methylation of 10 CpG sites, and two SNPs. Therefore, even though gene expression and DNA methylation played a major role in the integrative DIABLO model, a number of features (roughly 50%) that DIABLO found can be regarded as novel in that they would not have been found to be the most informative when performing predictive analysis on individual Omics.
[0004] In some embodiments, the encoding subnetwork of the hierarchical integration deep flexible neural forest (HI-DFN Forest) was built to accept input from every omics data modality. To categorize patients into distinct cancer subtypes, deep flexible neural forest feedforward layers were used after combining the latent embeddings from the omics modalities in a subsequent layer. The top 5000 characteristics were chosen by Lin et al. by -omics modality-specific dimensionality reduction using the Chi-squared test. By using this method, the model can benefit from the complementing knowledge that the various data sources offer. Seven papers in this study used an early fusion approach, combining several kinds of data, such as multi-omics and EHR. A well-defined tool for mechanistic investigations of the insulin-secreting cell was made possible by combining genetic, epigenetic, transcriptomic, and chromatin data from this cell line. To identify the genes and genomic regions of significance for the disease, however, research such as ours that shows the disparities between the pancreatic islets of people with type 2 diabetes and those who do not is essential.
[0005] In some embodiments, the numerical data are converted to strings, text is converted to the UTF-8 standard encoding format, entities are normalized, sentences are segmented, stop words and punctuation are eliminated, and unnecessary spaces, lines, or characters are eliminated as part of the fundamental pre-processing stages. Reusing structured EHR data presents additional difficulties, especially when it comes to data heterogeneity, quality, and high dimensionality. SNPs and both clinical variables together greatly increased the DNN models' AUC. Overall, the study's findings imply that integrating genetic information with traditional risk variables could enhance the accuracy of AI prediction models for the prognosis of type 2 diabetes, particularly when a larger range of genetic variations is included.
[0006] These results demonstrate how adding genetic data to T2DM prediction models may be beneficial. The Lumi pipeline was used to process the DNA methylation array data, correcting for dye-bias from the use of two-color channels, applying background fluorescence subtraction, and correcting the technical differences between the Type I and Type II probe types through quantile normalization and BMIQ normalization.
[0007] Secondary use of EHR data is made more difficult by the lack of interoperability amongst institutions, especially when doing cross-institutional collaborative research. We can enable enhanced precision medicine research and simplify large-scale data research collaboration by using a common data model (CDM) to represent structured EHR data. The best prediction performance was obtained by combining clinical factors with the panel of chosen metabolite signatures, which boosted the AUC to 0.78. This combination fared better than the metabolite-only and clinical models, highlighting the importance of combining the two data sets to improve prediction accuracy. With an AUC of 0.77, a different investigation found 21 metabolites that were substantially different between patients with and without type 2 diabetes. Another crucial component of medication therapy procedures is the escalation of diabetes treatment, which is necessary as a patient's condition worsens. Several studies have attempted to determine the best escalation points to help physicians. To identify and predict T2D patients for whom metformin monotherapy is likely to fail and necessitate medication escalation, Murphree et al. used a number of AI techniques.
[0008] The deep neural network model demonstrated higher prediction accuracy and discriminative capability than logistic regression or other machine learning models for identifying high-risk patients. Furthermore, EHR data has been investigated to forecast mortality risk, risk of mild-to-severe transition, future oxygen needs, and mechanical ventilation need during the COVID-19 pandemic. The CNN model's image feature vector was fed into a multilayer perceptron for joint learning and prediction after being concatenated with the patient's clinical characteristics.
[0009] According to their findings, the model's performance was significantly improved by combining fundus images and clinical data, as seen by its AUC of 0.85 as opposed to the fundus-only and clinical-only models' respective AUCs of 0.82 and 0.76. If the AI-enabled DHTs are trained on data that represents the health-care inequalities faced by groups based on socioeconomic position, gender, sexual orientation, race, ethnicity, or geography, they may increase unconscious prejudice and discrimination. AI algorithms must be trained on equitable datasets that incorporate and accurately reflect social, environmental, and economic elements that affect health in order to reduce these risks. By evaluating sensitivity, specificity, and AUROC as objective functions all at once, the multi-objective model gets beyond the drawbacks of choosing a model solely on the basis of one objective. By combining result prediction and treatment category categorization, Chen et al.'s approach uses multi-task deep learning to extract both outcome-predictive and treatment-specific latent representations from EHRs. The fairness of the model's predictions could be impacted by an unintentional bias introduced by the lack of calibration evaluations. Furthermore, the algorithms' perceived lack of external evaluation calls into doubts their fairness.
[0010] The generated algorithm may be skewed towards the particular demographics contained in the training data if the training data are not diverse enough to represent the larger population. When applied to other demographic groupings, this would further restrict its fairness and applicability. Federated computing techniques have been further developed, wherein local learning is aggregated and distributed by dedicated parameter servers, while maintaining a central framework. Swarm learning is a decentralized machine learning technique that eliminates the need for a dedicated server, distributes parameters via a swarm network, and develops models autonomously using private data at each location.
, Claims:CLAIMS
I/We Claim:
1. A method for AI-powered framework for diabetes onset prediction based on multi-omic genomic data fusion, wherein the method comprising:
allowing the random forest model to perform well even with missing or imbalanced data;
predicting diabetes using secondary data from reliable sources like the UK Biobank, TCGA, and NCBI. Random Forest, with its 99% accuracy, stands out as the most effective model, followed by Neural Networks and Gradient Boosting Machines, which also achieved impressive results;
examining the growing use of AI predictive models for diabetes, with a focus on studies utilizing genetic data as a primary source;
managing diabetes in traditional face-to-face medical practices, such as ineffective prevention systems, uneven distribution of medical resources, and improper self-management; and
utilizing the state-of-the-art AI-based approaches for utilizing multi-omics data, electronic health records data, and cross-modality data integration.
Dated this, 27th March, 2025.
Signature:
Name: KALAIMATHI. J
| # | Name | Date |
|---|---|---|
| 1 | 202541034585-STATEMENT OF UNDERTAKING (FORM 3) [08-04-2025(online)].pdf | 2025-04-08 |
| 2 | 202541034585-REQUEST FOR EARLY PUBLICATION(FORM-9) [08-04-2025(online)].pdf | 2025-04-08 |
| 3 | 202541034585-PROOF OF RIGHT [08-04-2025(online)].pdf | 2025-04-08 |
| 4 | 202541034585-POWER OF AUTHORITY [08-04-2025(online)].pdf | 2025-04-08 |
| 5 | 202541034585-FORM-9 [08-04-2025(online)].pdf | 2025-04-08 |
| 6 | 202541034585-FORM 1 [08-04-2025(online)].pdf | 2025-04-08 |
| 7 | 202541034585-DRAWINGS [08-04-2025(online)].pdf | 2025-04-08 |
| 8 | 202541034585-DECLARATION OF INVENTORSHIP (FORM 5) [08-04-2025(online)].pdf | 2025-04-08 |
| 9 | 202541034585-COMPLETE SPECIFICATION [08-04-2025(online)].pdf | 2025-04-08 |