Abstract: Cardiovascular diseases (CVDs), which affect the heart and blood vessels, are now the leading cause of death globally, including in India. The bulk of the world's mortality in the previous several decades has been attributable to cardiovascular illnesses. Consequently, a trustworthy, precise, and workable method is required for the early detection of these illnesses so that the right treatment can be given. A number of medical datasets have been automated through the application of machine learning algorithms and methodologies to explore massive and complex data sets. Utilising the patient's medical history in our model, we devised a method to ascertain the likelihood of a cardiac diagnosis. It proposes to detect this disease by considering the dataset obtained from the Cleveland repository and utilizing the Genetic Algorithm for feature extraction, and performing the feature level fusion because of its automatic and efficient learning. Moreover, Fast Track Gram Matrix-principal Component Analysis (FTGM-PCA) is implemented for reducing the dimensionality and fusion to resolve the issues related to over fitting, reducing space and time complexity and eliminating the irrelevant data for improving the performance of the classifier. Moreover, the effective classification is performed via a newly implemented technique called Informative Entropy-Based Random Forest (IEB-RF) because of its capability to achieve a huge accuracy rate and also manage the flexibility of huge data.
Description:Field of Invention
Due to the prevalence of the disease and the high death rate it is connected with, heart disease has become a significant public health problem in recent years. Therefore, predicting heart disease using certain simple physical indications that may be gained through having frequent physical examinations at an early stage has emerged as an important topic of study.
Background of the Invention
The discipline of machine learning encloses a lot of different subfields within them, and its scope and application are expanding every day. Several different classifiers, including supervised, unsupervised, and ensemble learning, are included into machine learning, and these classifiers are used to make predictions and determine the accuracy of a given dataset. Because this information will be beneficial to a large number of individuals, we may include it into our HDPS project. In contrast to other life-threatening ailments, cardiovascular disease has garnered an imbalanced degree of attention and focus within the medical domain. Eradicating heart disease presents formidable challenges, yet it holds the potential to offer an automated insight into a patient's cardiac well-being, a critical aspect for effective subsequent treatment. The diagnostic process for heart conditions typically relies on a thorough physical examination of the patient, incorporating an evaluation of symptoms, manifestations, and outcomes. Various factors contribute to the susceptibility to heart disease, encompassing habits like smoking, cholesterol levels, and familial history of coronary ailments, excess weight, high blood pressure, and insufficient physical activity. The term "cardiovascular disease" encompasses a broad spectrum of disorders that exert an impact on the heart's functionality. These encompass maladies involving blood vessels, such as coronary artery disease, disturbances in heart rhythm (known as arrhythmias), and congenital heart anomalies that are present from birth (termed congenital heart defects). The utilization of the terms "coronary disease" and "cardiovascular disease" interchangeably is not a very common practice. Typically, when the term "cardiovascular disease" is employed, it is generally indicative of disorders involving restricted or blocked blood vessels. Such conditions can lead to outcomes like impaired breathing, chest pain (referred to as angina), or even stroke. Moreover, the terminology "coronary disease" also encompasses a range of ailments that directly impact the heart itself. These include issues relating to the heart's muscle, valves, or rhythm, all of which fall within the spectrum of this medical condition (JP7621609B2).
Cardiovascular disorders, which include a wide lot of problems that may have an effect on your heart and blood vessels, are quite prevalent in todayβs society. According to data provided by the WHO, cardiovascular diseases are responsible for causing the death of approximately 17.9 million individuals worldwide. It stands as the most prevalent cause of mortality among individuals, surpassing other causes in terms of its prevalence. Utilizing an individual's clinical history, our method has the potential to predict individuals who are prone to cardiovascular issues. This approach identifies those experiencing symptoms associated with heart disease, such as chest pain or high blood pressure, potentially leading to a diagnosis with fewer clinical trials and more efficient treatments, ensuring appropriate management. Such symptoms can manifest in various ways, including chest discomfort and elevated blood pressure. This study primarily focuses on three distinct data mining techniques: Sequential Regression, K-Nearest Neighbors (KNN), and Random Forest Classifier. These techniques are presented in the order they are described below. Our Invention has achieved an impressive accuracy rate of 87.5%, showcasing significant progress compared to our previous framework that relied solely on a single data mining method. Therefore, increasing the number of data mining methods used improved the HDPS's accuracy as well as its efficiency.
One of the ideas that is used in a significant way all over the globe is known as machine learning. It will be necessary in the healthcare industry, where it will be helpful for medical professionals to make diagnoses more quickly. In this post, we will be dealing with the Heart disease dataset and will analyze, and utilizing computer learning, one may make a prediction about whether or not an individual has cardiovascular illness or if the person's condition is normal. This kind of prediction is known as forecasting illness prognosis using artificial intelligence. In healthcare industries, which will be a time-consuming procedure, this forecast will make it quicker and more efficient.
As per information provided by the World Health Organization, cardiovascular ailments stand as some of the most fatal conditions, attributing to a significant portion of global human mortality. This observation was made when it was found that heart illness were among the most common reasons of death around the world. It has also been shown that different kinds of cardiovascular disease are responsible for more than 24 percent of fatalities that occur in India. Therefore, it is necessary to design a method of early diagnosis that may avoid the fatalities that are happening as a result of cardiovascular disorders. Various methods exist for the detection of these heart conditions, including procedures like angiography. However, these diagnostic approaches carry a substantial cost and potential risks of unfavorable physiological reactions within the patient's body. Other diagnostic techniques exist. Because of this, the broad use of these methods is inhibited in nations with large and impoverished populations (JP6559730B2).
Summary of the Invention
Heart disease is the leading cause of death, according to the analysis of the World Health Organisation (WHO). Reducing mortality rates is possible with early prediction of heart disease and appropriate treatment of it. When it comes to fixing over-fitting problems, fast-track gramme matrices-principal component analysis is most commonly employed for fusion and dimension reduction. It increases the classifier's performance, helps reduce space and time requirements, and gets rid of redundant data. In linear algebra and basic matrix operations, principal component analysis is a tool for reducing the original data to a single integer or a reduced dimension. We call it the informational entropy based-random forest (IEB-RF). This system incorporates this strategy to enhance the performance of the classifier. Despite its adaptability to large datasets, it boasts excellent classifier accuracy.
Brief Description of Drawings
Figure 1: Proposed System Architecture for Predicting Heart Disease.
Figure 2: Classification using informative entropy based random forest
Detailed Description of the Invention
The proposed system mainly involves three steps, which are Pre-processing, Feature selection and extraction and Prediction. Below in Figure.1 is the general procedure of the suggested methodology. The Cleveland dataset on heart disease is loaded, and pre-processing is done in the first stage. There are fourteen variables collected from three hundred and thirty-three patients with heart disease in the Cleveland dataset, which is part of the UCI machine learning repository. Datasets comprise the UCI machine learning. Through data cleaning, which involves removing unnecessary datasets, this method aids the user in obtaining a highly informative dataset. The following stage involves extracting features and fusing them together using the suggested genetic algorithm. The next step is to reduce dimensionality using the suggested FTGM-PCA. Dimension counts are minimised using this approach. This results in less need for calculation time and training as well as less space for storing data. Additionally, it facilitates data visualisation. In order to prevent data overfitting, Principal Component Analysis (PCA) reduces characteristics. Test, in particular for split and train, takes the output from the previous step and uses it. To carry out the classification, the suggested IEB-RF classifier is employed. Data pertaining to suspected patients can be separated from data pertaining to regular people with the use of classification. It all comes down to using a trained model to make the forecast and then evaluating its efficiency based on the results of the performance exploration.
An adaptable heuristic exploration method was thought of by John Holland and De Jong, and it was later implemented in the genetic algorithm. The theory of natural evolution proposed by Charles Darwin serves as the basis for its construction. This theory incorporates the fundamentals learned from genetics and the process of natural selection. Although John Holland and De Jong were the ones who first developed this flexible algorithm, Google has also discovered that it is useful in their pursuit of technological growth. In the context of natural selection, the selection process favors the most robust individuals within a population. The well-being of offspring directly correlates with the physical condition of their parents, as hereditary traits are passed down and integrated into successive generations. This iterative process persists until a group of individuals demonstrating optimal health is identified. Optimization strategies are frequently employed to enhance the overall performance of machine learning systems. For the proposed study, a population size of 303 has been established. Among these datasets, one contains the complete pool of candidates. After that, an assessment of fitness is constructed, which takes potential remedies as inputs as well as generates an outcome that is aligned with the health evaluation of the issue.
During the selection phase, chromosomes with favorable attributes are chosen. This is succeeded by the crossover process, which involves gene reproduction. Finally, mutation is applied, introducing random changes to a chromosome to sustain genetic diversity within the population. In our exploration of heart disease prediction, we harnessed an optimization technique with the following parameter configurations: a mutation rate of 0.1, 100 parent candidates (n_parents), 30 features (n_feat), and 15 generations (n_gen). The optimized feature sets for each model were evaluated through confusion matrices. Addressing missing data, identifying and removing outliers, eliminating noise, and rectifying errors collectively constitute the data cleaning process. Data reduction pertains to strategies aimed at minimizing data volume within a dataset. This includes activities such as the selection of features, the extraction of distinguishing characteristics, and the discretization of continuous attributes, which is the process of transforming continuous attributes into a restricted set of nominal values.
The process of selecting features includes the steps of identifying pertinent attributes while discarding redundant ones to effectively characterize an issue without substantial information loss. Feature selection methods can be categorized into three primary classes: filters, wrappers, and embedded models. Filters evaluate attribute significance without utilizing learning mechanisms. Evaluation of the importance of selected characteristics is a necessary step in the feature selection process, which requires an assessment of the chosen attributes' usefulness as determined by an information mining model. This is a fundamental prerequisite for feature selection. On the other hand, integrated techniques include feature selection as an essential component of the whole process of model creation. Procedures such as filtering, wrapping, and embedding may be arranged into either a univariate or a multivariate category, respectively. The primary method, which is referred to as feature ranking, comprises giving a score to individual characteristics depending on how relevant they are to the class characteristic, as assessed by a performance measure. This ranking is then used to decide the order in which the attributes appear in the final list. In contrast, the next strategy, which is referred to as subset selection, evaluates different groupings of qualities by making use of a predetermined performance measure in conjunction with a systematic search strategy. In contrast, feature extraction, which is also known as feature projection, involves the generation of fresh characteristics from the original ones via the use of functional mapping, which often results in a decrease in the dimensionality of the feature space.
In the proposed study, a Fast Track Gramme Matrix-Principal Component Analysis (FTGM-PCA) was utilised. In machine learning, the gramme matrices are indicated by the kernel functions as a quick way to minimise the dimensionality of the features. In order to prevent the data from being overfit, Principal Component Analysis (PCA) reduces the number of features. In order to boost the classifier's efficiency, Fast Track Gramme Matrix-Principal Component Analysis (FTGM-PCA) enhances the dimension reduction process. Variation in output is quite high. This leads to better visualisation. PCA removes linked variables that do not contribute to decision making and helps ML algorithms perform better. Data with high dimensions are transformed into low dimension data using this technique, which reduces dimensionality by improving the low dimension variance. The eigenvectors, along with their Cumulative Explained Variance (CEV) and values, compute the results after the preceding procedures. As soon as the CEV surpasses or equals the threshold value, the projection matrix is computed. It is PCA that defines the function. The π dimension provides the input, and then the correlation and gramme matrices are computed. What follows is the application of the while statement, which yields π¦ππ as the mean π and π¦ππ as the mean b. The decomposition of Full-PCA, denoted as πππ, is computed from the obtained values. While the statement becomes true, the while condition terminates. The eigenvalues and eigenvectors are then given a magnitude equal to the attained decompose value. The CEV is calculated based on this information. As a result, the projection matrix π needs to be built if the condition is applied as a CEV that is equal to or less than the threshold. The condition should be cancelled when it is stated to be untrue. Once the condition is terminated, the input is transformed into π by means of π. The value of π ~ is returned after obtaining the subset characteristics for dimension k through π ~.
The Informative Entropy-Based Random Forest (IEB-RF) is a proposed method for classification. For this procedure, we use information entropy, which is the average quantity of data sent through an event that accounts for all possible outcomes. Contrarily, RF gives effective predictions, and it is clearly explained. It has great predictive accuracy and can handle large datasets with ease. Therefore, the impacted patient data and normal person data are distinguished using the suggested IEB-RF. Figure 2 shows the categorisation of this sequential procedure. It all starts with making the tree's nodes. Partitioning the training data into subsets is the next step in building the split. A subset of the variable is selected by this procedure. In order to determine the optimal split, each of the selected variables computes the information gain entropy and Gini index, and then the following split is inputted into the build process. Following this, the predicted value is returned and the prediction error is computed. , Claims:The scope of the invention is defined by the following claims:
Claim:
1. A Hybrid approach for prediction of Heart Disease using Informative Entropy-Based Random Forest comprising the steps of
a) Perform Pre-processing on input data by removing noisy data and irrelevant data and extract the features from the given input.
b) Select the best relevant features from the extracted features by reducing the dimensions of the data and perform training with the best features.
c) By performing the training with best relevant features, the classification will be prformed.
2. According to claim 1, for pre processing along with the fusion and feature extraction Genetic Algorithm is used.
3. According to claim 1, for Dimensionality reduction a novel approach Principal Component Analysis of Fast Track Gram Matrix is developed.
4. According to claim 1, for classification a novel hybrid approach Informative Entropy-Based Random Forest (IEB-RF) is developed.
| # | Name | Date |
|---|---|---|
| 1 | 202541060947-REQUEST FOR EARLY PUBLICATION(FORM-9) [26-06-2025(online)].pdf | 2025-06-26 |
| 2 | 202541060947-FORM-9 [26-06-2025(online)].pdf | 2025-06-26 |
| 3 | 202541060947-FORM FOR STARTUP [26-06-2025(online)].pdf | 2025-06-26 |
| 4 | 202541060947-FORM FOR SMALL ENTITY(FORM-28) [26-06-2025(online)].pdf | 2025-06-26 |
| 5 | 202541060947-FORM 1 [26-06-2025(online)].pdf | 2025-06-26 |
| 6 | 202541060947-EVIDENCE FOR REGISTRATION UNDER SSI(FORM-28) [26-06-2025(online)].pdf | 2025-06-26 |
| 7 | 202541060947-EVIDENCE FOR REGISTRATION UNDER SSI [26-06-2025(online)].pdf | 2025-06-26 |
| 8 | 202541060947-EDUCATIONAL INSTITUTION(S) [26-06-2025(online)].pdf | 2025-06-26 |
| 9 | 202541060947-DRAWINGS [26-06-2025(online)].pdf | 2025-06-26 |
| 10 | 202541060947-COMPLETE SPECIFICATION [26-06-2025(online)].pdf | 2025-06-26 |