Generation Of Synthetic Data Using Photoplethysmogram Signals For

< Back

Generation Of Synthetic Data Using Photoplethysmogram Signals For Classification Of Heart Disease In Subjects

Abstract: Conventional models for classifying heart diseases in subjects required a lot of insight parameters along with a significant amount of realistic data as reference wherein generation of realistic patient data is often challenging. Most of the conventional methods included applications such as understanding disease progression, therapy planning, personalized model generation as a digital twin, etc. In all the applications, the physiological signals, mostly ECG are generated as a byproduct of the functionality manifestation and not as a synthetic generator tool. Embodiments of the present disclosure provide system and method that implement a physics-based model for synthesizing healthy as well as Photoplethysmogram (PPG) templates pertaining to coronary artery disease (CAD) by varying/modeling pathophysiological parameters, followed by a Variational Autoencoder (VAE) for generating simulated CAD dataset to train a binary classifier for detecting CAD or non-CAD subjects based on the generated synthetic PPG data.

Get Free WhatsApp Updates!
Notices, Deadlines & Correspondence

Patent Information

Application #

Filing Date

23 July 2021

Publication Number

04/2023

Publication Type

INA

Invention Field

BIO-MEDICAL ENGINEERING

Status

kcopatents@khaitanco.com

Parent Application

Patent Number

Legal Status

Grant Date

2024-12-17

Renewal Date

Applicants

Tata Consultancy Services Limited

Nirmal Building, 9th Floor, Nariman Point Mumbai Maharashtra India 400021

Inventors

1. BANERJEE, Rohan

Tata Consultancy Services Limited Block -1B, Eco Space, Plot No. IIF/12 (Old No. AA-II/BLK 3. I.T) Street 59 M. WIDE (R.O.W.) Road, New Town, Rajarhat, P.S. Rajarhat, Dist - N. 24 Parganas, Kolkata West Bengal India 700160

2. MAZUMDAR, Oishee

3. ROY, Dibyendu

4. BHATTACHARYA, Sakyajit

5. GHOSE, Avik

6. SINHA, Aniruddha

Specification

FORM 2
THE PATENTS ACT, 1970
(39 of 1970)
&
THE PATENT RULES, 2003
COMPLETE SPECIFICATION (See Section 10 and Rule 13)
Title of invention:
GENERATION OF SYNTHETIC DATA USING
PHOTOPLETHYSMOGRAM SIGNALS FOR CLASSIFICATION OF
HEART DISEASE IN SUBJECTS
Applicant
Tata Consultancy Services Limited A company Incorporated in India under the Companies Act, 1956
Having address:
Nirmal Building, 9th floor,
Nariman point, Mumbai 400021,
Maharashtra, India
Preamble to the description
The following specification particularly describes the invention and the manner in which it is to be performed.

TECHNICAL FIELD [001] The disclosure herein generally relates to synthetic data generation and disease classification, and, more particularly, to generation of synthetic data using photoplethysmogram signals for classification of heart disease in subjects.
BACKGROUND [002] Artificial Intelligence (AI) and Machine Learning (ML) models have been very popular in biomedicine mostly in cardiology. Supervised machine learning algorithms have been popularly used for classification of biomedical signals like ECG, PPG, etc. for identifying cardiac diseases in subjects. However, a machine learning algorithm requires a large volume of training data. Recording of disease specific data is often challenging. Hence, artificially generated realistic data are used. Generation of realistic patient data is often challenging. Traditional models require a lot of insight parameters whereas a deep learning (DL) based generative modeling approaches (GAN) requires a significant amount of realistic data as reference.
SUMMARY [003] Embodiments of the present disclosure present technological improvements as solutions to one or more of the above-mentioned technical problems recognized by the inventors in conventional systems. For example, in one aspect, there is provided a method for generation of synthetic data using photoplethysmogram signals for classification of heart disease in subjects. The method comprises receiving, via one or more hardware processors, a plurality of photoplethysmogram (PPG) signals corresponding to a plurality of subjects having coronary artery disease (CAD); generating, via a physics-based model executed by the one or more hardware processors, a plurality of simulated CAD specific PPG signals based on the plurality of photoplethysmogram (PPG) signals; generating, via a variational autoencoder (VAE) executed by the one or more hardware processors, a synthetic CAD specific PPG dataset based on a distribution of the plurality of simulated CAD specific PPG signals; and generating a CAD PPG

training dataset based on the plurality of simulated CAD specific PPG signals and the plurality of photoplethysmogram (PPG) signals.
[004] In an embodiment, a Kullback–Leibler (KL) divergence loss of the VAE is minimized such that a distribution of an encoded output generated by an encoder comprised in the VAE is similar to a unit Gaussian distribution.
[005] In an embodiment, the synthetic CAD specific PPG dataset corresponds to a predefined size.
[006] In an embodiment, the step of generating a synthetic CAD specific PPG dataset comprises modeling physiological behavior of the plurality of subjects comprised in the plurality of simulated CAD specific PPG signals.
[007] In an embodiment, the method further comprises training, a binary classifier via the one or more hardware processors, using the CAD PPG training dataset and a plurality of non-CAD PPG signals to obtain a trained binary classifier; and applying the trained binary classifier on a test PPG signal corresponding to a subject to classify the subject as a CAD subject or a non-CAD subject.
[008] In another aspect, there is provided a system for generation of synthetic data using photoplethysmogram signals for classification of heart disease in subjects. The system comprises a memory storing instructions; one or more communication interfaces; and one or more hardware processors coupled to the memory via the one or more communication interfaces, wherein the one or more hardware processors are configured by the instructions to: receive a plurality of photoplethysmogram (PPG) signals corresponding to a plurality of subjects having coronary artery disease (CAD); generate, via a physics-based model, a plurality of simulated CAD specific PPG signals based on the plurality of photoplethysmogram (PPG) signals; generate, via a variational autoencoder (VAE) executed by the one or more hardware processors, a synthetic CAD specific PPG dataset based on a distribution of the plurality of simulated CAD specific PPG signals; and generate a CAD PPG training dataset based on the plurality of simulated CAD specific PPG signals and the plurality of photoplethysmogram (PPG) signals.

[009] In an embodiment, a Kullback–Leibler (KL) divergence loss of the VAE is minimized such that a distribution of an encoded output generated by an encoder comprised in the VAE is similar to a unit Gaussian distribution.
[010] In an embodiment, the synthetic CAD specific PPG dataset corresponds to a predefined size.
[011] In an embodiment, the synthetic CAD specific PPG dataset is generated by modeling physiological behavior of the plurality of subjects comprised in the plurality of simulated CAD specific PPG signals.
[012] In an embodiment, the one or more hardware processors are further configured by the instructions to train a binary classifier using the CAD PPG training dataset and a plurality of non-CAD PPG signals to obtain a trained binary classifier; and apply the trained binary classifier on a test PPG signal corresponding to a subject to classify the subject as a CAD subject or a non-CAD subject.
[013] In yet another aspect, there are provided one or more non-transitory machine-readable information storage mediums comprising one or more instructions which when executed by one or more hardware processors cause a method for selective path sensitive interval analysis. The method comprises receiving, via the one or more hardware processors, a plurality of photoplethysmogram (PPG) signals corresponding to a plurality of subjects having coronary artery disease (CAD); generating, via a physics-based model executed by the one or more hardware processors, a plurality of simulated CAD specific PPG signals based on the plurality of photoplethysmogram (PPG) signals; generating, via a variational autoencoder (VAE) executed by the one or more hardware processors, a synthetic CAD specific PPG dataset based on a distribution of the plurality of simulated CAD specific PPG signals; and generating a CAD PPG training dataset based on the plurality of simulated CAD specific PPG signals and the plurality of photoplethysmogram (PPG) signals.
[014] In an embodiment, a Kullback–Leibler (KL) divergence loss of the VAE is minimized such that a distribution of an encoded output generated by an encoder comprised in the VAE is similar to a unit Gaussian distribution.

[015] In an embodiment, the synthetic CAD specific PPG dataset corresponds to a predefined size.
[016] In an embodiment, the step of generating a synthetic CAD specific PPG dataset comprises modeling physiological behavior of the plurality of subjects comprised in the plurality of simulated CAD specific PPG signals.
[017] In an embodiment, the method further comprises training, a binary classifier via the one or more hardware processors, using the CAD PPG training dataset and a plurality of non-CAD PPG signals to obtain a trained binary classifier; and applying the trained binary classifier on a test PPG signal corresponding to a subject to classify the subject as a CAD subject or a non-CAD subject.
[018] It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the invention, as claimed.
BRIEF DESCRIPTION OF THE DRAWINGS
[019] The accompanying drawings, which are incorporated in and constitute a part of this disclosure, illustrate exemplary embodiments and, together with the description, serve to explain the disclosed principles:
[020] FIG. 1 illustrates an exemplary block diagram of a system for generation of synthetic data using photoplethysmogram (PPG) signals for classification of heart disease in subjects, in accordance with an embodiment of the present disclosure.
[021] FIG. 2 depicts an exemplary block diagram of the system illustrating generation of synthetic data using photoplethysmogram signals for classification of heart disease in subjects, in accordance with an embodiment of the present disclosure.
[022] FIG. 3 illustrates an exemplary flow diagram of a method for generation of synthetic data using photoplethysmogram signals for classification of heart disease in subjects, using the system of FIGS. 1-2, in accordance with an embodiment of the present disclosure.

[023] FIG. 4A depicts a graphical representation of simulated electrocardiogram (ECG) templates, in accordance with an embodiment of the present disclosure.
[024] FIG. 4B depicts a graphical representation of simulated PPG templates (signals) for healthy and diseased condition, in accordance with an embodiment of the present disclosure.
[025] FIG. 5 depicts an exemplary block diagram of the Variational Autoencoder (VAE) as implemented by the system of FIGS. 1-2, in accordance with an embodiment of the present disclosure.
[026] FIG. 6A depicts a graphical representation illustrating distribution of feature ‘self-similarity’ (as an example) for the two classes (coronary artery disease (CAD) and non-CAD), in accordance with an embodiment of the present disclosure.
[027] FIG. 6B depicts a graphical representation illustrating number of features belonging to their corresponding CAD distributions as R increases, in accordance with an embodiment of the present disclosure.
DETAILED DESCRIPTION OF EMBODIMENTS [028] Exemplary embodiments are described with reference to the accompanying drawings. In the figures, the left-most digit(s) of a reference number identifies the figure in which the reference number first appears. Wherever convenient, the same reference numbers are used throughout the drawings to refer to the same or like parts. While examples and features of disclosed principles are described herein, modifications, adaptations, and other implementations are possible without departing from the scope of the disclosed embodiments.
[029] Synthetic data is artificially generated data, used to mimic real world data while preserving some selected properties from the original data. Application domain of synthetic data has been mostly in the domain of data privacy, speech processing, and healthcare. Synthetic data generation has recently emerged as a substitution technique for handling the problem of bulk data needed in training machine learning algorithms (ML), particularly for health-care application. Use of

synthetic time-series data in healthcare have seen a surge in research interest due to its application in ambulatory monitoring sensors. Mostly, efficacy of the synthetic data is evaluated through improvement in ML algorithm by introducing surrogate data in training sets. This requirement is more pronounced in cardiac research domain, where quality cardiac physiological signals namely ECG, PCG and PPG are always in demand for assessing cardiac health and provide algorithms and smart solutions for early disease detection to disease monitoring.
[030] For generating synthetic time series data pertaining to cardiac domain, the PPG signal has received less attention compared to ECG and PCG, although, it can give a broader image of the health condition, including for example information on the cardiac, respiratory, and autonomic systems. PPG is a noninvasive optical technique that monitors and detects changes in blood volume in peripheral blood vessels, where the change in blood volume typically correlates with blood pressure, oxygen saturation, cardiac output, etc. Specifically, changes in blood volume caused by pressure variation in flow can be detected by illuminating the skin through infrared light. This facilitates clinicians to screen, detect, and monitor with ease the presence of cardiovascular diseases, specifically those linked with arterial irregularities like atherosclerosis. In atherosclerosis, arterial diameters are reduced due to plaque formation by fatty deposits and causes restrictions in blood flow and can be assessed through abnormal light illuminations in the PPG. Coronary artery diseases (CAD) are subclass of atherosclerosis, where blockage is mainly in the coronary arteries. Using similar principle, use of PPG signal has been reported in numerous studies for early screening of CAD using morphological and statistical feature variation of the PPG signals.
Over the years several approaches have been reported for modeling PPG signals, and most of them are based on fitting multiple Gaussian waveforms. However, the Gaussian approach do not provide the efficiency that is appropriate for daily monitoring of cardiac health. The most significant approach of PPG synthesis is based on stochastic modeling, where the patient-specific PPG signals are produced along with a set of parameters that allow regeneration of statistically equivalent PPG signals by utilizing shape parameterization and a non-stationary model of PPG

signal time evolution. This technique however focuses on developing statistical characterization, specifically suitable for ambulatory measurements, where data points are either corrupted by artifacts or missing in sections. Pathophysiological linkage and clinical use of the synthetic PPG signal has not been evaluated.
[031] In recent years, Generative Adversarial Networks (GAN) have been used in synthetic medical image and time series generation. GANs and their variations have been shown to generate very realistic instances but present some issues that may limit their applicability to the medical domain. First, they suffer of the so-called mode-collapse problem, tending to generate with high probability only some of the modes of the underlying distribution they want to emulate. A second major issue of GAN-based approaches is their poor interpretability. Being based on deep generative networks, it is difficult to assess why certain samples are generated with respect to others, and a clear interpretation of the provided generative model is typically missing. Use of GAN network, although promising in certain health care domain, does not provide an employable solution in generating cardiac time series signal like PPG. Hence, despite its requirement, synthetic data generation in health-care application has received criticism from research community, mainly because synthetic data is believed to replicate only specific properties of data and is prone to bias. GAN itself requires substantial amount of data for training and for application which specifically aims at replicating quality data for substituting bulk data requirement, GAN approach is not optimal.
[032] Physical models or computational models of cardiac system may also serve as a synthetic data generator that can add on the data interpretability that is lacking in all other conventional techniques. Cardiac computational models have been developed in past for various applications like understanding disease progression, therapy planning, personalized model generation as a digital twin, etc. In all the applications, the physiological signals, mostly ECG are generated as a byproduct of the functionality manifestation and not as a synthetic generator tool.
[033] In the present disclosure, system and method are described for generating synthetic PPG data using a hybrid approach of using a physical/physics-based model of the cardiovascular system (also referred as cardiac model) and

statistical feature space selection with random sampling using Variational Auto Encoder to improve classification of CAD and non-CAD data. The method of the present disclosure is based established insilico cardiac model (as known in the research work) embedded with electrophysiology and hemodynamic functionality to generate synthetic PPG data pertaining to healthy as well as diseased class, disease class being CAD in the present disclosure. The system and method aim to generate quality PPG time series data with interpretability and statistical variation, starting from a small quantity of measured data. As a mean to generate synthetic data, specific templates of PPG signal related to healthy and CAD conditions were generated through the model. These templates were varied through pathophysiological features in the cardiac model and clustered to CAD and non-CAD group based on statistical feature distribution. Based on physical parameter variation, a substantial number of synthetic templates can be generated which can aid as substitute data for training machine learning algorithms. As a validation of the efficacy of the generated data, the synthetic PPG data was used as training data for CAD classification and classifier performance is reported against the baseline of classifier performance with measured PPG data verses that with synthetic data along with various combination of real and synthetic data mixed together.
[034] Referring now to the drawings, and more particularly to FIGS. 1 through 6B, where similar reference characters denote corresponding features consistently throughout the figures, there are shown preferred embodiments and these embodiments are described in the context of the following exemplary system and/or method.
[035] FIG. 1 illustrates an exemplary block diagram of a system 100 for generation of synthetic data using photoplethysmogram (PPG) signals for classification of heart disease in subjects, in accordance with an embodiment of the present disclosure. In an embodiment, the system 100 includes one or more processors 104, communication interface device(s) or input/output (I/O) interface(s) 106, and one or more data storage devices or memory 102 operatively coupled to the one or more processors 104. The one or more processors 104 may be one or more software processing modules and/or hardware processors. In an

embodiment, the hardware processors can be implemented as one or more microprocessors, microcomputers, microcontrollers, digital signal processors, central processing units, state machines, logic circuitries, and/or any devices that manipulate signals based on operational instructions. Among other capabilities, the processor(s) is configured to fetch and execute computer-readable instructions stored in the memory. In an embodiment, the device 100 can be implemented in a variety of computing systems, such as laptop computers, notebooks, hand-held devices, workstations, mainframe computers, servers, a network cloud and the like.
[036] The I/O interface device(s) 106 can include a variety of software and hardware interfaces, for example, a web interface, a graphical user interface, and the like and can facilitate multiple communications within a wide variety of networks N/W and protocol types, including wired networks, for example, LAN, cable, etc., and wireless networks, such as WLAN, cellular, or satellite. In an embodiment, the I/O interface device(s) can include one or more ports for connecting a number of devices to one another or to another server.
[037] The memory 102 may include any computer-readable medium known in the art including, for example, volatile memory, such as static random access memory (SRAM) and dynamic random access memory (DRAM), and/or non-volatile memory, such as read only memory (ROM), erasable programmable ROM, flash memories, hard disks, optical disks, and magnetic tapes. In an embodiment a database 108 can be stored in the memory 102, wherein the database 108 may comprise, but are not limited to a plurality of photoplethysmogram (PPG) signals corresponding to a plurality of subjects having coronary artery disease (CAD), a plurality of simulated CAD specific PPG signals, a synthetic CAD specific PPG dataset, a CAD PPG training dataset, a plurality of non-CAD PPG signals, information related to classification of subjects as CAD subject or non-CAD subject, and like. The memory 102 further stores a physics-based model (e.g., also referred as cardiac model), a Variational Autoencoder (VAE), a binary classifier, and the like. The memory 102 further comprises (or may further comprise) information pertaining to input(s)/output(s) of each step performed by the systems and methods of the present disclosure. In other words, input(s) fed at

each step and output(s) generated at each step are comprised in the memory 102 and can be utilized in further processing and analysis.
[038] FIG. 2, with reference to FIG. 1, depicts an exemplary block diagram of the system 100 illustrating generation of synthetic data using photoplethysmogram signals for classification of heart disease in subjects, in accordance with an embodiment of the present disclosure.
[039] FIG. 3, with reference to FIGS. 1-2, illustrates an exemplary flow
diagram of a method for generation of synthetic data using photoplethysmogram
signals for classification of heart disease in subjects, using the system 100 of FIG.
1, in accordance with an embodiment of the present disclosure. In an embodiment,
the system(s) 100 comprises one or more data storage devices or the memory 102
operatively coupled to the one or more hardware processors 104 and is configured
to store instructions for execution of steps of the method by the one or more
processors 104. The steps of the method of the present disclosure will now be
explained with reference to the components of the system 100 as depicted in FIG.
1, and the flow diagram of FIG. 2. In an embodiment of the present disclosure, at
step 202, the one or more hardware processors 104 receive a plurality of
photoplethysmogram (PPG) signals corresponding to a plurality of subjects having
coronary artery disease (CAD). At step 204, the one or more hardware processors
104 generate, via a physics-based model, a plurality of simulated CAD specific PPG
signals based on the plurality of photoplethysmogram (PPG) signals. In an
embodiment, the physics-based model is also referred as a cardiac model or
cardiovascular model and interchangeably used herein (e.g., for details on the
physics-based model/cardiac model/cardiovascular model - refer patent application
number 201921029536 titled ‘METHOD AND SYSTEM FOR PRESSURE
AUTOREGULATION BASED SYNTHESIZING OF
PHOTOPLETHYSMOGRAM SIGNAL’ filed on 22-Jul-2019). The above steps 202 and 204 may be better understood by way of following description.
[040] Developed in-silico cardiac model is a reduced order lumped model consisting of three interlinked functional block named Electrophysiology (EP), Hemodynamics and a simplified Central Nervous system (CNS) control

implemented as baroreflex controller (not shown in FIGS.). The EP block is responsible for initiation of cardiac contraction and pulsatile behavior of heart chambers through ECG signal generation and drives the hemodynamic block. The hemodynamic block regulates cardiac circulation, both pulmonary and systemic, capturing the flow, pressure, and volume dynamics during cardiac cycle. It is also coupled with central nervous system modulation in terms of a baroreflex control, which regulates pressure autonomously through sympathetic and parasympathetic interaction of heart rate, contractility, and systemic vascular resistance (e.g., refer “Mazumder et al. - O. Mazumder, D. Roy, S. Bhattacharya, A. Sinha, and A. Pal, Synthetic ppg generation from hemodynamic model with baroreflex autoregulation: a digital twin of cardiovascular system, in 2019 41th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC). IEEE, pp. 5489-5492, 2019.” and “Roy et al - D.Roy, O. Mazumder, A. Sinha, and S.Khandelwal, Multimodal cardiovascular model for hemodynamic analysis: Simulation study on mitral valve disorders, PLOS ONE 16(3): e0247921. https://doi.org/10.1371/journal.pone.0247921, 2021.”)
[041] The EP module solves the forward problem of calculating body surface potential from a known heart potential. Known heart potential is the cardiac trans-membrane potential (TMP) for each cell in the myocardium, mathematically approximated using different cardiac source models over the myocardium. In the present disclosure, the physics-based model (e.g., also referred as cardiac source model and interchangeably used hereinafter) is expressed as an equivalent double layer (EDL) of sources on the closed surface of the atria and ventricles, implemented in ECGsim (e.g., refer A.Oosterom, T.Oostendorp, ECGSIM: an interactive tool for studying the genesis of QRST waveforms. Heart; vol:90(2), pp:165-8. 2004.). Source matrix (S) at node ‘n’ at time instant ‘t’ is defined as: S(t;δ,p)=D(t;δ)R(t;p), where D is the depolarization phase, R is the repolarization phase. The timing of local depolarization at node ‘n’ is denoted as d, timing of local repolarisation at node ‘n’ is defined as r. The cardiac surface is divided in to 1500 such nodes. Based on the EDL source description, local strength at position ‘x’ on the surface of the myocardium (Sv) can be mapped to potential f

generated at location ‘y’ on the body surface as:
is the transfer function expressing the volume conductor model, considering geometry and conductivity in the chest cavity, Vm is the local transmembrane potential at heart surface and d ω(y, x) is the solid angle subtended at y by the surface element d S(x) of the myocardial node Sv. This is solved numerically using boundary element method (BEM) and the potential at discretized body surface consisting of lead position are expressed as:
where B is a transfer matrix, incorporating the solid angles subtended by source elements as viewed from the nodes of the triangulated surface. The resulting matrix ‘f ’generates the body surface potential, a subset of which is the standard 12 lead ECG and single-lead ECG configuration.
[042] The Hemodynamic block consists of four chambered heart with lumped pulmonary and systemic circulations. The pressure variations across the cardiac chambers are modulated through time-varying compliance functions. In addition to that, heart valves are modeled to replicate the functionality of each cardiac phases, capturing the pressure difference across the cardiac chambers to ensure unidirectional blood flow through the heart and maintain the pressure-volume dynamics. The dynamic equations to replicate the pressure dynamics of the model at various chambers and pulmonary and aortic artery can be represented by state space equations, depicting the flow variation due to resistance to blood flow from the vessel along with compliance property of the chambers and can be expressed as in Roy et al as follows:

[043] Here are the pressure variables in the
left-atrium, left-ventricle, systemic arteries, right-atrium, right-ventricle and pulmonic arteries respectively, Rmi, Rao, Rtr and Rpu are the valvular resistances across the mitral, aortic, tricuspid and pulmonary valves respectively, (Rp, Cpa) and (Rs, Csa) are the pair of resistance and compliance across the pulmonary and systemic vessel respectively. The symbols Umi , Uao, Utr,, and Upu are the control functions to mimic the opening or closing of the respective cardiac valves.
[044] Coupling of EP and hemodynamics block is through a compliance function, which determines the time-varying compliance of auricles and ventricle and brings about the pumping action of the heart. Single lead simulated ECG can be decomposed to its characteristic constituents like PQ (auricular depolarization), QRS (ventricular depolarization) and ST duration (ventricular repolarization), and R-R interval. In generic ECG signal for one cardiac cycle, these events are marked by specific set of PQRST peaks whose amplitudes and time-instances are represented as [(Pp, Tp ) (Pq ; Tq ) (Pr, Tr) (Ps, Ts ) (Pt , Tt)] respectively. These changes were encoded to modulate compliance function and timing information to control the synchronized operation of four heart chambers (e.g., refer Roy et al.).
[045] Referring to steps of FIG. 3, at step 206 of the present disclosure, the one or more hardware processors generate, via a variational autoencoder (VAE), a synthetic CAD specific PPG dataset based on a distribution of the plurality of simulated CAD specific PPG signals. The synthetic CAD specific PPG dataset corresponds to a predefined size (e.g., as shown in FIG. 2). The above step of 206 can be better understood by way of following description.
[046] Modeled PPG time-series captures the flow variation at aorta linked to the peripheral measurement site at fingertip and encodes the systemic flow during ventricular contraction along with the back-flow pulse, referred as dicrotic notch. The blood circulation in the extremities occurs due the pressure fluctuations in the left-ventricle (Plv(t)), right-atrium (Pra(t)) and systemic arteries (Psa(t)), defined by equations (2)-(4). More specifically, it is based on the fact that during systole,

as there is a sudden pressure enhancement in (Plv(t)), the blood flows from the left-ventricle to the systemic arteries via systemic vessels. On the other hand, in diastolic phase, blood flows towards the right-atrium to refill. Based on these two scenarios, the blood-volume across the extremities alters, thus, creating the PPG signal. This physiological behavior is modeled to generate synthetic PPG P(t) and the function is expressed as:
(7)
where the cardiac parameters k1, k2 are the gains associated with the systolic and diastolic phases, having a time-delays of td1 and td2 respectively. These parameters are modeled from measured real PPG data. In other words, the step of generating a synthetic CAD specific PPG dataset comprises modeling physiological behavior (e.g., cardiac parameters k1, k2) of the plurality of subjects comprised in the plurality of simulated CAD specific PPG signals.
[047] Measured PPG data consists of PPG signals recorded from 145 subjects were recorded at a hospital in Kolkata, India, using non-medical grade commercial pulse oximeter (CMS 50D+) at a sampling rate of 60 Hz. All the recorded signals were annotated using Angiogram report to 90 CAD and 55 non-CAD PPG time-series. Data collection protocol was approved by hospital ethics committee and all participating subjects provided written consent (approval reference number 109(i)/EC/PI/2016 dated 10-Dec 2016). These data act as the reference to generate synthetic data and also the parent data set based upon which synthetic PPG data were modeled, exploded and later used for CAD classification.
[048] The synthesized PPG signal obtained from the model do not exactly match with the true (measured) PPG signal and have certain morphological variations. For proper morphological matching, Heart rate (HR) or cardiac cycle duration(T) of the true PPG (measured data) was linked to the cardiac compliance functions and then the cardiac parameters (e.g., refer equation (7)) were optimized with respect to the measured PPG. The steps involving PPG function optimization are described explicitly in D.Roy, O.Mazumder, K.Chakravarty, A.Sinha, A.Ghose, and A.Pal, Parameter Estimation of Hemodynamic Cardiovascular Model for Synthesis of Photoplethysmogram Signal, 42nd Annual International Conference

of the IEEE Engineering in Medicine & Biology Society (EMBC),pp: 918-922, 2020.). The optimization procedure was implemented using a multi-dimensional Particle-Swarm-optimization (PSO) to find θ̂ (parameter set) by minimizing following objective function:
(8)
[049] The PSO heuristics having Z number of particles compute the error for all the particles in each iteration and search for the minimum value (global best) θ̂ = argminθJ(θ) where θ̂ is a vector that minimizes the difference between the true and simulated signals. Once the optimized values of θ̂ is identified, they are fed into the PPG-synthesized function to generate optimized PPG-signal p̃(t). Healthy and diseased template generation:
[050] Synthetic PPG template generated as described above is a generic template that can be modified by introducing pathological condition in the cardiac in-silico model. In the present disclosure, the system and method focus on generating PPG template for CAD. CAD is caused due to plaque formation in the coronary arteries, thereby reducing the blood flow, depriving the surrounding tissues from oxygen supply. CAD is commonly associated with ischemic effect, due to scar generation in the myocardial tissues in the area surrounding plaque formation (e.g., refer “T.Gaziano, A.Bitton, S.Anand, S.Gessel and A.Murphy, Growing epidemic of coronary heart disease in low-and middle-income countries. Current problems in cardiology 35: 72-115, 2010.”). In cardiac model of the present disclosure, the effect of CAD was simulated through scar formation in the myocardial tissues as well as reduction in systemic vessel diameter to replicate the effect of narrowing of artery and ischemic effect.
[051] Ischemic effect of CAD was simulated in the EP block as an occlusion in the left anterior descending artery (LAD), effecting apical anterior and anterio-septal area of heart. Oxygen deprivation resulting in scar formation was modeled through alteration in the ionic concentration at cell level which manifests itself in the form of action potential or TMP on the cardiac surface (e.g., refer “RM Shaw, Y.Rudy, Electrophysiologic effects of acute myocardial ischemia: a

theoretical study of altered cell excitability and action potential duration. Cardiovasc Res. vol:35, pp:256-272. 1997.”). Specifically, the scar tissue shows marked reduction of action potential amplitude corresponding to reduction in strength of effected area of myocardium, decreased propagation velocity and reduction in repolarization time (e.g., refer “B.Rodriguez, N.Trayanova and D.Noble, Modeling Cardiac Ischemia, Ann N Y Acad Sci. 1080: pp:395-414, 2006.”). These effects were encoded in the model by varying the TMP generation function in the EP block, like changing the repolarization time, maximum amplitude, depolarization time etc. Dimension of these scar tissues can also be varied to generate disease gradation and severity. In the present disclosure, healthy condition refers to perfect myocardium without any scar, whereas diseased refers to a scar tissue of 30 mm size with velocity reduction of 50% in the affected area, along with 20% reduction in TMP amplitude and repolarization time. Forward EP pipeline computes ECG template for the diseased case under consideration and thereafter is fed to hemodynamic block to generate the hemodynamic parameters.
[052] Effect of narrowing of systemic vessels due to plaque formation were modeled in the hemodynamic block. Pathophysiologicaly, decrease in vessel diameter results in increase in systemic resistance (Rs ) to blood flow. Healthy PPG, with normal flow and pressure profile was modeled with no scar condition and with systemic resistance value of ‘R’. CAD condition was expressed as a myocardium with scar tissue along with systemic resistance value of ‘2R’. It is to be noted that the value of ‘R’ as well as dimension of scar tissue can be varied giving rise to a distribution representing different subjects. PPG template for each such distribution can be generated by the model. A simulated ECG and PPG template for healthy and diseased (CAD) condition as generated by the model is shown in FIGS. 4A and 4B. More specifically, FIG. 4A, with reference to FIGS. 1 through 3, depicts a graphical representation of simulated electrocardiogram (ECG) templates, in accordance with an embodiment of the present disclosure. More specifically, in FIG. 4A, the generated ECG captures the pathophysiological changes observed due to supply demand insufficiency of oxygen resulted from narrowing of coronary arteries. The diseased template has a marked increase in ‘ST segment’ amplitude which is a

characteristic of ischemic conditions (e.g., refer “M.Maclachlan, J.Sundnes, and G.Lines, Simulation of ST segment changes during subendocardial ischemia using a realistic 3-D cardiac geometry. IEEE Trans Biomed Eng; vol:52, pp:799-807, 2005.”). Similarly, for the model generated PPG signal, the diseased templates show volumetric flow reduction in both systolic and diastolic flow due to reduction in arterial diameter and an increased resistance to flow (e.g., refer FIG. 4B). More specifically, FIG. 4B, with reference to FIGS. 1 through 4A, depicts a graphical representation of simulated PPG templates (signals) for healthy and diseased condition, in accordance with an embodiment of the present disclosure.
[053] Referring to steps of FIG. 3, at step 208 of the present disclosure, the one or more hardware processors 104 generate a CAD PPG training dataset based on the plurality of simulated CAD specific PPG signals and the plurality of photoplethysmogram (PPG) signals. The above step of 208 may be better understood by way of following description.
Data augmentation via Variational Autoencoder:
[054] From the measured parent dataset of 90 real CAD subjects, the physical or physics-based model is applied to generate a total of 1800 CAD specific PPG signals and out of them, 70% of the data, amounting to 1260 PPG segments were used to form the CAD population of the training set to design classifier for identifying CAD and non-CAD recordings. The dataset after augmentation however introduces severe skewness in the training set as the non-CAD population is much smaller than the CAD population. Randomly under-sampling of the CAD class to balance the dataset may not represent the entire distribution of the CAD population of the large synthetic data created by the physical model. Hence, the system 100 implements the Variational Autoencoder (VAE) structure to simulate the feature vector from CAD population of desired number to train a binary classifier on a balanced dataset for detecting CAD and non-CAD subjects. The Variational Autoencoder (VAE) is an improved version of a conventional autoencoder, where the latent space is represented in a probabilistic manner (e.g., refer “Kingma, Diederik P., and Max Welling. An introduction to variational autoencoders. arXiv preprint arXiv:1906.02691 (2019).”). Block diagram of a VAE

is shown in FIG. 5. More specifically, FIG. 5, with reference to FIGS. 1 through 4B, depicts an exemplary block diagram of the Variational Autoencoder (VAE) as implemented by the system 100 of FIGS. 1-2, in accordance with an embodiment of the present disclosure. Instead of mapping the input vector x to a static latent vector, the encoder’s job is to convert it into a Gaussian probability density function z, so that new samples can be drawn from the same for new data generation. The objective function of a VAE maximizes the marginal likelihood p(x), where
In theory, this is equivalent to maximizing the Evidence Lower Bound (ELBO) (L) loss as shown in the below equations:
(9) (10)
(11) where, qϕ(z|x) is the approximate posterior and pθ(z) is the prior distribution of
the latent variable z. pθ(x|z) is the likelihood of x given x. Here, pq (z) pθ(z) = N(z; 0,1). The term measures the Kullback-Leibler divergence (KL
divergence) between two distributions. The first term in the loss function indicates the reconstruction of x from the posterior distribution qϕ(z|x) and the likelihood pθ(x|z). The second term ensures that the posterior distribution is similar to the prior Gaussian distribution. In other words, the Kullback-Leibler (KL) divergence loss of the VAE is minimized such that a distribution of an encoded output generated by an encoder comprised in the VAE is similar to a unit Gaussian distribution (e.g., refer “Diederik P Kingma and Max Welling, Auto-encoding variational bayes, arXiv preprint arXiv:1312.6114, 2013”). When the VAE is trained from a large number of training data, the encoded distribution is enforced to be similar to a unit Gaussian distribution via minimizing the KL divergence. Hence, new samples can be synthetically generated by drawing samples from the unit Gaussian distribution and applying them to the decoder block.
[055] When a binary classifier is trained, the classes should have equal representation in the training set, otherwise the training model might get biased towards the majority class. In practical scenario, the prevalence of CAD is less than

10% in any population independent of demography (e.g., refer “M.Khan, M. J. Hashim and S. Ali Hussain Lootah, Global Epidemiology of Ischemic Heart Disease: Results from the Global Burden of Disease Study Cureas 2020, vol 12(7), 2020. e9349.doi: 10.7759/cureus.9349”). However, this statistic does not hold true in the CAD positivity rate in the total angiograms done in a hospital every day. The present disclosure performed the data collection drive in an urban hospital setup, where most of the observed subjects undergoing coronary angiogram did the diagnosis because they had some kind of cardiac issues. Hence, the prevalence of CAD in the dataset is found to be above 50%. After applying the physical/physics-based model on the training part of the CAD population, the 11-dimensional feature vector below were extracted from the simulated data. Subsequently, the input feature vector obtained from the simulated CAD data is fed to the VAE network. The encoder and the decoder of the VAE architecture of FIG. 5 contain single hidden layer each. The neurons were activated using non-linear ReLU function. The objective function of VAE represents the latent space in a probabilistic manner which is enforced to be similar to a prior Gaussian distribution of zero mean and unit standard deviation. The latent vector is generated by picking up samples from the latent space. The decoder maps the encoded vector to the original feature space. Hence, when the VAE is trained the latent space from the input becomes similar to the prior distribution. This helps us to create new training instances firstly by drawing random samples from the prior distribution followed by converting it to the input space by applying to the decoder.
[056] Further, once the training dataset is generated, the one or more hardware processors 104 train, a binary classifier using the CAD PPG training dataset and a plurality of non-CAD PPG signals to obtain a trained binary classifier (e.g., Support Vector Machine (SVM)). Once the trained binary classifier is obtained, the trained binary classifier is applied on a test PPG signal corresponding to a subject to classify the subject as a CAD subject or a non-CAD subject. The above steps of training and applying the trained binary classifier for classification of subject as the CAD subject or the non-CAD subject are better understood by way of description.

[057] Synthetic PPG template generated for both healthy and CAD distribution in the physical/physics-based model were exploded to generate multiple PPG templates, starting from the measured parent PPG dataset. Out of 90 CAD and 55 non-CAD measured PPG time series, 70 CAD PPG and 35 non-CAD PPG data were used as the training data. Rest 20 were kept aside for testing. From the initial 70 PPG data, physical model was used to modify the distribution with pathophysiological variation (varying scar size, systemic resistance) and an enhanced CAD PPG distribution of over 1000 templates were generated. Variational Autoencoder (VAE) structure was used to simulate the feature vector from CAD population of desired number to train a binary classifier for detecting CAD and non-CAD subjects. The subsequent section explains the data explosion mechanism from PPG template and implementation of VAE followed by classification.
Features for classification:
[058] Feature set for classification comprised of eleven statistical features extracted from each of the simulated time series of PPG templates. These features contain important statistical information regarding inherent properties of a signal that may help in discriminating a healthy subject from a diseased one. Features were calculated on both raw time series data as well as timeseries after de-trending and de-seasonalizing (TSA) for a precise and comprehensive calibration. The methodology of feature extraction follows the path as discussed in research work (e.g., “R.Banerjee, S.Bhattacharya, S.Alam, Time series and morphological feature extraction for classifying coronary artery disease from photoplethysmogram, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp:950-954,2018.”). Features used were Trend (TSA), Seasonality (TSA), Serial Correlation (TSA), Nonlinearity (raw and TSA), Skewness (raw and TSA), Kurtosis (raw and TSA), Self-similarity (raw), Periodicity (raw), Average Maharaj distance (raw), number of direction changes (raw), and the like.
[059] These eleven features both for CAD and non-CAD data were used to define a statistical distribution, defined as a Gaussian Kernel of the form

. Mathematically, a kernel is a positive function K(x,h)
which is controlled by the bandwidth parameter h. Given this kernel form, the density estimate at a point y within a group of points xi, i = 1,…,p is given by:
Feature set (i-th feature) for CAD and non-CAD group is denoted as Fic and FiNC , respectively, where i = 1,2, …,11. Gaussian kernel is fitted to each of these features to fit a non-parametric distribution. To discriminate CAD and non-CAD features, similarity between Fic and FiNC needs to be calculated as if these class shows similar features, two classes cannot be discriminated. This necessitates a measure of dissimilarity between two distributions, and the present disclosure used Matusita distance in measuring the likeness, or lack of it, between Fic and FiNC (e.g., refer “S.Bhattacharya, O.Mazumder, D.Roy, A.Sinha and A.Ghose, Synthetic Data Generation Through Statistical Explosion: Improving Classification Accuracy of Coronary Artery Disease Using PPG, ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp: 1165-1169 ,2020.”).
[060] For evaluating CAD feature distribution variation with change in physical model, it is important to check how many CAD (non-CAD) features out of these eleven tallies with the simulated feature distribution of the CAD (non-CAD) data. From the base synthetic PPG template, the system 100 generated a gradually increasing disease template by changing the equivalent systemic resistance from ‘R’ in healthy subject to ‘4R’ for CAD subject with ‘0.25 R’ gap. For each of these templates, the eleven features were extracted and their Matusita distance was calculated from the corresponding distribution generated by the simulated features. FIG. 6A, with reference to FIGS. 1 through 5, depicts a graphical representation illustrating distribution of feature ‘self-similarity’ (as an example) for the two classes (coronary artery disease (CAD) and non-CAD), in accordance with an embodiment of the present disclosure. Black dot in the plot refers to the value of ‘Similarity feature’ in the generated signal. The plot shows that the point has higher probability of belonging to the CAD set. FIG. 6B, with reference to FIGS. 1 through 6A, depicts a graphical representation illustrating

number of features belonging to their corresponding CAD distributions as R increases, in accordance with an embodiment of the present disclosure. From 3.5R onward, the number becomes more or less stable, with 8 out of 11 features tallying with the ground truth, hence disease group variation is limited to 3.5R templates. The model thus generated PPG CAD template varying systemic resistance between 1.8 to 3.5 R along with scar tissue and non-CAD group were simulated for 0.8 to 1.2 R.
[061] 11 features as mentioned above were used to train a binary Support Vector Machine (SVM) classifier having a Radial Basis Function (RBF) kernel for detection of CAD and non-CAD subjects. 20 CAD were randomly selected along with an equal number of non-CAD subjects and separated then only for test purpose and the remaining portion of the dataset was kept for training. Initially it was assumed by the system 100 a baseline situation by selecting a subset from the original training set where the class ratio is close to 1. 30 CAD patients and 35 non-CAD patients were selected to train the SVM classifier for CAD and non-CAD and evaluated on the test set. In the present disclosure, the disease detection performance is reported in terms of sensitivity and specificity of detecting CAD.
Improvement on classification performance due to data augmentation:
[062] The classification performance of the baseline classifier is shown in Table 1. Although the baseline classifier can accurately detect all the non-CAD subjects in the test set, a large number of CAD patients are missed in the process. The discriminating features for identifying the CAD population are complex and often can be intermittent across patients. A small training set of 30 patients might not be sufficient to capture the distribution for CAD. Hence, the trained classifier fails to detect a large number of CAD patients. An obvious solution to this problem is to enrich the CAD population in the training set by adding more patient data as shown in Table 2. Table 1: Performance of the baseline classifier, trained on a balanced dataset.

Number of CAD and
non-CAD subjects in
training set Sensitivity Specificity
CAD = 30, non-CAD = 35 0.65 1
Table 2: Improvement of the baseline classifier by adding more real patient.

Number of CAD and
non-CAD subjects in
training set Sensitivity Specificity
CAD = 40, non-CAD = 35 0.75 1
CAD = 50, non-CAD = 35 1 0.95
CAD = 60, non-CAD = 35 1 0.95
CAD = 70, non-CAD = 35 1 0.95
[063] As expected, the sensitivity of the classifier improves with increasing the number of patients. The hypothesis has been tested on a very small dataset. In practical scenario, a dataset can be very large in volume. Hence, increasing the real patient data is always not a practical solution due to associated time and cost involved in the process, and this dictates the need for artificial data simulation. Synthetic PPG generated from the physics-based model (e.g., also referred as cardiac model) can be used to generate a pool of patient data but incorporating all of them in the training set can introduce a bias in the model. Hence, a desired number of samples were selected as representative from the large volume
24

of data generated by the physiological model and these selected samples were added to the CAD population of the baseline training dataset and the classifier was retrained. More specifically, desired number of samples were randomly selected from the pool. The classification performance is reported in Table 3. Finally, the method of the present disclosure was applied where the representative samples were drawn by the VAE model. As shown in Table 4, the method of the present disclosure outperforms the random selection based approach (Table 3) and yields a performance close to what is achieved using the real dataset (Table 2).
[064] Table 3 and 4 summarize the performance improvement obtained by the data augmentation approach of the present disclosure on top of the accuracy reported by the baseline classifier in Table 1 and 2. The classifier reported in Tables 1 and 2 are trained on real CAD and non-CAD subjects. Whereas, in Tables 3 and 4 the training dataset is improved by simulating CAD data from a small subset of the real CAD population of 30 subjects via data augmentation. This is a practical scenario, as data augmentation approach is primary chosen due to lack of sufficient real data in the training dataset.
Table 3: Improvement of the baseline classifier using a training dataset having real and simulated CAD patients using physical modeling and random sampling.

Number of CAD and
non-CAD subjects in Sensitivity Specificity
training set
CAD = 50 (real 30 and
simulated 20), non-CAD 0.5 1
= 35
CAD = 90 (real 30 and
simulated 60), non-CAD 0.7 0.85
= 35

CAD = 130 (real 30 and
simulated 100), non- 0.8 0.8
CAD = 35
Table 4: Improvement of the baseline classifier by using the dataset having real and simulated CAD patients based on VAE from physical model.

Number of CAD and
non-CAD subjects in Sensitivity Specificity
training set
CAD = 50 (real 30 and
simulated 20), non-CAD 0.7 1
= 35
CAD = 90 (real 30 and
simulated 60), non-CAD 1 0.9
= 35
CAD = 130 (real 30 and
simulated 100), non- 1 0.8
CAD = 35
[065] The present disclosure further investigated the feasibility of the method of the present disclosure described herein to enhance the classification performance of the exiting classifier as shown in Table 2, via data augmentation. The present disclosure and its system and method used various combination of real and simulated CAD data to train the classifier. It is to be noted that, unlike Table 4, here the data pool of exploded data is created on the entire CAD population of 70 patients. The heatmaps (not shown in FIGS.) showed the effect of data augmentation using synthetic data; augmentation with synthetic data outperform the accuracy reported by the baseline classifier (using all 70 real CAD patients for

training in Table 2). This indicated the importance of data augmentation, obtained through a combination of real and simulated data, to improve the classification performance. As the amount of simulated data is increased in the training data, the sensitivity tends to increase, whereas the specificity reduced. However, there is a combination of real (30, 40 in number) and simulated (20, 30 in number) PPG data where the F1 score is 1. Thus, the present disclosure demonstrates that even though using the whole of 70 real CAD data for training a model is not able to give an F1 score of 1 but a combination of real and simulated data is able to. This implies that the physical cardiac model based simulated data truly augments the characteristics of the CAD data as observed in limited real PPG signal, thus improving the test accuracy.
Impact of hyper-parameter tuning:
[066] Next, the present disclosure investigated the effect of the hyper-parameters on the SVM classifier performance for various combinations of real and simulated CAD PPG training data. There are two parameters that largely determines the classification performance of a SVM, “C” and “Gamma”. The parameter C adds a penalty for each misclassified instances in the dataset. For a small value of C, the penalty for misclassified instances is low so a decision boundary with large margin is chosen at the expense of a high misclassification rate. If a large value of C is chosen, SVM tries to minimize the number of misclassified instances by adding a high penalty which results in a decision boundary with a smaller margin. The second parameter, Gamma is related to the inverse of the radius of influence of samples selected by the model as support vectors. A large value of Gamma results in the radius of the area of influence of the support vectors only to include the support vector, whereas a small value causes the model to be too constrained and hence cannot capture the complexity of the data. Hence, in the present disclosure, the influence of the features for a particular Gamma and its effect was studied on the classifier performance. Here the ‘C = 64’ was kept fixed which is found to give the best results for all the combinations of real and simulated CAD data. The Gamma is varied to study the classifier performance (not shown in FIGS.). It was observed that for the combination of the real and simulated CAD data, the F1-score

attains 1 for both low (< 0.75) and high values (1.25-1.75) of Gamma. This indicates that the classifier is able to capture the complexity of the training data, where the combination real and simulated data is used and hence attains a better performance. [067] The area under the curve (AUC) given by F1-score for different values of Gamma gives an indication of how well a classifier is performing. Higher the value, better is the performance. For different combinations of real and simulated training data, the normalized AUC values of F1-score are shown in Table 5.
Table 5: Normalized AUC of F1-score obtained for the same test data where the SVM classifier is trained with different combinations of real(R) and simulated(S) CAD data and 35 non-CAD data.

30R+10S 30R+20S 30R+30S 30R+40S 70R+0S
0.664 0.955 0.905 0.928 0.891
[068] It can be seen from the above Table 5, for combination 30R+20S, where the training data is 30 real and 20 simulated data, AUC is highest (0.955) whereas, it is 0.89 for the model trained with all 70 real CAD data. In fact, the AUC for 30R+30S and 30R+40S are also higher than that of 70R+0S. Thus, it is noted by the present disclosure that the classifier performance improves when a combination of real and simulated data is used rather than only real CAD data, as demonstrated using the hyperparameter (Gamma) of the classifier.
[069] In the present disclosure/application, the system and method have demonstrated the generation of synthetic PPG using an insilico cardiac model, followed by its efficacy in the improvement of machine learning algorithm for the classification of CAD. Due to the scarcity of patient data, such synthesis is of immense importance in the medical domain. Moreover, CAD specific markers are often not prominent in a PPG waveform of very short duration. Several clinical studies reveal that at least 2 minutes of PPG data is required to extract the relevant CAD specific features using HRV and pulse morphology analysis. Generating PPG signals of longer duration using LSTM or CNN based VAE typically requires more

computational power (e.g., refer “Wei, Qiong and Dunbrack Jr, Roland L, The role of balanced training and testing data sets for binary classifiers in bioinformatics, PloS one, vol:8(7), pp: 63-67, 2013.”). Additionally, the performance of any machine learning algorithm not only depends on the amount of data but also depends on the diversity of the training data and the divergence in the distribution of the test data as compared to the training dataset (e.g., refer “A.Tucker, Z.Wang, Y.Rotalinti et al. Generating high-fidelity synthetic patient data for assessing machine learning healthcare software. npj Digit. Med. vol:3(147), 2020. https://doi.org/10.1038/s41746-020-00353-9”). The usage of the cardiac/physics-based model followed by sampling using VAE is aimed to address all such needs. Its primary application is to generate synthetic PPG data pertaining to healthy and diseased conditions while incorporating the patho-physiological conditions associated with CAD. Blind statistical approaches for data explosion, though attempts to generate large and diverse dataset, does not take care of the underlying physiological principles. Similarly, standalone deep learning approaches for synthetic data generation, using Generative Adversarial Networks, though attempts to address the divergence between the test and training data (e.g., refer “Diederik P Kingma and Max Welling, Auto-encoding variational bayes, arXiv preprint arXiv:1312.6114, 2013”), lacks in interpretability. In CAD, the partial or complete blockage of certain coronary arteries, depending on the severity, leads to myocardial ischemia. Moreover, pathological intimal thickening in the systemic vessels, observed in Atherosclerosis are also associated with CAD (e.g., refer “M.Khan, M. J. Hashim and S. Ali Hussain Lootah, Global Epidemiology of Ischemic Heart Disease: Results from the Global Burden of Disease Study Cureas 2020, vol 12(7), 2020. e9349.doi: 10.7759/cureus.9349”). Thus, in the present disclosure, the physical EP driven hemodynamic model synthesizes the CAD PPG waveforms by tuning certain parameters, which are analogous to arterial stiffness due to excess cholesterol deposition, peripheral compliance, and other physiological properties. Incorporating such properties, which are clinically known to cause CAD, not only helps in obtaining the diversity in the training data but also becomes clinically acceptable. The current model can be upgraded by incorporating

more physiological parameters which cannot be performed by a traditional deep learning approach.
[070] The synthetic PPG data was used to augment the PPG data captured from real patients. Such real data are prone to various noise, be it due to the sensor or due to the usage of the same during the data capture. Statistical information of such PPG signals was extracted to generate features which are robust enough to handle sensor noise (e.g., refer “R.Banerjee, A. Ghose, A.Dutta Choudhury, A. Sinha, and A.Pal, Noise cleaning and Gaussian modeling of smart phone photoplethysmogram to improve blood pressure estimation, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp: 967-971, 2015.”). Ranking of the important features were performed by analyzing the distribution of the features obtained from the CAD and non-CAD data. Such an analysis aids in obtaining better separation between the CAD and non-CAD features.
[071] Unlike the existing methods of using VAE to generate time series data, the present disclosure and its system that describes the VAE simulates new feature vectors characterizing the PPG waveforms generated by the physical cardiac model. Feature vectors were drawn from a latent space generated by the VAE, through a set of non-linear operations by the encoder block. The latent space is learned by the VAE from the features of the synthetic PPG waveforms, obtained using the cardiac model by setting the biophysical parameters in the range of healthy and CAD regime. This hybrid approach of using physical model and data model enables in optimizing the divergence in both while preserving the clinical interpretation.
[072] The utility of the synthetic PPG is presented from two perspectives. One of them is when the real training data is scarce then how effective is the synthetic data in improving the classifier accuracy. The other perspective is that even if there are enough real training data, is there a possibility to improve the classifier accuracy by mixing real and synthetic data. The former one is experimented by artificially reducing the available training CAD data and experimenting by augmenting with synthetic data. It is demonstrated that physical

cardiac model-based synthesis followed by VAE based sampling, indeed improves the sensitivity of detecting CAD in test dataset (Table 4 - no misses happening for detecting CAD data) as compared to just using the real CAD data (Table 1) for training the ML classifier. This demonstrates the efficacy of augmentation of synthetic data in a reduced dataset. The second perspective provides more insights into the underlying trade-off between choosing real CAD data and synthetic CAD data for training a classifier. It was observed that even with using all the 70 real CAD data for training, the specificity reached only 0.95 (Table 2), whereas by judiciously choosing 30/40 real CAD and 20/30 simulated CAD both the sensitivity and specificity reach 1.0 (not shown in FIGS.). Such results can be only argued by the fact that the divergence between the training and the test (unseen data) is reduced through biophysical simulation of the CAD conditions. Hence it can be argued that the synthetically generated CAD PPG data helps in better characterizing the test data, that was never observed with real training data alone. The system and method of the present disclosure also demonstrated that augmenting the available real CAD PPG data with the synthetic CAD PPG data not only improves the accuracy of the classifier but also increases the robustness of the machine learning (ML) algorithm as demonstrated using the AUC of the F1 score (not shown in FIGS.). The PPG synthesis approach as described by the system and method of the present disclosure is also shown to be better compared to GAN models of data generation, specifically when original data set itself is small. These indicates that the model parameters representing the distribution of the physiological properties like contractility of cardiac chambers, arterial stiffness, compliance etc. indeed help in enriching the training data, improving classifier performance on the test data by infusing the pathophysiological interpretation that is generally neglected for conventional synthetic data generation.
[073] Embodiments of the present disclosure provide system and method for generating synthetic PPG data using a physical model of cardiac system, replicating electrophysiology and hemodynamics functionality. PPG data pertaining to healthy and CAD data were generated with pathophysiological interpretability and exploded using statistical feature distribution and random

sampling. Model generated synthetic PPG data was able to classify between CAD and non-CAD group with accuracy similar to benchmark accuracy obtained by using real measured PPG data from patients. The synthetic data was also shown to increase the performance of ML algorithm in terms of robustness and bias handling, when augmented with small number of real data. The cardiac model (or physics-based model) as PPG generator thus can serve as a useful tool to generate quality cardiac time-series data that can be used as training data to improve ML algorithm performance. The system 100, though used for simulating CAD, is a generic platform that can be used to generate PPG templates for other diseases like Arrhythmia, Myocardial Ischemia, Valvular diseases, or other heart diseases in subjects, etc. Apart from its applicability in screening cardiac diseases through ML, the cardiac model can also be used to understand, analyze, and classify cardiovascular disease progression.
[074] The written description describes the subject matter herein to enable any person skilled in the art to make and use the embodiments. The scope of the subject matter embodiments is defined by the claims and may include other modifications that occur to those skilled in the art. Such other modifications are intended to be within the scope of the claims if they have similar elements that do not differ from the literal language of the claims or if they include equivalent elements with insubstantial differences from the literal language of the claims.
[075] It is to be understood that the scope of the protection is extended to such a program and in addition to a computer-readable means having a message therein; such computer-readable storage means contain program-code means for implementation of one or more steps of the method, when the program runs on a server or mobile device or any suitable programmable device. The hardware device can be any kind of device which can be programmed including e.g., any kind of computer like a server or a personal computer, or the like, or any combination thereof. The device may also include means which could be e.g., hardware means like e.g., an application-specific integrated circuit (ASIC), a field-programmable gate array (FPGA), or a combination of hardware and software means, e.g., an ASIC and an FPGA, or at least one microprocessor and at least one memory with software

processing components located therein. Thus, the means can include both hardware means and software means. The method embodiments described herein could be implemented in hardware and software. The device may also include software means. Alternatively, the embodiments may be implemented on different hardware devices, e.g., using a plurality of CPUs.
[076] The embodiments herein can comprise hardware and software elements. The embodiments that are implemented in software include but are not limited to, firmware, resident software, microcode, etc. The functions performed by various components described herein may be implemented in other components or combinations of other components. For the purposes of this description, a computer-usable or computer readable medium can be any apparatus that can comprise, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device.
[077] The illustrated steps are set out to explain the exemplary embodiments shown, and it should be anticipated that ongoing technological development will change the manner in which particular functions are performed. These examples are presented herein for purposes of illustration, and not limitation. Further, the boundaries of the functional building blocks have been arbitrarily defined herein for the convenience of the description. Alternative boundaries can be defined so long as the specified functions and relationships thereof are appropriately performed. Alternatives (including equivalents, extensions, variations, deviations, etc., of those described herein) will be apparent to persons skilled in the relevant art(s) based on the teachings contained herein. Such alternatives fall within the scope of the disclosed embodiments. Also, the words “comprising,” “having,” “containing,” and “including,” and other similar forms are intended to be equivalent in meaning and be open ended in that an item or items following any one of these words is not meant to be an exhaustive listing of such item or items, or meant to be limited to only the listed item or items. It must also be noted that as used herein and in the appended claims, the singular forms “a,” “an,” and “the” include plural references unless the context clearly dictates otherwise.

[078] Furthermore, one or more computer-readable storage media may be utilized in implementing embodiments consistent with the present disclosure. A computer-readable storage medium refers to any type of physical memory on which information or data readable by a processor may be stored. Thus, a computer-readable storage medium may store instructions for execution by one or more processors, including instructions for causing the processor(s) to perform steps or stages consistent with the embodiments described herein. The term “computer-readable medium” should be understood to include tangible items and exclude carrier waves and transient signals, i.e., be non-transitory. Examples include random access memory (RAM), read-only memory (ROM), volatile memory, nonvolatile memory, hard drives, CD ROMs, DVDs, flash drives, disks, and any other known physical storage media.
[079] It is intended that the disclosure and examples be considered as exemplary only, with a true scope of disclosed embodiments being indicated by the following claims.

We Claim:
1. A processor implemented method, comprising:
receiving, via one or more hardware processors, a plurality
photoplethysmogram (PPG) signals corresponding to a plurality of subjects havin coronary artery disease (CAD) (202);
generating, via a physics-based model executed by the one or mo hardware processors, a plurality of simulated CAD specific PPG signals based o the plurality of photoplethysmogram (PPG) signals (204);
generating, via a variational autoencoder (VAE) executed by the one more hardware processors, a synthetic CAD specific PPG dataset based on distribution of the plurality of simulated CAD specific PPG signals (206); and
generating, via the one or more hardware processors, a CAD PPG trainin dataset based on the plurality of simulated CAD specific PPG signals and th plurality of photoplethysmogram (PPG) signals (208).
2. The processor implemented method of claim 1, wherein a Kullback– Leibler (KL) divergence loss of the VAE is minimized such that a distribution of an encoded output generated by an encoder comprised in the VAE is similar to a unit Gaussian distribution.
3. The processor implemented method of claim 1, wherein the synthetic CAD specific PPG dataset corresponds to a predefined size.
4. The processor implemented method of claim 1, wherein the step of generating a synthetic CAD specific PPG dataset comprises modeling physiological behavior of the plurality of subjects comprised in the plurality of simulated CAD specific PPG signals.

5. The processor implemented method of claim 1, further comprising:
training, a binary classifier via the one or more hardware processors, using
the CAD PPG training dataset and a plurality of non-CAD PPG signals to obtain a trained binary classifier; and
applying the trained binary classifier on a test PPG signal corresponding to a subject to classify the subject as a CAD subject or a non-CAD subject.
6. A system (100), comprising:
a memory (102) storing instructions;
one or more communication interfaces (106); and
one or more hardware processors (104) coupled to the memory (102) via the one or more communication interfaces (106), wherein the one or more hardware processors (104) are configured by the instructions to:
receive a plurality of photoplethysmogram (PPG) signals corresponding to a plurality of subjects having coronary artery disease (CAD);
generate, via a physics-based model, a plurality of simulated CAD specific PPG signals based on the plurality of photoplethysmogram (PPG) signals;
generate, via a variational autoencoder (VAE), a synthetic CAD specific PPG dataset based on a distribution of the plurality of simulated CAD specific PPG signals; and
generate a CAD PPG training dataset based on the plurality of simulated CAD specific PPG signals and the plurality of photoplethysmogram (PPG) signals.
7. The system of claim 6, wherein a Kullback–Leibler (KL) divergence loss of the VAE is minimized such that a distribution of an encoded output generated by an encoder comprised in the VAE is similar to a unit Gaussian distribution.
8. The system of claim 6, wherein the synthetic CAD specific PPG dataset corresponds to a predefined size.

9. The system of claim 6, wherein the synthetic CAD specific PPG dataset is generated by modeling physiological behavior of the plurality of subjects comprised in the plurality of simulated CAD specific PPG signals.
10. The system of claim 6, wherein the one or more hardware processors are further configured by the instructions to:
train a binary classifier using the CAD PPG training dataset and a plurality of non-CAD PPG signals to obtain a trained binary classifier; and
apply the trained binary classifier on a test PPG signal corresponding to a subject to classify the subject as a CAD subject or a non-CAD subject.

Documents

Application Documents

#	Name	Date
1	202121033257-STATEMENT OF UNDERTAKING (FORM 3) [23-07-2021(online)].pdf	2021-07-23
2	202121033257-REQUEST FOR EXAMINATION (FORM-18) [23-07-2021(online)].pdf	2021-07-23
3	202121033257-FORM 18 [23-07-2021(online)].pdf	2021-07-23
4	202121033257-FORM 1 [23-07-2021(online)].pdf	2021-07-23
5	202121033257-FIGURE OF ABSTRACT [23-07-2021(online)].jpg	2021-07-23
6	202121033257-DRAWINGS [23-07-2021(online)].pdf	2021-07-23
7	202121033257-DECLARATION OF INVENTORSHIP (FORM 5) [23-07-2021(online)].pdf	2021-07-23
8	202121033257-COMPLETE SPECIFICATION [23-07-2021(online)].pdf	2021-07-23
9	202121033257-FORM-26 [21-10-2021(online)].pdf	2021-10-21
10	202121033257-Proof of Right [11-01-2022(online)].pdf	2022-01-11
11	Abstract1.jpg	2022-02-01
12	202121033257-FER.pdf	2023-09-06
13	202121033257-OTHERS [24-11-2023(online)].pdf	2023-11-24
14	202121033257-FER_SER_REPLY [24-11-2023(online)].pdf	2023-11-24
15	202121033257-DRAWING [24-11-2023(online)].pdf	2023-11-24
16	202121033257-CLAIMS [24-11-2023(online)].pdf	2023-11-24
17	202121033257-US(14)-HearingNotice-(HearingDate-27-11-2024).pdf	2024-10-25
18	202121033257-Correspondence to notify the Controller [22-11-2024(online)].pdf	2024-11-22
19	202121033257-Written submissions and relevant documents [11-12-2024(online)].pdf	2024-12-11
20	202121033257-PatentCertificate17-12-2024.pdf	2024-12-17
21	202121033257-IntimationOfGrant17-12-2024.pdf	2024-12-17

Search Strategy

1	SearchHistoryfE_14-03-2023.pdf