Abstract: The present disclosure provides a system 102 and a method 200 for generating a synthetic dataset. The method 200 includes receiving 202 raw data from a plurality of sources, via an Application Programming Interface (API) and anonymizing 204 the raw data. Further, the method 200 includes generating 206 a plurality of synthetic datasets using an Artificial Intelligence (AI) model and determining 208 risk metrics. Further, the method 200 includes determining 210 that the risk metrics exceed a predefined threshold in real time and dynamically adjusting 212 parameters. Further, method 200 includes selecting 214 a synthetic dataset. Therefore, the system 102 and the method 200 overcome the limitations of traditional synthetic data generation techniques by integrating real-time privacy monitoring and dynamic adjustments, ensuring that both the fidelity and privacy of the synthetic data are continuously optimized.
Description:TECHNICAL FIELD
[001] The present disclosure generally relates to the field of data generation. In particular, the present disclosure provides a system and a method for generating a synthetic dataset using a multi-model ensemble to ensure regulatory compliance by incorporating privacy-preservation mechanisms, thereby minimizing the risk of personal data re-identification while maintaining high data fidelity.
BACKGROUND
[002] Modern Machine Learning (ML) and Artificial Intelligence (AI) solutions require large, diverse, and high-quality datasets for effective model training. However, enterprises often face significant challenges in accessing and utilizing such datasets. Privacy regulations and similar frameworks globally impose strict limitations on the sharing and use of Personally Identifiable Information (PII). Furthermore, domains such as healthcare and finance, which manage sensitive and regulated information, often face challenges such as data scarcity or inaccessibility, creating significant barriers to AI development.
[003] Existing synthetic data generation methods face critical limitations. Basic Generative Adversarial Network (GAN)-based generators, while capable of creating realistic data, often lack integrated privacy safeguards or mechanisms for regulatory compliance. Single-model approaches using GANs, transformers, or other singular techniques struggle to capture the diverse statistical characteristics required across various data modalities such as images, text, or tabular data. Furthermore, current solutions rarely incorporate automated privacy feedback mechanisms, relying instead on manual interventions or static thresholds. Such reliance often leads to either excessive information loss, reducing data utility, or insufficient protection, increasing the risk of re-identification. These challenges highlight the need for a comprehensive system that dynamically balances data fidelity, privacy, and compliance.
[004] Therefore, there is a need to address the drawbacks mentioned above and any other shortcomings, or at the very least, provide a valuable alternative to the existing methods and systems.
OBJECTS OF THE PRESENT DISCLOSURE
[005] A general object of the present disclosure is to provide an efficient and a reliable system and method that obviates the above-mentioned limitations of existing systems and methods in an efficient manner.
[006] An object of the present disclosure is to provide a system and a method for a multi-model ensemble that enables high-fidelity synthetic data generation capable of handling diverse data modalities, including structured data, unstructured data, images, and text, thereby ensuring robust adaptability across various applications and domains.
[007] Another object of the present disclosure is to provide a system and a method for implementing a privacy feedback mechanism that continually monitors re-identification risk metrics. The privacy feedback dynamically tunes the generative models to ensure compliance with data privacy regulations, thereby addressing privacy concerns in synthetic data generation.
[008] Yet another object of the present disclosure is to provide a system and a method that reduces reliance on real datasets to minimize data-sharing constraints and accelerates Artificial Intelligence (AI)/Machine Learning (ML) techniques within regulated industries such as healthcare, finance, and other sensitive domains, overcoming traditional data accessibility barriers.
SUMMARY
[009] Aspects of the present disclosure relate to the field of a data generation. In particular, the present disclosure provides a system and a method for generating a synthetic dataset using a multi-model ensemble to ensure regulatory compliance by incorporating privacy-preservation mechanisms, thereby minimizing the risk of personal data re-identification while maintaining high data fidelity.
[010] An aspect of the present disclosure relates to a method for generating a synthetic dataset. The method includes receiving, by one or more processors associated with a system, raw data from a plurality of sources, via an Application Programming Interface (API) and anonymizing, by the one or more processors, the raw data. Further, the method includes generating, by the one or more processors, a plurality of synthetic datasets based on the anonymized raw data using an Artificial Intelligence (AI) model configured within the one or more processors and determining, by the one or more processors, one or more risk metrics for each of the plurality of synthetic datasets. Further, the method includes determining, by the one or more processors, that the one or more risk metrics exceed a predefined threshold in real time. Further, in response to the determination that the one or more risk metrics exceed the predefined threshold, the method includes dynamically adjusting, by the one or more processors, one or more parameters associated with the AI model and selecting, by the one or more processors, at least one synthetic dataset from the plurality of synthetic datasets based on the adjustment.
[011] In an embodiment, for anonymizing, by the one or more processors, the raw data, the method may include determining, by the one or more processors, a domain of the raw data to anonymize the raw data.
[012] In an embodiment, for generating, by the one or more processors, the plurality of synthetic datasets, the method may include generating, by the one or more processors, at least one of: synthetic structured data of the plurality of synthetic datasets, and image-based data of the plurality of synthetic datasets based on the anonymized raw data using the AI model and simultaneously generating, by the one or more processors, at least one of: synthetic textual data of the plurality of synthetic datasets, and tokenized data of the plurality of synthetic datasets based on the anonymized raw data using the AI model. Further, the method may include simultaneously generating, by the one or more processors, high-dimensional synthetic data of the plurality of synthetic datasets based on the anonymized raw data using the AI model.
[013] In an embodiment, for determining, by the one or more processors, the one or more risk metrics, the method includes measuring, by the one or more processors, the one or more risk metrics in the plurality of synthetic datasets using at least one of: k-anonymity techniques, l-diversity techniques, and t-closeness techniques.
[014] In an embodiment, for determining, by the one or more processors, that the one or more risk metrics exceed the predefined threshold, the method may include assigning, by the one or more processors, a score corresponding to the one or more risk metrics and comparing, by the one or more processors, the score with the predefined threshold.
[015] In an embodiment, the one or more parameters comprise at least one of: a noise injection levels, sampling rates, and weights of the AI model.
[016] In an embodiment, for selecting, by the one or more processors, the at least one synthetic dataset, the method may include regenerating, by the one or more processors, the plurality synthetic datasets based on the adjustment of the one or more parameters and determining, by the one or more processors, fidelity and the one or more risk metrics of the regenerated plurality of synthetic datasets. Further, the method may include determining, by the one or more processors, that the fidelity and the one or more risk metrics of the at least one synthetic dataset do not exceed the predefined threshold. Further, in response to the determination that the fidelity and the one or more risk metrics of the at least one synthetic dataset do not exceed the predefined threshold, the method may include selecting, by the one or more processors, the at least one synthetic dataset.
[017] In an embodiment, for determining, by the one or more processors, the fidelity and the one or more risk metrics, the method includes measuring, by the one or more processors, the fidelity and the one or more risk metrics in the plurality of regenerated synthetic datasets using evaluation techniques.
[018] In an embodiment, for determining, by the one or more processors, that the fidelity and the one or more risk metrics of the at least one synthetic dataset do not exceed the predefined threshold, the method may include reassigning, by the one or more processors, a score corresponding to the fidelity and the one or more risk metrics and comparing, by the one or more processors, the reassigned score with the predefined threshold.
[019] Another aspect of the present disclosure relates to a system for generating a synthetic dataset. The system includes one or more processors and a memory. The method is operatively coupled with the one or more processors, where the memory comprises one or more instructions which, when executed, cause the one or more processors to receive raw data from a plurality of sources, via an Application Programming Interface (API) and anonymize the raw data. Further, the one or more processors are configured to generate a plurality of synthetic datasets based on the anonymized raw data using an Artificial Intelligence (AI) model configured within the one or more processors and determine one or more risk metrics for each of the plurality of synthetic datasets. Further, the one or more processors are configured to determine whether the one or more risk metrics exceed a predefined threshold in real time. Further, in response to the determination that the one or more risk metrics exceed the predefined threshold, the one or more processors are configured to dynamically adjust one or more parameters associated with the AI model and select at least one synthetic dataset from the plurality of synthetic datasets based on the adjustment
[020] Various objects, features, aspects, and advantages of the inventive subject matter will become more apparent from the following detailed description of preferred embodiments, along with the accompanying drawing figures in which like numerals represent components.
BRIEF DESCRIPTION OF THE DRAWINGS
[021] The accompanying drawings are included to provide a further understanding of the present disclosure and are incorporated in and constitute a part of this specification. The drawings illustrate exemplary embodiments of the present disclosure and, together with the description, serve to explain the principles of the present disclosure.
[022] FIG. 1 illustrates a block diagram of an example system for generating a synthetic dataset, in accordance with an embodiment of the present disclosure.
[023] FIG. 2 illustrates a flow chart of an example method for generating the synthetic dataset, in accordance with an embodiment of the present disclosure.
[024] FIG. 3 illustrates an exemplary computer system in which or with which embodiments of the present disclosure may be implemented.
DETAILED DESCRIPTION
[025] The following is a detailed description of embodiments of the disclosure depicted in the accompanying drawings. The embodiments are in such detail as to clearly communicate the disclosure. However, the amount of detail offered is not intended to limit the anticipated variations of embodiments; on the contrary, the intention is to cover all modifications, equivalents, and alternatives falling within the spirit and scope of the present disclosures as defined by the appended claims.
[026] The present disclosure addresses a critical need for a holistic system that combines the strengths of multiple generative modules (e.g., Artificial Intelligence (AI) models), such as Generative Adversarial Network (GAN) models, transformer models, and diffusion models, to effectively handle diverse data types and domains. Additionally, there is a need for a real-time privacy feedback loop that dynamically adjusts model hyper parameters (e.g., one or more parameters) to reduce re-identification risk. Furthermore, the system must adhere to privacy and regulatory frameworks while maintaining high data fidelity, ensuring compliance without compromising the quality of the generated synthetic data.
[027] Additionally, the present disclosure addresses the challenge of generating realistic synthetic data while ensuring strict adherence to privacy and compliance standards. By integrating multiple generative models within a unified framework, the system may intelligently select the most suitable synthetic data for a specific domain (e.g., healthcare, finance, or text-based analytics) in real time. Further, the present disclosure may include a privacy feedback mechanism that ensures that if the risk of re-identification exceeds a defined threshold (e.g., a predefined threshold), the system may dynamically adjust parameters of the generative models, thereby minimizing privacy exposure without compromising data fidelity. Furthermore, key features and advantages of the present disclosure include a multi-model ensemble that leverages diverse AI paradigms such as GANs, transformers, and diffusion models to deliver robust and high-quality outputs. The present disclosure incorporates dynamic privacy control through metrics like differential privacy measures, ensuring continuous monitoring and mitigation of privacy leakage. Additionally, the present disclosure provides domain-specific customization by enabling tailored transformations or constraints to comply with industry regulations for medical records. Further, a scalable and modular design may allow seamless integration with various enterprise workflows, supporting on premise, cloud, or hybrid deployments.
[028] Embodiments explained herein relate to a data generation. In particular, the present disclosure relates to a system and a method for generating a synthetic dataset using a multi-model ensemble of GANs, transformer-based architectures, and diffusion models to ensure regulatory compliance by incorporating privacy-preservation mechanisms, thereby minimizing a risk of personal data re-identification while maintaining high data fidelity. Various embodiments with respect to the present disclosure will be explained in detail with reference to FIGs. 1-3.
[029] FIG. 1 illustrates a block diagram 100 of an example system 102 for detecting anomalies in webpages, in accordance with an embodiment of the present disclosure.
[030] Referring to FIG. 1, the system 102 may include one or more processors 104, a memory 106, and an interface(s) 108. The one or more processors 104 may be implemented as one or more microprocessors, microcomputers, microcontrollers, digital signal processors, central processing units, logic circuitries, and/or any devices that manipulate data based on operational instructions. Among other capabilities, the one or more processor(s) 104 may be configured to fetch and execute computer-readable instructions stored in the memory 106 of the system 102. The memory 106 may store one or more computer-readable instructions or routines, which may be fetched and executed. The memory 106 may include any non-transitory storage device including, for example, volatile memory such as Random-Access Memory (RAM), or non-volatile memory such as Erasable Programmable Read-Only Memory (EPROM), flash memory, and the like.
[031] The interface(s) 108 may comprise a variety of interfaces, for example, a variety of interfaces, for example, interfaces for data input and output devices, referred to as I/O devices, storage devices, and the like. The interface(s) 108 may facilitate communication of the system 102 with various devices coupled to it. The interface(s) 108 may also provide a communication pathway for one or more components of the system 102. Examples of such components include but are not limited to, processing engine(s) 110 and a database 112. The database 112 may include data that is either stored or generated as a result of functionalities implemented by any of the components of the processing engine(s) 110.
[032] In an embodiment, the processing engine(s) 110 may be implemented as a combination of hardware and programming (for example, programmable instructions) to implement one or more functionalities of the processing engine(s) 110. In examples described herein, such combinations of hardware and programming may be implemented in several different ways. For example, the programming for the processing engine(s) 110 may be processor executable instructions stored on a non-transitory machine-readable storage medium, and the hardware for the one or more processor(s) 104 may comprise a processing resource (for example, one or more processors), to execute such instructions. In the present examples, the machine-readable storage medium may store instructions that, when executed by the processing resource, implement the processing engine(s) 110. In such examples, the system 102 may comprise the machine-readable storage medium storing the instructions and the processing resource to execute the instructions, or the machine-readable storage medium may be separate but accessible to the system 102 and the processing resource. In other examples, the processing engine(s) 110 may be implemented by an electronic circuitry. The processing engine(s) 110 may include a data ingestion module 114, an anonymize module 116, a multi-module ensemble 118, a privacy monitoring module 120, a selection module 122, and other module(s) 124. The other module(s) 124 may implement functionalities that supplement applications/functions performed by the processing engine(s) 110. In an embodiment, the processing engine(s) 110 may be an AI engine that may be configured with an AI model.
[033] In an embodiment, the data ingestion module 114 may receive raw data (e.g., real data input) from a plurality of sources (e.g., databases, file systems, and the like) via an Application Programming Interface (API). Once the raw data is received, the anonymize module 116 may determine a domain of the raw data to anonymize the raw data. In exemplary embodiments, the anonymize module 116 may perform pre-processing and basic anonymization by removing direct identifiers such as names and IDs, converting the data into a standardized format suitable for further processing. Additionally, domain-specific rules can be applied as needed. For example, in healthcare data, the anonymize module 116 may remove or tokenize patient names, while in finance data, the anonymize module 116 may mask account numbers, ensuring that sensitive information is effectively anonymized while maintaining the utility of the dataset.
[034] In an embodiment, once the raw data is anonymized, the multi-module ensemble 118 may generate a plurality of synthetic datasets. For generating the plurality of synthetic datasets, the multi-module ensemble 118 may generate at least one of: synthetic structured data of the plurality of synthetic datasets, and image-based data of the plurality of synthetic datasets based on the anonymized raw data using a Generative Adversarial Network (GAN) model associated with the AI model. Further, the multi-module ensemble 118 may simultaneously generate at least one of: synthetic textual data of the plurality of synthetic datasets, and tokenized data of the plurality of synthetic datasets based on the anonymized raw data using the transformer model associated with the AI model. Further, the multi-module ensemble 118 may simultaneously generate high-dimensional synthetic data of the plurality of synthetic datasets based on the anonymized raw data using a diffusion model associated with the AI model.
[035] In exemplary embodiments, the multi-module ensemble 118 (e.g., a generative model) may include three distinct AI generators such as the GAN model, the transformer model, and the diffusion model. The GAN module may use architectures like Style-based Generative Adversarial Network 2 (StyleGAN2) or Wasserstein Generative Adversarial Network with Gradient Penalty (WGAN-GP) to produce realistic images or tabular data by learning from a latent vector distribution. Further, the transformer model, based on architectures like Generative Pre-trained Transformer (GPT) or Bidirectional Encoder Representations from Transformers (BERT), handles textual data and generates synthetic text or structured tokens that mimic real linguistic or semantic patterns. The diffusion module utilizing architectures such as Denoising Diffusion Probabilistic Models (DDPM), is particularly effective in generating high-quality images or high-dimensional data with stable training dynamics.
[036] Once the plurality of synthetic datasets is generated, the privacy monitoring module 120 may determine one or more risk metrics for each of the plurality of synthetic datasets based on measuring the one or more risk in the plurality of synthetic datasets using k-anonymity techniques, l-diversity techniques, and t-closeness techniques and determine whether the one or more risk metrics exceed a predefined threshold or not in real time. To determine whether the one or more risk metrics exceed the predefined threshold or not, the privacy monitoring module 120 may assign a score corresponding to the one or more risk metrics and compare the score with the predefined threshold. If the one or more risk metrics exceed the predefined threshold, the privacy monitoring module 120 may adjust one or more parameters associated with the AI model.
[037] In exemplary embodiments, the privacy monitoring module 120 is a real-time module that calculates a privacy risk metric, such as the risk of re-identification and feeds this metric back to each generative model. The re-identification risk metrics may utilize differential privacy techniques to assess the likelihood of an identity of an individual being inferred from the synthetic dataset. Further, the privacy monitoring module 120 may implement scoring techniques that may incorporate privacy risk parameters such as k-anonymity, l-diversity, and t-closeness. If the calculated risk metrics exceed the predefined threshold (e.g., a 0.05 chance of re-identification), a feedback loop may trigger metric-based tuning. The privacy monitoring module 120 may transmit adjustment signals to the relevant generative model to mitigate privacy risk (e.g., the one or more parameters), such as, but not limited to increasing noise injection levels in the GAN, altering sampling strategies/rates in the diffusion model, weights of the AI model, and the like.
[038] Once the one or more parameters are adjusted, the multi-module ensemble 118 may regenerate the plurality synthetic datasets based on the adjustment of the one or more parameters. Once the plurality synthetic datasets are regenerated, the privacy monitoring module 120 may determine fidelity and the one or more risk metrics of the regenerated plurality of synthetic datasets based on measuring the fidelity and the one or more risk metrics in the plurality of regenerated synthetic datasets using evaluation techniques. Once the fidelity and the one or more risk metrics are determined, the privacy monitoring module 120 may determine whether the fidelity and the one or more risk metrics of at least one synthetic dataset exceed the predefined threshold or not. To determine whether the fidelity and the one or more risk metrics of the at least one synthetic dataset exceed the predefined threshold or not. The privacy monitoring module 120 may reassign a score corresponding to the fidelity and the one or more risk metrics and compare the reassigned score with the predefined threshold. If the fidelity and the one or more risk metrics of the at least one synthetic dataset exceed the predefined threshold, the selection module 122 may select the at least one synthetic dataset.
[039] Once the at least one synthetic dataset is selected, the selection module 122 may determine a domain with respect to the at least one selected synthetic dataset and validate the at least one selected synthetic dataset based on the domain. Further, the selection module 122 may utilize the at least one selected synthetic dataset based on the validation of the at least one selected synthetic dataset. For utilizing the at least one selected synthetic dataset, the other module(s) 124 may store the at least one selected synthetic dataset in the database 112. In an embodiment, the other module(s) 124 may transfer the at least one selected synthetic dataset to downstream systems.
[040] In exemplary embodiments, the selection module 122 is an AI-based decision layer that compares outputs of each generative model in terms of two main criteria such as the fidelity and privacy risk (e.g., the one or more risk metrics). The selection module 122 may performance evaluation conducted by the selection module 122, which evaluates the synthetic datasets against statistical alignment with real data distribution using the evaluation techniques such as Fréchet Inception Distance (FID) for images, precision-recall for images, or Bilingual Evaluation Understudy (BLEU) score for text) for fidelity and the privacy risk metric generated by the privacy monitoring module 120. Based on these evaluations, the logic of the privacy monitoring module 120 selects the best-performing synthetic data stream. For example, if the GAN module produces minimal re-identification risk but meets a minimum fidelity threshold, the selection module 122 may select the GAN output. Alternatively, if textual data requires advanced linguistic structure, the transformer output might be chosen. The selection process may be followed by synthesis output, which includes final validation to ensure compliance with domain-specific regulations (e.g., healthcare data). Finally, the synthetic dataset is securely stored (on premise, cloud, or hybrid) or delivered to downstream consumers, such as AI/Machine Learning (ML) pipelines or analytics tools, for further use. In exemplary embodiments, the multi-module ensemble 118 may collectively produce candidate synthetic datasets (e.g., the plurality of synthetic datasets), with each AI generator configured to generate a distinct candidate synthetic dataset (e.g., the at least one synthetic dataset). The selector module 122 may compare these candidate datasets based on a combined performance score that evaluates fidelity, diversity, and privacy. This comparison may ensure that the most optimal synthetic dataset is chosen for the intended application.
[041] In exemplary embodiments, the system 102 may process partially de-identified Electronic Health Records (EHR) as input. The output consists of high-fidelity synthetic records that preserve essential patterns such as comorbidities and medication timelines, while ensuring compliance with the privacy requirements. In exemplary embodiments, the system 102 may be configured with a privacy tuning mechanism that may be activated when the system 102 detects high overlap in rare disease patterns in such cases, the system 102 may inject additional noise to reduce the re-identification risk.
[042] In exemplary embodiments, for a financial transactions scenario, the input consists of bank transaction logs, which include account identities (IDs), transaction amounts, and timestamps. The system 102 may generate synthetic transaction streams (e.g., the synthetic data) that maintain key characteristics such as seasonality, volume distributions, and transaction frequencies, making suitable for fraud detection model training. If a cluster analysis of the system 102 identifies a potential unique signature, the privacy monitoring module 120 may transmit signals to a generative model (e.g., the WGAN-GP) associated with the system 102 to increase randomization and reduce privacy risks.
[043] In exemplary embodiments, the system 102 may process customer complaint logs or call centre transcripts as input. The output is synthetic textual data that captures critical features like linguistic structure, sentiment profiles, and topic distributions, enabling Natural Language Processing (NLP)-based sentiment analysis. The transformer model may adjust the data by replacing or masking rare phrases in the text, effectively reducing the risk of direct re-identification while maintaining data integrity. Further, the system 102 may apply a domain-specific transformation to the selected synthetic dataset to meet compliance requirements unique to healthcare, finance, or other regulated industries.
[044] Therefore, the present disclosure includes training the GAN module, such as StyleGAN2 or WGAN-GP, using Wasserstein loss with a gradient penalty for stable image synthesis. The transformer model may be fine-tuned on domain-specific text corpora, like the GPT architecture, for generating synthetic textual data. The Diffusion model is configured with DDPM to handle complex data distributions. Privacy monitoring is incorporated by using open-source differential privacy libraries, such as TensorFlow (TF) Privacy or Opacus for PyTorch, to dynamically assess re-identification risk. The system 102 may operate on a Graphical Processing Unit (GPU)-accelerated architecture, to ensure the timely processing of large datasets. Additionally, the GPU-accelerated architecture integrates with industry compliance checks to support auditing and regulatory requirements.
[045] The present disclosure has industrial applicability across several sectors. In healthcare, it can generate synthetic EHR data for AI model training without exposing sensitive patient information. In finance, the system 102 may provide realistic yet privacy-safe transaction records for fraud detection, credit scoring, and risk analysis models. In retail, synthetic consumer purchasing patterns can offer synthetic consumer purchasing patterns for market analysis without revealing personal details. In cybersecurity, synthetic network traffic logs can be created to test intrusion detection systems. Additionally, in general AI applications, the invention can enhance model performance across multiple domains by augmenting or replacing real datasets with synthetic versions.
[046] Additionally, the present disclosure comprehensively describes a system and method for generating high-fidelity synthetic data with privacy preservation. The present disclosure introduces a multi-technique ensemble approach integrated with a dynamic privacy feedback loop, offering a robust solution to the critical challenges in synthetic data generation. By ensuring that re-identification risk remains low while maintaining high data fidelity, the system 102 addresses a pressing industry need and has broad applicability across various regulated and non-regulated sectors.
[047] FIG. 2 illustrates a flow chart of an example method 200 for generating a synthetic dataset, in accordance with an embodiment of the present disclosure.
[048] Referring to FIG. 2, at 202, the method 200 may include receiving, by one or more processors (e.g., 104 as represented in FIG. 1) associated with a system (e.g., 102 as represented in FIG. 1), raw data from a plurality of sources, via an Application Programming Interface (API). At 204, the method 200 may include anonymizing, by the one or more processors 104, the raw data. At 206, the method 200 may include generating, by the one or more processors 104, a plurality of synthetic datasets based on the anonymized raw data using an AI model configured within the one or more processors 104. At 208, the method 200 may include determining, by the one or more processors 104, one or more risk metrics for each of the plurality of synthetic datasets. At 210, the method 200 may include determining, by the one or more processors 104, whether the one or more risk metrics exceed a predefined threshold in real time. At 212, in response to the determination that the one or more risk metrics exceed the predefined threshold, the method 200 may include dynamically adjusting, by the one or more processors, one or more parameters associated with the AI model. At 214, the method 200 may include selecting, by the one or more processors 104, at least one synthetic dataset from the plurality of synthetic datasets based on the adjustment.
[049] In an embodiment, for generating, by the one or more processors 104, the plurality of synthetic datasets, the method 200 may include generating, by the one or more processors, at least one of: synthetic structured data of the plurality of synthetic datasets, and image-based data of the plurality of synthetic datasets based on the anonymized raw data using the AI model. Further, the method 200 may include simultaneously generating, by the one or more processors 104, at least one of: synthetic textual data of the plurality of synthetic datasets, and tokenized data of the plurality of synthetic datasets based on the anonymized raw data using the AI model. Further, the method 200 may include simultaneously generating, by the one or more processors 104, high-dimensional synthetic data of the plurality of synthetic datasets based on the anonymized raw data using the AI model.
[050] In an embodiment, for selecting, by the one or more processors 104, the at least one synthetic dataset, the method 200 may include regenerating, by the one or more processors 104, the plurality synthetic datasets based on the adjustment of the one or more parameters and determining, by the one or more processors 104, fidelity and the one or more risk metrics of the regenerated plurality of synthetic datasets. Further, the method 200 may include determining, by the one or more processors 104, that the fidelity and the one or more risk metrics of the at least one synthetic dataset do not exceed the predefined threshold. Further, in response to the determination that the fidelity and the one or more risk metrics of the at least one synthetic dataset do not exceed the predefined threshold, the method 200 may include selecting, by the one or more processors, the at least one synthetic dataset.
[051] FIG. 3 illustrates an exemplary computer system 300 in which or with which embodiments of the present disclosure may be implemented.
[052] As shown in FIG. 3, the computer system 300 may include an external storage device 310, a bus 320, a main memory 330, a read only memory 340, a mass storage device 350, a communication port 360, and a processor 370. A person skilled in the art will appreciate that the computer system 300 may include more than one processor and communication ports. The processor 370 may include various modules associated with embodiments of the present disclosure.
[053] In an embodiment, the communication port 360 may be any of an RS-232 port for use with a modem-based dialup connection, a 10/100 Ethernet port, a Gigabit or 10 Gigabit port using copper or fiber, a serial port, a parallel port, or other existing or future ports. The communication port 360 may be chosen depending on a network, such a Local Area Network (LAN), Wide Area Network (WAN), or any network to which the computer system 300 connects.
[054] In an embodiment, the memory 330 may be a Random-Access Memory (RAM), or any other dynamic storage device commonly known in the art. The read-only memory 340 may be any static storage device(s) e.g., but not limited to, a Programmable Read Only Memory (PROM) chips for storing static information e.g., start-up or Basic Input/Output system (BIOS) instructions for the processor 370.
[055] In an embodiment, the mass storage device 350 may be any current or future mass storage solution, which can be used to store information and/or instructions. Exemplary mass storage solutions include, but are not limited to, Parallel Advanced Technology Attachment (PATA) or Serial Advanced Technology Attachment (SATA) hard disk drives or solid-state drives (internal or external, e.g., having Universal Serial Bus (USB) and/or Firewire interfaces), one or more optical discs, Redundant Array of Independent Disks (RAID) storage, e.g., an array of disks (e.g., SATA arrays).
[056] In an embodiment, the bus 320 communicatively couples the processor(s) 370 with the other memory, storage, and communication blocks. The bus 320 may be, e.g., a Peripheral Component Interconnect (PCI)/PCI Extended (PCI-X) bus, Small Computer System Interface (SCSI), USB or the like, for connecting expansion cards, drives, and other subsystems as well as other buses, such a front side bus (FSB), which connects the processor 370 to computer system 300.
[057] Optionally, operator and administrative interfaces, e.g., a display, keyboard, joystick, and a cursor control device, may also be coupled to the bus 320 to support direct operator interaction with the computer system 300. Other operator and administrative interfaces may be provided through network connections connected through the communication port 360. Components described above are meant only to exemplify various possibilities. In no way should the aforementioned exemplary computer system 300 limit the scope of the present disclosure.
[058] While the foregoing describes various embodiments of the disclosure, other and further embodiments of the present disclosure may be devised without departing from the basic scope thereof. The scope of the disclosure is determined by the claims that follow. The disclosure is not limited to the described embodiments, versions, or examples, which are included to enable a person having ordinary skill in the art to make and use the disclosure when combined with information and knowledge available to the person having ordinary skill in the art.
ADVANTAGES OF THE PRESENT DISCLOSURE
[059] The present disclosure ensures high-fidelity data generation while actively monitoring for privacy risk through a novel multi-model ensemble, effectively balancing data fidelity and privacy compliance.
[060] The present disclosure features an adaptive learning mechanism where a dynamic feedback loop continuously recalibrates hyper parameters, optimizing the fidelity-privacy trade-off.
[061] The present disclosure enables domain versatility, allowing scalability across image-intensive domains, textual analysis, and structured data, making it suitable for various industries and applications.
[062] The present disclosure incorporates built-in metrics designed to address privacy regulations, enabling organizations to safely share or utilize synthetic data while ensuring adherence to regulatory requirements.
[063] The present disclosure reduces human intervention by automating model selection and parameter tuning, minimizing the need for specialized data scientists to manually refine the synthetic dataset.
, Claims:1. A method (200) for generating a synthetic dataset, comprising:
receiving (202), by one or more processors (104) associated with a system (102), raw data from a plurality of sources, via an Application Programming Interface (API);
anonymizing (204), by the one or more processors (104), the raw data;
generating (206), by the one or more processors (104), a plurality of synthetic datasets based on the anonymized raw data using an Artificial Intelligence (AI) model configured within the one or more processors (104);
determining (208), by the one or more processors (104), one or more risk metrics for each of the plurality of synthetic datasets;
determining (210), by the one or more processors (104), that the one or more risk metrics exceed a predefined threshold in real time;
in response to the determination that the one or more risk metrics exceed the predefined threshold, dynamically adjusting (212), by the one or more processors (104), one or more parameters associated with the AI model; and
selecting (214), by the one or more processors (104), at least one synthetic dataset from the plurality of synthetic datasets based on the adjustment.
2. The method (200) as claimed in claim 1, wherein anonymizing (204), by the one or more processors (104), the raw data comprises:
determining, by the one or more processors (104), a domain of the raw data to anonymize the raw data.
3. The method (200) as claimed in claim 1, wherein generating (206), by the one or more processors (104), the plurality of synthetic datasets comprises:
generating, by the one or more processors (104), at least one of: synthetic structured data of the plurality of synthetic datasets, and image-based data of the plurality of synthetic datasets based on the anonymized raw data using the AI model;
simultaneously generating, by the one or more processors (104), at least one of: synthetic textual data of the plurality of synthetic datasets, and tokenized data of the plurality of synthetic datasets based on the anonymized raw data using the AI model; and
simultaneously generating, by the one or more processors (104), high-dimensional synthetic data of the plurality of synthetic datasets based on the anonymized raw data using the AI model.
4. The method (200) as claimed in claim 1, wherein determining (208), by the one or more processors (104), the one or more risk metrics comprises:
measuring, by the one or more processors (104), the one or more risk metrics in the plurality of synthetic datasets using at least one of: k-anonymity techniques, l-diversity techniques, and t-closeness techniques.
5. The method (200) as claimed in claim 1, wherein determining (210), by the one or more processors (104), that the one or more risk metrics exceed the predefined threshold comprise:
assigning, by the one or more processors (104), a score corresponding to the one or more risk metrics; and
comparing, by the one or more processors (104), the score with the predefined threshold.
6. The method (200) as claimed in claim 1, wherein the one or more parameters comprise at least one of: a noise injection levels, sampling rates, and weights of the AI model.
7. The method (200) as claimed in claim 1, wherein selecting (214), by the one or more processors (104), the at least one synthetic dataset comprises:
regenerating, by the one or more processors (104), the plurality synthetic datasets based on the adjustment of the one or more parameters;
determining, by the one or more processors (104), fidelity and the one or more risk metrics of the regenerated plurality of synthetic datasets;
determining, by the one or more processors (104), that the fidelity and the one or more risk metrics of the at least one synthetic dataset do not exceed the predefined threshold; and
in response to the determination that the fidelity and the one or more risk metrics of the at least one synthetic dataset do not exceed the predefined threshold, selecting, by the one or more processors (104), the at least one synthetic dataset.
8. The method (200) as claimed in claim 7, wherein determining, by the one or more processors (104), the fidelity and the one or more risk metrics comprises:
measuring, by the one or more processors (104), the fidelity and the one or more risk metrics in the plurality of regenerated synthetic datasets using evaluation techniques.
9. The method (200) as claimed in claim 7, wherein determining, by the one or more processors (104), that the fidelity and the one or more risk metrics of the at least one synthetic dataset do not exceed the predefined threshold comprises:
reassigning, by the one or more processors (104), a score corresponding to the fidelity and the one or more risk metrics; and
comparing, by the one or more processors (104), the reassigned score with the predefined threshold.
10. A system (102) for generating a synthetic dataset, comprising:
one or more processors (104); and
a memory (106) operatively coupled with the one or more processors (104), wherein the memory (106) comprises one or more instructions which, when executed, cause the one or more processors (104) to:
receive raw data from a plurality of sources, via an Application Programming Interface (API);
anonymize the raw data;
generate a plurality of synthetic datasets based on the anonymized raw data using an Artificial Intelligence (AI) model configured within the one or more processors (104);
determine one or more risk metrics for each of the plurality of synthetic datasets;
determine whether the one or more risk metrics exceed a predefined threshold in real time;
in response to the determination that the one or more risk metrics exceed the predefined threshold, dynamically adjust one or more parameters associated with the AI model; and
select at least one synthetic dataset from the plurality of synthetic datasets based on the adjustment.
| # | Name | Date |
|---|---|---|
| 1 | 202521012932-STATEMENT OF UNDERTAKING (FORM 3) [14-02-2025(online)].pdf | 2025-02-14 |
| 2 | 202521012932-REQUEST FOR EARLY PUBLICATION(FORM-9) [14-02-2025(online)].pdf | 2025-02-14 |
| 3 | 202521012932-POWER OF AUTHORITY [14-02-2025(online)].pdf | 2025-02-14 |
| 4 | 202521012932-FORM-9 [14-02-2025(online)].pdf | 2025-02-14 |
| 5 | 202521012932-FORM FOR STARTUP [14-02-2025(online)].pdf | 2025-02-14 |
| 6 | 202521012932-FORM FOR SMALL ENTITY(FORM-28) [14-02-2025(online)].pdf | 2025-02-14 |
| 7 | 202521012932-FORM 1 [14-02-2025(online)].pdf | 2025-02-14 |
| 8 | 202521012932-EVIDENCE FOR REGISTRATION UNDER SSI(FORM-28) [14-02-2025(online)].pdf | 2025-02-14 |
| 9 | 202521012932-EVIDENCE FOR REGISTRATION UNDER SSI [14-02-2025(online)].pdf | 2025-02-14 |
| 10 | 202521012932-DRAWINGS [14-02-2025(online)].pdf | 2025-02-14 |
| 11 | 202521012932-DECLARATION OF INVENTORSHIP (FORM 5) [14-02-2025(online)].pdf | 2025-02-14 |
| 12 | 202521012932-COMPLETE SPECIFICATION [14-02-2025(online)].pdf | 2025-02-14 |
| 13 | 202521012932-STARTUP [17-02-2025(online)].pdf | 2025-02-17 |
| 14 | 202521012932-FORM28 [17-02-2025(online)].pdf | 2025-02-17 |
| 15 | 202521012932-FORM 18A [17-02-2025(online)].pdf | 2025-02-17 |
| 16 | Abstract.jpg | 2025-02-25 |
| 17 | 202521012932-FER.pdf | 2025-03-25 |
| 18 | 202521012932-FORM 3 [25-06-2025(online)].pdf | 2025-06-25 |
| 19 | 202521012932-Proof of Right [23-07-2025(online)].pdf | 2025-07-23 |
| 20 | 202521012932-FORM-5 [28-07-2025(online)].pdf | 2025-07-28 |
| 21 | 202521012932-FER_SER_REPLY [28-07-2025(online)].pdf | 2025-07-28 |
| 1 | 202521012932_SearchStrategyNew_E_2932E_13-03-2025.pdf |