A System And Method For Synthetic Data Generation

< Back

A System And Method For Synthetic Data Generation

Abstract: The present invention relates to a system (102) for synthetic data generation (402) and data replication (404). The system (102) first collects sample data are through the user interface (302). The core parameters and sample data are passed to the LLM (320) integration module. The LLM (320) is configured to understand the core parameter, sample data and business logic provided by the user. Further, the LLM (320) generates specialized Python code for data generation (402). The code is validated and executed by the code executor (312) to generate the synthetic data. The system (102) generates the pointers at the end, which may be considered by the LLM (320) for data generation (402). The pointers provide the explanation, reasons, business logic, and constraints in natural language about the points that LLM (320) has considered for generating the synthetic data.

Get Free WhatsApp Updates!
Notices, Deadlines & Correspondence

Patent Information

Application #

Filing Date

29 May 2025

Publication Number

25/2025

Publication Type

INA

Invention Field

COMPUTER SCIENCE

Status

Parent Application

Applicants

TECH MAHINDRA LIMITED

Tech Mahindra Limited, Phase III, Rajiv Gandhi Infotech Park Hinjewadi, Pune - 411057, Maharashtra, India

Inventors

1. JHA, Saurabh

Tech Mahindra Limited, Oberoi Garden Estate, Chandivali, Off Saki Vihar Road, Andheri East, Mumbai - 400072, Maharashtra, India

2. TIWARI, Sushil

Tech Mahindra Limited, Phase 3, Hinjawadi Rajiv Gandhi Infotech Park, Hinjawadi, Pimpri-Chinchwad, Pune – 411057, Maharashtra, India

3. MAHARAJAN, Ezhilarasan

Tech Mahindra, KIADB Industrial Area, Plot No. 45- 47, First Main Rd, Phase II, Electronic City, Bengaluru - 560100, Karnataka, India

4. JAJOO, Sachin

Tech Mahindra Limited ,7th & 8th Floor, Capital Cyberscape, Sector-59, Golf Course Extension Road, Gurugram - 122102, Haryana, India

5. VISHNU, Bonthala Kanth

Tech Mahindra, Survey number 64,Plot number 35,36,#techmahindra, hitech city, Jubilee enclave, Madhapur, Hyderabad -500081, Telangana, India

6. KALYAN, Pavan C

Tech Mahindra, KIADB Industrial Area, Plot No. 45- 47, First Main Rd, Phase II, Electronic City, Bengaluru - 560100, Karnataka, India

7. KHAYALIA, Sonali

Tech Mahindra Limited ,7th & 8th Floor, Capital Cyberscape, Sector-59, Golf Course Extension Road, Gurugram - 122102, Haryana, India

8. VELAMAKURI, Himabindu

Tech Mahindra, KIADB Industrial Area, Plot No. 45- 47, First Main Rd, Phase II, Electronic City, Bengaluru - 560100, Karnataka, India

Specification

Description:FORM 2

THE PATENTS ACT, 1970
(39 of 1970)
&
THE PATENT RULES, 2003

COMPLETE SPECIFICATION
(See Section 10 and Rule 13)

Title of Invention:
A SYSTEM AND METHOD FOR SYNTHETIC DATA GENERATION

Applicant:
TECH MAHINDRA LIMITED
A company Incorporated in India under the Companies Act, 1956
Having address:
Tech Mahindra Limited, Phase III, Rajiv Gandhi Infotech Park Hinjewadi,
Pune - 411057, Maharashtra, India

The following specification particularly describes the invention and the manner in which it is to be performed.

CROSS-REFERENCE TO RELATED APPLICATIONS AND PRIORITY
[001] The present application does not claim priority from any patent application.
TECHNICAL FIELD
[002] The present invention relates generally to the field of a system and method for synthetic data generation. The present invention leverages Large Language Models (LLMs), Agentic AI, with the combination of AI-driven code generation and supports with structured execution environments, automated data creation, replication, and augmentation, ensuring compliance with privacy regulations.
BACKGROUND OF THE INVENTION
[003] Generally, organizations face significant challenges when they don’t have any data available or when they are working with sensitive data, including regulatory compliance issues, data privacy concerns, and limitations in data availability for system stress testing and AI-ML model or Dashboard development, and sharing data across the marketplace for monetization. Organizations face legal and security risks when sharing sensitive data across their own teams, partners, vendors and Customers (i.e. both internally and externally). Many industries struggle with limited, fragmented, or restricted data, making AI/ML model training and analytics challenging. Stringent data privacy laws (GDPR, HIPAA, CCPA, etc.) restrict the use of real-world data for AI, analytics, and product development. Traditional data collection, or data generation, data replication, annotation, and labelling are time-consuming, expensive, and labour-intensive. Traditional AI models suffer from bias due to imbalanced, incomplete, or skewed training datasets, leading to inaccurate predictions. Traditional AI/ML models require large, diverse, and high-quality datasets for training, validation, and stress testing.
[004] Therefore, to overcome the problems associated with the traditional system, there is need for an advanced system to generate synthetic data that mimics real datasets without exposing confidential information.
OBJECTS OF THE INVENTION
[005] Primary objective of the present invention is to provide a system and a method for synthetic data generation by taking data from various sources, including databases, cloud storage, CSV files, natural language-based text, free flow text or data, and DDL or SQL scripts.
[006] Another objective of the present invention is to perform data replication by maintaining the distribution pattern of the input data.
[007] Still another objective of the present invention is to provide a Gen AI and Agentic AI--powered synthetic data generation system to generate millions of records while maintaining column constraints, inter-column dependencies, and data distribution patterns.
[008] Yet another objective of the present invention is to provide a LLM Models to facilitate the data generation by generating python codes and not the actual data, wherein the Python codes are used to execute various libraries to generate millions of records or replicate existing set of data-set.
[009] A further objective of the present invention is to create synthetic data with or without sample data and replicate existing datasets for AI/ML training, data analytics, testing, sharing over data marketplace and compliance adherence.
SUMMARY OF THE INVENTION
[0010] Before the present system is described, it is to be understood that this application is not limited to the particular machine, device, or system, as there can be multiple possible embodiments that are not expressly illustrated in the present disclosures. It is also to be understood that the terminology used in the description is to describe the particular versions or embodiments only, and is not intended to limit the scope of the present application. This summary is provided to introduce aspects related to a system and method for synthetic data generation and the aspects are further elaborated below in the detailed description. This summary is not intended to identify essential features of the proposed subject matter nor is it intended for use in determining or limiting the scope of the proposed subject matter.
[0011] In an embodiment, the present invention provide a system for generating synthetic data, comprising a processing unit (202) and a storage device (206) coupled to the processing unit. The processing unit (202) enables the system to perform various operations, starting with receiving structured, semi-structured or unstructured input data through a user interface. The received data is then stored in a database that is communicatively coupled to the processing unit, allowing further processing. The system pre-processes the stored data to perform data profiling, extracting schema information, relationship constraints, and statistical parameters essential for accurate synthetic data generation. Additionally, it detects personally identifiable information (PII) within the input and applies at least one anonymization technique to ensure privacy-preserving synthetic data.
[0012] Once the data is pre-processed, the Large Language Model (LLM) integration module analyses the privacy-preserving synthetic data along with the extracted schema information, relationship constraints, and statistical patterns. Based on this analysis, the LLM integration module generates executable scripts that encode the logic for synthetic data generation using Python code. These scripts are then executed via a code execution engine, which leverages at least one library designed to generate synthetic data while ensuring that the generated records maintain the statistical characteristics, dependencies, and relationships defined in the input data. An Agentic AI/LLM model does not generate the data, it just generates the python scripts that are used to execute different libraries to generate the data.
[0013] To validate the synthetic data, the system incorporates AI/ML models and techniques within the library, ensuring that the generated data aligns with predefined constraints and maintains consistency with real-world distributions. Finally, the system generates the synthetic data along with a natural language explanation, with the help of python code produced by the LLM integration module, which details the applied logic, constraints, and data generation considerations in a predefined format. This structured approach ensures transparency, accuracy, and compliance in synthetic data generation across various industries and applications.
[0014] In an embodiment, the present invention provides a system for synthetic data generation, comprising multiple modular layers to ensure structured and scalable data processing. The data ingestion layer serves as the primary interface, enabling user interaction through an interface that facilitates data submission via file uploads, database connections, user-defined scripts, or natural language instructions. Once data is submitted, the pre-processing layer processes the input by performing data profiling, extracting relationships, constraints, and statistical properties. The system includes an LLM integration module, which analyses the pre-processed data and generates specialized executable scripts tailored for synthetic data generation. These scripts are then validated and executed within a secure execution layer, leveraging predefined libraries to ensure data consistency and integrity. The generated synthetic data undergoes post-processing, where statistical validation is performed to compare the generated data against the input data, ensuring accuracy. Finally, the storage and output layer stores the validated synthetic data and provides access to the final dataset in predefined formats, including CSV, JSON, Parquet, SQL, and NoSQL.
[0015] In an embodiment, the present invention provides a system configured to accept various types of input data, including natural language descriptions defining table structures, column specifications, date ranges, industry-specific constraints, and logics without requiring sample data. Additionally, the system can process structured SQL queries defining table schemas with optional business logic, DDL scripts specifying table schemas, column names, data types, primary key and foreign key relationships, and join conditions, with optional business logic. The system also accepts sample datasets comprising at least rows of records configured to generate large volume of synthetic data maintaining logic, inter and intra column constraints; or free-flow text defining data generation requirement and a data-set configured for replication to ensure that the generated synthetic data maintain the statistical and distribution pattern.
[0016] In yet another embodiment, the present invention provides a pre-processing mechanism by the processing unit (202) that identifies statistical patterns, inter-column dependencies, and data distribution characteristics of the input data. The system ensures that the executable scripts encode logic for synthetic data generation, preserving statistical patterns, primary key and foreign key relationships, and privacy constraints of the input data. Additionally, the executable scripts are configured to generate a user-specified number of synthetic records, including millions of records, when requested.
[0017] In still another embodiment, the present invention provides a system wherein the generated executable scripts for synthetic data generation are executed within a secure, isolated, and controlled virtual space. This environment prevents the executed code from affecting the main system or input data. Furthermore, the Large Language Model (LLM) integration module gathers insights from the input data to construct executable code, which, when combined with the library, produces synthetic data while preserving the structure, relationships, constraints, and statistical parameters. The system is also configured to extrapolate or replicate database data while maintaining its distribution patterns and statistical properties.
[0018] In an embodiment, the present invention provides a system configured to output synthetic data and natural language explanations in various formats, including CSV, JSON, Parquet, SQL, and NoSQL. The generated natural language explanation details the business logic, constraints, and considerations applied during synthetic data generation, ensuring clarity for users. Additionally, the system supports automated synthetic data generation, including scenarios where no input data is available, by leveraging only DDL/SQL scripts or free-flowing natural language text.
[0019] In yet another embodiment, the present invention provides a system configured to validate synthetic data through statistical similarity tests, ensuring that the generated data matches the distribution, dependencies, and relationships of the input data or user-defined schema. Moreover, the system incorporates a user interface that allows users to select columns for masking, faking, or synthetic generation while specifying logic or constraints to be applied during the data generation process.
[0020] In another aspect, the present invention provides, a method for generating synthetic data, starting with receiving structured, semi-structured or unstructured input data through a user interface. The data is then stored in a database communicatively coupled to a processing unit, where it undergoes pre-processing to extract schema information, relationship constraints, and statistical parameters. During this stage, personally identifiable information (PII) is detected, and anonymization techniques are applied to ensure privacy preservation. The system analyses privacy-preserving synthetic data using an LLM integration module, which identifies patterns, inter-column dependencies, and distribution characteristics to maintain data fidelity. The LLM integration module generates executable scripts, encoding Python-based logic for synthetic data generation. These scripts are executed using a code execution engine, leveraging predefined libraries to produce synthetic data efficiently. The system then performs validation using AI/ML models, ensuring that the synthetic data adheres to business logic constraints and statistical integrity. Lastly, the LLM integration module in a predefined format, providing transparency into the applied constraints and methodologies, generates a natural language explanation of the synthetic data creation process.
BRIEF DESCRIPTION OF DRAWING
[0021] The foregoing summary, as well as the following detailed description of embodiments, is better understood when read in conjunction with the appended drawings. For the purpose of illustrating the disclosure, there is shown in the present document example constructions of the disclosure, however, the disclosure is not limited to the specific methods and device disclosed in the document and the drawing. The detailed description is described with reference to the following accompanying figures.
[0022] Figure 1: illustrates a network implementation of a system, in accordance with an embodiment of the present subject matter.
[0023] Figure 2: illustrates the block diagram of the system for generating the synthetic data, in accordance with an embodiment of the present subject matter.
[0024] Figure 3: illustrates the architecture of the system for generating the synthetic data, in accordance with an embodiment of the present subject matter.
[0025] Figure 4: illustrates a flow diagram for synthetic data generation and data replication, in accordance with an embodiment of the present subject matter.
[0026] Figure 5: illustrates a flow chart performing a method for synthetic data generation, in accordance with an embodiment of the present subject matter.
[0027] The figures depict various embodiments of the present disclosure for purposes of illustration only. One skilled in the art will readily recognize from the following discussion that alternative embodiments of the structures illustrated herein may be employed without departing from the principles of the disclosure described herein.
DETAILED DESCRIPTION OF THE INVENTION
[0028] Some embodiments of this disclosure, illustrating all its features, will now be discussed in detail. The words "comprising", “having”, and "including," and other forms thereof, are intended to be equivalent in meaning and be open ended in that an item or items following any one of these words is not meant to be an exhaustive listing of such item or items, or meant to be limited to only the listed item or items. It must also be noted that as used herein and in the appended claims, the singular forms "a," "an," and "the" include plural references unless the context clearly dictates otherwise. Although any devices and methods similar or equivalent to those described herein can be used in the practice or testing of embodiments of the present disclosure, the exemplary, devices and methods are now described. The disclosed embodiments are merely exemplary of the disclosure, which may be embodied in various forms.
[0029] Various modifications to the embodiment will be readily apparent to those skilled in the art and the generic principles herein may be applied to other embodiments. However, one of ordinary skill in the art will readily recognize that the present disclosure is not intended to be limited to the embodiments illustrated, but is to be accorded the widest scope consistent with the principles and features described herein.
[0030] Following is a list of elements and reference numerals used to explain various embodiments of the present subject matter.
Reference Numeral Element Description
100 Network implementation of a system
102 System
106 Network
202 Processing Unit
206 Storage Device
208 Database
302 User interface
304 Pre-processing Layer
306 Record/ data/ DDL/SQL/Table/Natural Language input
308 Inter-column dependency
310 Code generator
312 Code executor
314 Post-processing Layer
316 LangChain
318 Prompt
320 Large Language Model (LLM)
322 Generative Artificial Intelligence (Gen AI) Model
324 Output
402 Data generation
404 Data replication
500 Method
[0031] The synthetic data generation system of present invention offers a promising solution by creating artificial datasets that replicate the statistical properties, patterns, and characteristics of real-world data without containing any personally identifiable information (PII) or sensitive details. By leveraging advanced techniques such as AI-driven methods, synthetic data can be tailored to meet specific requirements for diversity, volume, and quality. This enables organizations to conduct robust AI/ML model training, testing, and validation while ensuring compliance with stringent data privacy regulations. Moreover, synthetic data eliminates the risks associated with sharing sensitive information across teams, partners, or third-party vendors, facilitating secure collaboration and data monetization in marketplaces without compromising confidentiality.
[0032] Additionally, the present invention provides the synthetic data system analysis with high-quality synthetic datasets, organizations can enhance the robustness and generalizability of their AI/ML models and dashboards, ultimately driving innovation while maintaining data security and regulatory compliance.
[0033] The present invention relates to a synthetic data generation system and method designed to create structured, high-fidelity datasets without relying on real-world data samples. It leverages Large Language Models (LLMs), web architectures to produce synthetic data that maintains statistical properties, relationships, and business logic. The system supports automated data creation, replication, and augmentation, ensuring compliance with privacy regulations while enabling large-scale data generation for AI/ML training, software testing, analytics, and regulatory compliance.
[0034] The invention provides multiple data input mechanisms, including natural language descriptions, SQL queries, and DDL scripts, allowing users to generate synthetic datasets without pre-existing records. It incorporates LLM-powered analysis to interpret user specifications, and generate executable Python code for structured data creation. Furthermore, the system ensures privacy protection through differential privacy techniques, statistical pattern preservation, and referential integrity enforcement.
[0035] By combining AI-driven code generation with structured execution environments, this invention addresses challenges in data scarcity, bias reduction, and privacy-preserving data synthesis, offering an efficient, scalable, and compliant solution for enterprises seeking intelligent synthetic data generation methodologies.
[0036] Referring to Figure 1, illustrates a network implementation (100) of a system (102) hosted on a server, with the capability to operate across various computing environments, including cloud-based setups. Multiple users or stakeholders access the system (102) via diverse user devices (104-1 to 104-4), such as, processing units, IoT devices, gateways, portable computers, and workstations, all connected through a network (106). This network (106) can be wired, wireless, or hybrid, supporting protocols like HTTP, HTTPS, and TCP/IP while integrating routers, servers, and storage devices.
[0037] In accordance with an embodiment illustrated in Figure 2, a block diagram (200) of the system (102) for generating the synthetic data is illustrated. The system (102) includes at least one processing circuitry or processing unit (202), an interface (302), and a storage device (206). The processing unit (202), which may consist of microprocessors, microcontrollers, or digital signal processors, is responsible for executing computer-readable instructions stored in the storage device (206). The interface (302) facilitates interaction with users and communication with other computing devices, supporting multiple network types and protocols. The storage device (206) encompasses various computer-readable media, including volatile and non-volatile memory, and contains a plurality of modules, a code generator (310), a code executor (312), and database (208). The plurality of modules comprises routines and programs that execute specific tasks, and the database (208) serves as a storage space for processed, received, and generated data, including data associated with the invention. The system (102) is built and supports APIs, container technologies so the system (102) is portable on any cloud platform. The system (102) comprises plurality of modules, including a Generative Artificial Intelligence (Gen AI) (322) and a Large Language model (LLM) (320).
[0038] An user interface (UI) (302) is the first page of the system (102) for accessing the different type of module of the system (102). The UI (302) may be accessed by the users to explore the capabilities of the system (102) as well as to interact with the system (102) through prompts (318). The synthetic data generation system (102) is configured to feed data from various sources, including databases, cloud storage, CSV files, natural language-based text, free flow text or data, and DDL or SQL scripts (306). Users may also define an additional business logic. The system (102) leverages the large language model (LLM) (320) to analyze the input and business logic, derive insights, and construct executable code for generating synthetic data. The LLM (320) processes the provided data whether sample data, DDL scripts, or SQL queries, free-flow data to identify statistical patterns, column dependencies, business rules, and distribution characteristics. The system (102) ensures compliance with PK/FK constraints, inter-table relationships, and privacy-preserving requirements. The system (102) learns business logic, statistical correlations, and data structures from the provided sample data, configured to replicate real-world distributions.
[0039] Specifically, the system (102) detects personally identifiable information (PII) and applies anonymization or differential privacy techniques to ensure sensitive data is masked, faked, or replicated as required. Based on the analysis, the LLM (320) generates a script (e.g., Python code) that encodes the necessary logic for synthetic data generation. The script ensures that the generated data maintains the structural integrity, statistical properties, and privacy constraints required for accurate replication or anonymization and enables the system (102) to generate millions of records. The script is handed off to a library (a software component in the system (102)) from the LLM (320), which executes it. This library might directly use AI models to get the structure, relationships, constraints, and statistical parameters to produce the synthetic data. The library is configured to generate synthetic data that preserves the identified pattern of data or statistical and characteristic parameters of the sample data.
[0040] In an embodiment, the system (102) provides a unique capability to generate synthetic data without requiring real sample data. It offers flexible input methods like:
[0041] Natural Language Input: Users describe the desired table structure, including columns, data types, date ranges, industry, and constraints, in plain English. The system interprets this input and generates the specified number of records accordingly.
[0042] SQL Query Input: Users provide a SQL statement (such as CREATE TABLE) to define one or more tables. They specify the number of rows and any business logic, and the system processes this input to generate the corresponding synthetic data.
[0043] DDL Script Input: Users submit a Data Definition Language (DDL) script, typically used in data warehouses. This script outlines table definitions, column names, data types, and relationships, along with additional business rules. The system then generates synthetic data that adheres to all schema constraints and relationships.
[0044] This structured approach ensures that synthetic data generation is highly customizable and accurately reflects user-defined constraints.
[0045] A key differentiator is that, regardless of input method or whether sample data is provided, the system uses a Large Language Model (LLM) to generate a plain English explanation detailing the business logic, constraints, data structure, and reasoning behind the generated values, providing transparency and trust for business users—a feature not found in other solutions. If users do provide sample records, the system can analyze the data’s structure, patterns, and correlations, and generate a much larger synthetic dataset that maintains the same characteristics and relationships, with the option to add custom business logic in natural language.
[0046] Unlike other solutions that use the LLM to directly generate all synthetic data—which is inefficient for large datasets due to token and cost limitations, the system (102) uses the LLM only to understand user input, extract constraints, and generate Python code that encodes the logic for data generation. This code is then executed by other libraries (such as Faker or SDV) to actually generate the data, making the approach highly efficient, scalable, and cost-effective. As a result, the system (102) can efficiently create millions of records without hitting too many LLM tokens or processing limits, supports complex business logic and multi-table generation with referential integrity and privacy constraints, and provides a transparent, user-friendly architecture that surpasses traditional AI/ML or LLM-only synthetic data solutions.
[0047] In an embodiment, the system may generate synthetic data with a) sample set of data as input, b) free flow text as an input, c) DDL script as an input, d) extrapolation of data in a data-base, e) Production database (208) data replication by maintaining the distribution pattern of the original dataset, f) the system provides natural language based explanation for every bit of data it synthetically generates for the uses to understand the logic and constraints it has factored in while generating the data.
[0048] In an embodiment, the system (102) uses the database (208) in the backend to store various types of information. In an embodiment, the Database (208) may be also referred to as a backend services. All information, such as input provided by the user including sample data, DDL script, SQL queries, free flow data, natural language query, is saved in the database (208).
[0049] In another embodiment, a Figure 3 illustrates the architecture of the system (102) for generating the synthetic data. The system (102) comprises a modular, AI-driven architecture. The architecture comprises a user interface (UI) (302) for the users to interact with the system (102). This UI (302) may be used by different users depending on their use. The system (102) provides ability to all the users involved in the synthetic data generation for any kind of use.
[0050] In an embodiment, the system (102) may employs a web application architecture with clear separation between a frontend, a backend, and processing components. In an embodiment, plurality of layers of the architecture are explained further in detail. In an embodiment, the frontend layer of the system (102) may be Angular-based web application, providing user interfaces (302) for all data generation modes, visualization of results, and data export functions.
[0051] In an embodiment, the Backend Application programming interface (API) Layer is a Flask-based RESTful API service. The API services may be configured to manage file upload processing, database (208) connection management, and result storage, retrieval from the database (208).
[0052] In an embodiment, the processing layer (314) of the system (102) provides the LLM Integration Module using Lang-chain (316) and LangGraph agentic framework (316) and LLMs (320) for automating and orchestrating end to end flow, Code Generation Engine, Secure Code Execution Environment, Data Analysis and Validation Service.
[0053] In an embodiment, a storage layer of the system (102) is configured to provide temporary storage for uploaded files, and results database (208) for generated datasets.
[0054] The system’s architecture comprises a data ingestion layer. The data ingestion layer accepts structured (databases, CSVs), semi-structured (JSON, XML), or metadata (DDL scripts), or free-flow of data as input. The data ingestion layer may further connect with cloud data warehouses and on premise databases (208) to store the received data.
[0055] The system's architecture comprises a pre-processing layer (304) and a feature engineering layer. The pre-processing layer (304) is configured to perform data profiling to understand patterns, constraints, and distributions of the received data. The system (102) may anonymize the sensitive data to ensure compliance. The system (102) identifies inter-column dependencies (308) and applies constraints.
[0056] In an embodiment, the system's architecture comprises a Generative Artificial Intelligence (Gen AI) Model (322) Layer. The system (102) uses a plurality of variational auto encoders (VAEs), GANs (Generative Adversarial Networks), and Transformer-based models to learn from sample data. The system (102) may generate synthetic data while retaining relationships, correlations, and business logic of the sample data.
[0057] In an embodiment, the system's architecture comprises a Post-processing and Validation Layer. The system (102) runs statistical similarity tests (Kolmogorov-Smirnov, Chi-Square, Jensen-Shannon divergence). The validation layer ensures synthetic data matches the distribution and dependencies of the real inputted data.
[0058] In an embodiment, the system's architecture comprises an output (324) and deployment layer. The deployment layer may be configured exports synthetic data in CSV, JSON, Parquet, SQL, or NoSQL formats. The deployment layer provides APIs for real-time synthetic data generation.
[0059] Particularly, in an embodiment, the system (102) follows a streamlined, modular flow to ensure high quality, scalable synthetic data generation regardless of the chosen mode. The process begins at the data ingestion layer, which acts as the system’s primary interface (302). The users interact with an angular-based frontend to submit their data, by uploading files, connecting to databases, or entering scripts or natural language instructions. Once data submitted, the Flask backend validates and pre-processes the input, performing tasks such as schema detection, column type inference, and initial profiling to extract relationships and constraints. The core parameters, sample data, or schema definitions are then passed to the LLM integration module, which acts as the system’s “brain.” For example, if a user uploads a CSV file or enters a DDL script, the LLM receives both the data and any user-provided business logic, then generates specialized Python code tailored to the data generation requirements—such as maintaining referential integrity or applying specific business rules. This code is validated and executed within a secure sandbox environment, leveraging libraries like Faker with custom providers, and is monitored for resource usage and errors to ensure safe, reliable operation. Once the synthetic data is generated, it undergoes post-processing for consistency and quality, including statistical validation against the original data or schema and the generation of preview samples for the user interface (302). Finally, results are stored and made available for download in various formats (CSV, JSON, Excel, SQL), visualization, or direct insertion into databases (208). For example, a user might connect to a database (208), select a table, specify that sensitive columns should be masked, and request a million synthetic records. The system may automatically analyze the schema, generate the appropriate code, synthesize the data while preserving constraints and statistical properties, and provide both the synthetic dataset and a plain-English explanation of the applied logic and constraints. This structured, layered approach ensures that every step—from input to output—is transparent, auditable, and customizable to diverse enterprise needs.
[0060] In another embodiment, a Figure 4 illustrates a flow diagram (400) for synthetic data generation (402) and data replication (404). The system (102) is configured to perform the set of steps to generate synthetic data. The system (102) first collects user inputs are through a frontend interface (302). The input data is validated and pre-processed by a Flask backend. The core parameters and sample data are passed to the LLM integration module. The LLM (320) is configured to understand the core parameter, sample data and business logic provided by the user. In an embodiment, the system (102) may define business logic such as the date range, the product type, the product names, the industry name, and the like. Further, the LLM (320) generates specialized Python code for data generation (402). The code is validated and executed in a secure sandbox environment. The system (102) finally explains how it has generated synthetic data in Natural language. The explanation generated by the LLM (320) is the natural language output (324) which shown on what consideration considered by the system (102) to generate synthetic data. The system (102) generates these explanations at the end, that illustrates what all business logic, constraints, correlation, data-relationships, data ranges etc. are considered by the LLM (320) for data generation (402). The pointers provide the explanation, reasons, business logic, constraints in natural language about what points LLM (320) has considered for generating the synthetic data. As per one example, if the data has been generated for Auto Industry, then the claim date is always after the Warranty start date, Total cost is the combination of part, labour, tax etc., Primary Key and Foreign keys considered to maintain the relationship etc. Therefore, whatever business logic or whatever constraints the LLM (320) may have used to generate the data, the system may show that as one of the output (324) in the plain English to the user.
[0061] The generated data is post-processed for consistency and quality. The generated results are stored and made available for download, visualization, or database (208) insertion.
[0062] In an embodiment, the LLM (320) may be used analyse inputted data information, input constraints, and user expectation. The LLM (320) generates a Python code. Further, the python code is used to run other libraries in order to generate the synthetic data.
[0063] The LLM (320) is just giving the code construct and then the value sample values or the actual data is being generated by the Python code. As per one example, if LLM (320) is used to generate 1,000,000 records then the number of tokens used by LLM (320) is very less, because in the proposed system (102), the LLM (320) is used to just generate the python code and not for generating whole synthetic data. The system (102) is generating the python code with the LLM (320). The system (102) uses python libraries to execute the python code and subsequently generate the synthetic data. The system (102) is reducing the cost of consumption and processing time.
[0064] In an embodiment, the system (102) may generate synthetic data by using various type of input data is illustrated further in detail.
[0065] In an embodiment, the system (102) may generate synthetic data using files or tables. This functionality of the system (102) allows users to generate synthetic datasets based on sample data that maintains the characteristics of the original data but with completely generated values.
[0066] In an embodiment, the technical implementation of the system (102) is illustrated further. The Data Ingestion layer provides direct database (208) connection using JDBC/ODBC protocols. The data ingestion layer is configured to provide file upload options for CSV, Excel, JSON, and other common formats. Further, the data ingestion layer includes Automatic schema detection and column type inference, a Sample selection mechanism (random or user-directed).
[0067] In an embodiment, the LLM (320) is configured to perform the relationship analysis. The LLM (320) analyses the user-provided primary key and foreign key specifications, Optional automatic relationship detection for database (208) inputs, and constraint preservation mechanisms for maintaining referential integrity.
[0068] In an embodiment, the LLM (320)-based code generation is illustrated further. The system (102) constructs specialized prompts (318). The prompt (318) receives sample data containing N number of records. The LLM (320) may extract data schema information, subject area context, user-provided generation guidelines, and Relationship constraints from the N number of records.
[0069] In an embodiment, the LLM (320) generates the python code which are used to execute a Faker library with custom providers. In an embodiment, the code may be python code and like more. Further, the code generator (310) of the system (102) is configured to execute the code in the Sandboxed execution environment by utilizing the faker library. The system (102) monitor resource consumption limits, progress, cancellation capabilities, Error handling and retry mechanisms.
[0070] In an embodiment, the system (102) finally process the result. The system (102) generate output format (324). The output format (324) may be converted to CSV, Excel, JSON, or SQL inserts. The generated output (324) of the system (102) maintain consistency validation, statistical comparison with sample data, data preview generation for UI (302) display.
[0071] In an embodiment, the system (102) generate the synthetic data by inputting the SQL Script or DDL Script through the user interface (302). This approach allows users to generate data directly from database schema definitions without requiring sample data. This approach particularly useful for new system (102) development or when sample data is unavailable. The technical implementation of the system (102) using the SQL Script or DDL Script as the input is illustrated further.
[0072] The system (102) LLM (320) further analyses the SQL/DDL script parsing. Further, the system (102) perform schema extraction, data type mapping, Constraint identification (NOT NULL, unique, checks, etc.). The LLM (320) analyses the subject area integration, user-provided guidelines interpretation, and domain-specific rules application.
[0073] The LLM prompt (318) construction with parsed schema and context is processed by the system (102). The system (102) further provide python code generation with appropriate data generation (402) functions, Inter-table relationship preservation logic, and Optimization for large dataset generation. In an embodiment, the system (102) may have the fast execution, delivery, and batch processing capabilities. The system (102) may provide multiple download options for downloading the generated synthetic data.
[0074] In an embodiment, the data replication process (404) of the system (102) using tables or files as an input is illustrated further. This functionality focuses on creating synthetic versions of existing datasets while maintaining statistical properties and optionally masking sensitive information. Technical Implementation of the data replication process (404) is illustrated further.
[0075] In an embodiment, the system (102) implements the statistical analysis engine. In an embodiment, the five-number summary calculation (minimum, first quartile, median, third quartile, maximum), distribution analysis, pattern recognition, and outlier detection. This ensures that the statistical and distribution pattern of the original data-set that is replicated with the synthetic data, are within the expected range and very close to the original properties and hence helps in validating the final replicated synthetic data output (324).
[0076] In an embodiment, the system (102) is configured to perform the column classification of the inputted table. The system (102) is configured to perform the sensitive data identification, data type categorization, pattern-based classification, user-directed column selection interface.
[0077] In an embodiment, the system (102) is configured to perform synthesis approach selection on the inputted table and file. In an embodiment, the system (102) is configured to perform the Masking. The masking perform the Character-level substitution on the generate output format (324), by preserving sensitive data. In an embodiment, the system (102) is configured to perform the faking by complete replacing the generated data with realistic but synthetic values. Further, the system (102) is configured to perform the synthesizing on the output data where the system (102) generates the values with identical statistical properties of the original data.
[0078] In an embodiment, the LLM (320) Code Generation Process is illustrated further in detail. The system (102) supports statistical summary integration in prompts (318), Python code generation for specific column types, Statistical preservation techniques implementation, relationship, and maintenance logic.
[0079] In an embodiment, the system (102) is configured to perform column merging and re-assembly. The system (102) support original data preservation for non-selected columns, synthetic data integration for selected columns, row-level integrity maintenance, final dataset composition and validation.
[0080] In an embodiment, the system (102) is configured to perform the dynamic code generation using LLMs (320). The system (102) uses a Large Language Models to dynamically generate executable Python code tailored to specific data generation (402) needs. This approach offers several advantages which are listed below:
[0081] Adaptability: The system (102) may handle arbitrary data schemas and domains without predefined templates.
[0082] Domain Awareness: The LLM (320) may produces data, appropriate to specific industries or use cases by including subject area information in prompts (318).
[0083] Complex Relationships: The generated code may implement sophisticated relationships between tables and fields that would be difficult to express in configuration-based system (102).
[0084] Natural Variation: The LLM (320) introduces realistic variations in the generated data that mimic real-world data patterns.
[0085] Efficient Data generation (402): The system (102) may generate millions of records using only a few tokens, significantly reducing token costs compared to other traditional synthetic data generation system (102).
[0086] Cost-Effective: The system (102) offers a competitive cost structure, with a total monthly cost which lower than many other synthetic data generation platforms.
[0087] Scalable and Fast: The system (102) may supports daily generation of large datasets while maintaining a low input and output token cost, making it ideal for high-volume data needs.
[0088] In some embodiment, the system (102) employs an advanced techniques to ensure synthetic data preserves important statistical properties of original datasets:
[0089] Distribution Matching: The system (102) may ensures synthetic data follows the same statistical distributions as the original data.
[0090] Pattern Preservation: The system (102) may maintains temporal patterns, cycles, and trends present in time-series data.
[0091] Relationship Consistency: The system (102) may preserves correlations between fields within and across tables.
[0092] Realistic Anomalies: The system (102) introduces statistically appropriate outliers and exceptions.
[0093] In an embodiment, the system (102) may generate synthetic data across multiple related tables while maintaining referential integrity represents a significant advancement which are listed below:
[0094] Constraint Propagation: Foreign key relationships are preserved by the system (102) during generation.
[0095] Circular Dependency Resolution: Sophisticated algorithms handle circular references between tables.
[0096] Cardinality Preservation: The system (102) maintains one-to-many and many-to-many relationships with appropriate distributions.
[0097] Hierarchical Data Support: The system (102) may handle parent-child relationships across multiple levels
[0098] In an embodiment, the system (102) may be a next-generation AI-driven synthetic data generator that provides:
[0099] Highly qualitative and reliable synthetic data with maintained constraints & inter-column dependencies (308).
[00100] Privacy-first data replication (404) that adheres to compliance regulations.
[00101] Scalable and efficient synthetic data generation for AI/ML training, testing, and analytics.
[00102] Seamless integration with enterprise data pipelines and cloud ecosystems.
[00103] In some embodiment, the organizations may eliminate data scarcity, accelerate AI innovation, reduce compliance risks, and drive secure, cost-effective data-driven decision-making by using the proposed system (102).
[00104] In an embodiment, the feature and functionalities of the system (102) is illustrated further in detail. The system (102) may configured to generate synthetic data for given sample data input through the user interface (302). The system (102) may learns statistical patterns and correlations from sample data. Further, the system (102) may generates new synthetic records that preserve real-world distributions.
[00105] In an embodiment, the system (102) may be configured to perform data replication (404) by maintaining the distribution pattern of the inputted data. The system (102) may be configured to replicates data structures, column dependencies, and relationships of the input data. The system (102) generates scalable synthetic datasets in such way that, the generated synthetic data match the statistical properties of the original data.
[00106] In an embodiment, the system (102) may be configured to generate synthetic data without inputting sample data. In an embodiment, the sample data may be DDL Scripts. The system (102) may uses DDL scripts and metadata to create datasets without requiring real data. The system (102) ensures synthetic data adheres to column constraints and business rules.
[00107] In an embodiment, the system (102) may be configured to removes Personally Identifiable Information (PII) while preserving analytical value of the generated synthetic data. The system (102) supports differential privacy and anonymization techniques. The system (102) supports privacy-Preserving Data Generation.
[00108] In an embodiment, the system (102) may be configured to perform the bias reduction and balanced data generation (402). The system (102) may identify and removes dataset imbalances, ensuring unbiased AI/ML training data. The system (102) may supports controlled diversity injection to enhance Gen AI model (322) fairness.
[00109] In an embodiment, the system (102) may provide scalability and high-performance synthetic data generation (402). The system (102) generates millions of records in minutes, supporting batch and real-time synthetic data creation. In an embodiment, the may work on cloud (AWS, GCP, Azure) or on premise infrastructure. The system (102) may have a seamless Integration with data pipelines and AI/ML Frameworks. The system (102) may provide Application Programming Interface (APIs) and Software Development Kit (SDKs) for integration with BigQuery, Snowflake, Databricks, and AI/ML models. In an embodiment, the system (102) may support structured (CSV, Parquet, SQL) and semi-structured (JSON, XML) data formats.
[00110] In an embodiment, the system (102) may provide a provision for the user to write their own custom parameters or in natural language that may be used by the LLM (320) to generate the data.
[00111] In some embodiment, the system (102) reduces costs and time for data collection and labelling.
[00112] In some embodiment, the system (102) generates high-quality, diverse synthetic datasets that mirror real-world data, ensuring enterprises continuous and scalable access to data.
[00113] In some embodiment, the system (102) ensures Data Privacy & Regulatory Compliance
[00114] In some embodiment, the system (102) creates synthetic data that retains statistical fidelity while eliminating personally identifiable information (PII), ensuring compliance without risk.
[00115] In some embodiment, the system (102) generates unbiased, representative datasets to enhance AI model fairness and accuracy.
[00116] In some embodiment, the system (102) automates data generation (402), cutting costs by up to 50% and accelerating AI model development.
[00117] In some embodiment, the system (102) provides on-demand, scalable synthetic datasets, improving AI model generalization and robustness.
[00118] In some embodiment, the system (102) enables safe collaboration by generating synthetic data that mimics real datasets without exposing confidential information.
[00119] In some embodiment, the system (102) removes data access barriers, ensures compliance, mitigates bias, reduces costs, accelerates AI development, and enables secure collaboration, empowering organizations to innovate faster, scale AI models, and drive data-driven transformation with confidence.
[00120] In an embodiment, the system (102) addresses key limitations with the traditional synthetic data generators, i.e., data privacy, scalability, AI bias correction, real-time data generation (402), and metadata-driven synthesis.
[00121] In some embodiment, the system (102) may generate data without real-world samples (DDL-Based Synthesis).
[00122] Traditionally tools require sample data to generate synthetic versions. However, the system (102) may generate structured datasets from metadata (DDL scripts). The system (102) is more useful for those enterprises that cannot share real data due to privacy restrictions.
[00123] In some embodiment, the system (102) enables AI/ML model training without requiring access to real-world sensitive data.
[00124] In some embodiment, the system (102) may maintain Column Constraints and Inter-Column Relationships (even with a sample of 40 rows of input records)
[00125] In some embodiment, the traditional synthetic data generators treat each column independently, leading to unrealistic datasets (e.g., salary columns not aligning with job titles). The system (102) enforces column relationships, dependencies, and business logic, ensuring data validity and usability. The system (102) also provides a natural language interface to add business logic.
[00126] In some embodiment, the system (102) produces high-quality, structured synthetic data that is realistic and analytics-ready.
[00127] Traditional synthetic data generators slow down with large datasets. However, the system (102) is optimized for enterprise-scale AI/ML applications, generating millions of high-fidelity synthetic records in minutes.
[00128] In some embodiment, the system (102) provides faster model training, lower compute costs, and enterprise-ready scalability.
[00129] In some embodiment, the system (102) provides the Multi-Modal Synthetic Data generation (402) (Tabular, Scripts, Free flow text with table structure, Natural language based business logic, Time-Series).
[00130] In an embodiment, organizations often struggle with limited access to high-quality, diverse, and labelled data due to privacy regulations, data-sharing restrictions, and insufficient real-world samples. The system (102) is designed to generate synthetic data that preserves statistical and structural properties while ensuring compliance with data privacy regulations such as GDPR and HIPAA. The system (102) addresses data scarcity and privacy challenges.
[00131] In an embodiment, the AI and machine learning models require vast amounts of high-quality data for training. However, acquiring and annotating real-world data is time-consuming and expensive. The system (102) generates diverse, bias-free, and high-quality synthetic datasets, enabling faster and more effective AI model training, validation, and testing.
[00132] In an embodiment, the businesses in sectors like banking, healthcare, telecom, and retail require large datasets for analytics, risk modelling, and customer insights. The system (102) is configured to simulate real-world scenarios, allowing organizations to develop and test new solutions without risking sensitive data exposure.
[00133] In an embodiment, the AI models usually suffer from biases due to imbalanced datasets. The system (102) generate underrepresented or missing data categories, ensuring balanced datasets for unbiased model predictions and improved accuracy. The system (102) enables data augmentation for better AI model performance.
[00134] In an embodiment, the enterprises and research institutions require secure ways to share data across teams and partners. The system (102) creates privacy-preserving synthetic datasets that enable cross-functional collaboration without compromising sensitive information.
[00135] In an embodiment, the system (102) provides data access, accelerate AI innovation, and ensure privacy-first data generation (402) using Generative AI model (322). The system (102) empowers organizations with high-fidelity, diverse, and scalable synthetic data, driving better AI models, regulatory compliance, and cost-efficient data operations.
[00136] In an embodiment, the system (102) is a Gen AI-powered synthetic data generation system (102) designed to generate millions of records while maintaining column constraints, inter-column dependencies (308), and data distribution patterns. The may create synthetic data with or without sample data and replicate existing datasets for AI/ML training, data analytics, testing, and compliance adherence.
[00137] In an embodiment, the Generative AI-powered synthetic data generation system (102) may revolutionize multiple industries and use cases. The proposed system (102) is used in various products, services, and applications. Some of the examples are:
[00138] AI Model Training and Testing- The system (102) may generate balanced, privacy-safe datasets to improve ML model accuracy.
[00139] Bias-Free AI Development- The system (102) creates synthetic data that corrects biases and enhances model fairness.
[00140] Fraud Detection Models– The system (102) generates synthetic transaction data to improve fraud detection AI.
[00141] Customer 360° Analytics– The system (102) may simulate banking customer behaviour without exposing real PII.
[00142] Regulatory Compliance (GDPR, CCPA, Basel III)– The system (102) may generate synthetic financial data for audits and stress testing.
[00143] Medical AI Training– The system (102) may generate privacy-preserving synthetic patient records for AI-powered diagnostics.
[00144] Synthetic Call Records and Network Logs– The system (102) may improve fraud detection and security monitoring.
[00145] Anomaly Detection in Networks– The system (102) may train AI on synthetic traffic patterns.
[00146] Autonomous Vehicle Training– The system (102) may create synthetic driving scenarios for self-driving AI.
[00147] Predictive Maintenance– The system (102) may simulate equipment failures for AI-based vehicle maintenance.
[00148] Synthetic IoT Sensor Data– The system (102) train AI models for predictive maintenance and defect detection.
[00149] Supply Chain Optimization– The system (102) may generate synthetic inventory, logistics, and demand forecasting data.
[00150] Customer Segmentation and Personalization– The system (102) may generate synthetic shopper behaviour data.
[00151] Dynamic Pricing Algorithms– The system (102) may simulate market demand for AI-powered pricing optimization.
[00152] Technical effect achieved by the system (102)-
[00153] In an embodiment the system (102) increasing server testing efficiency.
[00154] In an embodiment, the vast amount of synthetic data generated the system (102) is used to train the AI and Gen AI model (322) in different organization as per their individual use.
[00155] In an embodiment, the synthetic data generated by the system (102) is being used for data sharing and which gives the benefit of data monetization.
[00156] Figure 5 illustrates a flow chart performing a method (500) for a synthetic data generation, in accordance with an embodiment of the present subject matter. The order in which the method (500) may be described may be not intended to be construed as a limitation, and any number of the described method blocks may be combined in any order to implement the method (500) or alternate methods. Additionally, individual blocks may be deleted from the method (500) without departing from the spirit and scope of the subject matter described herein. Furthermore, the method (500) may be implemented in any suitable hardware, software, firmware, or combination thereof. However, for ease of explanation, in the embodiments described below, the method (500) may be considered to be implemented as described in the system (102) for the synthetic data generation.
[00157] At block 502, the system is receiving structured or unstructured input data through a user interface (302).
[00158] At block 504, the system is storing the input data in a database (208) communicatively coupled to a processing unit;
[00159] At block 506, the system is pre-processing the stored input data to extract schema information, relationship constraints, and statistical parameters.
[00160] At block 508, the system is detecting personally identifiable information (PII) within the input data.
[00161] At block 510, the system is applying at least one anonymization technique to ensure privacy-preserving synthetic data generation.
[00162] At block 512, the system is analyzing privacy-preserving synthetic data using a Large Language Model (LLM) (320) to identify patterns, inter-column dependencies, and distribution characteristics.
[00163] At block 514, the system is generating executable scripts using the LLM integration module, wherein the Python-based code encodes logic for synthetic data generation.
[00164] At block 516, the system is executing the generated Python scripts using a code execution engine, leveraging at least one library for synthetic data creation.
[00165] At block 518, the system is validating the synthetic data via AI/ML models within the library to ensure statistical integrity and adherence to business logic constraints.
[00166] At block 520, the system is generating a natural language explanation of the synthetic data creation process via the LLM integration module in a predefined format.
[00167] Equivalents
[00168] With respect to the use of substantially any plural and/or singular terms herein, those having skill in the art can translate from the plural to the singular and/or from the singular to the plural as is appropriate to the context and/or application. The various singular/plural permutations may be expressly set forth herein for the sake of clarity.
[00169] It will be understood by those within the art that, in general, terms used herein, and especially in the appended claims (e.g., bodies of the appended claims) are generally intended as "open" terms (e.g., the term "including" should be interpreted as "including but not limited to," the term "having" should be interpreted as "having at least," the term "includes" should be interpreted as "includes but is not limited to," etc.). It will be further understood by those within the art that if a specific number of an introduced claim recitation is intended, such an intent will be explicitly recited in the claim, and in the absence of such recitation no such intent is present.
[00170] Although implementations for the system and method for the synthetic data generation have been described in language specific to structural features and/or methods, it is to be understood that the appended claims are not necessarily limited to the specific features described. Rather, the specific features are disclosed as examples of implementation for the system and method for the synthetic data generation.
, Claims:
1. A system for generating synthetic data, comprising:
a processing unit;
a storage device (206) coupled to the processing unit, wherein the processing unit, cause the system to perform operations comprising:
receiving structured or unstructured input data through a user interface (302) coupled with the processing unit;
storing the input data in a database (208) communicatively coupled to the processing unit;
pre-processing the stored input data to perform data profiling, to extract schema information, relationship constraints, and statistical parameters;
detecting personally identifiable information (PII) in the input data, and applying at least one of anonymization to generate privacy-preserving synthetic data;
analyzing, via a Large Language Model (LLM) integration module, the privacy-preserving synthetic data and extracted schema information, relationship constraints, statistical parameters;
generating, via the LLM integration module, executable scripts based on the analyzed data, wherein the Python code encodes logic for synthetic data generation;
executing the executable scripts using a code execution engine, wherein the execution engine leverages at least a library to generate synthetic data;
validating the synthetic data via an AI/ML models used within the library; and
generating, via the LLM integration module, the synthetic data, and natural language explanation in a predefined format.
2. The system as claimed in claim 1, wherein the input data comprises at least one of:
o a natural language input defining a table structure, column specifications, data ranges, industry-specific requirements, and logics without providing a sample data;
o a Structured Query Language (SQL) query defining a table schema for one or more tables, with an optional business logic;
o a Data Definition Language (DDL) script specifying table schemas, column names, data types, primary
o a key and foreign key relationships, and join conditions, with the optional business logic;
o a sample dataset comprising at least rows of records configured to generate large volume of synthetic data maintaining logic, inter and intra column constraints; or free-flow text defining data generation requirements; and
o a data-set configured for replication to ensure that the generated synthetic data maintain the statistical and distribution pattern.

3. The system as claimed in claim 1, wherein the pre-processing by the processing unit (202) comprises identifying a statistical patterns, inter-column dependencies (308), and data distribution characteristics of the input data.

4. The system as claimed in claim 1, wherein the executable scripts encode logic for synthetic data generation that preserves the statistical patterns, inter-column dependencies (308), primary key and foreign key relationships, and privacy constraints of the input data, and
wherein the executable scripts are configured to generate a user-specified number of synthetic records, including millions of records, when requested.

5. The system as claimed in claim 1, wherein the executable scripts generated for synthetic data generation are executed within a secure, isolated and controlled virtual space configured to prevent the executed code from affecting the main system, or input data.
6. The system as claimed in claim 1, wherein the Large Language Model (LLM) integration module gather insights from the input data to construct executable code that, when combined with the library, produces synthetic data.

7. The system as claimed in claim 1, wherein the library, which executes the script leverages AI models to preserve structure, relationships, constraints, and statistical parameters of the input data.

8. The system as claimed in claim 1, wherein the system is configured to extrapolate data within the database (208) or replicate the database data while preserving the distribution patterns and statistical properties of the input data.

9. The system as claimed in claim 1, wherein the system is configured to output the synthetic data and the natural language explanation in a format selected from the group consisting of CSV, JSON, Parquet, SQL, and NoSQL.

10. The system as claimed in claim 1, wherein the natural language explanation generated by the LLM integration module details the business logic, constraints, and considerations used in generating the synthetic data, and is presented in user comprehensible language.

11. The system as claimed in claim 1, wherein the system is configured to support automated synthetic data generation, including scenarios with zero input data by leveraging only DDL/SQL scripts or free-flowing natural language text.

12. The system as claimed in claim 1, wherein the system is configured to validate the synthetic data by performing statistical similarity tests to ensure the synthetic data matches the distribution, dependencies, and relationships of the input data or user-defined schema.
13. The system as claimed in claim 1, wherein the system provides the user interface (302) that enables users to select columns for masking, faking, or synthetic generation, and to specify logic or constraints for the data generation process.

14. The system as claimed in claim 1, wherein the system comprising:
a data ingestion layer, configured to enable user interaction through an user interface (302), facilitating data submission via file uploads, database connections, user-defined scripts, or natural language instructions;
a pre-processing Layer (302), configured to process input data and data profiling to extract relationships, constraints, and statistical properties of the input data;
an LLM Integration Module, configured to analyze pre-processed data and to generate specialized executable scripts tailored for synthetic data generation;
a secure execution layer, configured to validate and execute the script within predefined libraries;
a post-processing layer (314), configured to perform statistical validation of the generated synthetic data against the input data; and
storage and output layer, configured to store validated synthetic data and provide access to validated data in predefined formats.

15. A method for generating synthetic data, comprising:
receiving structured or unstructured input data through a user interface (302);
storing the input data in a database (208) communicatively coupled to a processing unit;
pre-processing the stored input data to extract schema information, relationship constraints, and statistical parameters;
detecting personally identifiable information (PII) within the input data;
applying at least one anonymization technique to ensure privacy-preserving synthetic data generation;
analyzing privacy-preserving synthetic data using a Large Language Model (LLM) (320) to identify patterns, inter-column dependencies (308), and distribution characteristics;
generating executable scripts using the LLM integration module, wherein the Python-based code encodes logic for synthetic data generation;
executing the generated Python scripts using a code execution engine, leveraging at least one library for synthetic data creation;
validating the synthetic data via AI/ML models within the library to ensure statistical integrity and adherence to business logic constraints; and
generating a natural language explanation of the synthetic data creation process via the LLM integration module in a predefined format.

Documents

Application Documents

#	Name	Date
1	202521052095-STATEMENT OF UNDERTAKING (FORM 3) [29-05-2025(online)].pdf	2025-05-29
2	202521052095-REQUEST FOR EXAMINATION (FORM-18) [29-05-2025(online)].pdf	2025-05-29
3	202521052095-REQUEST FOR EARLY PUBLICATION(FORM-9) [29-05-2025(online)].pdf	2025-05-29
4	202521052095-POWER OF AUTHORITY [29-05-2025(online)].pdf	2025-05-29
5	202521052095-FORM-9 [29-05-2025(online)].pdf	2025-05-29
6	202521052095-FORM 18 [29-05-2025(online)].pdf	2025-05-29
7	202521052095-FORM 1 [29-05-2025(online)].pdf	2025-05-29
8	202521052095-FIGURE OF ABSTRACT [29-05-2025(online)].pdf	2025-05-29
9	202521052095-DRAWINGS [29-05-2025(online)].pdf	2025-05-29
10	202521052095-DECLARATION OF INVENTORSHIP (FORM 5) [29-05-2025(online)].pdf	2025-05-29
11	202521052095-COMPLETE SPECIFICATION [29-05-2025(online)].pdf	2025-05-29
12	Abstract.jpg	2025-06-16