Abstract: CLUSTERING SYSTEM FOR REAL-WORLD DATASETS ABSTRACT A clustering system (100) for real-world datasets is disclosed. The clustering system (100) comprises a client device (102) comprising a first processor (104). A second processor (106) located on an application server (108). A storage medium (114) comprising programming instructions executable by the second processor (106). The clustering system (100) is configured to receive, from the client device (102), a dataset with high dimensionality, heterogeneous distribution of data points, an absence of labelling, or a combination thereof; group the received dataset into clusters based on variations in data density of the dataset; apply grouping architectures to tag the clusters with a generalized feature; generate graphs of the tagged clusters using graph neural networks with hierarchical strategies; integrate self-supervised learning objectives in the generated graphs of the tagged clusters to improve clustering accuracy in unlabeled datasets; and evaluate a clustering quality and generate evaluation metrics of the dataset. Claims: 10, Figures: 2 Figure 1 is selected.
Description:BACKGROUND
Field of Invention
[001] Embodiments of the present invention generally relate to a clustering system and particularly to a clustering system for real-world datasets.
Description of Related Art
[002] Real-world datasets often contain very high dimensionality, uneven distribution of data points, and a lack of sufficient labeling. These conditions create difficulty in identifying meaningful patterns and in forming reliable clusters. Traditional clustering methods cannot consistently handle variations in density, poor feature quality, and the absence of reliable labels, which results in unsatisfactory outcomes across practical applications.
[003] Current approaches include density-based clustering methods such as Density-Based Spatial Clustering of Applications with Noise (DBSCAN), Ordering Points to Identify the Clustering Structure (OPTICS), and Hierarchical Density-Based Spatial Clustering of Applications with Noise (HDBSCAN). Deep clustering models such as Deep Embedded Clustering (DEC) and Deep Convolutional Embedded Clustering (DCEC) attempt to improve feature representation. Graph-based clustering methods such as Louvain method, Info map method, and Spectral Clustering are applied. Commercial implementations exist in software platforms such as Scikit-learn and H2O.ai, while deep learning frameworks such as TensorFlow and PyTorch provide support for deep clustering. Self-supervised methods and graph neural networks have been explored to address clustering challenges.
[004] However, density-based methods prove highly sensitive to parameter settings and often fail on clusters of varying density. Deep clustering suffers from poor interpretability and weak generalization to unseen datasets. Graph-based clustering encounters scalability limits in large or overlapping networks. Self-supervised learning lacks mature integration with clustering objectives and remains unstable in performance. Additionally, the evaluation of clustering quality across diverse datasets remains unreliable.
[005] There is thus a need for an improved and advanced a clustering system for real-world datasets that can administer the aforementioned limitations in a more efficient manner.
SUMMARY
[006] Embodiments in accordance with the present invention provide a clustering system for real-world datasets. The system comprising a client device comprising a first processor. The system further comprising a second processor located on an application server. The system further comprising a communication network adapted to establish a communicative link connecting the client device to the application server. The system further comprising a storage medium comprising programming instructions executable by the second processor. The second processor is configured to receive, from the client device, a dataset with high dimensionality, heterogeneous distribution of data points, an absence of labelling, or a combination thereof; group the received dataset into clusters based on variations in data density of the dataset; apply grouping architectures to tag the clusters with a generalized feature; generate graphs of the tagged clusters using graph neural networks with hierarchical strategies; integrate self-supervised learning objectives in the generated graphs of the tagged clusters to improve clustering accuracy in unlabeled datasets; and evaluate a clustering quality and generate evaluation metrics of the dataset.
[007] Embodiments in accordance with the present invention further provide a method for grouping real-world datasets using a clustering system. The method comprising steps of receiving, from a client device, a dataset with high dimensionality, heterogeneous distribution of data points, an absence of labelling, or a combination thereof; grouping the received dataset into clusters based on variations in data density of the dataset; applying grouping architectures to tag the clusters with a generalized feature; generating graphs of the tagged clusters using graph neural networks with hierarchical strategies; integrating self-supervised learning objectives in the generated graphs of the tagged clusters to improve clustering accuracy in unlabeled datasets; and evaluating a clustering quality and generating evaluation metrics of the dataset.
[008] Embodiments of the present invention may provide a number of advantages depending on their particular configuration. First, embodiments of the present application may provide a clustering system for real-world datasets.
[009] Next, embodiments of the present application may provide a clustering system that introduces dynamic adjustment of parameters in density-based clustering methods.
[0010] Next, embodiments of the present application may provide a clustering system that allows effective management of clusters with non-uniform densities without extensive manual tuning.
[0011] Next, embodiments of the present application may provide a clustering system that enhances feature quality while also improving interpretability and generalization to new datasets.
[0012] Next, embodiments of the present application may provide a clustering system that ensures scalability and more accurate detection of communities in very large or overlapping networks.
[0013] Next, embodiments of the present application may provide a clustering system that establishes evaluation methods that are robust across different datasets.
[0014] Next, embodiments of the present application may provide a clustering system that ensures consistent and trustworthy assessment of clustering performance.
[0015] These and other advantages will be apparent from the present application of the embodiments described herein.
[0016] The preceding is a simplified summary to provide an understanding of some embodiments of the present invention. This summary is neither an extensive nor exhaustive overview of the present invention and its various embodiments. The summary presents selected concepts of the embodiments of the present invention in a simplified form as an introduction to the more detailed description presented below. As will be appreciated, other embodiments of the present invention are possible utilizing, alone or in combination, one or more of the features set forth above or described in detail below.
BRIEF DESCRIPTION OF THE DRAWINGS
[0017] The above and still further features and advantages of embodiments of the present invention will become apparent upon consideration of the following detailed description of embodiments thereof, especially when taken in conjunction with the accompanying drawings, and wherein:
[0018] FIG. 1 illustrates a block diagram of a clustering system for real-world datasets, according to an embodiment of the present invention; and
[0019] FIG. 2 depicts a flowchart of a method for grouping real-world datasets using a clustering system, according to an embodiment of the present invention.
[0020] The headings used herein are for organizational purposes only and are not meant to be used to limit the scope of the description or the claims. As used throughout this application, the word "may" is used in a permissive sense (i.e., meaning having the potential to), rather than the mandatory sense (i.e., meaning must). Similarly, the words “include”, “including”, and “includes” mean including but not limited to. To facilitate understanding, like reference numerals have been used, such as possible, to designate like elements common to the figures. Optional portions of the figures may be illustrated using dashed or dotted lines, unless the context of usage indicates otherwise.
DETAILED DESCRIPTION
[0021] The following description includes the preferred best mode of one embodiment of the present invention. It will be clear from this description of the invention that the invention is not limited to these illustrated embodiments but that the invention also includes a variety of modifications and embodiments thereto. Therefore, the present description should be seen as illustrative and not limiting. While the invention is susceptible to various modifications and alternative constructions, it should be understood, that there is no intention to limit the invention to the specific form disclosed, but, on the contrary, the invention is to cover all modifications, alternative constructions, and equivalents falling within the scope of the invention as defined in the claims.
[0022] In any embodiment described herein, the open-ended terms "comprising", "comprises”, and the like (which are synonymous with "including", "having” and "characterized by") may be replaced by the respective partially closed phrases "consisting essentially of", “consists essentially of", and the like or the respective closed phrases "consisting of", "consists of”, the like.
[0023] As used herein, the singular forms “a”, “an”, and “the” designate both the singular and the plural, unless expressly stated to designate the singular only.
[0024] FIG. 1 illustrates a block diagram of a clustering system 100 for real-world datasets, according to an embodiment of the present invention. The clustering system 100 may be adapted to process real-world datasets that may include high dimensionality, heterogeneous distribution of data points, and absence of labels. The clustering system 100 may be adapted to operate by receiving such datasets, applying dynamic clustering techniques to manage non-uniform densities, enhancing feature representation through deep models, and performing graph-based operations for scalability. The clustering system 100 may further be adapted to integrate self-supervised learning objectives into clustering tasks so that reliable grouping may be achieved without labeled data. The outputs generated by the clustering system 100 may include cluster assignments and evaluation metrics that may indicate stability, accuracy, and generalizability across different datasets.
[0025] In an embodiment of the present invention, the clustering system 100 may be configured as a unified framework that may combine dynamic density-based clustering, grouping architectures, graph neural network clustering, and self-supervised learning. The integration of these multiple approaches within a single system may allow handling of high dimensionality, non-uniform densities, scalability challenges, and unlabeled data simultaneously.
[0026] The clustering system 100 may dynamically adjust parameters locally to handle non-uniform densities. The clustering system 100 may combine autoencoders with unsupervised embeddings to enhance interpretability and adaptability. The clustering system 100 may employ hierarchical graph neural networks to process overlapping and large-scale networks. The clustering system 100 may align self-supervised learning directly with clustering objectives to improve accuracy on unlabeled data.
[0027] According to the embodiments of the present invention, the clustering system 100 may incorporate non-limiting hardware components to enhance a processing speed and an efficiency such as the clustering system 100 may comprise a client device 102, a first processor 104, a second processor 106, an application server 108, a secure cloud database 110, a communication network 112, and a storage medium 114. In an embodiment of the present invention, the hardware components of the clustering system 100 may be integrated with computer-executable instructions for overcoming the challenges and the limitations of the existing systems.
[0028] In an embodiment of the present invention, the client device 102 may be an electronic device adapted to upload a dataset to the clustering system 100. The uploaded dataset may be of high dimensionality. The uploaded dataset may comprise a heterogeneous distribution of data points. The uploaded dataset may be in an absence of labelling. The client device 102 may be, but not limited to, a personal computer, a consumer device, and alike. Embodiments of the present invention are intended to include or otherwise cover any type of the client device 102, including known, related art, and/or later developed technologies. In an embodiment of the present invention, the personal computer may be, but not limited to, a desktop, a server, a laptop, and alike. Embodiments of the present invention are intended to include or otherwise cover any type of the personal computer, including known, related art, and/or later developed technologies.
[0029] Further, in an embodiment of the present invention, the consumer device may be, but not limited to, a tablet, a mobile phone, a notebook, a netbook, a smartphone, a wearable device, and so forth. Embodiments of the present invention are intended to include or otherwise cover any type of the consumer device including known, related art, and/or later developed technologies.
[0030] In an embodiment of the present invention, the client device 102 may comprise and operatively communicate with the first processor 104. The operative communication may include, but is not limited to, receiving, transmitting, processing, synchronizing, querying, updating, encrypting, decrypting, storing, retrieving, validating, logging, monitoring, alerting, authenticating, authorizing, compressing, decompressing, streaming, and rendering data or commands between the system 100 and the client device 102.
[0031] In an embodiment of the present invention, the second processor 106 may be located on the application server 108. The second processor 106 may be configured to integrate multiple clustering techniques, including density-based clustering with dynamic parameter adjustment, grouping with autoencoders and embeddings, graph neural network clustering with hierarchical strategies, and clustering with self-supervised learning objectives.
[0032] The second processor 106 may be configured to receive the dataset with high dimensionality, the heterogeneous distribution of the data points, the absence of labelling, and so forth from the client device 102. The second processor 106 may be configured to handle datasets transmitted over the communication network 112 and may prepare the same for subsequent clustering operations. The second processor 106 may be configured to process datasets received from the client device 102 in multiple formats, including, but not limited to, comma-separated values, tabular records, graph structures, text corpora, or image datasets. The second processor 106 may further be configured to recognize structured, semi-structured, and unstructured data so that broad applicability across domains may be ensured.
[0033] In an embodiment of the present invention, the second processor 106 may be configured to validate the dataset received from the client device 102. The validation may include schema verification, dimensionality checks, and detection of anomalies such as missing entries, duplicate values, or corrupted records. The validation performed by the second processor 106 may ensure that only reliable data may progress to subsequent stages. The second processor 106 may be configured to secure the dataset received from the client device 102 by applying encryption protocols, including AES-256 or TLS-based communication. The second processor 106 may further be configured to anonymize sensitive data fields prior to processing so that data privacy may be preserved while maintaining usability of the dataset for clustering tasks.
[0034] The second processor 106 may be configured to group the received dataset into clusters based on variations in data density of the dataset. The variations in data density of the dataset may comprise automatic adaptation of local thresholds for distance and minimum cluster size based on dataset characteristics. The second processor 106 may be configured to automatically tune parameters such as neighborhood radius and minimum number of points per cluster. By dynamically adjusting these parameters, the second processor 106 may reduce sensitivity to fixed threshold values and may ensure accurate identification of clusters across regions with differing densities. The second processor 106 may be configured to apply iterative refinement techniques for parameter selection. In this process, the second processor 106 may initially generate clustering outputs using baseline parameter values, then assess the quality of the clusters using stability indices, and subsequently refine the parameter values until optimal cluster separation may be achieved.
[0035] In an embodiment of the present invention, the second processor 106 may be configured to apply local adaptive thresholds, such as dense regions may be assigned smaller distance thresholds and sparse regions may be assigned larger thresholds. Such dynamic adjustment may enable the second processor 106 to detect both tightly packed and loosely distributed clusters in a single dataset without manual intervention. In an embodiment, the second processor 106 may be configured to apply local adaptive thresholds, such as dense regions may be assigned smaller distance thresholds and sparse regions may be assigned larger thresholds. Such dynamic adjustment may enable the second processor 106 to detect both tightly packed and loosely distributed clusters in a single dataset without manual intervention.
[0036] The second processor 106 may be configured to apply grouping architectures to tag the clusters with a generalized feature. The grouping architectures may comprise autoencoders, unsupervised embedding models, and so forth to enhance interpretability and accuracy of the tag applied to the clusters. The grouping architectures may comprise a combination of autoencoders and unsupervised embedding models to enhance interpretability and accuracy. The grouping architectures may execute an autoencoder that may be programmed to compress high-dimensional input into a latent representation, and unsupervised embeddings may refine this latent representation into a clustering-friendly space. The combination may improve interpretability by allowing visualization of latent features, while also ensuring generalization to unseen datasets.
[0037] In an embodiment of the present invention, the second processor 106 may be configured to jointly optimize the reconstruction loss of the autoencoder and the clustering-oriented loss of the unsupervised embedding model. The joint optimization may ensure that the latent space may preserve input fidelity while also enhancing discriminative cluster separation. In an embodiment of the present invention, the second processor 106 may be configured to align the latent representation produced by the autoencoder with embedding objectives such as contrastive similarity, neighbourhood preservation, or graph-based embedding constraints. This alignment may improve interpretability by allowing the latent space to maintain meaningful relationships among data points.
[0038] In an embodiment of the present invention, the second processor 106 may be configured to enhance generalization by applying transfer learning such as pre-trained autoencoders or embedding models may be fine-tuned on new datasets. This approach may reduce the requirement for retraining from scratch and may accelerate adaptation to unseen data. In an embodiment of the present invention, the second processor 106 may be configured to provide visualization of the refined latent space using dimensionality reduction techniques such as t-distributed Stochastic Neighbor Embedding (t-SNE) or Uniform Manifold Approximation and Projection (UMAP). Such visualization may allow operators to observe separation, overlaps, and structural distribution of clusters. The second processor 106 may be configured to improve interpretability of clustering outcomes by generating embeddings that may be visualized in reduced-dimensional spaces. The refined latent space produced through autoencoders and embeddings may allow users to inspect cluster boundaries, overlaps, and data separation directly. This interpretability may support a practical usability of clustering results in real-world decision-making contexts.
[0039] The second processor 106 may be configured to generate graphs of the tagged clusters using graph neural networks with hierarchical strategies. The graphs generated through the graph neural networks with the hierarchical strategies may enhance scalability and community detection in large networks. The graphs of the tagged clusters may comprise hierarchical graph neural networks configured to process overlapping and large-scale networks. The graphs of the tagged clusters through the graph neural networks may represent the dataset as a graph, such that nodes may correspond to data points and edges may represent relationships or similarities among the data points.
[0040] In an embodiment of the present invention, the second processor 106 may be configured to apply graph neural networks that may aggregate neighborhood information for each node so that both local and global structural patterns may be preserved during clustering. By leveraging this aggregation, the second processor 106 may detect community structures in dense as well as sparse graph regions. In an embodiment of the present invention, the second processor 106 may be configured to employ hierarchical strategies such as recursive partitioning of the graph or multi-level pooling of nodes and edges. These hierarchical strategies may enable scalability of clustering across very large networks by reducing computational complexity while maintaining clustering accuracy.
[0041] In an embodiment of the present invention, the second processor 106 may be configured to handle overlapping communities by assigning soft cluster memberships to nodes. This approach may allow a single data point to belong to multiple communities, thereby reflecting real-world complex relationships within graph-structured datasets. In an embodiment of the present invention, the second processor 106 may be configured to evaluate the quality of the graphs of the tagged clusters by applying modularity scores, conductance values, or normalized cut measures. The use of such evaluation criteria may ensure that hierarchical graph neural networks produce clusters that are both scalable and meaningful.
[0042] The second processor 106 may be configured to integrate self-supervised learning objectives in the generated graphs of the tagged clusters to improve clustering accuracy in unlabelled datasets. The self-supervised learning objectives may be optimized to align with clustering objectives for robust learning in the absence of labeled data. The self-supervised learning objectives may be directly optimized for clustering tasks. The alignment of self-supervised learning with clustering objectives may be among the first implementations in the field and may provide improved stability and accuracy for large unlabeled datasets. The self-supervised learning objectives may be designed to extract supervisory signals directly from the dataset structure without the requirement of ground-truth labels.
[0043] In an embodiment of the present invention, the second processor 106 may be configured to optimize pretext tasks such as contrastive similarity learning, masked feature prediction, or data augmentation consistency. The representations obtained from these pretext tasks may be aligned with clustering objectives so that the latent features may support more accurate cluster formation. In an embodiment of the present invention, the second processor 106 may be configured to jointly optimize clustering loss and self-supervised loss in a unified training pipeline. The combination may stabilize training in the absence of labeled data and may improve the ability of the system to generalize across heterogeneous datasets.
[0044] In an embodiment of the present invention, the second processor 106 may be configured to leverage graph-based self-supervised learning objectives such as nodes may predict attributes of their neighborhoods. By aligning these objectives with clustering, the second processor 106 may capture both structural and feature-level information, leading to robust community detection in large-scale networks. In an embodiment of the present invention, the second processor 106 may be configured to enhance interpretability of clusters obtained through self-supervised objectives by projecting learned embeddings into a reduced-dimensional space. This visualization may allow operators to verify that the clusters formed reflect meaningful separation of unlabeled data.
[0045] The second processor 106 may be configured to evaluate a clustering quality and generate evaluation metrics of the dataset. The evaluation metrics may comprise stability indices to measure consistency of clusters, generalizability scores to measure performance on unseen datasets, and cross-dataset consistency checks. By incorporating such adaptable evaluation, the clustering system 100 may ensure trustworthy performance assessment across diverse real-world scenarios. The second processor 106 may be configured to integrate composite evaluation dashboards such as the stability indices, the generalizability scores, and the cross-dataset consistency metrics may be visualized together. Such evaluation dashboards may assist operators in interpreting overall clustering quality and may support decision-making in sensitive applications.
[0046] The second processor 106 may be configured to operate on frameworks selected from a TensorFlow, a PyTorch, and so forth. The second processor 106 may be configured to adjust a sensitivity parameter selection by iterative refinement of clustering outputs based on dataset feedback. The second processor 106 may be configured to enable improvement of clustering interpretability through visualization of embeddings in reduced dimensional space. The second processor 106 may be configured to improve interpretability of clustering outputs by projecting embeddings into reduced-dimensional spaces that may be visualized. Such visualization may allow detection of cluster separation, boundary clarity, and data overlap, thereby increasing the transparency and usability of the clustering system 100 in real-world applications.
[0047] In an embodiment of the present invention, the application server 108 may be a hardware adapted to accommodate and install the second processor 106. The application server 108 may be, but not limited to, a motherboard, a wired board, a mainframe, and so forth. Embodiments of the present invention are intended to include or otherwise cover any type of the application server 108, including known, related art, and/or later developed technologies.
[0048] In an embodiment of the present invention, the second processor 106 may be located on the application server 108. The second processor 106 may be configured to execute the computer-readable instructions to generate an output relating to the clustering system 100. The second processor 106 may be, but not limited to, a Programmable Logic Control (PLC) unit, a microprocessor, a development board, and so forth. Embodiments of the present invention are intended to include or otherwise cover any type of the second processor 106 including known, related art, and/or later developed technologies.
[0049] In an embodiment of the present invention, the secure cloud database 110 may be adapted to store the dataset received from the client device 102. The secure cloud database 110 may be, for example, but not limited to, a distributed database, a personal database, an end-user database, a commercial database, a Structured Query Language (SQL) database, a non-SQL database, an operational database, a relational database, an object-oriented database, a graph database, a cloud server database, and so forth. Embodiments of the present invention are intended to include or otherwise cover any type of the secure cloud database 110, including known, related art, and/or later developed technologies.
[0050] Further, the secure cloud database 110 may be a cloud server database, in an embodiment of the present invention. In an embodiment of the present invention, the cloud server may be remotely located. In an exemplary embodiment of the present invention, the cloud server may be a public cloud server. In another exemplary embodiment of the present invention, the cloud server may be a private cloud server. In yet another embodiment of the present invention, the cloud server may be a dedicated cloud server. The cloud server may be, but not limited to, a Microsoft Azure cloud server, an Amazon AWS cloud server, a Google Compute Engine (GCE) cloud server, an Amazon Elastic Compute Cloud (EC2) cloud server, and so forth. Embodiments of the present invention are intended to include or otherwise cover any type of the cloud server, including known, related art, and/or later developed technologies.
[0051] In an embodiment of the present invention, the communication network 112 may be adapted to establish a communicative link connecting the client device 102 to the application server 108. The communication network 112 may utilize one or more protocols, including LoRa for long-range low-power data exchange, Zigbee for mesh-based short-range connectivity, Bluetooth or Bluetooth Low Energy for local communication, Wi-Fi for high-bandwidth data transfer, and cellular standards such as 4G or 5G for wide-area coverage. The communication network 112 may be configured to dynamically select or switch among these protocols based on latency requirements, bandwidth availability, and energy constraints. The communication network 112 may further implement encryption schemes and secure authentication to safeguard transmitted data. In certain cases, adaptive routing or mesh topologies may be employed so that uninterrupted connectivity may be maintained between the client device 102 and the application server 108.
[0052] In an embodiment of the present invention, the storage medium 114 comprises programming instructions executable by the second processor 106. In an embodiment of the present invention, the storage medium 114 may store the computer programmable instructions in form of programming modules. The storage medium 114 may be a non-transitory storage medium, in an embodiment of the present invention. The storage medium 114 may communicate with the second processor 106 and execute a computer-readable set of instructions present in storage medium 114, in an embodiment of the present invention.
[0053] The storage medium 114 may be, but not limited to, a Random-Access Memory (RAM), a Static Random-access Memory (SRAM), a Dynamic Random-access Memory (DRAM), a Read Only Memory (ROM), an Erasable Programmable Read-only Memory (EPROM), an Electrically Erasable Programmable Read-only Memory (EEPROM), a NAND Flash, a Secure Digital (SD) memory, a cache memory, a Hard Disk Drive (HDD), a Solid-State Drive (SSD) and so forth. Embodiments of the present invention are intended to include or otherwise cover any type of the storage medium 114, including known, related art, and/or later developed technologies.
[0054] In one exemplary embodiment of the present invention, the clustering system 100 may be applied in the healthcare domain for patient risk stratification in chronic disease management. Healthcare datasets generally comprise high dimensionality, unevenly distributed records, and a frequent absence of reliable labeling, as electronic health records often include structured data such as demographics, laboratory results, and prescriptions, along with unstructured data such as physician notes. Conventional clustering methods are unable to consistently identify clinically meaningful patient cohorts from such heterogeneous data.
[0055] In such as embodiment of the present invention, the client device 102 may upload anonymized electronic health records of patients with diabetes and related comorbidities to the application server 108. The dataset may be validated, encrypted, and anonymized by the second processor 106 to ensure secure handling prior to clustering. The second processor 106 may then perform a density-based clustering with dynamic adjustment of parameters so that tightly grouped patient profiles, such as those with frequent hospital visits, may be distinguished from more sparsely distributed patient profiles, such as those with irregular follow-ups. The grouping model combining autoencoders and unsupervised embeddings may compress the input records into a latent representation, wherein hidden factors such as obesity, hypertension, and renal impairment may be more clearly separable.
[0056] The patient records may further be represented in a graph structure, where each node corresponds to a patient and edges correspond to similarities in clinical history. The graphs of the tagged clusters with hierarchical graph neural networks may detect overlapping communities, such as patients simultaneously at risk of cardiovascular complications and kidney failure. Self-supervised learning objectives, including masked feature prediction and contrastive similarity learning, may be aligned with clustering objectives to improve accuracy in the absence of explicit diagnostic labels. Evaluation metrics such as stability indices and cross-dataset consistency measures may then be employed to verify the quality of the clusters, while dimensionality reduction methods such as UMAP may allow clinicians to visualize patient subgroups. The clustering results may ultimately provide hospitals with risk-based cohorts of patients that may be utilized to optimize treatment strategies, allocate resources, and forecast disease progression with improved reliability compared to existing systems.
[0057] In an exemplary embodiment of the present invention, the client device 102 may be a hospital workstation or physician’s tablet configured to interface with the electronic health record system and upload patient records to the clustering system 100. The first processor 104 may perform local pre-processing operations including formatting of laboratory values, tokenization of clinical notes, and encryption of patient identifiers. The dataset may then be transmitted to the application server 108 through the communication network 112, which may operate over a secure 5G or Wi-Fi channel with TLS encryption.
[0058] In an exemplary embodiment of the present invention, the second processor 106 located on the application server 108 may execute iterative refinement of density-based clustering parameters to adapt to population variations within the dataset. The second processor 106 may also train and apply autoencoders on historical medical data to generate compressed latent features, while unsupervised embeddings may further refine these features for clustering. The dataset may be transformed into a graph structure wherein cosine similarity of the latent features determines edge weights, and hierarchical pooling techniques may be applied within graph neural networks for scalable clustering across large patient populations. Self-supervised learning pretext tasks such as predicting missing laboratory results may be employed to enhance the embedding space without labeled outcomes, thereby aligning with the clustering task. The secure cloud database 110 may store anonymized patient embeddings, cluster assignments, and evaluation dashboards. The final outputs may include interpretable visualizations of patient groups, quantitative stability metrics, and generalizability scores, thereby enabling healthcare practitioners to obtain actionable insight into patient risk stratification.
[0059] In another exemplary embodiment, the clustering system 100 may be applied in the context of smart city management for analyzing data generated from Internet of Things devices. Modern urban environments deploy large-scale networks of IoT sensors that monitor traffic flow, environmental conditions, energy consumption, and public safety. The resulting datasets are extremely high-dimensional, heterogeneous, and unlabeled, with measurements arriving in real time from distributed sensors. In this embodiment, a client device 102 deployed within a city operations center may upload sensor data streams to the application server 108 through the communication network 112.
[0060] The second processor 106 may validate the data by detecting anomalies such as missing readings or corrupted sensor values. Dynamic density-based clustering may identify both tightly packed regions of abnormal traffic congestion and sparsely distributed patterns of energy wastage across neighborhoods. The system may then apply grouping architectures to compress multimodal sensor data, enabling interpretable latent representations of environmental or infrastructure states. The IoT devices and their data flows may be represented as a graph, with the graphs of the tagged clusters through hierarchical graph neural networks detecting overlapping communities such as clusters of sensors consistently reporting hazardous air quality across multiple districts. Self-supervised learning tasks such as masked sensor prediction or temporal contrastive learning may enhance clustering accuracy in the absence of ground-truth incident labels. The secure cloud database 110 may store cluster assignments and provide operators with dashboards that visualize traffic clusters, environmental hotspots, and energy inefficiencies. The clustering results may support city planners in deploying targeted interventions such as rerouting traffic, increasing green cover in pollution clusters, or optimizing energy consumption in identified zones.
[0061] FIG. 2 depicts a flowchart of a method 200 for grouping the real-world datasets using the clustering system 100, according to an embodiment of the present invention.
[0062] At step 202, the clustering system 100 may receive the dataset with the high dimensionality, the heterogeneous distribution of the data points, and the absence of labelling from the client device 102.
[0063] At step 204, the clustering system 100 may group the received dataset into the clusters based on the variations in the data density of the dataset.
[0064] At step 206, the clustering system 100 may apply the grouping architectures to tag the clusters with the generalized feature.
[0065] At step 208, the clustering system 100 may generate the graphs of the tagged clusters using the graph neural networks with the hierarchical strategies.
[0066] At step 210, the clustering system 100 may integrate the self-supervised learning objectives in the generated graphs of the tagged clusters to improve clustering accuracy in the unlabeled datasets.
[0067] At step 212, the clustering system 100 may evaluate the clustering quality and generate the evaluation metrics of the dataset.
[0068] While the invention has been described in connection with what is presently considered to be the most practical and various embodiments, it is to be understood that the invention is not to be limited to the disclosed embodiments, but on the contrary, is intended to cover various modifications and equivalent arrangements included within the scope of the appended claims.
[0069] This written description uses examples to disclose the invention, including the best mode, and also to enable any person skilled in the art to practice the invention, including making and using any devices or systems and performing any incorporated methods. The patentable scope of the invention is defined in the claims, and may include other examples that occur to those skilled in the art. Such other examples are intended to be within the scope of the claims if they have structural elements that do not differ from the literal language of the claims, or if they include equivalent structural elements within substantial differences from the literal languages of the claims. , Claims:CLAIMS
I/We Claim:
1. A clustering system (100) for real-world datasets, the clustering system (100) comprising:
a client device (102) comprising a first processor (104);
a second processor (106) located on an application server (108);
a communication network (112) adapted to establish a communicative link connecting the client device (102) to the application server (108); and
a storage medium (114) comprising programming instructions executable by the second processor (106), characterized in that the second processor (106) is configured to:
receive, from the client device (102), a dataset with high dimensionality, heterogeneous distribution of data points, an absence of labelling, or a combination thereof;
group the received dataset into clusters based on variations in data density of the dataset;
apply grouping architectures to tag the clusters with a generalized feature;
generate graphs of the tagged clusters using graph neural networks with hierarchical strategies;
integrate self-supervised learning objectives in the generated graphs of the tagged clusters to improve clustering accuracy in unlabelled datasets; and
evaluate a clustering quality and generate evaluation metrics of the dataset.
2. The clustering system (100) as claimed in claim 1, wherein the variations in data density of the dataset comprises automatic adaptation of local thresholds for distance, minimum cluster size based on dataset characteristics, or a combination thereof
3. The clustering system (100) as claimed in claim 1, wherein the grouping architectures comprise autoencoders, unsupervised embedding models, or a combination thereof to enhance interpretability and accuracy of the tag applied to the clusters.
4. The clustering system (100) as claimed in claim 1, wherein the graphs of the tagged clusters comprises hierarchical graph neural networks configured to process overlapping and large-scale networks.
5. The clustering system (100) as claimed in claim 1, wherein the self-supervised learning objectives are optimized to align with clustering objectives for robust learning in the absence of labeled data.
6. The clustering system (100) as claimed in claim 1, wherein the evaluation metrics comprise stability indices, generalizability scores, cross-dataset consistency measures, or a combination thereof.
7. The clustering system (100) as claimed in claim 1, wherein the second processor (106) is configured to operate on frameworks selected from a TensorFlow, a PyTorch, or a combination thereof.
8. The clustering system (100) as claimed in claim 1, wherein the second processor (106) is configured to adjust a sensitivity parameter selection by iterative refinement of clustering outputs based on dataset feedback.
9. The clustering system (100) as claimed in claim 1, wherein the second processor (106) is configured to enable improvement of clustering interpretability through visualization of embeddings in reduced reduced-dimensional space.
10. A method (200) for grouping real-world datasets using a clustering system (100), the method (200) is characterized by steps of:
receiving, from a client device (102), a dataset with high dimensionality, heterogeneous distribution of data points, an absence of labelling, or a combination thereof;
grouping the received dataset into clusters based on variations in data density of the dataset;
applying grouping architectures to tag the clusters with a generalized feature;
generating graphs of the tagged clusters using graph neural networks with hierarchical strategies;
integrating self-supervised learning objectives in the generated graphs of the tagged clusters to improve clustering accuracy in unlabelled datasets; and
evaluating a clustering quality and generating evaluation metrics of the dataset.
Date: October 06, 2025
Place: Noida
Nainsi Rastogi
Patent Agent (IN/PA-2372)
Agent for the Applicant
| # | Name | Date |
|---|---|---|
| 1 | 202541096356-STATEMENT OF UNDERTAKING (FORM 3) [07-10-2025(online)].pdf | 2025-10-07 |
| 2 | 202541096356-REQUEST FOR EARLY PUBLICATION(FORM-9) [07-10-2025(online)].pdf | 2025-10-07 |
| 3 | 202541096356-POWER OF AUTHORITY [07-10-2025(online)].pdf | 2025-10-07 |
| 4 | 202541096356-OTHERS [07-10-2025(online)].pdf | 2025-10-07 |
| 5 | 202541096356-FORM-9 [07-10-2025(online)].pdf | 2025-10-07 |
| 6 | 202541096356-FORM FOR SMALL ENTITY(FORM-28) [07-10-2025(online)].pdf | 2025-10-07 |
| 7 | 202541096356-FORM 1 [07-10-2025(online)].pdf | 2025-10-07 |
| 8 | 202541096356-EVIDENCE FOR REGISTRATION UNDER SSI(FORM-28) [07-10-2025(online)].pdf | 2025-10-07 |
| 9 | 202541096356-EDUCATIONAL INSTITUTION(S) [07-10-2025(online)].pdf | 2025-10-07 |
| 10 | 202541096356-DRAWINGS [07-10-2025(online)].pdf | 2025-10-07 |
| 11 | 202541096356-DECLARATION OF INVENTORSHIP (FORM 5) [07-10-2025(online)].pdf | 2025-10-07 |
| 12 | 202541096356-COMPLETE SPECIFICATION [07-10-2025(online)].pdf | 2025-10-07 |