Method For Classifying Non Iid Datasets In Federated Learning

< Back

Method For Classifying Non Iid Datasets In Federated Learning

Abstract: A method for classification of non-IID datasets in Federated Learning is disclosed. The method (100) includes receiving (102) local data at plurality of client devices (202-1…n) for local training models, each of the client devices include local training model, generating (104) signature vector for each of the client devices by the local training models and providing (106) the signature vectors for the plurality of client devices to a server device for storing the signature vectors.. The method further includes aggregating (108) the local training models’ parameters received from each of the client devices to generate a global model, receiving (110) a dataset at the server device for the global model, generating (112) a response vector for the dataset provided to the server device, comparing (114) the response vector with the aggregated signature vectors and assigning (116) the closest matching signature vector as classification for the dataset.

Get Free WhatsApp Updates!
Notices, Deadlines & Correspondence

Patent Information

Application #

Filing Date

04 April 2025

Publication Number

17/2025

Publication Type

INA

Invention Field

COMPUTER SCIENCE

Status

Parent Application

Applicants

Amrita Vishwa Vidyapeetham

Amrita Vishwa Vidyapeetham, Amritapuri Campus, Amritapuri, Clappana PO, Kollam, Kerala - 690525, India.

Inventors

1. NAIR, Jyothisha J.

Shambhavi, Decent Junction PO, Kollam - 691577, Kerala, India.

2. G., Gopakumar

Puliyara, Manappally North PO, Karunagappally, Kollam - 690574, Kerala, India.

Specification

Description:FIELD OF INVENTION
[0001] The present disclosure relates to machine learning models, more particularly it relates to classifying datasets in Federated Learning models.

DESCRIPTION OF THE RELATED ART
[0002] In recent years, Federated Learning (FL) has emerged as a groundbreaking paradigm in machine learning and artificial intelligence, enabling the training of models across distributed data sources while preserving data privacy and security. FL allows multiple clients to collaboratively train machine learning models without the need to transmit raw data, offering a decentralized approach that safeguards sensitive information. Federated Learning operates by exchanging and aggregating intermediate results, such as model gradients or parameters, across clients. However, one of the major challenges in FL that significantly impacts its effectiveness is data heterogeneity, where participating clients possess diverse data distributions that diverge from the traditional machine learning assumption of independently and identically distributed (IID) data. In real-world FL scenarios, the assumption of IID data rarely holds. Clients often contribute datasets that vary in several aspects, a condition referred to as non-IID data. Improving accuracy on non-IID data is relevant because it can help to make federated learning more effective in real-world scenarios. If the model accuracy is poor on non-IID data, it may not be able to generalize well to new data or may not be effective in making accurate predictions in real-world settings.
[0003] Data heterogeneity in FL can arise from various factors. One of the primary factors is statistical distribution disparities, where clients may have datasets originating from different sources, such as diverse geographical locations, distinct user populations, or application-specific data. This results in variations in the statistical properties of their data, making it difficult to train a global model that accurately captures the underlying patterns. Another factor is data imbalance. In many FL scenarios, some clients may have a significantly larger or smaller volume of data compared to others, which leads to imbalanced contributions during the model training process. This imbalance can result in a global model that is biased toward clients with larger datasets, further reducing performance on underrepresented data segments.
[0004] Additionally, feature heterogeneity poses a challenge due to differences in data collection methods or variations in hardware devices. This variability can produce differences in feature representations, complicating the integration of diverse data formats into a unified model. Moreover, in FL settings, data is often non-IID, unlike traditional machine learning where the IID assumption holds. Non-IID data introduces complexities in the training process, such as slow convergence or suboptimal model performance. This divergence from the IID assumption results in global models that may not generalize well to unseen or diverse datasets. Finally, privacy constraints are a significant factor. Clients may apply different privacy-preserving techniques, such as differential privacy, which can further diversify the quality and utility of their contributed data. The variations in privacy policies and techniques add another layer of complexity to training an effective and unbiased global model. Addressing these forms of data heterogeneity is essential for achieving robust and accurate models in FL. Poor model accuracy on non-IID data can hinder generalization to unseen data, limiting the effectiveness of FL in real-world scenarios. Consequently, there is a pressing need to develop strategies that improve the performance of FL on non-IID data, ensuring that the global model can effectively capture and learn from heterogeneous client data distributions.
[0005] Various publications have tried to address the aforementioned problems encountered when non-IID for FL models. Chinese application 117556919A discloses personalized graph federal learning method and system based on signature clustering and a storage medium. US application 11711348B2 discloses scalable and reliable communication mechanism between a plurality of requesters and a plurality of edge devices. Ozkara et al. (2019) discusses a generative framework for learning, which unifies several known personalized FL algorithms and also suggests new ones in “A Generative Framework for Personalized Learning and Estimation: Theory, Algorithms, and Privacy” while in “A Blockchain System for Clustered Federated Learning with Peer-to-Peer Knowledge Transfer” Wu et.al. (2024) discusses a knowledge transfer method using other clients on the peer-to-peer network of blockchain for optimizing local training in FL.
[0006] Presently, there is a requirement for improving the accuracy and performance of Federated Learning (FL) models on non-IID data.

SUMMARY OF THE INVENTION
[0007] The present subject matter relates to a method for classification of non-IID datasets in Federated Learning (FL).
[0008] In one embodiment of the present subject matter, the method comprises receiving local data at plurality of client devices for local training models, wherein each of the client devices include a local training model, generating signature vector for each of the client devices by the local training models and providing the signature vectors with local training models’ parameters for the plurality of client devices to a server device and storing the received signature vectors at the server device. The method further includes aggregating the local training models’ parameters received from each of the client devices , receiving testdata at the server device, generating a response vector for the dataset provided to the server device, comparing the response vector with the stored signature vectors and assigning the closest matching signature vector as classification for the dataset.
[0009] In various embodiments, the signature vector is generated by the local training model with a validation data provided to each of client devices.
[0010] In various embodiments, distance difference is computed for the received signature vectors.
[0011] In one embodiment, a system for classification of non-IID datasets in Federated Learning (FL) is disclosed. The system comprises a plurality of client devices each with a local training model, the local training models for the client devices is configured to generate signature vector for each of the client devices based on local data received at each of the client devices. The system further comprises a server device for receiving the signature vectors with local training models’ parameters from the plurality of client device and storing the received signature vectors, A global model is generated by aggregating local training models’ parameters received from each of the client devices , receive a dataset at the server for the global model, generate a response vector for the dataset provided to the server device, compare the response vector with the stored signature vectors and assign the closest matching signature vector as classification for the dataset
[0012] This and other aspects are described herein.

BRIEF DESCRIPTION OF THE DRAWINGS
[0013] The invention has other advantages and features, which will be more readily apparent from the following detailed description of the invention and the appended claims, when taken in conjunction with the accompanying drawings, in which:
[0014] FIG. 1 illustrates the method for classifying non-IID datasets in Federated Learning models.
[0015] FIGs. 2A and 2B illustrate the system for classifying non-IID datasets in Federated Learning models.
[0016] FIG. 3 illustrates the convergence speed and accuracy of the method.
[0017] FIGs. 4A to 4D illustrate the usage of signature vectors and response vector during implementation of the method.
[0018] Referring to the figures, like numbers indicate like parts throughout the various views.

DETAILED DESCRIPTION OF THE EMBODIMENTS
[0019] While the invention has been disclosed with reference to certain embodiments, it will be understood by those skilled in the art that various changes may be made and equivalents may be substituted without departing from the scope of the invention. In addition, many modifications may be made to adapt to a particular situation or material to the teachings of the invention without departing from its scope.
[0020] Throughout the specification and claims, the following terms take the meanings explicitly associated herein unless the context clearly dictates otherwise. The meaning of “a”, “an”, and “the” include plural references. The meaning of “in” includes “in” and “on.” Referring to the drawings, like numbers indicate like parts throughout the views. Additionally, a reference to the singular includes a reference to the plural unless otherwise stated or inconsistent with the disclosure herein.
[0021] The present subject matter describes a method for improving classification accuracy in Federated Learning (FL) with non-IID datasets using signature vectors generated by client devices during model training. These signature vectors are shared with a global model present on central server device alongside model parameters to obtain each client device’s local data distribution without compromising data privacy. By incorporating the received signature vectors, the global model’s performance is improved considerably, particularly in non-IID environments.
[0022] A method 100 for classification of non-IID datasets in Federated Learning (FL), is illustrated in FIG. 1, in various embodiments of the subject matter. The method 100 works in an environment with plurality of client device 202-1…n and a server device 208, wherein each of the plurality of client devices 202-1…n include a local training model 204 and the server device 208 includes a global model 210. In step 102, local data is received at the plurality of client devices for processing information. The local training models 204 for each of the client devices 202-1…n generate a signature vector 206 for the received local data in step 104. The signature vectors 206 from the plurality of client devices with local training models’ parameters is provided to a server device 208 and the received signature vectors 206 are stored at the server device 208 in step 106. In step 108, the received local training models’ parameters received from each of the client devices is aggregated to generate a global model 210. Further, in step 110 testdata is received at the server device for processing and the received dataset is used to generate a response vector 212 by the global model in step 112. The response vector generated is compared with the stored signature vectors in step 114 and the closest matching signature vector is assigned for classification of the dataset received at the server device in step 116.
[0023] In various embodiments, the local training models 204 generate the signature vectors 206 along with a validation data that is provided to each of the plurality of the client devices 202-1…n. Further, the signature vectors 206 are provided to the global model 210 at the server device 208 along with the parameters of the local training models of each of the client devices 202-1…n.
[0024] In an embodiment. a system for classifying non-IID datasets in FL environment is illustrated in FIGs. 2A and 2B. The FL environment includes plurality of client devices 202-1…n and a server device 208, wherein the client devices 202-1…n each includes a local training model 204 and the server device 208 includes a global model 210. Each of the local training models 204 generates a signature vector 206 for the client device 202-1…n based on the local data 214 provided and a validation data 218 provided to the client device. The signature vector is generated using the following equation.

[0025] Where, the signature vector Si at each client is computed using its validation dataset Dival. Here, f (W, x) is the response of model W to the sample data x. The signature vector captures the essential characteristics of the client’s data distribution without compromising privacy by sharing raw data.
[0026] After training the local model 204, each of the client devices send the parameters of the local training model 204 and the corresponding signature vectors 206 to the server device 208. At the server device 208, a global model 210 is constructed using the following equation with Federated Averaging algorithm, wherein │Di │is the size of the client device’s local dataset and is the factor assigned to the client i's contribution to the global model. The contribution of a is client to the global model proportional to the size of the local dataset, hence larger datasets contribute more significantly to the global model update, ensuring that the aggregation reflects the data distribution across all clients.

[0027] A dataset 216 is provided to the server device 208 for processing, the global model 210 generates a response vector 212 for the dataset 216. The response vector is generated using the following equation.

[0028] Wherein, the dataset sample xtest maybe any digit (1 to 10) passed through the global model Wt+1.
[0029] The response vector 212 generated by the global model 210 is compared to the each of the stored signature vectors 206 in the server device 208 to determine a match between the vectors. The signature vector 206 having the closest match with response vector 212 is assigned for classification of the dataset 216. The comparison of the response vector 212 and the signature vector 206 is done using the following equation.

[0030] The method for classification of non-IID datasets in Federated Learning has several advantages over the present prior art. Traditional federated learning (FL) struggles with non-IID datasets as each client has a different distribution of data. With signature vectors, this issue is addressed by incorporating client-specific information during testing thereby increasing the accuracy of the global model. Further, this method enables model training without the need for clients to share their raw data with the central server. By only sharing the signature vectors and model parameters, privacy concerns are mitigated, making it a more privacy-preserving approach compared to traditional FL methods. As only the signature vectors and model parameters are communicated between clients and the server, the amount of data transmitted over the network is significantly reduced. This reduction in communication overhead leads to faster training and testing processes, making the method more scalable for large-scale deployments. Additionally, clients with specialized datasets can participate in the federated learning process without needing to conform to a standard dataset distribution. This flexibility allows for a wider range of participants, potentially leading to more diverse and representative global models. The method maybe used for multiple sectors such as healthcare, finance, manufacturing, retail and e-commerce.
[0031] Although the detailed description contains many specifics, these should not be construed as limiting the scope of the invention but merely as illustrating different examples and aspects of the invention. It should be appreciated that the scope of the invention includes other embodiments not discussed herein. Various other modifications, changes and variations which will be apparent to those skilled in the art may be made in the arrangement, operation and details of the system and method of the present invention disclosed herein without departing from the spirit and scope of the invention as described here, and as delineated in the claims appended hereto.

EXAMPLES
[0032] EXAMPLE 1: Communication overhead analysis for aggregated signature vectors
[0033] In a typical FL setup, the communication overhead is dominated by the exchange of model parameters. If the global model has p parameters, and each parameter is represented using s bytes, the size of each model update sent by a client is p × s. For a system with N clients and T communication rounds, the total communication overhead without signature vectors is given by:

[0034] The factor of 2 accounts for both upload (client to server) and download (server to client) of model parameters in each communication round. When incorporating signature vectors, an additional component is introduced into the communication cost. Let m be the number of elements in the signature vector and q the size of each element (in bytes). If the signature vector is generated at every communication round and transmitted along with the model parameters, the communication overhead per client becomes:

[0035] This expression includes the size of the model parameters and the additional size of the signature vector. To evaluate the relative increase, the ratio of communication overhead with and without signature vectors can be expressed as:

[0036] This ratio quantifies the proportional increase in communication overhead. If m × q ≪ p × s, the additional overhead is minimal. If the signature vector size m × q is much smaller than the total size of the model parameters p × s, then the ratio can be approximated as:

[0037] This implies that the overhead introduced by the signature vectors is negligible. The additional communication overhead due to the signature vectors can be quantified as a percentage of the baseline overhead:

[0038] If m × q ≪ p × s, the extra overhead will be very small. Thus, it can be concluded that the communication overhead introduced by signature vectors is minimal for most realistic settings, making it a feasible solution for enhancing FL performance in non-IID scenarios.
[0039] EXAMPLE 2: Security Analysis of Passing Signature Vectors
[0040] Passing signature vectors to the server introduces potential security and privacy concerns. While signature vectors do not contain raw data, they still carry information about the local data distribution, which could be exploited by a malicious server or eavesdropper. Some of the security considerations and possible mitigation strategies are outlined. Signature vectors are summaries of the model’s response to local validation data. An adversary with access to these vectors could perform model inversion attacks to gain insights into the local data distribution. For example, if a client’s data consists of medical records with a particular disease distribution, the signature vector could reveal sensitive patterns.
[0041] Let the signature vector for client i be Si ∈ ℝm, where m is the dimension of the signature vector. An attacker could attempt to minimize a reconstruction loss function L defined as:

[0042] where ϕ(Wi, x) represents the client’s model response to input x. Minimizing L over x allows the adversary to infer potential samples x that match the signature vector, revealing data characteristics. To protect against such attacks, consider applying differential privacy to the signature vectors. One approach is to add Gaussian noise to each element of Si:

[0043] Here, N (0, σ2) is Gaussian noise with mean 0 and variance σ2. The choice of σ should balance between privacy and utility, ensuring that the noisy signature vector retains enough information for accurate classification without leaking sensitive patterns. An adversarial client could intentionally modify its signature vector to mislead the global model, causing a form of model poisoning attack. To address this, the server can implement integrity checks using hash-based verification or cryptographic commitments on the signature vectors before aggregation. The server can verify the integrity of Si using a hash function h:

[0044] If the received signature vector does not match a pre-verified hash value, the server rejects the update, thereby maintaining robustness against malicious clients. The use of signature vectors can enhance classification accuracy in non-IID federated learning but introduces additional communication overhead and security risks. The trade-offs must be carefully analyzed based on application-specific requirements. Applying differential privacy and cryptographic techniques can help mitigate some of the security concerns, ensuring that the system remains both efficient and secure.
[0045] EXAMPLE 3: Comparison with traditional FL
[0046] Results indicate that incorporating the signature vector leads to significant performance improvement compared to not using it. TABLE 1 is the result observed in accuracy on the MNIST non-IID dataset with and without signature vector on different rounds. The convergence is also faster with signature vector incorporation. Convergence is also faster with signature vector incorporation as compared to traditional FL.

TABLE 1: Comparison of traditional FL and signature vector FL
Round 1 Round10 Round 100
Normal Fed Average 14.14%
22.33% 54.62%
With Signature Vector FedAverage 65.85% 79.43%, 82.33%

[0047] EXAMPLE 4: Convergence speed and accuracy with implementation of method
[0048] FIG. 3 compares the convergence speed and accuracy of a Federated Learning system on the MNIST dataset under two scenarios: using a normal federated averaging method (denoted as "MNIST Normal") and an enhanced method utilizing signature vectors (denoted as "MNIST Signature Vector"). The x-axis represents the number of training rounds, while the y-axis shows the classification accuracy. Key observations include:
[0049] Convergence Speed: The method with signature vectors achieves a faster convergence compared to the normal method. For instance, it reaches 50% accuracy in significantly fewer rounds than the normal method.
[0050] Final Accuracy: By the 100th round, the accuracy with signature vectors stabilizes at approximately 82%, whereas the normal method achieves only around 54%. This highlights the enhanced learning capability brought by signature vectors in handling non-IID data.
[0051] Steady Improvement: The accuracy curve for the signature vector method exhibits a steeper initial rise, demonstrating its ability to incorporate valuable insights from client data earlier in the training process.
[0052] Normal Federated Learning: The normal method shows slower and less effective convergence, likely due to challenges in aggregating non-IID client data effectively.
[0053] The use of signature vectors significantly improves both the learning efficiency and the final accuracy in a federated learning setup, especially under non-IID data conditions, making it a more robust choice for such scenarios.
[0054] EXAMPLE 5: Convergence speed and accuracy with CIFAR-10 dataset
[0055] TABLE 2 represents the results of convergence speed and accuracy of current method with CIFAR-10 dataset.
TABLE 2: Convergence speed and accuracy with CIFAR-10 dataset
Round 1 Round10 Round 100
Normal Fed Average 9.97 11.11 14.55
With Signature Vector FedAverage 17.31 19.82 20.84

[0056] Convergence Speed: The signature vector method reaches 20% accuracy significantly earlier (Round 61) than the normal method (Round 96), indicating faster convergence.
[0057] Final Accuracy: The signature vector method achieves a final accuracy of 20.84%, outperforming the normal method, which stagnates at 14.55% after 100 rounds.
[0058] Initial Accuracy: The initial accuracy for the signature vector method (17.31%) is notably higher than the normal method (9.97%), demonstrating its early effectiveness.
[0059] EXAMPLE 6: Difference comparison for signature vectors during implementation of the method
[0060] This comparison includes the following 3 steps:
[0061] Step1: For each class, computing the mean response vector (signature vector) from the local model's outputs on validation data
[0062] Signature vectors are abstract representations and don’t have a direct, human-interpretable meaning. They encode how the model distinguishes between classes. Each value in the signature vector corresponds to the output of a specific node (logit) in the final layer of the neural network. The output values are unnormalized scores produced before applying the SoftMax function to calculate probabilities for each class. The signature vector is specific to each class and reflects the typical "response pattern" the model produces when it processes validation samples of that class as shown in FIG. 4A. For instance, the vector for Class 0 represents the average response pattern for samples labeled 0 in the validation dataset. The length of the vector corresponds to the number of output neurons in the model, which is equal to the number of classes (0–9 for MNIST). Each position in the vector corresponds to the average logit for that output neuron across all validation samples of the class. The magnitude and direction (sign and relative values) of these numbers encode the distinguishing features of each class as learned by the local model. Each class has a unique signature vector. For instance:
[0063] Class 0 has relatively strong positive values at indices 0, 3, and 7.
[0064] Class 7 has a relatively negative value at index 1 but positive values at indices 3 and 7.
[0065] These distinct patterns are what help the system differentiate between classes.
[0066] Step 2: The test sample passes through the global model to generate the response vector as shown in FIG. 4B
[0067] Step 3: Test sample’s response vector is compared with the signature vectors to find the closest match. The closest match determines the predicted class as shown in FIGs. 4C and 4D. Euclidean distance between the response vector and the signature vector for each class is computed to obtain the closest match. Based on the computation assigned label for test sample is (Digit 7): 7 which is the label of the class with the smallest distance.

, Claims:1. A method (100) for classification of non-IID datasets in Federated Learning (FL), the method (100) comprising:
receiving (102) local data at plurality of client devices (202-1…n) for local training models, wherein each of the client devices (202-1…n) include a local training model (204);
generating (104) signature vector (206) for each of the client devices (202-1…n) by the local training models (204);
providing (106) the signature vectors (206) with local training models’ parameters for the plurality of client devices (202-1…n) to a server device (208) and storing the received signature vectors at the server device (208);
aggregating (108) the local training models’ parameters received from each of the client devices (202-1…n) to generate a global model (210);
receiving (110) testdata at the server device for the global model (210);
generating (112) a response vector (212) for the dataset provided to the server device (208);
comparing (114) the response vector (212) with the stored signature vectors (206); and
assigning (116) the closest matching signature vector as classification for the dataset.

2. The method (100) as claimed in claim 1, wherein the signature vector (206) is generated by the local training model (204) with a validation data provided to each of client devices (202-1…n).

3. The method (100) as claimed in claim 1, wherein a distance difference is computed for the received signature vectors (206).

4. A system (200) for classification of non-IID datasets in Federated Learning (FL), the system (200) comprises:
a plurality of client devices (202-1…n ) each with a local training model (204), the local training models (204) for the client devices is configured to:
generate signature vector (206) for each of the client devices (202-1…n) based on local data (214) received at each of the client devices; and
a server device (208) for receiving the signature vectors (206) from the plurality of client devices (202-1…n ), the server device is configured to:
store the received signature vectors at the server device (208);
aggregate the received local training models’ parameters received from each of the client devices (202-1…n) to generate a global model (210);
receive a dataset (216) at the server device (208) for the global model (210);
generate a response vector (212) for the dataset (216) provided to the server device (208);
compare the response vector (212) with the stored signature vectors (206); and
assign the closest matching signature vector (206) as classification for the dataset (216).

Documents

Application Documents

#	Name	Date
1	202541033468-STATEMENT OF UNDERTAKING (FORM 3) [04-04-2025(online)].pdf	2025-04-04
2	202541033468-REQUEST FOR EXAMINATION (FORM-18) [04-04-2025(online)].pdf	2025-04-04
3	202541033468-REQUEST FOR EARLY PUBLICATION(FORM-9) [04-04-2025(online)].pdf	2025-04-04
4	202541033468-FORM-9 [04-04-2025(online)].pdf	2025-04-04
5	202541033468-FORM FOR SMALL ENTITY(FORM-28) [04-04-2025(online)].pdf	2025-04-04
6	202541033468-FORM 18 [04-04-2025(online)].pdf	2025-04-04
7	202541033468-FORM 1 [04-04-2025(online)].pdf	2025-04-04
8	202541033468-EVIDENCE FOR REGISTRATION UNDER SSI(FORM-28) [04-04-2025(online)].pdf	2025-04-04
9	202541033468-EVIDENCE FOR REGISTRATION UNDER SSI [04-04-2025(online)].pdf	2025-04-04
10	202541033468-EDUCATIONAL INSTITUTION(S) [04-04-2025(online)].pdf	2025-04-04
11	202541033468-DRAWINGS [04-04-2025(online)].pdf	2025-04-04
12	202541033468-DECLARATION OF INVENTORSHIP (FORM 5) [04-04-2025(online)].pdf	2025-04-04
13	202541033468-COMPLETE SPECIFICATION [04-04-2025(online)].pdf	2025-04-04
14	202541033468-FORM-26 [03-07-2025(online)].pdf	2025-07-03
15	202541033468-Proof of Right [07-08-2025(online)].pdf	2025-08-07