System And Method For Data Anonymization Using Optimization Techniques
Abstract:
This disclosure relates generally to data anonymization using clustering techniques. In Typically, data anonymization using global recoding can overgeneralize the data. However, preservation of information while anonymization the data is of equal importance as obscuring the relevant information that can be used by the attackers. The disclosed method and system utilized attribute taxonomy tree for generalization to optimize the generalization of the records. The disclosed method uses clustering-based approach and after clustering, each cluster is solved independently using ILP model for K-Anonymization. The ILP model is solved by generalizing the value of the attributes. Sometimes, even after clustering the number of possible patterns is large, thus the disclosed method generates patterns on the fly during multiple iterations.
[To be published with FIGS. 4A-4B]
Get Free WhatsApp Updates!
Notices, Deadlines & Correspondence
Nirmal Building, 9th Floor,
Nariman Point
Mumbai
Maharashtra
India
Inventors
1. SAURABH, Saket
Tata Consultancy Services Limited
Tata Research Development & Design Centre, 54-B, Hadapsar Industrial Estate, Hadapsar,
Pune
Maharashtra 411013
India
2. RAMAMURTHY, Arun
Tata Consultancy Services Limited
Tata Research Development & Design Centre, 54-B, Hadapsar Industrial Estate, Hadapsar,
Pune
Maharashtra 411013
India
3. MONDAL, Sutapa
Tata Consultancy Services Limited
Tata Research Development & Design Centre, 54-B, Hadapsar Industrial Estate, Hadapsar,
Pune
Maharashtra 411013
India
4. GHAROTE, Mangesh Sharad
Tata Consultancy Services Limited
Tata Research Development & Design Centre, 54-B, Hadapsar Industrial Estate, Hadapsar,
Pune
Maharashtra 411013
India
5. LODHA, Sachin Premsukh
Tata Consultancy Services Limited
Tata Research Development & Design Centre, 54-B, Hadapsar Industrial Estate, Hadapsar,
Pune
Maharashtra 411013
India
Specification
FORM 2
THE PATENTS ACT, 1970
(39 of 1970)
&
THE PATENT RULES, 2003
COMPLETE SPECIFICATION (See Section 10 and Rule 13)
Title of invention:
SYSTEM AND METHOD FOR DATA ANONYMIZATION USING OPTIMIZATION TECHNIQUES
Applicant
Tata Consultancy Services Limited A company Incorporated in India under the Companies Act, 1956
Having address:
Nirmal Building, 9th floor,
Nariman point, Mumbai 400021,
Maharashtra, India
Preamble to the description
The following specification particularly describes the invention and the manner in which it is to be performed.
TECHNICAL FIELD [001] The disclosure herein generally relates to data anonymization, and, more particularly, to system and method for data anonymization for privacy preserving data publication.
BACKGROUND
[002] Data has value , and its importance is increasing rapidly. As a result, large amount of data is being collected from almost every aspect of human beings ranging from watching habits to shopping habits, health care to social media, internet browsing to reading habits. According to a Seagate and IDC report, IoT devices will create over 90ZB data by 2025. This exponential growth in data can be both an opportunity and a threat.
[003] Since more data is available, it can be used in real time fraud detection, efficient traffic flow management between smart vehicles and prioritized traffic protocol for emergency response vehicles facial recognition for improved security at common places. These applications require collecting and sharing of personal data. If such data is shared without any processing, it can be misused by attackers. It can severely harm any person emotionally and economically. For example, in the Cambridge Analytica data scandal 2018, a firm exploited user personal data to influence people's behaviors. Similarly, in the case of Cambridge Analytica, FacebookTM data was mined and used without authorization to build a software that predicted and influenced voters during the 2016 US Presidential Election. There are many more examples like YahooTM, SonyTM, EquifaxTM, UberTM, and JP Morgan ChaseTM have been hit by cybersecurity attacks where the real names, email addresses, and telephone numbers of millions of users were compromised. To address this privacy challenge, data should be sanitized before sharing.
[004] There are mainly two type of solutions to sanitize data. The first one is data anonymization-based approach. The other approach is synthetic data generation. Implementation of synthetic data generation methods are non-truthful, i.e. perturbative, as noise is added to original data. In many use cases, truthfulness
is desired. There is growing interest for truthful data of hospital and medical data from governmental and industrial organizations.
[005] In data anonymization-based approach, data should transform in such a way that each record should match to at least other k-1 records. This approach hides one person (or a record) among k persons (or records). There are two ways to achieve K-Anonymity, namely global recoding and local recoding.
[006] In global recoding, K-anonymization is achieved by generalizing all records to same level of generalization. Although global recoding is time efficient, data utility loss is high. This loss in data utility is due to the unnecessarily high level of generalization. In contrast, in local recoding, K-Anonymization is achieved by generalizing different set of observations to different sets of generalization. An example of a global recoding and a local recoding are described further with reference to Tables, 1, 2 and 3. Herein, Table 2 is 2-anonymisation using Global recoding of Table 1, and Table 3 is 2-anonymisation using local recoding of Table 1. Local recoding preserves data utility at the expense of higher computational time.
Table 1: A Raw Table
No. 1 2 3 4 5 6 Gender Age Zip Problem
male Middle 4350 stress
male Middle 4350 obesity
male Middle 4350 obesity
female middle 4352 stress
female old 4354 stress
female old 4353 obesity
Table 2: A Two-Anonymity view by Global Recoding
No. 1 2 3 4 5 6 Gender Age Zip problem
* middle 435* stress
* middle 435* obesity
* middle 435* obesity
* middle 435* stress
* old 435* stress
* old 435* obesity
Table 3: A Two-Anonymity view by Local Recoding
No. 1 2 3 4 5 6 Gender Age Zip problem
male middle 4350 stress
male middle 4350 obesity
* middle 435* obesity
* middle 435* stress
female old 4353 stress
female old 4353 obesity
SUMMARY [007] Embodiments of the present disclosure present technological improvements as solutions to one or more of the above-mentioned technical problems recognized by the inventors in conventional systems. For example, in one embodiment, a method for data anonymization is provided. The method includes obtaining a dataset comprising a plurality of records for anonymization, via one or more hardware processors, the plurality of records comprising a plurality of attributes arranged in a taxonomy tree structure. Further, the method includes clustering the dataset into a plurality of clusters using an extended M-mode clustering technique, via the one or more hardware processors. Furthermore, the method includes, performing, For each cluster of the plurality of clusters, via the one or more hardware processors generating a set of patterns for an initial level of generalization of a set of records associated with the cluster, wherein each pattern of the plurality of patterns is representative of a distinct level of generalization and a distinct generalization loss, calculating a generalized information loss and a beta value for each pattern of the set of patterns, wherein the beta value for a pattern from amongst the set of patterns is indicative of a possibility of record to be anonymized by the pattern, and wherein the generalized information loss captures penalty incurred when generalizing an attribute from amongst the plurality of attributes, solving integer linear programming (ILP) model using the generalized information loss and the beta value to obtain a set of anonymized records by generated patterns and a set of suppressed records, determining whether the solution of ILP model is acceptable or not, wherein for a solution, the generalized information loss comprises a sum over the set of anonymized records and the set of
suppressed records, and wherein the acceptance of the solution is determined based on a percentage of reduction in the generalized information loss in a current iteration as compared to a previous iteration. Moreover, the method includes iteratively generating patterns with subsequent level of generalization of the set of records, calculating generalized information loss and solving the ILP model to obtain one or more solutions in one or more subsequent iterations until the solution in the one or more subsequent iteration is determined to be improved by a threshold percentage on determination that the solution is unacceptable for the set of anonymized records and the set of suppressed records.
[008] In another aspect, a system for data anonymization is provided. The method includes a memory storing instructions, one or more communication interfaces; and one or more hardware processors coupled to the memory via the one or more communication interfaces, wherein the one or more hardware processors are configured by the instructions to obtain a dataset comprising a plurality of records for anonymization, the plurality of records comprising a plurality of attributes arranged in a taxonomy tree structure. The one or more hardware processors are configured by the instructions to cluster the dataset into a plurality of clusters using an extended M-mode clustering technique. Further, the one or more hardware processors are configured by the instructions to perform, for each cluster of the plurality of clusters, generate a set of patterns for an initial level of generalization of a set of records associated with the cluster, wherein each pattern of the plurality of patterns is representative of a distinct level of generalization and a distinct generalization loss, calculate a generalized information loss and a beta value for each pattern of the set of patterns, wherein the beta value for a pattern from amongst the set of patterns is indicative of a possibility of record to be anonymized by the pattern, and wherein the generalized information loss captures penalty incurred when generalizing an attribute from amongst the plurality of attributes, solve integer linear programming (ILP) model using the generalized information loss and the beta value to obtain a set of anonymized records by generated patterns and a set of suppressed records, determine whether the solution of ILP model is acceptable or not, wherein for a solution, the generalized
information loss comprises a sum over the set of anonymized records and the set of suppressed records, and wherein the acceptance of the solution is determined based on a percentage of reduction in the generalized information loss in a current iteration as compared to a previous iteration and on determination that the solution is unacceptable for the set of anonymized records and the set of suppressed records, iteratively generate patterns with subsequent level of generalization of the set of records, calculate generalized information loss and solve the ILP model to obtain one or more solutions in one or more subsequent iterations until the solution in the one or more subsequent iteration is determined to be improved by a threshold percentage.
[009] In yet another aspect, a non-transitory computer readable medium for a method for data anonymization is provided. The method includes obtaining a dataset comprising a plurality of records for anonymization, via one or more hardware processors, the plurality of records comprising a plurality of attributes arranged in a taxonomy tree structure. Further, the method includes clustering the dataset into a plurality of clusters using an extended M-mode clustering technique, via the one or more hardware processors. Furthermore, the method includes, performing, For each cluster of the plurality of clusters, via the one or more hardware processors generating a set of patterns for an initial level of generalization of a set of records associated with the cluster, wherein each pattern of the plurality of patterns is representative of a distinct level of generalization and a distinct generalization loss, calculating a generalized information loss and a beta value for each pattern of the set of patterns, wherein the beta value for a pattern from amongst the set of patterns is indicative of a possibility of record to be anonymized by the pattern, and wherein the generalized information loss captures penalty incurred when generalizing an attribute from amongst the plurality of attributes, solving integer linear programming (ILP) model using the generalized information loss and the beta value to obtain a set of anonymized records by generated patterns and a set of suppressed records, determining whether the solution of ILP model is acceptable or not, wherein for a solution, the generalized information loss comprises a sum over the set of anonymized records and the set of suppressed records, and wherein
the acceptance of the solution is determined based on a percentage of reduction in the generalized information loss in a current iteration as compared to a previous iteration. Moreover, the method includes iteratively generating patterns with subsequent level of generalization of the set of records, calculating generalized information loss and solving the ILP model to obtain one or more solutions in one or more subsequent iterations until the solution in the one or more subsequent iteration is determined to be improved by a threshold percentage on determination that the solution is unacceptable for the set of anonymized records and the set of suppressed records.
[010] It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the invention, as claimed.
BRIEF DESCRIPTION OF THE DRAWINGS
[011] The accompanying drawings, which are incorporated in and constitute a part of this disclosure, illustrate exemplary embodiments and, together with the description, serve to explain the disclosed principles:
[012] FIG. 1 illustrates an example representation of generalization loss computation for data anonymization according to some embodiments of the present disclosure.
[013] FIG. 2 illustrates an example of pattern generation through records and level combinations for data anonymization according to some embodiments of the present disclosure.
[014] FIG. 3 illustrates an exemplary network implementation of a data anonymization system according to some embodiments of the present disclosure.
[015] FIGS. 4A-4B is a flow chart illustrating a method for data anonymization in accordance with some embodiments of the present disclosure.
[016] FIG. 5 is a block diagram of an exemplary computer system for implementing embodiments consistent with the present disclosure.
[017] FIG. 6 illustrates variation of loss in data utility for values of K in accordance with an example embodiment of present disclosure.
DETAILED DESCRIPTION OF EMBODIMENTS [018] Typically, data anonymization approaches can be categorized into three categories. The first category is random. In this category, randomization technique is used to perturb data and reconstructing statistics at an aggregate level. The second category is anonymization through cryptography. If there are two firms and each one has its own data and want to do computation on combined dataset without sharing to each other for preserving privacy.
[019] The third Category is K-Anonymity. In this approach, data is anonymized by hiding an individual among k persons (or records). In k-anonymization approach, there are also two famous approaches. The first one is global recording. This method generalizes the data at domain level. A global recoding method can overgeneralize anonymization by local recoding. A known technique used clustering based technique to anonymized data, an integer linear programing (ILP) program was proposed to anonymized data through local recoding. Another known technique also proposed mathematical model to solve K-anonymization but this work only consider suppression strategy. Yet another also proposed MILP model for this problem but they did not consider attributes taxonomy. In many cases, generalizing value by attribute taxonomy can make more sense than random bucketing. Suppose that in bucket one, there is undergraduate and post-graduate and in another bucket, there is post-graduate and 6th grade. Here former bucket preserves the information that the person attended higher school but any this type of interference is not possible in later bucket. So, considering taxonomy can preserve information while generalization.
[020] The method and system disclosed in various embodiments of the present disclosure provides technical solutions for aforementioned technical problems existing in art. For example, the disclosed embodiments consider attributes taxonomy trees for generalization. The disclosed method provides a clustering and mathematical modelling-based optimization approach which generates K-anonymized sanitized data and is capable of achieving K-anonymity in local recording. The method uses a modified form of M-mode clustering technique,
hereinafter referred to as extended M-mode clustering for achieving anonymization. The typical M-mode clustering algorithm uses binary dissimilarity (0 if two values are same, 1 of they are different) to measure the distance between two records in the dataset. However, the extended M-mode clustering disclosed herein a distance metric, known as WHD is used to compute distance between two records. The WHD based distance is appropriate for K-anonymization then the dissimilarity-based distance. In an embodiment, the disclosed method uses exteded m-mode clustering algorithm to cluster the records of the dataset and Integer Linear programmig (ILP) model to solve each cluster independently. Additionally, the method proposes to generate patterns in iterative manner which leads to decreasing the solution time of the ILP model.
[021] Exemplary embodiments are described with reference to the accompanying drawings. In the figures, the left-most digit(s) of a reference number identifies the figure in which the reference number first appears. Wherever convenient, the same reference numbers are used throughout the drawings to refer to the same or like parts. While examples and features of disclosed principles are described herein, modifications, adaptations, and other implementations are possible without departing from the scope of the disclosed embodiments. It is intended that the following detailed description be considered as exemplary only, with the true scope being indicated by the following claims.
[022] Referring now to the drawings, and more particularly to FIG. 1 through 6, where similar reference characters denote corresponding features consistently throughout the figures, there are shown preferred embodiments and these embodiments are described in the context of the following exemplary system and/or method.
[023] Glossary of terms used in description of various embodiments
[024] Definition 1 (Quasi-identifier attribute set): A quasi-identifier (QID) set is a set of attributes which can be linked together to identify individual in the table of records (or dataset). For example, attribute set {Gender, Age, Zip code} in Table 1 is a quasi-identifier. Table 1 can be used to get the information of individual of row 5 by linking Gender, Age and zip code.
[025] Definition 2 (Equivalence class). An equivalence of a table is defined as set of records which have identical QID value for all QID attributes in the QID set. For example, record 1, 2, and 3 in Table 1 is one equivalence class.
[026] Definition 3 (K-Anonymity property). Any table is said to be satisfy K-Anonymity property if the size of each equivalence class in an anonymized table is greater than or equal to k. Table 1 does not satisfy 2-Anonymity as equivalence class {female, old, 4354} has only one record in the table.
[027] Definition 4 (K-Anonymization) It is a process to transform any table to a table which satisfy K-anonymity property. There are two ways for k-Anonymization, namely global recoding and local recoding. In global recoding, the generalization happens in domain. When an attribute value is generalized, every occurrence of the value is replaced by the same new generalized value. In local recoding, generalization happen at a cell level. For any value, many different levels of generalization value can co-exist in the anonymized table. As in global generalization, generalization happen at the domain level, so there is more data distortion.
[028] Definition 5 (Generalized Information Loss (GenILoss)) Data Utility is an important part of data anonymization. In the present embodiments, this metrics (GenILoss) is used to quantify data utility. This metrics capture the penalty incurred when generalizing a specific attribute, by quantifying the fraction of the domain values that have been generalized.
[029] Let Li and Ui are the lower limit and upper limit of attribute i respectively. A record j entry for attribute i is generalized by lower limit Lij and upper limit Uij. The overall information loss of an anonymized table T* is calculated as:
Where T is original table, n is the number of attributes and |T| is number of records.
[031] This metrics (GenILoss) is based on the fact that given a generalized value of attribute which have larger range, has less precision than specific value
which have smaller range. Here, lower the value, the better it is. Value ‘zero’ represent no transformation and the value of 1’ represent full suppression or maximum level of generalization of attribute. This above defined formula is defined for numerical attributes. This metric is also defined for categorical attributes. Here, each leaf node is mapped to integer and the above formula can be applied.
[032] Illustration For marital status refer FIG. 1, single is mapped to 1, separated mapped to 2 and so on till remarried to six. The GenILoss for the cell
value “not married” is Numerator represent that ‘not married’ is a
generalized form of [1,4]. For age, which is numerical attribute, the GenILoss for
cell value with the value
[033] Definition 6 (Generalization height) Each of the attributes can be generalized to different level according to its taxonomy tree. The level to which the value is generalized is referred to as generalization height.
[034] Definition 7 (Level): Sum of generalization height of all the attributes of a record is defined as level of generalization for that record.
[035] Definition 8 (Level Combination) A record can be generalized into many ways depend on the number attributes and their taxonomy trees height. For a given level, all the possible combination of generalization is level combination. Figure 2 is showing all level combinations for level 1.
[036] Definition 9 (Weighted Hierarchical Distance). Let h be the height of domain generalization of a given attribute and generalization heights are defined as 1, 2,.. h-1, h from most specific to most general. Let the weight of generalization from generalization from j-1 to j is wj-1,j, where 2 ≤ j ≤ h. When a cell is generalized from level p to q, where p
Documents
Application Documents
#
Name
Date
1
202121011625-STATEMENT OF UNDERTAKING (FORM 3) [18-03-2021(online)].pdf
2021-03-18
2
202121011625-REQUEST FOR EXAMINATION (FORM-18) [18-03-2021(online)].pdf
2021-03-18
3
202121011625-FORM 18 [18-03-2021(online)].pdf
2021-03-18
4
202121011625-FORM 1 [18-03-2021(online)].pdf
2021-03-18
5
202121011625-FIGURE OF ABSTRACT [18-03-2021(online)].jpg
2021-03-18
6
202121011625-DRAWINGS [18-03-2021(online)].pdf
2021-03-18
7
202121011625-DECLARATION OF INVENTORSHIP (FORM 5) [18-03-2021(online)].pdf