Privacy Preserving Data Mining Philosophical Issues
• Mary J. Cronin, "e-Privacy?," HOOVER DIGEST, No. 3, 2000. Adapted from the essay "Privacy and Electronic Commerce," by Mary J. Cronin, in the new Hoover Press book Public Policy and the Internet: Privacy, Taxes, and Contract, edited by Nicholas Imparato. [link]
[abstract]
[showhide] 数据挖掘研究院
Imagine going to a shopping mall in which researchers follow you from store to store, taking notes on every product you examine or buy. Would you shop in such a place? Chances are, you already do. Welcome to the Internet.
数据挖掘研究院
[bib]
[showhide]
''this is the bibtex entry ...'' 数据挖掘研究院
• Alan J. Broder, "Data Mining, the Internet, and Privacy," International WEBKDD’99 Workshop San Diego, CA, USA, August 15, 1999. [link]
[abstract]
[showhide]
This paper addresses the inherent technological conflict between the desire for privacy by World Wide Web users, and the need of Web content providers and advertisers to more fully collect and utilize data about users. As the other papers in this volume illustrate, the Data Mining community has now turned its collective attention towards the Web as a fertile venue for research and development. In doing so, Data Miners have found themselves at the nexus of this conflict. 数据挖掘实验室
We present the technical issues regarding privacy from two perspectives. First, from the perspective of the Web user who may be unaware of the degree to which identifying information can be inadvertently disclosed. And second, from the perspective of a Data Miner we consider the extent to which privacy enhancing technologies could substantially invalidate data mining results.
数据挖掘研究院
[bib]
[showhide]
@INCOLLECTION{Broder_99, AUTHOR = "Alan J. Broder", TITLE = "Data Mining, the Internet, and Privacy", BOOKTITLE = "Web Usage Analysis and User Profiling: International WEBKDD'99 Workshop San Diego, CA, USA, August 15, 1999 ", PUBLISHER = "Springer Berlin/Heidelberg", YEAR = "1999", editor = "", volume = "1836/2000", number = "", series = "", type = "", chapter = "p.56", pages = "", address = "", edition = "", month = "", note = "", abstract = "", keywords = "", file = F } 数据挖掘研究院
• EPIC: Electronic Privacy Information Center, a public interest research center in Washington, D.C. [link]
[abstract]
[showhide]
[bib]
[showhide]
数据挖掘研究院
Privacy Preserving Data Mining Survey/General Issues
• V.S. Verykios, E. Bertino, I.N. Fovino, L.P. Provenza, Y. Saygin, and Y. Theodoridis, “State-of-the-Art in Privacy Preserving Data Mining,” ACM SIGMOD Record, vol. 3, no. 1, pp. 50-57, Mar. 2004. [link] 数据挖掘研究院
[abstract]
[showhide]
数据挖掘研究院
We provide here an overview of the new and rapidly emerging research area of privacy preserving data mining. We also propose a classification hierarchy that sets the basis for analyzing the work which has been performed in this context. A detailed review of the work accomplished in this area is also given, along with the coordinates of each work to the classification hierarchy. A brief evaluation is performed, and some initial conclusions are made.
数据挖掘研究院
[bib]
[showhide] 数据挖掘研究院
@ARTICLE{Verykios_04v, AUTHOR = "V. S. Verykios and E. Bertino and I. N. Fovino and L. P. Provenza and Y. Saygin and Y. Theodoridis", TITLE = "State-of-the-art in Privacy Preserving Data Mining", JOURNAL = "ACM SIGMOD Record", YEAR = "2004", volume = "3", number = "1", pages = "50--57", month = "March", note = "", abstract = "", keywords = "", } 数据挖掘研究院
• C. Clifton, M. Kantarcioglu, J. Vaidya, X. Lin, and M. Zhu, “Tools for Privacy Preserving Distributed Data Mining,” ACM SIGKDD Explorations, vol. 4, no. 2, 2003. [link] 数据挖掘研究院
[abstract]
[showhide] 数据挖掘实验室
Privacy preserving mining of distributed data has numerous applications. Each application poses di erent constraints: What is meant by privacy, what are the desired results, how is the data distributed, what are the constraints on collaboration and cooperative computing, etc. We suggest that the solution to this is a toolkit of components that can be combined for speci c privacy-preserving data mining applications. This paper presents some components of such a toolkit, and shows how they can be used to solve several privacy-preserving data mining problems. 数据挖掘研究院
[bib]
[showhide]
数据挖掘研究院
@article{Clifton:03, author = {C. Clifton and M. Kantarcioglu and J. Vaidya and X. Lin and M. Zhu}, title = "Tools for Privacy Preserving Distributed Data Mining", journal = {ACM SIGKDD Explorations}, volume = {4}, number = {2}, year = {2003}, } 数据挖掘实验室
• B. Pinkas, “Cryptographic Techniques for Privacy Preserving Data Mining,” SIGKDD Explorations, vol. 4, no. 2, pp. 12-19, 2002. [link] 数据挖掘研究院
[abstract]
[showhide]
数据挖掘研究院
Research in secure distributed computation, which was done as part of a larger body of research in the theory of cryptography, has achieved remarkable results. It was shown that non-trusting parties can jointly compute functions of their different inputs while ensuring that no party learns anything but the defined output of the function. These results were shown using generic constructions that can be applied to any function that has an efficient representation as a circuit. We describe these results, discuss their efficiency, and demonstrate their relevance to privacy preserving computation of data mining algorithms. We also show examples of secure computation of data mining algorithms that use these generic constructions. 数据挖掘研究院
[bib]
[showhide]
数据挖掘研究院
@ARTICLE{Pinkas_02, AUTHOR = "B. Pinkas", TITLE = "Cryptographic Techniques for Privacy Preserving Data Mining", JOURNAL = "SIGKDD Explorations", YEAR = "2002", volume = "4", number = "2", pages = "12--19 ", month = "", note = "", abstract = "", keywords = "", source = "", } 数据挖掘研究院
• Murat Kantarcio\&\#487;lu and Jiashun Jin and Chris Clifton, "When do data mining results violate privacy?," in Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, 2004. [link] 数据挖掘研究院
[abstract]
[showhide]
Privacy-preserving data mining has concentrated on obtaining valid results when the input data is private. An extreme example is Secure Multiparty Computation-based methods, where only the results are revealed. However, this still leaves a potential privacy breach: Do the results themselves violate privacy? This paper explores this issue, developing a framework under which this question can be addressed. Metrics are proposed, along with analysis that those metrics are consistent in the face of apparent problems.
[bib]
[showhide]
数据挖掘研究院
@inproceedings{1014126, author = {Murat Kantarcio\&\#487;lu and Jiashun Jin and Chris Clifton}, title = {When do data mining results violate privacy?}, booktitle = {KDD '04: Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining}, year = {2004}, isbn = {1-58113-888-1}, pages = {599--604}, location = {Seattle, WA, USA}, doi = {http://doi.acm.org/10.1145/1014052.1014126}, publisher = {ACM Press}, address = {New York, NY, USA}, }
• Shipra Agrawal, Vijay Krishnan, Jayant Haritsa, "On Addressing Efficiency Concerns in Privacy Preserving Data Mining," In Proceedings of the 9th International Conference on Database Systems for Advanced Applications (DASFAA-2004). [link] 数据挖掘研究院
[abstract]
[showhide] 数据挖掘实验室
Data mining services require accurate input data for their results to be meaningful, but privacy concerns may influence users to provide spurious information. To encourage users to provide correct inputs, we recently proposed a data distortion scheme for association rule mining that simultaneously provides both privacy to the user and accuracy in the mining results. However, mining the distorted database can be orders of magnitude more time-consuming as compared to mining the original database. In this paper, we address this issue and demonstrate that by (a) generalizing the distortion process to perform symbol-specific distortion, (b) appropriately choosing the distortion parameters, and (c) applying a variety of optimizations in the reconstruction process, runtime efficiencies that are well within an order of magnitude of undistorted mining can be achieved.
数据挖掘研究院
[bib]
[showhide] 数据挖掘实验室
''this is the bibtex entry ...'' 数据挖掘研究院
• Chris Clifton and Murat Kantarcioglu and Jaideep Vaidya, "Defining Privacy for Data Mining," in Proceedings of the National Science Foundation Workshop on Next Generation Data Mining, November 1-3, 2002, Baltimore, MD. [link]
数据挖掘研究院
[abstract]
[showhide]
Privacy preserving data mining – getting valid data mining results without learning the underlying data values – has been receiving attention in the research community and beyond. It is unclear what privacy preserving means. This paper provides a framework and metrics for discussing the meaning of privacy preserving data mining, as a foundation for further research in this field.
[bib]
[showhide]
数据挖掘研究院
''this is the bibtex entry ...'' 数据挖掘研究院
数据挖掘研究院
Additive Data Perturbation
• R. Agrawal and R. Srikant, “Privacy Preserving Data Mining,” Proc. ACM SIGMOD Conf. Management of Data, pp. 439-450, May 2000. [link]
[abstract]
[showhide]
数据挖掘研究院
A fruitful direction for future data mining research will be the development of techniques that incorporate privacy concerns. Specifically, we address the following question. Since the primary task in data mining is the development of models about aggregated data, can we develop accurate models without access to precise information in individual data records? We consider the concrete case of building a decision-tree classifier from training data in which the values of individual records have been perturbed. The resulting data records look very different from the original records and the distribution of data values is also very different from the original distribution. While it is not possible to accurately estimate original values in individual data records, we propose a novel reconstruction procedure to accurately estimate the distribution of original data values. By using these reconstructed distributions, we are able to build classifiers built with the original data.
[bib]
[showhide] 数据挖掘研究院
@INPROCEEDINGS{Agrawal:00, AUTHOR = "R. Agrawal and R. Srikant", TITLE = "Privacy Preserving Data Mining", BOOKTITLE = "Proceedings of the ACM SIGMOD Conference on Management of Data", YEAR = "2000", editor = "", volume = "", number = "", series = "", pages = "439--450", address = "Dallas, TX", month = "May", organization = "", publisher = "", note = "", abstract = "", keywords = "", file = F }
• D. Agrawal and C. C. Aggarwal, "On the Design and Quantification of Privacy Preserving Data Mining Algorithms," Proc. ACM SIGMOD, pp. 247-255, 2001. [link] 数据挖掘研究院
[abstract]
[showhide] 数据挖掘研究院
The increasing ability to track and collect large amounts of data with the use of current hardware technology has lead to an interest in the development of data mining algorithms which preserve user privacy. A recently proposed technique addresses the issue of privacy preservation by perturbing the data and reconstructing distributions at an aggregate level in order to perform the mining. This method is able to re- tain privacy while accessing the information implicit in the original attributes. The distribution reconstruction process naturally leads to some loss of information which is accept- able in many practical situations. This paper discusses an Expectation Maximization (EM) algorithm for distribution reconstruction which is more effective than the currently available method in terms of the level of information loss. Specifically, we prove that the EM algorithm converges to the maximum likelihood estimate of the original distribu- tion based on the perturbed data. We show that when a large amount of data is available, the EM algorithm pro- vides robust estimates of the original distribution. We pro- pose metrics for quantification and measurement of privacy- preserving data mining algorithms. Thus, this paper pro- vides the foundations for measurement of the effectiveness of privacy preserving data mining algorithms. Our privacy metrics illustrate some interesting results on the relative ef- fectiveness of different perturbing distributions.
数据挖掘实验室
[bib]
[showhide] 数据挖掘研究院
@INPROCEEDINGS{Agrawal:01, AUTHOR = "D. Agrawal and C. C. Aggarwal" TITLE = "On the Design and Quantification of Privacy Preserving Data Mining Algorithms", BOOKTITLE = "Proceedings of the twentieth ACM SIGMOD-SIGACT-SIGART symposium on Principles of Database Systems", YEAR = "2001", editor = "", volume = "", number = "", series = "", pages = "247--255", address = "Santa Barbara, CA", month = "", organization = "", publisher = "", note = "", abstract = "", keywords = "", } 数据挖掘研究院
• H. Kargupta, S. Datta, Q. Wang, and K. Sivakumar, “On the Privacy Preserving Properties of Random Data Perturbation Techniques,” Proc. IEEE Int’l Conf. Data Mining, Nov. 2003. [link]
[abstract]
[showhide]
Privacy is becoming an increasingly important issue in many data mining applications. This has triggered the development of many privacy-preserving data mining techniques. A large fraction of them use randomized data distortion techniques to mask the data for preserving the privacy of sensitive data. This methodology attempts to hide the sensitive data by randomly modifying the data values often using additive noise. This paper questions the utility of the random value distortion technique in privacy preservation. The paper notes that random objects (particularly random matrices) have "predictable" structures in the spectral domain and it develops a random matrix-based spectral filtering technique to retrieve original data from the dataset distorted by adding random values. The paper presents the theoretical foundation of this filtering method and extensive experimental results to demonstrate that in many cases random data distortion preserve very little data privacy. The paper also points out possible avenues for the development of new privacy-preserving data mining techniques like exploiting multiplicative and colored noise for preserving privacy in data mining applications. 数据挖掘研究院
[bib]
[showhide] 数据挖掘实验室
@INPROCEEDINGS{Kargupta:03a, AUTHOR = {H. Kargupta and S. Datta and Q. Wang and K. Sivakumar}, TITLE = "On the Privacy Preserving Properties of Random Data Perturbation Techniques", BOOKTITLE = {{Proceedings of the IEEE International Conference on Data Mining}}, YEAR = {2003}, EDITOR = {}, PAGES = {99}, VOLUME = {}, NUMBER = {}, SERIES = {}, ADDRESS = {Melbourne, FL}, MONTH = {November}, NOTE = {}, KEY = {}, KEYWORDS = {}, ISBN = {}, URL = {}, ABSTRACT = {}, }
• K. Muralidhar and R. Sarathy, "Security of Random Data Perturbation Methods," ACM Transactions on Database Systems, Vol. 24, No. 4, December 1999, Pages 487–493. [link]
数据挖掘研究院
[abstract]
[showhide] 数据挖掘研究院
Statistical databases often use random data perturbation (RDP) methods to protect against disclosure of confidential numerical attributes. One of the key requirements of RDP methods is that they provide the appropriate level of security against snoopers who attempt to obtain information on confidential attributes through statistical inference. In this study, we evaluate the security provided by three methods of perturbation. The results of this study allow the database administrator to select the most effective RDP method that assures adequate protection against disclosure of confidential information. 数据挖掘研究院
[bib]
[showhide]
数据挖掘研究院
''this is the bibtex entry ...'' 数据挖掘研究院
• K. Muralidhar and R. Sarathy, "A theoretical basis for perturbation methods," Statistics and Computing 13: 329–335, 2003. [link] 数据挖掘研究院
[abstract]
[showhide] 数据挖掘研究院
In this paper we discuss a new theoretical basis for perturbation methods. In developing this new theoretical basis, we define the ideal measures of data utility and disclosure risk. Maximum data utility is achieved when the statistical characteristics of the perturbed data are the same as that of the original data. Disclosure risk is minimized if providing users with microdata access does not result in any additional information. We show that when the perturbed values of the confidential variables are generated as independent realizations from the distribution of the confidential variables conditioned on the non-confidential variables, they satisfy the data utility and disclosure risk requirements. We also discuss the relationship between the theoretical basis and some commonly used methods for generating perturbed values of confidential numerical variables. 数据挖掘研究院
[bib]
[showhide]
''this is the bibtex entry ...'' 数据挖掘研究院
• Alexandre Evfimievski, "Randomization in Privacy Preserving Data Mining," ACM SIGKDD Explorations Newsletter, Volume 4, Issue 2, Pages: 43-48, December 2002. [link]
[abstract]
[showhide] 数据挖掘研究院
Suppose there are many clients, each having some personal information, and one server, which is interested only in aggregate, statistically significant, properties of this information. The clients can protect privacy of their data by perturbing it with a randomization algorithm and then submitting the randomized version. The randomization algorithm is chosen so that aggregate properties of the data can be recovered with sufficient precision, while individual entries are significantly distorted. How much distortion is needed to protect privacy can be determined using a privacy measure. Several possible privacy measures are known; finding the best measure is an open question. This paper presents some methods and results in randomization for numerical and categorical data, and discusses the issue of measuring privacy.
数据挖掘研究院
[bib]
[showhide] 数据挖掘实验室
''this is the bibtex entry ...'' 数据挖掘研究院
• A. Evfimevski, J. Gehrke, and R. Srikant, “Limiting Privacy Breaches in Privacy Preserving Data Mining,” Proc. ACM SIGMOD/PODS Conf., June 2003. [link]
数据挖掘研究院
[abstract]
[showhide] 数据挖掘研究院
There has been increasing interest in the problem of building accurate data mining models over aggregate data, while protecting privacy at the level of individual records. One approach for this problem is to randomize the values in individual records, and only disclose the randomized values. The model is then built over the randomized data, after first compensating for the randomization (at the aggregate level). This approach is potentially vulnerable to privacy breaches: based on the distribution of the data, one may be able to learn with high confidence that some of the randomized records satisfy a specified property, even though privacy is preserved on average.In this paper, we present a new formulation of privacy breaches, together with a methodology, "amplification", for limiting them. Unlike earlier approaches, amplification makes it is possible to guarantee limits on privacy breaches without any knowledge of the distribution of the original data. We instantiate this methodology for the problem of mining association rules, and modify the algorithm from [9] to limit privacy breaches without knowledge of the data distribution. Next, we address the problem that the amount of randomization required to avoid privacy breaches (when mining association rules) results in very long transactions. By using pseudorandom generators and carefully choosing seeds such that the desired items from the original transaction are present in the randomized transaction, we can send just the seed instead of the transaction, resulting in a dramatic drop in communication and storage cost. Finally, we define new information measures that take privacy breaches into account when quantifying the amount of privacy preserved by randomization. 数据挖掘实验室
[bib]
[showhide]
数据挖掘实验室
@INPROCEEDINGS{Evfimievski:03, AUTHOR = "A. Evfimevski and J. Gehrke and R. Srikant", TITLE = "Limiting Privacy Breaches in Privacy Preserving Data Mining", BOOKTITLE = "Proceedings of the ACM SIGMOD/PODS Conference", YEAR = "2003", editor = "", volume = "", number = "", series = "", pages = "211--222", address = "San Diego, CA", month = "June", organization = "", note = "", abstract = "", keywords = "", } 数据挖掘实验室
• C. W. Wu, "Privacy Preserving Data Mining: a signal processing perspective and a simple data perturbation protocol," 2nd Workshop on Privacy Preserving Data Mining, IEEE International Conference on Data Mining, Melbourne, FL, 2003. [link] 数据挖掘研究院
[abstract]
[showhide]
Privacy concerns over the proliferation of gathering of personal information by various institutions over the internet led to the development of data mining algorithms that preserve the privacy of those whose personal data are collected and analyzed. A novel approach to such privacy preserving data mining algorithms was proposed where the individual datum in a data set is perturbed by adding a random value from a known distribution. In these applications, the distribution of the original data set is important and estimating it is one of the goals of the data mining algorithm. This distribution is estimated via an iterative algorithmsuch as the ExpectationMaximization (EM) algorithmwhich was shown to have desirable properties such as low privacy loss and high fidelity estimates of the distribution. Each iteration of EM requires computation that is proportional to the size of the data set and can require large computation time to estimate the distribution. In this paper we propose two ways to reduce the amount of computation. First, we show that the problem can be recast as a deconvolution problem and signal processing algorithms can be applied to solve this problem. In particular we consider both a direct method and iterative methods which are more robust against noise and ill-conditioning. We show that the Richardson-Lucy deblurring algorithm is equivalent to EM after quantization. The signal processing approach also shows how the choice of perturbation affects information loss and privacy loss and allows us to clarify some points made in the literature. 数据挖掘研究院
In the second part of this paper, we propose a scheme for perturbing data which also has the nice properties of arbitrarily small privacy loss and arbitrarily high fidelity in the estimate. The main advantage of the proposed scheme is the simplicity of the estimation algorithm. In contrast to iterative algorithms such as EM, the proposed scheme estimates the unknown distribution in one step. This is significant in applications where the data set is very large or when the data mining algorithm is run in an online environment.
数据挖掘研究院
[bib]
[showhide] 数据挖掘研究院
''this is the bibtex entry ...'' 数据挖掘实验室
• Z. Huang and W. Du and B. Chen, "Deriving Private Information from Randomized Data," in Proceedings of the 2005 ACM SIGMOD Conference, pp. 37-48, Baltimroe, MD, 2005. [link]
[abstract]
[showhide]
数据挖掘研究院
Randomization has emerged as a useful technique for data disguising in privacy-preserving data mining. Its privacy properties have been studied in a number of papers. Kar- gupta et al. challenged the randomization schemes, and they pointed out that randomization might not be able to preserve privacy. However, it is still unclear what factors cause such a security breach, how they affect the privacy preserving property of the randomization, and what kinds of data have higher risk of disclosing their private contents even though they are randomized. 数据挖掘研究院
We believe that the key factor is the correlations among 数据挖掘研究院
attributes. We propose two data reconstruction methods that are based on data correlations. One method uses the Principal Component Analysis (PCA) technique, and the other method uses the Bayes Estimate (BE) technique. We have conducted theoretical and experimental analysis on the relationship between data correlations and the amount of private information that can be disclosed based our proposed data reconstructions schemes. Our studies have shown that when the correlations are high, the original data can be re- constructed more accurately, i.e., more private information can be disclosed.
数据挖掘研究院
To improve privacy, we propose a modified randomization
数据挖掘研究院
scheme, in which we let the correlation of random noises "similar" to the original data. Our results have shown that the reconstruction accuracy of both PCA-based and BE- based schemes become worse as the similarity increases.
数据挖掘实验室
[bib]
[showhide]
@INPROCEEDINGS{Huang_05, AUTHOR = "Z. Huang and W. Du and B. Chen", TITLE = "Deriving Private Information from Randomized Data", BOOKTITLE = "Proceedings of the 2005 ACM SIGMOD Conference", YEAR = "2005", editor = "", volume = "", number = "", series = "", pages = "37--48", address = "Baltimroe, MD", month = "June", organization = "", publisher = "", note = "", abstract = "", keywords = "", file = F }
• S. Guo and X. Wu, "On the Use of Spectral Filtering for Privacy Preserving Data Mining," Proceedings of the 21st ACM Symposium on Applied Computing (SAC06), Dijon, France, April 23-27, pp. 622-626, 2006. [link] 数据挖掘研究院
[abstract]
[showhide] 数据挖掘研究院
Randomization has been a primary tool to hide sensitive pri- vate information during privacy preserving data mining.The previous work based on spectral ¯ltering, show the noise may be separated from the perturbed data under some conditions and as a result privacy can be seriously compromised. In this paper, we explicitly assess the e®ects of perturbation on the accuracy of the estimated value and give the explicit rela- tion on how the estimation error varies with perturbation. In particular, we derive one upper bound for the Frobenius norm of reconstruction error. This upper bound may be ex- ploited by attackers to determine how close their estimates are from the original data using spectral ¯ltering technique, which imposes a serious threat of privacy breaches.
[bib]
[showhide] 数据挖掘研究院
@INPROCEEDINGS{Guo_06, AUTHOR = "S. Guo and X. Wu", TITLE = "On the Use of Spectral Filtering for Privacy Preserving Data Mining", BOOKTITLE = "Proceedings of the 21st ACM Symposium on Applied Computing (SAC'06)", YEAR = "2006", editor = "", volume = "", number = "", series = "", pages = "622--626", address = "Dijon, France", month = "April", organization = "", publisher = "", note = "", abstract = "", keywords = "", file = F } 数据挖掘研究院
• S.Guo, X.Wu and Y. Li, "On the Lower Bound of Reconstruction Error for Spectral Filting," Proceedings of the 10th European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD06), Berlin, Germany, Sept 18-22, 2006. [link] 数据挖掘研究院
[abstract]
[showhide] 数据挖掘实验室
[bib]
[showhide]
数据挖掘实验室
@INPROCEEDINGS{, AUTHOR = "S.Guo and X.Wu and Y. Li", TITLE = "On the Lower Bound of Reconstruction Error for Spectral Filting", BOOKTITLE = "Proceedings of the 10th European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD'06)", YEAR = "2006", editor = "", volume = "", number = "", series = "", pages = "", address = "Berlin, Germany", month = "September", organization = "", publisher = "", note = "", abstract = "", keywords = "", file = F } 数据挖掘研究院