Simultaneous Pattern and Data Hiding in Unsupervised Learning

Jie Wang, Lian Liu, Dianwei Han, and Jun Zhang
Laboratory for High Performance Scientific Computing and Computer Simulation
Department of Computer Science
University of Kentucky
Lexington, KY 40506-0046, USA

Abstract

Data mining techniques enable discovery of valuable data patterns and knowledge in shared data and increase profitability and enhance national security. Security and privacy threats arising from the use of data mining techniques bring a risk of disclosure of confidential knowledge as data is made public. How to control the level of knowledge disclosure and secure certain confidential patterns is a subtask comparable to confidential data hiding in privacy preserving data mining. We propose a technique to simultaneously hide data values and confidential patterns without undesirable side effects on distorting nonconfidential patterns. We use nonnegative matrix factorization technique to distort the original dataset and preserve its overall characteristics. A factor swapping method is designed to hide particular confidential patterns in an unsupervised learning. The effectiveness of this novel hiding technique is examined by conducting k-means clustering on a benchmark dataset. Experimental results indicate that our technique can produce a single modified dataset to achieve both pattern and data value hiding. The usability of the data is well maintained. Under certain constraints on the nonnegative matrix factorization iterations, an optimal solution can be computed in which the user-specified confidential memberships or relationships are hidden without undesirable alterations on nonconfidential patterns.


Key words: Data distortion, nonnegative matrix factorization, clustering, privacy, data mining

Mathematics Subject Classification:


Download the the PDF file jiewang5.pdf.
Technical Report No. 487-07, Department of Computer Science, University of Kentucky, Lexington, KY, 2007.

The research work of Jun Zhang was supported in part by the U.S. National Science Foundation under grant CCF-0527967, in part by the National Institutes of Health under grant 1R01HL086644-01, in part by the Kentucky Science and Engineering Foundation under grant KSEF-148-502-06-186, and in part by the Alzheimer's Association under Grant NIGR-06-25460.