SMARTech   Library Home
 

Georgia Tech's Institutional Repository >
Georgia Tech Theses and Dissertations >
Georgia Tech Theses and Dissertations >

Please use this identifier to cite or link to this item: http://hdl.handle.net/1853/11561

Title: Geometric Methods for Mining Large and Possibly Private Datasets
Authors: Chen, Keke
Computing
Advisor: Committee Chair: Liu, Ling; Committee Member: Bertino, Elisa; Committee Member: Lee, Chin-hui; Committee Member: Navathe, Shamkant; Committee Member: Omiecinski, Edward
Subjects : Geometric methods
Information visualization
Data mining
Privacy-preserving data mining
Data clustering
Data classification
Distributed collaborative data mining
Categorical data clustering
Issue Date: 7-Jul-2006
Publisher: Georgia Institute of Technology
Abstract: With the wide deployment of data intensive Internet applications and continued advances in sensing technology and biotechnology, large multidimensional datasets, possibly containing privacy-conscious information have been emerging. Mining such datasets has become increasingly common in business integration, large-scale scientific data analysis, and national security. The proposed research aims at exploring the geometric properties of the multidimensional datasets utilized in statistical learning and data mining, and providing novel techniques and frameworks for mining very large datasets while protecting the desired data privacy. The first main contribution of this research is the development of iVIBRATE interactive visualization-based approach for clustering very large datasets. The iVIBRATE framework uniquely addresses the challenges in handling irregularly shaped clusters, domain-specific cluster definition, and cluster-labeling of the data on disk. It consists of the VISTA visual cluster rendering subsystem, and the Adaptive ClusterMap Labeling subsystem. The second main contribution is the development of ``Best K Plot'(BKPlot) method for determining the critical clustering structures in multidimensional categorical data. The BKPlot method uniquely addresses two challenges in clustering categorical data: How to determine the number of clusters (the best K) and how to identify the existence of significant clustering structures. The method consists of the basic theory, the sample BKPlot theory for large datasets, and the testing method for identifying no-cluster datasets. The third main contribution of this research is the development of the theory of geometric data perturbation and its application in privacy-preserving data classification involving single party or multiparty collaboration. The key of geometric data perturbation is to find a good randomly generated rotation matrix and an appropriate noise component that provides satisfactory balance between privacy guarantee and data quality, considering possible inference attacks. When geometric perturbation is applied to collaborative multiparty data classification, it is challenging to unify the different geometric perturbations used by different parties. We study three protocols under the data-mining-service oriented framework for unifying the perturbations: 1) the threshold-satisfied voting protocol, 2) the space adaptation protocol, and 3) the space adaptation protocol with a trusted party. The tradeoffs between the privacy guarantee, the model accuracy and the cost are studied for the protocols.
Type: Dissertation
URI: http://hdl.handle.net/1853/11561
Appears in Collections:Georgia Tech Theses and Dissertations
College of Computing Theses and Dissertations

Files in This Item:

File Description SizeFormat
chen_keke_200608_phd.pdf2.36 MBAdobe PDFView/Open

Items in SMARTech are protected by copyright, with all rights reserved, unless otherwise indicated.

 

Valid XHTML 1.0! DSpace Software Copyright © 2002-2007 MIT and Hewlett-Packard - Feedback