Show simple item record

dc.contributor.authorChen, Kekeen_US
dc.date.accessioned2006-09-01T19:33:08Z
dc.date.available2006-09-01T19:33:08Z
dc.date.issued2006-07-07en_US
dc.identifier.urihttp://hdl.handle.net/1853/11561
dc.description.abstractWith the wide deployment of data intensive Internet applications and continued advances in sensing technology and biotechnology, large multidimensional datasets, possibly containing privacy-conscious information have been emerging. Mining such datasets has become increasingly common in business integration, large-scale scientific data analysis, and national security. The proposed research aims at exploring the geometric properties of the multidimensional datasets utilized in statistical learning and data mining, and providing novel techniques and frameworks for mining very large datasets while protecting the desired data privacy. The first main contribution of this research is the development of iVIBRATE interactive visualization-based approach for clustering very large datasets. The iVIBRATE framework uniquely addresses the challenges in handling irregularly shaped clusters, domain-specific cluster definition, and cluster-labeling of the data on disk. It consists of the VISTA visual cluster rendering subsystem, and the Adaptive ClusterMap Labeling subsystem. The second main contribution is the development of ``Best K Plot'(BKPlot) method for determining the critical clustering structures in multidimensional categorical data. The BKPlot method uniquely addresses two challenges in clustering categorical data: How to determine the number of clusters (the best K) and how to identify the existence of significant clustering structures. The method consists of the basic theory, the sample BKPlot theory for large datasets, and the testing method for identifying no-cluster datasets. The third main contribution of this research is the development of the theory of geometric data perturbation and its application in privacy-preserving data classification involving single party or multiparty collaboration. The key of geometric data perturbation is to find a good randomly generated rotation matrix and an appropriate noise component that provides satisfactory balance between privacy guarantee and data quality, considering possible inference attacks. When geometric perturbation is applied to collaborative multiparty data classification, it is challenging to unify the different geometric perturbations used by different parties. We study three protocols under the data-mining-service oriented framework for unifying the perturbations: 1) the threshold-satisfied voting protocol, 2) the space adaptation protocol, and 3) the space adaptation protocol with a trusted party. The tradeoffs between the privacy guarantee, the model accuracy and the cost are studied for the protocols.en_US
dc.format.extent2417820 bytes
dc.format.mimetypeapplication/pdf
dc.language.isoen_US
dc.publisherGeorgia Institute of Technologyen_US
dc.subjectGeometric methodsen_US
dc.subjectInformation visualization
dc.subjectData mining
dc.subjectPrivacy-preserving data mining
dc.subjectData clustering
dc.subjectData classification
dc.subjectDistributed collaborative data mining
dc.subjectCategorical data clustering
dc.titleGeometric Methods for Mining Large and Possibly Private Datasetsen_US
dc.typeDissertationen_US
dc.description.degreePh.D.en_US
dc.contributor.departmentComputingen_US
dc.description.advisorCommittee Chair: Liu, Ling; Committee Member: Bertino, Elisa; Committee Member: Lee, Chin-hui; Committee Member: Navathe, Shamkant; Committee Member: Omiecinski, Edwarden_US


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record