Reconciling data privacy and utility in the era of big data
MetadataShow full item record
The widespread use of internet-connected mobile devices, internet of things(IoT) and cloud computing has enabled a large scale collection of personal data, including user profiles, daily activities, locations, photos and health states, etc, of millions and billions of users from a wide range of scenarios such as the usage of mobile apps, smart home, and cloud storage services. The availability of these huge amounts of datasets has been driving the breakthrough in deep learning and explosion of data-driven applications for enriching human with life-enhancing experiences. At the same time, however, these datasets often encode privacy-sensitive information related to individuals, which raises serious privacy concerns to the society. Therefore, it is imperative to develop principled privacy preserving approaches to harvesting the power of those big data. This dissertation research contributes original ideas and innovative techniques in applying differential privacy, a rigorous mathematical framework that offers provable privacy guarantee, to protect data privacy with improving the trade-off between privacy and utility in the era of big data from three perspectives respectively: data collection, data usage, and data publication. The first contribution of this dissertation research is the development of PIVE, a two-phase Bayesian differential location privacy framework that aims to protect users’ location privacy in location based services while ensuring the service quality. With the popularity of location based services for navigation, point-of-interest recommendation and social network etc, the companies that offer such services can continuously collect users’ locations. The collected location information may open doors to potential misuse and abuse of private location information, exposing users’ travel patterns and uncovering their health state and political views. PIVE provides a Bayesian differentially private location perturbation mechanism which transforms the user’s exact location to a perturbed location in a geo-indistinguishable way while being resilient against Bayesian attacks before reporting it to the servers. This approach essentially augments differential location privacy by bounding the inference error of the adversaries with specific prior knowledge, while enabling adaptive privacy control to improve the utility and user experience. The second contribution of this dissertation research is the development of differentially private deep learning for protecting the privacy of the training data Because of the breakthrough of deep learning, more companies are interested in training deep neural networks on the collected data to empower their business with new competitive edges. However, a deep neural network usually has millions of model parameters, leading to large effective capacity that could be sufficient for encoding the details of individual data into model parameters. Our research addresses a collection of related topics within the context of deep learning with differential privacy. We provide more refined analysis of the privacy losses for differentially private stochastic gradient descent algorithms(SGD) for different data batching strategies including random reshuffling and random sampling. Also, we propose a family of methods for non-uniformly allocating privacy budget across SGD iterations to improve model accuracy while retaining privacy guarantees. Last, we propose a differentially private data synthesis approach for data publication. Because the collection of individual data by governments and corporations can create tremendous opportunities for knowledge-based decision making, there is a demand for the exchange and publication of data among various parties. However, publishing data in its original form will violate individual privacy. Instead, releasing synthetic data that mimic original data provides a promising way for privacy preserving data publication while allowing rich data analytics. In particular, we propose to use deep generative models with differentially private training for location data synthesis, compared our approach with conventional methods that rely on sophisticated feature engineering, and examine the utility of synthesized data.