Attacking and Protecting Public Data with Differential Privacy
Garfinkel, Simson L.
MetadataShow full item record
Publishing exact statistical data creates mathematical risks and vulnerabilities that have only recently been appreciated. In 2010, the U.S. Census Bureau collected information on more than 308 million residents and published more than 8 billion statistics. Last year a Census Bureau red team performed a simulated attack against this public dataset and was able to reconstruct all of confidential microdata used in these tabulations with very limited error. They matched 45% of these reconstructed records to commercial datasets acquired between 2009 and 2011. 38% of these matches were confirmed in the original 2010 confidential microdata. These rates represent vulnerability levels more than a thousand times higher than had been previously considered acceptable. As a result of this internal test, the Census Bureau has adopted a new privacy protection methodology called differential privacy to protect the data publications of the 2020 Census. Differential privacy is based on systematically adding statistical "noise" to data products prior to publication. By carefully controlling the method by which the noise is added, and through the use of advanced post-processing, the Census Bureau is able to ensure the analytical validity of its statistical publications while protecting the underlying confidential data on which those publications are based. It is hypothesized that similar approaches could be used to protect other kinds of data products that must be shared outside of a trusted community, such as statistical models and cyber threat intelligence.