Improving robustness of DNS graph clustering against noise
MetadataShow full item record
Clustering is often the first step performed to assist us in finding structure within unlabeled datasets. Given a small set of labels, clustering also enables us to propagate these labels by discovering groups of objects that are similar to each other. The ever-growing amount of data being collected over a long period of time brings us many challenging opportunities to conduct clustering. Analyzing such long-term datasets allows us to solve evolving security problems such as: botnet forensic analysis; early warning of new threats; and the evolution of security phenomena. However, the analysis also faces the challenge presented by noise in the data. This thesis improves the robustness of clustering against noise by focusing on DNS graphs. Noise is either inherent in the dataset, or can be injected by adversaries. The first goal of the thesis is to remediate the effect of the noise inherent in the data. To that end, we perform measurement studies from two different vantage points in the online advertising ecosystem. As a multi-billion dollar industry, the online ad ecosystem naturally attracts ad abuse from miscreants. We propose a new clustering technique to automatically analyze the costs of impression fraud to advertisers generated by the botnet TDSS/TDL4 over four years. In addition, our measurement results show statistically significant differences between blacklisted publishers compared to those that were never blacklisted, from the vantage point of a Demand Side Platform provider. The second goal of the thesis is to increase the robustness of clustering against adversarial noise. Little work has been done in adversarial clustering in order to understand the weaknesses of clustering systems. We propose two novel attacks, one that injects noise to existing clusters, and one that moves data points to noisy clusters. After analyzing the effectiveness and the cost of attacks, we present defense techniques that improve the robustness of clustering in adversarial settings.