Automatic identification and removal of low quality online information
MetadataShow full item record
The advent of the Internet has generated a proliferation of online information-rich environments, which provide information consumers with an unprecedented amount of freely available information. However, the openness of these environments has also made them vulnerable to a new class of attacks called Denial of Information (DoI) attacks. Attackers launch these attacks by deliberately inserting low quality information into information-rich environments to promote that information or to deny access to high quality information. These attacks directly threaten the usefulness and dependability of online information-rich environments, and as a result, an important research question is how to automatically identify and remove this low quality information from these environments. The first contribution of this thesis research is a set of techniques for automatically recognizing and countering various forms of DoI attacks in email systems. We develop a new DoI attack based on camouflaged messages, and we show that spam producers and information consumers are entrenched in a spam arms race. To break free of this arms race, we propose two solutions. One solution involves refining the statistical learning process by associating disproportionate weights to spam and legitimate features, and the other solution leverages the existence of non-textual email features (e.g., URLs) to make the classification process more resilient against attacks. The second contribution of this thesis is a framework for collecting, analyzing, and classifying examples of DoI attacks in the World Wide Web. We propose a fully automatic Web spam collection technique and use it to create the Webb Spam Corpus -- a first-of-its-kind, large-scale, and publicly available Web spam data set. Then, we perform the first large-scale characterization of Web spam using content and HTTP session analysis. Next, we present a lightweight, predictive approach to Web spam classification that relies exclusively on HTTP session information. The final contribution of this thesis research is a collection of techniques that detect and help prevent DoI attacks within social environments. First, we provide detailed descriptions for each of these attacks. Then, we propose a novel technique for capturing examples of social spam, and we use our collected data to perform the first characterization of social spammers and their behaviors.