Systems abstractions for big data processing on a single machine
Maass, Steffen Alexander
MetadataShow full item record
Large-scale internet services, such as Facebook or Google, are using clusters of many servers for problems such as search, machine learning, and social networks. However, while it may be possible to apply the tools used at this scale to smaller, more common problems as well, this dissertation presents approaches to large-scale data processing on only a single machine. This approach has obvious cost benefits and lowers the barrier of entrance to large-scale data processing. This dissertation approaches this problem by redesigning applications to enable trillion-scale graph processing on a single machine while also enabling the processing of evolving, billion-scale graphs. First, this dissertation presents a new out-of-core graph processing engine, called Mosaic, for executing graph algorithms on trillion-scale datasets on a single machine. Mosaic makes use of many-core processors and fast I/O devices coupled with a novel graph encoding scheme to allow processing of graphs of up to one trillion edges on a single machine. Mosaic also employs a locality-preserving, space-filling curve to allow for high compression and high locality when storing graphs and executing algorithms. Our evaluation shows that for smaller graphs, Mosaic consistently outperforms other state-of-the-art out-of-core engines by 3.2x–58.6x and shows comparable performance to distributed graph engines. Furthermore, Mosaic can complete one iteration of the Pagerank algorithm on a trillion-edge graph in 21 minutes, outperforming a distributed disk-based engine by 9.2x. Second, while Mosaic addresses the setting of processing static graph, this dissertation presents Cytom, a new engine for processing billion-scale evolving graphs based on insights about achieving high compression and locality while improving load-balancing when processing a graph that changes rapidly. Cytom introduces a novel programming model that takes advantage of its subgraph-centric approach coupled with the setting of evolving graphs. This is an important enabling step for emerging workloads when processing graphs that change over time. Cytom’s programming model allows algorithm developers to quickly react to graph updates, discarding uninteresting ones while focusing on updates that, in fact, change the algorithmic result. We show that Cytom is effective in scaling to billion-edge graphs, as well as providing higher throughput when updating the graph structure (2.0x–192x) and higher throughput (1.5x–200x) when additionally processing an algorithm.