Generalized N-body problems: a framework for scalable computation
Riegel, Ryan Nelson
MetadataShow full item record
In the wake of the Big Data phenomenon, the computing world has seen a number of computational paradigms developed in response to the sudden need to process ever-increasing volumes of data. Most notably, MapReduce has proven quite successful in scaling out an extensible class of simple algorithms to even hundreds of thousands of nodes. However, there are some tasks---even embarrassingly parallelizable ones---that neither MapReduce nor any existing automated parallelization framework is well-equipped to perform. For instance, any computation that (naively) requires consideration of all pairs of inputs becomes prohibitively expensive even when parallelized over a large number of worker nodes. Many of the most desirable methods in machine learning and statistics exhibit these kinds of all-pairs or, more generally, all-tuples computations; accordingly, their application in the Big Data setting may seem beyond hope. However, a new algorithmic strategy inspired by breakthroughs in computational physics has shown great promise for a wide class of computations dubbed generalized N-body problems (GNBPs). This strategy, which involves the simultaneous traversal of multiple space-partitioning trees, has been applied to a succession of well-known learning methods, accelerating each asymptotically and by orders of magnitude. Examples of these include all-k-nearest-neighbors search, k-nearest-neighbors classification, k-means clustering, EM for mixtures of Gaussians, kernel density estimation, kernel discriminant analysis, kernel machines, particle filters, the n-point correlation, and many others. For each of these problems, no overall faster algorithms are known. Further, these dual- and multi-tree algorithms compute either exact results or approximations to within specified error bounds, a rarity amongst fast methods. This dissertation aims to unify a family of GNBPs under a common framework in order to ease implementation and future study. We start by formalizing the problem class and then describe a general algorithm, the generalized fast multipole method (GFMM), capable of solving all problems that fit the class, though with varying degrees of speedup. We then show O(N) and O(log N) theoretical run-time bounds that may be obtained under certain conditions. As a corollary, we derive the tightest known general-dimensional run-time bounds for exact all-nearest-neighbors and several approximated kernel summations. Next, we implement a number of these algorithms in a commercial database, empirically demonstrating dramatic asymptotic speedup over their conventional SQL implementations. Lastly, we implement a fast, parallelized algorithm for kernel discriminant analysis and apply it to a large dataset (40 million points in 4D) from the Sloan Digital Sky Survey, identifying approximately one million quasars with high accuracy. This exceeds the previous largest catalog of quasars in size by a factor of ten and has since been used in a follow-up study to confirm the existence of dark energy.