Show simple item record

dc.contributor.advisorSarkar, Vivek
dc.contributor.advisorShrivastava, Anshumali
dc.contributor.authorMandal, Ankush
dc.date.accessioned2020-09-08T12:48:55Z
dc.date.available2020-09-08T12:48:55Z
dc.date.created2020-08
dc.date.issued2020-07-27
dc.date.submittedAugust 2020
dc.identifier.urihttp://hdl.handle.net/1853/63692
dc.description.abstractToday's data mining tasks aim to extract meaningful information from a large amount of data in a reasonable time mainly via means of --- a) algorithmic advances, such as fast approximate algorithms and efficient learning algorithms, and b) architectural advances, such as machines with massive compute capacity involving distributed multi-core processors and high throughput accelerators. For current and future generation processors, parallel algorithms are critical for fully utilizing computing resources. Furthermore, exploiting data properties for performance gain becomes crucial for data mining applications. In this work, we focus our attention on power-law behavior –-- a common property found in a large class of data, such as text data, internet traffic, and click-stream data. Specifically, we address the following questions in the context of power-law data: How well do the critical data mining algorithms of current interest fit with today's parallel architectures? Which algorithmic and mapping opportunities can be leveraged to further improve performance?, and What are the relative challenges and gains for such approaches? Specifically, we first investigate the suitability of the "frequency estimation" problem for GPU-scale parallelism. Sketching algorithms are a popular choice for this task due to their desirable trade-off between estimation accuracy and space-time efficiency. However, most of the past work on sketch-based frequency estimation focused on CPU implementations. In our work, we propose a novel approach for sketches, which exploits the natural skewness in the power-law data to efficiently utilize the massive amounts of parallelism in modern GPUs. Next, we explore the problem of "identifying top-K frequent elements" for distributed data streams on modern distributed settings with both multi-core and multi-node CPU parallelism. Sketch-based approaches, such as Count-Min Sketch (CMS) with top-K heap, have an excellent update time but lacks the important property of reducibility, which is needed for exploiting data parallelism. On the other end, the popular Frequent Algorithm (FA) leads to reducible summaries, but its update costs are high. Our approach Topkapi, gives the best of both worlds, i.e., it is reducible like FA and has an efficient update time similar to CMS. For power-law data, Topkapi possesses strong theoretical guarantees and leads to significant performance gains, relative to past work. Finally, we study Word2Vec, a popular word embedding method widely used in Machine learning and Natural Language Processing applications, such as machine translation, sentiment analysis, and query answering. This time, we target Single Instruction Multiple Data (SIMD) parallelism. With the increasing vector lengths in commodity CPUs, such as AVX-512 with a vector length of 512 bits, efficient vector processing unit utilization becomes a major performance game-changer. By employing a static multi-version code generation strategy coupled with an algorithmic approximation based on the power-law frequency distribution of words, we achieve significant reductions in training time relative to the state-of-the-art.
dc.format.mimetypeapplication/pdf
dc.publisherGeorgia Institute of Technology
dc.subjectData mining
dc.subjectPerformance optimization
dc.subjectParallel approximate algorithms
dc.subjectPower-law data
dc.subjectSketches
dc.subjectWord embedding
dc.subjectMulti-core
dc.subjectGPU
dc.titleEnabling parallelism and optimizations in data mining algorithms for power-law data
dc.typeDissertation
dc.description.degreePh.D.
dc.contributor.departmentComputer Science
thesis.degree.levelDoctoral
dc.contributor.committeeMemberKim, Hyesoon
dc.contributor.committeeMemberPande, Santosh
dc.contributor.committeeMemberVuduc, Richard
dc.date.updated2020-09-08T12:48:55Z


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record