Show simple item record

dc.contributor.authorVenkatasubramanian, Sundaresanen_US
dc.date.accessioned2009-08-26T18:14:31Z
dc.date.available2009-08-26T18:14:31Z
dc.date.issued2009-05-18en_US
dc.identifier.urihttp://hdl.handle.net/1853/29728
dc.description.abstractWe describe heterogeneous multi-CPU and multi-GPU implementations of Jacobi's iterative method for the 2-D Poisson equation on a structured grid, in both single- and double-precision. Properly tuned, our best implementation achieves 98% of the empirical streaming GPU bandwidth (66% of peak) on a NVIDIA C1060. Motivated to find a still faster implementation, we further consider "wildly asynchronous" implementations that can reduce or even eliminate the synchronization bottleneck between iterations. In these versions, which are based on the principle of a chaotic relaxation (Chazan and Miranker, 1969), we simply remove or delay synchronization between iterations, thereby potentially trading off more flops (via more iterations to converge) for a higher degree of asynchronous parallelism. Our relaxed-synchronization implementations on a GPU can be 1.2-2.5x faster than our best synchronized GPU implementation while achieving the same accuracy. Looking forward, this result suggests research on similarly "fast-and-loose" algorithms in the coming era of increasingly massive concurrency and relatively high synchronization or communication costs.en_US
dc.publisherGeorgia Institute of Technologyen_US
dc.subjectHybriden_US
dc.subjectHigh performance computingen_US
dc.subjectArchitectureen_US
dc.subjectChaotic relaxationen_US
dc.subjectTeslaen_US
dc.subjectLinear system of equationsen_US
dc.subjectNumerical methodsen_US
dc.subjectOccupancyen_US
dc.subjectAlgorithmsen_US
dc.subjectExperimentationen_US
dc.subjectPerformanceen_US
dc.subjectScientific computingen_US
dc.subjectGauss siedelen_US
dc.subjectShared memoryen_US
dc.subjectCoalesced memoryen_US
dc.subjectBank conflictsen_US
dc.subjectGPUen_US
dc.subjectCUDAen_US
dc.subjectNvidiaen_US
dc.subjectHeterogenousen_US
dc.subjectCPUen_US
dc.subject.lcshIterative methods (Mathematics)
dc.subject.lcshKernel functions
dc.titleTuned and asynchronous stencil kernels for CPU/GPU systemsen_US
dc.typeThesisen_US
dc.description.degreeM.S.en_US
dc.contributor.departmentComputingen_US
dc.description.advisorCommittee Chair: Vuduc, Richard; Committee Member: Kim, Hyesoon; Committee Member: Vetter, Jeffreyen_US


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record