Methodologies and tools for computation offloading on heterogeneous multicores
MetadataShow full item record
Frequency scaling in traditional computing systems has hit the power wall and multicore computing is here to stay. Unlike homogeneous multicores which have uniform architecture and instruction set across cores, heterogenous multicores have differentially capable cores to provide optimal performance for specialized functionality. However, this heterogeneity also translates into difficult programming models, and extracting its potential is not trivial. The Cell Broadband Engine by the Sony Toshiba IBM(STI) consortium was amongst the first heterogenous multicore systems with a single Power Processing Unit(PPU) and 8 Synergistic Processor Units (SPUs). We address the issue of porting an existing sequential C/C++ codebase on to the Cell BE through compiler driven program analysis and profiling. Until parallel programming models evolve, the "interim" solution to performance involves speeding up legacy code by offloading computationally intense parts of a sequential thread to the co-processor; thus using it as an accelerator. Unique architectural characteristics of an accelerator makes this problem quite challenging. On the Cell, these characteristics include limited local store of the SPU, high latency of data transfer between PPU and SPU, lack of branch prediction unit, limited SIMDizability, expensive scalar code etc. In particular, the designers of the Cell have opted for software controlled memory on its SPUs to reduce power consumption and to give programmers more control over the predictability of latency. The lack of a hardware cache on the SPU can create performance bottlenecks because any data that needs to be brought in to the SPU must be brought in using a DMA call. The need for supporting a software controlled cache is thus evident for irregular memory accesses on the SPU. For such a cache to result in improved performance, the amount of time spent in book-keeping and tracking at run-time should be minimal. Traditional algorithms like LRU, when implemented in software incur overheads on every cache hit because appropriate data structures need to be updated. Such overheads are on off critical path for traditional hardware cache but on the critical path for a software controlled cache. Thus there is a need for better management of "data movement" for the code that is offloaded on to the SPU. This thesis addresses the "code partitioning" problem as well as the "data movement" problem. We present GLIMPSES - a compiler driven profiling tool that analyzes existing C/C++ code for its suitability for porting to the Cell, and presents its results in an interactive visualizer. Software Controlled Cache - an improved eviction policy that exploits information gleaned from memory traces generated through offline profiling. The trace is analyzed to provide guidance for a run-time state machine within the cache manager; resulting in reduced run-time overhead and better performance. The design tradeoffs and several pros and cons of this approach are brought forth as well. It is shown that with just about the right amount of runtime book-keeping and decision making, one can get to the difficult solution space of the right balance to achieve high performance.