Diagnosing performance bottlenecks in HPC applications
MetadataShow full item record
The software performance optimizations process is one of the most challenging aspects of developing highly performant code because underlying performance limitations are hard to diagnose. In many cases, identifying performance bottlenecks, such as latency stalls, requires a combination of fidelity and usability that existing tools do not provide: traditional performance models and runtime analysis lack the granularity necessary to uncover low-level bottlenecks; while, architectural simulations are too cumbersome and fragile to employ as a primary source of information. To address this need, we propose a performance analysis technique, called Pressure Point Analysis (PPA), which delivers the accessibility of analytical models with the precision of a simulator. The foundation of this approach is based on an autotuning-inspired technique that dynamically perturbs binary code (e.g., inserting/deleting instructions to affect utilization of functional units, altering memory access addresses to change cache hit rate, or swapping registers to alter instruction level dependencies) to then analyze the effects various perturbations have on the overall performance. When systematically applied, a battery of carefully designed perturbations, which target specific microarchitectural features, can glean valuable insight about pressure points in the code. PPA provides actionable information about hardware-software interactions that can be used by the software developer to manually tweak the application code. In some circumstances the performance bottlenecks are unavoidable, in which case this analysis can be used to establish a rigorous performance bound for the application. In other cases, this information can identify the primary performance limitations and project potential performance improvements if these bottlenecks are mitigated.