ADVANCING COMPILER OPTIMIZATIONS FOR GENERAL-PURPOSE & DOMAIN-SPECIFIC PARALLEL ARCHITECTURES
MetadataShow full item record
Computer hardware is undergoing a major disruption as we approach the end of Moore’s law, in the form of new advancements to general-purpose and domain-specific parallel architectures. Contemporaneously, the demand for higher performance is broadening across multiple application domains ranging from scientific computing applications to deep learning and graph analytics. These trends raise a plethora of challenges to the de-facto approach to achieving higher performance, namely application development using high-performance libraries. Some of the challenges include porting/adapting to multiple parallel architectures, supporting rapidly advancing domains, and also inhibiting optimizations across library calls. Hence, there is a renewed focus on advancing optimizing compilers from industry and academia to address the above trends, but doing so requires enabling compilers to work effectively on a wide range of applications and also to exploit current and future parallel architectures better. As summarized below, this thesis focuses on compiler advancements for current and future hardware trends. First, we observe that software with explicit parallelism for general-purpose multi-core CPUs and GPUs is on the rise, but the foundation of current compiler frameworks is based on optimizing sequential code. Our approach uses explicit parallelism speciﬁed by the programmer as logical parallelism to reﬁne the conservative dependence analysis inherent in compilers (arising from the presence of program constructs such as pointer aliasing, unknown function calls, non-affine subscript expressions, recursion, and unstructured control ﬂow). This approach makes it possible to combine user-specified parallelism and compiler-generated parallelism in a new unified polyhedral compilation framework (PoPP). Second, despite the fact that compiler technologies for automatic vectorization for general-purpose vector processing (SIMD) units have been under development for over four decades, there are still considerable gaps in the capabilities of modern compilers to perform automatic vectorization. One such gap can be found in the handling of loops with dependence cycles that involve memory-based anti (write-after-read) and output (write-after-write) dependences. A significant limitation in past work is the lack of a unified formulation that synergistically integrates multiple storage transformations to break the cycles and further unify the formulation with loop transformations to enable vectorization. To address this limitation, we propose the PolySIMD approach. Third, the efficiency of domain-specific spatial accelerators for Deep Learning (DL) solutions depends heavily on the compiler's ability to generate optimized mappings or code for various DL operators (building blocks of DL models, e.g., CONV2D, GEMM) on the accelerator's compute and memory resources. However, the rapid emergence of new operators and new accelerators pose two key challenges/requirements to the existing compilers: 1) Ability to perform fine-grained reasoning of various algorithmic aspects of the new operators and also complex hardware structures of the new accelerators to achieve peak performance, and 2) Ability to quickly explore the enormous space of possible mappings involving various partitioning schemes, loop transformations, and data-layout choices, yet achieving high-performance and energy efficiency. To address these challenges, we introduced a data-centric compiler ``Marvel'' for optimizing DL operators onto flexible spatial accelerators. We also introduced a high-performance vectorizing compiler ``Vyasa'' for optimizing tensors convolutions on specialized SIMD units of Xilinx AI Engine. Finally, with the emergence of a domain-specific thread migratory architecture (EMU) to address the locality wall, we developed thread-migration aware compiler optimizations to enhance the performance of graph analytics on the EMU machine. Our preliminary evaluation of compiler optimizations such as node fusion and edge flipping demonstrates a significant benefit relative to the original programs.