Relaxing coherence for modern learning applications
Lee, Joo Hwan
MetadataShow full item record
The main objective of this research is to efficiently execute learning (model training) of modern machine learning (ML) applications. The recent explosion in data has led to the emergence of data-intensive ML applications whose key phase is learning that requires significant amounts of computation. A unique characteristic of learning is that it is iterative- convergent, where a consistent view of memory does not always need to be guaranteed such that parallel workers are allowed to compute using stale values in intermediate computations to relax certain read-after-write data dependencies. While multiple workers read-and- modify shared model parameters multiple times during learning, incurring multiple data communication between workers, most of the data communication is redundant, due to the stale value tolerant characteristic. Relaxing coherence for these learning applications has the potential to provide extraordinary performance and energy benefits but requires innovations across the system stack from hardware and software. While considerable effort has utilized the stale value tolerance on distributed learning, still inefficient utilization of the full performance potential of this characteristic has caused modern ML applications to have low execution efficiency on the state-of-the-art systems. The inefficiency mainly comes from the lack of architectural considerations and detailed understanding of the different stale value tolerance of different ML applications. Today’s architecture, designed to cater to the needs of more traditional workloads, incurs high and often unnecessary overhead. The lack of detailed understanding has led to ambiguity for the stale value tolerance thus failing to take the full performance potential of this characteristic. This dissertation presents several innovations regarding this challenge. First, this dissertation proposes Bounded Staled Sync (BSSync), hardware support for the bounded staleness consistency model, which accompanies simple logic layers in the memory hierarchy, for reducing atomic operation overhead on data synchronization intensive workloads. The long latency and serialization caused by atomic operations have a significant impact on performance. The proposed technique overlaps the long latency atomic operation with the main computation. Compared to previous work that allows stale values for read operations, BSSync utilizes staleness for write operations, allowing stale- writes. It reduces the inefficiency coming from the data movement between where they are stored and where they are processed. Second, this dissertation presents StaleLearn, a learning acceleration mechanism to reduce the memory divergence overhead of GPU learning with sparse data. Sparse data induces divergent memory accesses with low locality, thereby consuming a large fraction of total execution time on transferring data across the memory hierarchy. StaleLearn trans- forms the problem of divergent memory accesses into the synchronization problem by replicating the model, and reduces the synchronization overhead by asynchronous synchronization on Processor-in-Memory (PIM). The stale value tolerance makes possible to clearly decompose tasks between the GPU and PIM, which can effectively exploit parallelism be- tween PIM and GPU cores by overlapping PIM operations with the main computation on GPU cores. Finally, this dissertation provides a detailed understanding of the different stale value tolerance of different ML applications. While relaxing coherence can reduce the data communication overhead, its complicated impact on the progress of learning has not been well studied thus leading to ambiguity for domain experts and modern systems. We define the stale value tolerance of ML training with the effective learning rate. The effective learning rate can be defined by the implicit momentum hyperparameter, the update density, the activation function selection, RNN cell types, and learning rate adaptation. Findings of this work will open further exploration of asynchronous learning including improving the findings laid out in this dissertation.