Accelerated deep learning for the edge-to-cloud continuum: A specialized full stack derived from algorithms
MetadataShow full item record
Advances in high-performance computer architecture design have been a major driver for the rapid evolution of Deep Neural Networks (DNN). Due to their insatiable demand for compute power, naturally, both the research community as well the industry have turned to accelerators to accommodate modern DNN computation. Furthermore, DNNs are gaining prevalence and have found applications across a wide spectrum of devices, from commod- ity smartphones to enterprise cloud platforms. However, there is no one-size-fits-all solu- tion for this continuum of devices that can meet the strict energy/power/chip-area budgets for edge devices and meet the high performance requirements for enterprise-grade servers. To this end, this thesis designs a specialized compute stack for DNN acceleration across the edge-to-cloud continuum that flexibly matches the varying constraints for different devices and simultaneously exploits algorithmic properties to maximize the benefits from acceleration. To this end, this thesis first explores a tight integration of Neural Network (NN) accelerators within the massively-parallel GPUs with a minimal area overhead. We show that a tight-coupling of NN-accelerators and GPUs can provide a significant gain in performance and energy efficiency across a diverse set of applications through neural acceleration, by approximating regions of approximation- amenable code using a neural networks. Next, this thesis develops a full-stack for accelerating DNN inference on FPGAs that aims to provide programmability, performance, and efficiency. We call our specialized compute stack DNNWEAVER, which encompasses (1) high-level algorithmic abstractions, (2) a flexible template accelerator architecture, and (3) a compiler that automatically and efficiently optimizes the template architecture to maximize DNN performance using the limited resources available on the FPGA die. The third thrust of this thesis explores scale-out acceleration of training using cloud-scale FPGAs for a wide range of machine learning algorithms, including neural networks. The challenge here is to design an accelerator architecture that can scale up to efficiently use the large pool of compute resources available on modern cloud-grade FPGAs. To tackle this challenge, this thesis explores multi-threading to maximize efficiency from FPGA acceleration by running multiple parallel threads of training. The final thrust of this thesis builds upon the algorithmic insight that bitwidth of operations in DNNs can be reduced without compromising their classification accuracy. However, to prevent loss of accuracy, the bitwidth varies significantly across DNNs and it may even be adjusted for each layer individually. Thus, a fixed-bitwidth accelerator would either offer limited benefits to accommodate the worst-case bitwidth requirements, or inevitably lead to a degradation in final accuracy. To alleviate these deficiencies, the final thrust of this thesis introduces dynamic bit-level fusion/decomposition as a new dimension in the design of DNN accelerators. The final thrust of this thesis explores mixed-signal acceleration to push accelerator efficiency to its limits. As such, the final thrust explores executing the low-bitwidth multiply- add operations prevalent in DNNs in the analog domain to gain significant efficiency ben- efits. Using low-bitwdith analog compute units enables us to overcome the limited range for information encoding, susceptibility to noise, and Analog to Digital (A/D) conversion overheads.