Neurocube: Energy-efficient programmable digital deep learning accelerator based on processor in memory platform
Abstract
Deep learning, machine learning algorithm based on artificial neural network, shows great success in numerous pattern recognition problems, such as image recognition or speech recognition. Most of deep learning developments are based on the software platform with general purpose graphic processor units (GPU). In terms of efficiency, however, operating deep learning with GPU is limited by power/thermal budget to be operated in mobile device or high performance computing cluster. In this thesis, I present a programmable and scalable deep learning accelerator based on 3D high-density memory integrated with logic tier. The proposed architecture consists of clusters of processing engines (PEs) and the PE clusters access multiple memory channels (vaults) in parallel. The operating principle, referred to as the memory centric computing, embeds specialized state-machines within the vault controllers of HMC to drive data into the PE clusters. Next version of NeuroCube is designed to improve throughput of global connections (fully connections) in the deep neural network, which is critical in recurrent neural network (RNN). NeuroCube is changed to accelerate deep learning training, which requires additional optimized data flow to improve throughput for both inference and training. For computing gradient, it also supports 32bit fixed point with stochastic rounding to prevent gradient vanishing issue. A programming model and supporting architecture utilizes the flexible data flow to efficiently accelerate training of various types of DNNs. The cycle level simulation and synthesized design in 15nm FinFET shows power efficiency of ~500 GFLOPS/W, and almost similar throughput for a wide range of DNNs including convolutional, recurrent, multi-layer-perceptron, and mixed (CNN+RNN) networks.