Distributed learning and inference in deep models
MetadataShow full item record
In recent years, the size of deep learning problems has been increased significantly, both in terms of the number of available training samples as well as the number of parameters and complexity of the model. In this thesis, we considered the challenges encountered in training and inference of large deep models, especially on nodes with limited computational power and capacity. We studied two classes of related problems; 1) distributed training of deep models, and 2) compression and restructuring of deep models for efficient distributed and parallel execution to reduce inference times. Especially, we considered the communication bottleneck in distributed training and inference of deep models. Data compression is a viable tool to mitigate the communication bottleneck in distributed deep learning. However, the existing methods suffer from a few drawbacks, such as the increased variance of stochastic gradients (SG), slower convergence rate, or added bias to SG. In my Ph.D. research, we have addressed these challenges from three different perspectives: 1) Information Theory and the CEO Problem, 2) Indirect SG compression via Matrix Factorization, and 3) Quantized Compressive Sampling. We showed, both theoretically and via simulations, that our proposed methods can achieve smaller MSE than other unbiased compression methods with fewer communication bit-rates, resulting in superior convergence rates. Next, we considered federated learning over wireless multiple access channels (MAC). Efficient communication requires the communication algorithm to satisfy the constraints imposed by the nodes in the network and the communication medium. To satisfy these constraints and take advantage of the over-the-air computation inherent in MAC, we proposed a framework based on random linear coding and developed efficient power management and channel usage techniques to manage the trade-offs between power consumption and communication bit-rate. In the second part of this thesis, we considered the distributed parallel implementation of an already-trained deep model on multiple workers. Since latency due to the synchronization and data transfer among workers adversely affects the performance of the parallel implementation, it is desirable to have minimum interdependency among parallel sub-models on the workers. To achieve this goal, we developed and analyzed RePurpose, an efficient algorithm to rearrange the neurons in the neural network and partition them (without changing the general topology of the neural network) such that the interdependency among sub-models is minimized under the computations and communications constraints of the workers.