Stochastic algorithms for distributed optimization and machine learning
MetadataShow full item record
In the big data era, machine learning acts as a powerful tool to help us make predictions and decisions. It has strong ties to the field of optimization, in the way the latter provides methods and theory. While data tends to be collected in a distributed fashion, the standard machine learning models require centralizing the training data on one machine or in a data center, which incurs significant communication cost and puts data privacy at risk. To circumvent such an issue, a variety of distributed machine learning models have been proposed and studied in the literature. In this thesis, we focus on designing and analyzing stochastic algorithms for distributed machine learning that achieve the best communication and sampling complexities. The first part of this thesis is devoted to investigating randomized incremental gradient (RIG) methods for solving the class of finite-sum optimization problems, which is the core problem in distributed optimization. By developing and proving the optimality of the primal-dual gradient method, which covers the well-known Nesterov's accelerated gradient method as a special case, we propose its randomized version, namely the randomized primal-dual gradient (RPDG) method. RPDG only involves the computation of one randomly selected component function per iteration. Moreover, by providing a lower complexity bound for the class of RIG methods for finite-sum optimization, we demonstrate the optimality of RPDG, and this is the first time such an optimal RIG method has been developed for solving finite-sum optimization problems. In the second part of this thesis, we consider a distributed topology with a central authority, and study the distributed finite-sum optimization problems defined over such a star network topology. We propose an optimal RIG method, namely the random gradient extrapolation method (RGEM) and show that it does not require any exact gradient evaluations even at the initial point, but can still achieve the optimal communication and sampling complexities for solving distributed finite-sum optimization problems. In fact, without any full gradient computation, RGEM possesses iteration costs as low as pure stochastic gradient descent methods, but achieves a much faster and optimal linear rate of convergence for solving deterministic finite-sum problems. Moreover, we extend RGEM for stochastic finite-sum optimization, i.e., at each iteration only one randomly selected network agent needs to compute an estimator of its gradient by sampling from its local data. Note that for these problems, it is difficult to compute the exact gradients even at the initial point. In the third part of this thesis, we study another type of distributed topology, called the decentralized network topology, where there is no central authority in the distributed networks. Consider communication is the major bottleneck, we present a class of communication-efficient algorithms for solving stochastic decentralized optimization problems. We introduce a new stochastic decentralized primal-dual type method, called stochastic decentralized communication sliding (SDCS), where the agents can skip communications while solving their local subproblems iteratively. And we show that SDCS achieves the best-known communication complexities and not improvable sampling complexities. Preliminary numerical experiments are performed to show that SDCS can significantly save communication costs over some existing state-of-the-art decentralized methods in all our tested instances. Consider synchronization is another critical issue in decentralized optimization, we further extend the communication-sliding idea to the asynchronous setting. By randomly activating a subset of network agents per iteration, the proposed asynchronous stochastic decentralized type methods can maintain the communication and sampling complexities obtained by SDCS, but these methods can be applied to solve a broader class of decentralized stochastic problems, for example, composite loss functions with smooth structures.