Theoretical Analysis of Stochastic Gradient Descent in Stochastic Optimization
MetadataShow full item record
Stochastic Gradient Descent (SGD) type algorithms have been widely applied to many stochastic optimization problems, such as machine learning. Despite its empirical success, there is still a lack of theoretical understanding of convergence properties of SGD and its variants. The major bottleneck comes from the highly nonconvex optimization landscape and the complicated noise structure. This thesis aims to provide useful insights on the good performance of SGD type algorithms through theoretical analysis with the help of diffusion approximation and martingale theory. Specifically, we answer the following questions: Chapter 2: What is the effect of Momentum in nonconvex optimization? We propose to analyze the algorithmic behavior of Momentum Stochastic Gradient Descent (MSGD) by diffusion approximation for general nonconvex optimization problems. Our study shows that the momentum helps escape from saddle points, but hurts the convergence within the neighborhood of optima (if without the step size annealing or momentum annealing). Our theoretical discovery partially corroborates the empirical successes of MSGD in training deep neural networks. Chapter 3: How does noise in SGD help the algorithm avoid spurious local optima? We answer this question through a simple two-layer convolutional neural network model, which has a spurious local optimum and a global optimum. Our theory shows that perturbed gradient descent and perturbed mini-batch stochastic gradient algorithms in conjunction with noise annealing is guaranteed to converge to a global optimum in polynomial time with arbitrary initialization. This implies that the noise enables the algorithm to efficiently escape from the spurious local optimum. Chapter 4: How does noise in SGD help select optima that have good generalization performance? We further investigate the role of noise when multiple global optima exist by considering nonconvex rectangular matrix factorization problem, which has infinitely many global minima due to rotation and scaling invariance. Gradient descent (GD) can converge to any optimum, depending on the initialization. In contrast, we show that a perturbed form of GD with an arbitrary initialization converges to a global optimum that is uniquely determined by the injected noise. Our result implies that the noise imposes implicit bias towards certain optima. Chapter 5: Does reusing past samples in SGD help improve the efficiency in simulation optimization? We consider a special type of stochastic optimization problem, simulation optimization. The main challenge of simulation optimization is the limited simulation budget because of the high computational cost of simulation experiments. One approach to overcome this challenge is to reuse simulation outputs from previous iterations in the current iteration of the optimization procedure. However, due to the dependence among iterations, simulation replications from different iterations are not independent, which leads to the lack of theoretical justification for the good empirical performance. We fill this gap by theoretically studying the stochastic gradient descent method with reusing past simulation replications. We show that reusing past replications does not change the convergence of the algorithm, which implies the bias of the gradient estimator is asymptotically negligible. Moreover, we justify that reusing past replications reduces the variance of gradient estimators around local optima, which implies that the algorithm can achieve faster convergence.