Distributionally robust stochastic optimization with applications in statistical learning
MetadataShow full item record
In this thesis, we study distributionally robust stochastic optimization (DRSO), a recent emerging framework for solving decision-making under uncertainty. In this framework, instead of assuming that there is a known underlying probability distribution that drives the uncertain behavior of stochastic systems, one seeks solutions that perform well for a family of distributions, so as to hedge against the distributional uncertainty in the future. This thesis focuses on the design of tractable models for DRSO. We develop novel formulations and insights for fundamental problems, and discover connections between different areas in optimization, statistics and learning. We first address the key question on how to construct a good family of distributions to hedge against. We point out that such family should be chosen to be appropriate for the application at hand, and that some of the choices that have been popular until recently are, for many applications, not good choices. We consider distributions that are within a chosen Wasserstein distance from a nominal distribution, for example an empirical distribution resulting from available data. We demonstrate that the resulting distributions hedged against are more reasonable than those resulting from other popular choices of sets. Moreover, the problem of determining the worst-case expectation over the resulting family of distributions has desirable tractability properties. We derive a dual reformulation of the Wasserstein DRSO problem in a very general setting, by constructing (approximate) worst-case distributions explicitly via the first-order optimality conditions of the dual problem. By construction, the worst-case distributions have a concise structure and a clear interpretation. Next, we establish a connection between Wasserstein DRSO and regularization in statistical learning. More precisely, we identify a broad class of loss functions, for which the Wasserstein DRSO is asymptotically equivalent to a regularization problem with a gradient-norm penalty. Such relation provides new interpretations for problems involving regularization, including a great number of statistical learning problems and discrete choice models (e.g. multinomial logit). The connection also suggests a principled way to regularize high-dimensional non-convex learning problems, which is demonstrated through the training of Wasserstein generative adversarial networks in deep learning. In the final part of the thesis, we consider robust decision-making when the data availability from marginal distributions is different than that from the joint distribution. This occurs, for example, when the data streams of different random variables are collected with different frequencies. We propose a distributionally robust approach which hedges against a family of joint distributions with fixed marginals and a dependence structure similar to that of a nominal joint distribution, such as an empirical distribution or the independent product distribution. Similarity of the dependence structure is measured through the Wasserstein distance between the copula of the joint distribution and the copula of the nominal distribution. We show that our choice of distance can be used as a new measure of dependence among random variables. Tractability of our new formulation is obtained by a novel constructive proof of strong duality, combining ideas from variational analysis and the theory of multi-marginal optimal transport.