## New progress in hot-spots detection, partial-differential-equation-based model identification and statistical computation

##### Abstract

This thesis discusses the new progress in (1) hot-spots detection in spatial-temporal data, (2) partial-differential-equation-based (PDE-based) model identification, and (3) optimization in the Least Absolute Shrinkage and Selection Operator (Lasso) type problem.
In this thesis, we have four main works. Chapter 1 and Chapter 2 fall in the first area, i.e., hot-spots detection in spatio-temporal data. Chapter 3 belongs to the second area, i.e., PDE-based model identification. Chapter 4 is for the third area, i.e., optimization in the Lasso-type problem. The detailed description of these four chapters is summarized as follows.
In Chapter 1, we aim at detecting hot-spots in multivariate spatio-temporal dataset that are non-stationary over time. To realize this objective, we propose a statistical method to under the framework of tensor decomposition and our method has three steps. First, we fit the observed data into a Smooth Sparse Decomposition Tensor (SSD-Tensor) model that serves as a dimension reduction and de-noising technique: it is an additive model that decomposes the original data into three components: smooth but non-stationary global mean, sparse local anomalies, and random noises. Next, we estimate the model parameters by the penalized framework that includes a combination of Lasso and fused Lasso penalty to address the spatial sparsity and temporal consistency, respectively. Finally, we apply a Cumulative Sum (CUSUM) Control Chart to monitor the model residuals, which allows us to detect when and where the hot-spot event occurs. To demonstrate the usefulness of our proposed SSD-Tensor method, we compare it with several other methods in extensive numerical simulation studies and a real crime rate dataset. The material of this chapter is published in Journal of Applied Statistics in January, 2021 under the title ``Rapid Detection of Hot-spots via Tensor Decomposition with Applications to Crime Rate Data'' with co-authors Hao Yan, Sarah E. Holte and Yajun Mei.
In Chapter 2, we improve the methodology in Chapter 1 both statistically and computationally.
The statistical improvement is the new methodologies to detect hot-spots with temporal circularity, instead of temporal continuity as in Chapter 1. This helps us handle many bio-surveillance and healthcare applications, where data sources are measured from many spatial locations repeatedly over time, say, daily/weekly/monthly. The computational improvement is the development of a more efficient algorithm. The main tool we use to accelerate the calculation is the tensor decomposition, which is similar to the matrix context where it might be difficult to compute the inverse of a large matrix in general, but it will be straightforward to calculate the inverse of a large block diagonal matrix through the inverse of sub-matrices in the diagonal. The usefulness of the improved methodology is validated through numerical simulations and a real-world dataset in the weekly number of gonorrhea cases from 2006 to 2018 for 50 states in U.S.. The material of this chapter is accepted as a book chapter in Frontiers in Statistical Quality Control 13 in February 2021 under the title ``Rapid Detection of Hot-spot by Tensor Decomposition with Application to Weekly Gonorrhea Data'' with co-authors Hao Yan, Sarah E. Holte, Roxanne P. Kerani and Yajun Mei.
In Chapter 3, we propose a two-stage method called Spline Assisted Partial Differential Equation involved Model Identification (SAPDEMI) method to efficiently identify the underlying PDE models from the noisy data. In the first stage -- functional estimation stage -- we employ the cubic spline to estimate the unobservable derivatives, which serve as candidates of the underlying PDE models. The contribution of this stage is that, it is computational efficient because it only requires the computational complexity of the linear polynomial of the sample size, which achieves the lowest possible order of complexity. In the second stage -- model identification stage -- we apply Lasso to identify the underlying PDE model. The contribution of this stage is that, we focus on the model selections, while the existing literature mostly focuses on parameter estimations. Moreover, we develop statistical properties of our method for correct identification, where the main tool we use is the primal-dual witness (PDW) method. Finally, we validate our theory through various numerical examples.
In Chapter 4, we focus on developing an algorithm to solve the optimization with a L1 regularization term, namely the Lasso-type problem. The algorithm developed in this chapter can greatly reduce the computational complexity in Chapter 1, Chapter 2 and Chapter 3, where we try to realize sparse identification. The challenge to develop an efficient algorithm for the Lasso-type problem is that the objective function of the Lasso-type problem is not strictly convex when the number of samples is less than the number of features. This special property of the Lasso-problem leads the existing Lasso-type estimator, in general, cannot achieve the optimal rate due to the undesirable behavior of the absolute function at the origin. To overcome the above challenge, we develop a homotopic method, where we use a sequence of surrogate functions to approximate the L1 penalty that is used in the Lasso-type of estimators. The surrogate functions will converge to the L1 penalty in the Lasso estimator. At the same time, each surrogate function is strictly convex, which enables a provable faster numerical rate of convergence. In this chapter, we demonstrate that by meticulously defining the surrogate functions, one can prove a faster numerical convergence rate than any existing methods in computing for the Lasso-type of estimators. Namely, the state-of-the-art algorithms can only guarantee O(1/\epsilon) or O(1/\sqrt{\epsilon}) convergence rates, while we can prove an O([\log(1/\epsilon)]^2) for the newly proposed algorithm. Our numerical simulations show that the new algorithm also performs better empirically.