High-dimensional classification and attribute-based forecasting
MetadataShow full item record
This thesis consists of two parts. The first part focuses on high-dimensional classification problems in microarray experiments. The second part deals with forecasting problems with a large number of categories in predictors. Classification problems in microarray experiments refer to discriminating subjects with different biologic phenotypes or known tumor subtypes as well as to predicting the clinical outcomes or the prognostic stages of subjects. One important characteristic of microarray data is that the number of genes is much larger than the sample size. The penalized logistic regression method is known for simultaneous variable selection and classification. However, the performance of this method declines as the number of variables increases. With this concern, in the first study, we propose a new classification approach that employs the penalized logistic regression method iteratively with a controlled size of gene subsets to maintain variable selection consistency and classification accuracy. The second study is motivated by a modern microarray experiment that includes two layers of replicates. This new experimental setting causes most existing classification methods, including penalized logistic regression, not appropriate to be directly applied because the assumption of independent observations is violated. To solve this problem, we propose a new classification method by incorporating random effects into penalized logistic regression such that the heterogeneity among different experimental subjects and the correlations from repeated measurements can be taken into account. An efficient hybrid algorithm is introduced to tackle computational challenges in estimation and integration. Applications to a breast cancer study show that the proposed classification method obtains smaller models with higher prediction accuracy than the method based on the assumption of independent observations. The second part of this thesis develops a new forecasting approach for large-scale datasets associated with a large number of predictor categories and with predictor structures. The new approach, beyond conventional tree-based methods, incorporates a general linear model and hierarchical splits to make trees more comprehensive, efficient, and interpretable. Through an empirical study in the air cargo industry and a simulation study containing several different settings, the new approach produces higher forecasting accuracy and higher computational efficiency than existing tree-based methods.