Data Mining in Tree-Based Models and Large-Scale Contingency Tables
Kim, Seoung Bum
MetadataShow full item record
This thesis is composed of two parts. The first part pertains to tree-based models. The second part deals with multiple testing in large-scale contingency tables. Tree-based models have gained enormous popularity in statistical modeling and data mining. We propose a novel tree-pruning algorithm called frontier-based tree-pruning algorithm (FBP). The new method has an order of computational complexity comparable to cost-complexity pruning (CCP). Regarding tree pruning, it provides a full spectrum of information. Numerical study on real data sets reveals a surprise: in the complexity-penalization approach, most of the tree sizes are inadmissible. FBP facilitates a more faithful implementation of cross validation, which is favored by simulations. One of the most common test procedures using two-way contingency tables is the test of independence between two categorizations. Current test procedures such as chi-square or likelihood ratio tests provide overall independency but bring limited information about the nature of the association in contingency tables. We propose an approach of testing independence of categories in individual cells of contingency tables based on a multiple testing framework. We then employ the proposed method to identify the patterns of pair-wise associations between amino acids involved in beta-sheet bridges of proteins. We identify a number of amino acid pairs that exhibit either strong or weak association. These patterns provide useful information for algorithms that predict secondary and tertiary structures of proteins.