Classification models for disease diagnosis and outcome analysis
MetadataShow full item record
In this dissertation we study the feature selection and classification problems and apply our methods to real-world medical and biological data sets for disease diagnosis. Classification is an important problem in disease diagnosis to distinguish patients from normal population. DAMIP (discriminant analysis -- mixed integer program) was shown to be a good classification model, which can directly handle multigroup problems, enforce misclassification limits, and provide reserved judgement region. However, DAMIP is NP-hard and presents computational challenges. Feature selection is important in classification to improve the prediction performance, prevent over-fitting, or facilitate data understanding. However, this combinatorial problem becomes intractable when the number of features is large. In this dissertation, we propose a modified particle swarm optimization (PSO), a heuristic method, to solve the feature selection problem, and we study its parameter selection in our applications. We derive theories and exact algorithms to solve the two-group DAMIP in polynomial time. We also propose a heuristic algorithm to solve the multigroup DAMIP. Computational studies on simulated data and data from UCI machine learning repository show that the proposed algorithm performs very well. The polynomial solution time of the heuristic method allows us to solve DAMIP repeatedly within the feature selection procedure. We apply the PSO/DAMIP classification framework to several real-life medical and biological prediction problems. (1) Alzheimer's disease: We use data from several neuropsychological tests to discriminate subjects of Alzheimer's disease, subjects of mild cognitive impairment, and control groups. (2) Cardiovascular disease: We use traditional risk factors and novel oxidative stress biomarkers to predict subjects who are at high or low risk of cardiovascular disease, in which the risk is measured by the thickness of the carotid intima-media or/and the flow-mediated vasodilation. (3) Sulfur amino acid (SAA) intake: We use 1H NMR spectral data of human plasma to classify plasma samples obtained with low SAA intake or high SAA intake. This shows that our method helps for metabolomics study. (4) CpG islands for lung cancer: We identify a large number of sequence patterns (in the order of millions), search candidate patterns from DNA sequences in CpG islands, and look for patterns which can discriminate methylation-prone and methylation-resistant (or in addition, methylation-sporadic) sequences, which relate to early lung cancer prediction.