• Login
    View Item 
    •   SMARTech Home
    • Georgia Tech Theses and Dissertations
    • Georgia Tech Theses and Dissertations
    • View Item
    •   SMARTech Home
    • Georgia Tech Theses and Dissertations
    • Georgia Tech Theses and Dissertations
    • View Item
    JavaScript is disabled for your browser. Some features of this site may not work without it.

    Efficient data integration techniques in some modern applications

    Thumbnail
    View/Open
    LIU-DISSERTATION-2018.pdf (1.261Mb)
    Date
    2018-04-10
    Author
    Liu, Kun
    Metadata
    Show full item record
    Abstract
    Data science is changing our society and economy, and complicated data from heterogeneous sources is often collected in various industries such as finance, manufacturing, security, and pharmaceutical industries. The main challenge is often how to analyze these complicated data from heterogeneous sources. One useful data analysis technique is data integration that allows one to extract invaluable information from heterogeneous sources to make intelligent decisions at the global level. This dissertation aims to develop efficient data integration techniques in some modern real-world applications. We consider four different contexts: (i) online monitoring of large-scale data streams, (ii) consensus sequential detection over distributed networks, (iii) combining different patients' responses to assess the treatment effects of new drugs, and (iv) robust statistical inference in the presence of contaminated data. Chapter 1 investigates the problem of online monitoring large-scale data streams where an undesired event may occur at some unknown time and affect only a few unknown data streams. Existing research is either statistical inefficient or computationally infeasible. Motivated by parallel and distributed computing, we propose to develop a new information fusion technique we called the “SUM-Shrinkage” approach that is efficient and scalable. The main idea is to parallel run local detection procedures and to use the sum of the shrinkage transformation of local detection statistics as a global statistic to make a decision. The proposed shrinkage transformation approach is able to automatically filter out the unaffected data streams and only use information from affected data streams to make the decision. The usefulness of our proposed SUM-Shrinkage approach is illustrated in an example of monitoring large-scale independent normally distributed data streams when the local post-change mean shifts are unknown and can be positive or negative. In Chapter 2, we consider the consensus sequential detection problem over distributed sensor networks, in which each local sensor can only communicate local information with its immediate neighborhood sensors at each time step, and the question is how the sensors can work together to make a quick but accurate decision when testing binary hypotheses on the true raw sensor distributions. An interesting data integration technique is based on the weighted local-likelihood-ratio-statistics, which yields the Consensus-Innovation Sequential Probability Ratio Test (CISPRT) algorithm proposed by Sahu and Kar (IEEE Trans. Signal Process., 2016). Our new contribution is to present improved, non-asymptotic properties of the CISPRT algorithm for Gaussian data in term of network connectivity no matter how large the number of sensors is. Moreover, we also provide sharp upper bounds on the information loss of the CISPRT algorithm as compared to the centralized optimal SPRT algorithm in term of expected sample sizes in the asymptotic regime when Type I and II error probabilities go to 0. Numerical simulations suggest that our results are useful under the practical setting when the number of sensors is moderately large. Chapter 3 aims to develop an efficient method that is able to combine different patients' responses to assess the treatment effects of new drugs. Our research is motivated by Biogen's ongoing Phase 3 clinical trial of a new drug “Aducanumab” for Alzheimer's disease (AD), where the primary outcome is on the change in the Clinical Dementia Rating-Sum of Boxes (CDR-SB) scores. The current gold standard method is the so-called responder analysis based on the two-sample proportion test, which only uses information at Month 18 and 0. This might lose detection powers because of two reasons: (i) Not every subject will have these CDR-SB scores at Month 18, due to various reasons such as missing the appointments or dropping out; (ii) it does not take advantage of the longitudinal study design when the CDR-SB scores will be collected multiple times for most subjects (e.g., at Month 0, 6, 12, 18, 24 and 36 after the enrollment of the study). We propose to model the CDR-SB scores by the Beta distribution and to use the mixed-effects Beta regression model combining all observed CDR-SB values together to increase the detection power of the changes in the CDR-SB scores. The usefulness of our proposed models and methods is demonstrated through the Alzheimer's Disease Neuroimaging Initiative (ADNI) database and simulation studies. In Chapter 4 of the dissertation, we investigate the problem of robust statistical inference in the presence of contaminated data. The corrupted or contaminated data is often a big issue when we integrate data from different sources, and thus it is crucial to have a robust local inference before combining different local information together. We present our research on the robust point estimations in the mixture model. Our main contribution is to consider an exponential loss function that is better to mitigate the effect of outliers and develop an asymptotic theory in a new asymptotic regime when the outlier means go to infinity in a suitable rate as the proportion of outliers goes to 0.
    URI
    http://hdl.handle.net/1853/61160
    Collections
    • Georgia Tech Theses and Dissertations [23877]
    • School of Industrial and Systems Engineering Theses and Dissertations [1457]

    Browse

    All of SMARTechCommunities & CollectionsDatesAuthorsTitlesSubjectsTypesThis CollectionDatesAuthorsTitlesSubjectsTypes

    My SMARTech

    Login

    Statistics

    View Usage StatisticsView Google Analytics Statistics
    facebook instagram twitter youtube
    • My Account
    • Contact us
    • Directory
    • Campus Map
    • Support/Give
    • Library Accessibility
      • About SMARTech
      • SMARTech Terms of Use
    Georgia Tech Library266 4th Street NW, Atlanta, GA 30332
    404.894.4500
    • Emergency Information
    • Legal and Privacy Information
    • Human Trafficking Notice
    • Accessibility
    • Accountability
    • Accreditation
    • Employment
    © 2020 Georgia Institute of Technology