Novel statistical learning and data mining methods for service systems improvement
MetadataShow full item record
This dissertation focuses on solving problems for service systems improvement using newly developed data mining methods. Among a large plethora of problems in this realm, this dissertation attempts to solve three distinct and critical research topics. As a first topic, a classical problem of accurately forecasting patient census, and thereby workloads, for hospital management is studied. Majority of current literature focuses on optimal scheduling of inpatients, but largely ignores the process of accurate estimation of the path of patients throughout the treatment and recovery process. The result is that current scheduling models are optimized based on inaccurate input data. We developed a Clustering and Scheduling Integrated (CSI) approach to capture patient flows through a network of hospital services. CSI works differently by clustering patients into groups based on the similarity of paths, instead of admit, condition, or other physical attributes. To that end, we develop a novel Semi-Markov model (SMM)-clustering scheme. The methodology is validated by simulation and then applied to real patient data from a partner hospital where we see it outperforms current methods. Further, we demonstrate that extant optimization methods achieve significantly better results on key hospital performance measures under CSI, compared with traditional estimation approaches. From a methodological standpoint, the SMM-clustering is a novel approach applicable to any temporal-spatial stochastic data that is prevalent in many industries and application areas. In the second topic, data analysis problems in a special scenario — longitudinal data with measurement errors but absence of replicates — is studied. Longitudinal data is commonly found across fields, and sometimes has measurement errors. Especially, if the data collection has several processing stages, like MRI scans in medical fields. Multiple measurements (replications) are often taken at the same time to gauge its error and correct the analysis. However, obtaining replicates are sometimes not possible due to cost or associated risks, for instance, MRI scans are taken at long intervals due to high costs. Inferences derived from such erroneous data can be unreliable and, in medical diagnosis, can be fatal. We, therefore, devise a new estimation approach, called as EM-Variogram, that utilizes the autocorrelation aspect of longitudinal data to isolate the variance from measurement errors. This estimation approach enables a more reliable data analysis and a powerful statistical test of model parameters. Upon using this methodology on Alzheimer disease patients, we could quickly and precisely detect any signal of decline in patients' conditions. This can prove to be extremely useful for providing any required treatment to the patients to improve their conditions. Besides, other possible applications are also discussed. The last topic is on one of the most commonly found data type – sequences. It has a ubiquitous presence across fields, like, web, healthcare, bioinformatics, text mining, etc. This has made sequence mining a vital research area. However, sequence mining is particularly challenging because of an absence of an accurate and fast approach to find (dis)similarity between sequences. As a measure of (dis)similarity, mainstream data mining methods like k-means, kNN, regression, etc., have proved distance between data points in a euclidean space to be most effective. But a distance measure between sequences is not obvious due to their unstructuredness – arbitrary strings of arbitrary length. We, therefore, propose a new function, called as Sequence Graph Transform (SGT), that extracts sequence features and embeds it in a finite-dimensional euclidean space. It is scalable due to a low computational complexity and has a universal applicability on any sequence problem. We theoretically show that SGT can capture both short- and long- term patterns in sequences, and provides an accurate distance-based measure of (dis)similarity between them. This is also validated experimentally. Finally, we show its real world application for clustering, classification, search and visualization on different sequence problems.