Biological and clinical data integration and its applications in healthcare
MetadataShow full item record
Answers to the most complex biological questions are rarely determined solely from the experimental evidence. It requires subsequent analysis of many data sources that are often heterogeneous. Most biological data repositories focus on providing only one particular type of data, such as sequences, molecular interactions, protein structure, or gene expression. In many cases, it is required for researchers to visit several different databases to answer one scientific question. It is essential to develop strategies to integrate disparate biological data sources that are efficient and seamless to facilitate the discovery of novel associations and validate existing hypotheses. This thesis presents the design and development of different integration strategies of biological and clinical systems. The BioSPIDA system is a data warehousing solution that integrates many NCBI databases and other biological sources on protein sequences, protein domains, and biological pathways. It utilizes a universal parser facilitating integration without developing separate source code for each data site. This enables users to execute fine-grained queries that can filter genes by their protein interactions, gene expressions, functional annotation, and protein domain representation. Relational databases can powerfully return and generate quickly filtered results to research questions, but they are not the most suitable solution in all cases. Clinical patients and genes are typically annotated by concepts in hierarchical ontologies and performance of relational databases are weakened considerably when traversing and representing graph structures. This thesis illustrates when relational databases are most suitable as well as comparing the performance benchmarks of semantic web technologies and graph databases when comparing ontological concepts. Several approaches of analyzing integrated data will be discussed to demonstrate the advantages over dependencies on remote data centers. Intensive Care Patients are prioritized by their length of stay and their severity class is estimated by their diagnosis to help minimize wait time and preferentially treat patients by their condition. In a separate study, semantic clustering of patients is conducted by integrating a clinical database and a medical ontology to help identify multi-morbidity patterns. In the biological area, gene pathways, protein interaction networks, and functional annotation are integrated to help predict and prioritize candidate disease genes. This thesis will present the results that were able to be generated from each project through utilizing a local repository of genes, functional annotations, protein interactions, clinical patients, and medical ontologies.