Middleware for online scientific data analytics at extreme scale
MetadataShow full item record
Scientific simulations running on High End Computing machines in domains like Fusion, Astrophysics, and Combustion now routinely generate terabytes of data in a single run, and these data volumes are only expected to increase. Since such massive simulation outputs are key to scientific discovery, the ability to rapidly store, move, analyze, and visualize data is critical to scientists' productivity. Yet there are already serious I/O bottlenecks on current supercomputers, and movement toward the Exascale is further accelerating this trend. This dissertation is concerned with the design, implementation, and evaluation of middleware-level solutions to enable high performance and resource efficient online data analytics to process massive simulation output data at large scales. Online data analytics can effectively overcome the I/O bottleneck for scientific applications at large scales by processing data as it moves through the I/O path. Online analytics can extract valuable insights from live simulation output in a timely manner, better prepare data for subsequent deep analysis and visualization, and gain improved performance and reduced data movement cost (both in time and in power) compared to the conventional post-processing paradigm. The thesis identifies the key challenges for online data analytics based on the needs of a variety of large-scale scientific applications, and proposes a set of novel and effective approaches to efficiently program, distribute, and schedule online data analytics along the critical I/O path. In particular, its solution approach i) provides a high performance data movement substrate to support parallel and complex data exchanges between simulation and online data analytics, ii) enables placement flexibility of analytics to exploit distributed resources, iii) for co-placement of analytics with simulation codes on the same nodes, it uses fined-grained scheduling to harvest idle resources for running online analytics with minimal interference to the simulation, and finally, iv) it supports scalable efficient online spatial indices to accelerate data analytics and visualization on the deep memory hierarchies of high end machines. Our middleware approach is evaluated with leadership scientific applications in domains like Fusion, Combustion, and Molecular Dynamics, and on different High End Computing platforms. Substantial improvements are demonstrated in end-to-end application performance and in resource efficiency at scales of up to 16384 cores, for a broad range of analytics and visualization codes. The outcome is a useful and effective software platform for online scientific data analytics facilitating large-scale scientific data exploration.