Measurement and resource allocation problems in data streaming systems
MetadataShow full item record
In a data streaming system, each component consumes one or several streams of data on the fly and produces one or several streams of data for other components. The entire Internet can be viewed as a giant data streaming system. Other examples include real-time exploratory data mining and high performance transaction processing. In this thesis we study several measurement and resource allocation optimization problems of data streaming systems. Measuring quantities associated with one or several data streams is often challenging because the sheer volume of data makes it impractical to store the streams in memory or ship them across the network. A data streaming algorithm processes a long stream of data in one pass using a small working memory (called a sketch). Estimation queries can then be answered from one or more such sketches. An important task is to analyze the performance guarantee of such algorithms. In this thesis we describe a tail bound problem that often occurs and present a technique for solving it using majorization and convex ordering theories. We present two algorithms that utilize our technique. The first is to store a large array of counters in DRAM while achieving the update speed of SRAM. The second is to detect global icebergs across distributed data streams. Resource allocation decisions are important for the performance of a data streaming system. The processing graph of a data streaming system forms a fork and join network. The underlying data processing tasks consists of a rich set of semantics that include synchronous and asynchronous data fork and data join. The different types of semantics and processing requirements introduce complex interdependence between various data streams within the network. We study the distributed resource allocation problem in such systems with the goal of achieving the maximum total utility of output streams. For networks with only synchronous fork and join semantics, we present several decentralized iterative algorithms using primal and dual based optimization techniques. For general networks with both synchronous and asynchronous fork and join semantics, we present a novel modeling framework to formulate the resource allocation problem, and present a shadow-queue based decentralized iterative algorithm to solve the resource allocation problem. We show that all the algorithms guarantee optimality and demonstrate through simulation that they can adapt quickly to dynamically changing environments.