Internet-Scale Information Monitoring: A Continual Query Approach
MetadataShow full item record
Information monitoring systems are publish-subscribe systems that continuously track information changes and notify users (or programs acting on behalf of humans) of relevant updates according to specified thresholds. Internet-scale information monitoring presents a number of new challenges. First, automated change detection is harder when sources are autonomous and updates are performed asynchronously. Second, information source heterogeneity makes the problem of modelling and representing changes harder than ever. Third, efficient and scalable mechanisms are needed to handle a large and growing number of users and thousands or even millions of monitoring triggers fired at multiple sources. In this dissertation, we model users' monitoring requests using continual queries (CQs) and present a suite of efficient and scalable solutions to large scale information monitoring over structured or semi-structured data sources. A CQ is a standing query that monitors information sources for interesting events (triggers) and notifies users when new information changes meet specified thresholds. In this dissertation, we first present the system level facilities for building an Internet-scale continual query system, including the design and development of two operational CQ monitoring systems OpenCQ and WebCQ, the engineering issues involved, and our solutions. We then describe a number of research challenges that are specific to large-scale information monitoring and the techniques developed in the context of OpenCQ and WebCQ to address these challenges. Example issues include how to efficiently process large number of continual queries, what mechanisms are effective for building a scalable distributed trigger system that is capable of handling tens of thousands of triggers firing at hundreds of data sources, how to effectively disseminate fresh information to the right users at the right time. We have developed a suite of techniques to optimize the processing of continual queries, including an effective CQ grouping scheme, an auxiliary data structure to support group-based indexing of CQs, and a differential CQ evaluation algorithm (DRA). The third contribution is the design of an experimental evaluation model and testbed to validate the solutions. We have engaged our evaluation using both measurements on real systems (OpenCQ/WebCQ) and simulation-based approach. To our knowledge, the research documented in this dissertation is to date the first one to present a focused study of research and engineering issues in building large-scale information monitoring systems using continual queries.