Partial persistent sequences and their applications to collaborative text document editing and processing
MetadataShow full item record
In a variety of text document editing and processing applications, it is necessary to keep track of the revision history of text documents by recording changes and the metadata of those changes (e.g., user names and modification timestamps). The recent Web 2.0 document editing and processing applications, such as real-time collaborative note taking and wikis, require fine-grained shared access to collaborative text documents as well as efficient retrieval of metadata associated with different parts of collaborative text documents. Current revision control techniques only support coarse-grained shared access and are inefficient to retrieve metadata of changes at the sub-document granularity. In this dissertation, we design and implement partial persistent sequences (PPSs) to support real-time collaborations and manage metadata of changes at fine granularities for collaborative text document editing and processing applications. As a persistent data structure, PPSs have two important features. First, items in the data structure are never removed. We maintain necessary timestamp information to keep track of both inserted and deleted items and use the timestamp information to reconstruct the state of a document at any point in time. Second, PPSs create unique, persistent, and ordered identifiers for items of a document at fine granularities (e.g., a word or a sentence). As a result, we are able to support consistent and fine-grained shared access to collaborative text documents by detecting and resolving editing conflicts based on the revision history as well as to efficiently index and retrieve metadata associated with different parts of collaborative text documents. We demonstrate the capabilities of PPSs through two important problems in collaborative text document editing and processing applications: data consistency control and fine-grained document provenance management. The first problem studies how to detect and resolve editing conflicts in collaborative text document editing systems. We approach this problem in two steps. In the first step, we use PPSs to capture data dependencies between different editing operations and define a consistency model more suitable for real-time collaborative editing systems. In the second step, we extend our work to the entire spectrum of collaborations and adapt transactional techniques to build a flexible framework for the development of various collaborative editing systems. The generality of this framework is demonstrated by its capabilities to specify three different types of collaborations as exemplified in the systems of RCS, MediaWiki, and Google Docs respectively. We precisely specify the programming interfaces of this framework and describe a prototype implementation over Oracle Berkeley DB High Availability, a replicated database management engine. The second problem of fine-grained document provenance management studies how to efficiently index and retrieve fine-grained metadata for different parts of collaborative text documents. We use PPSs to design both disk-economic and computation-efficient techniques to index provenance data for millions of Wikipedia articles. Our approach is disk economic because we only save a few full versions of a document and only keep delta changes between those full versions. Our approach is also computation-efficient because we avoid the necessity of parsing the revision history of collaborative documents to retrieve fine-grained metadata. Compared to MediaWiki, the revision control system for Wikipedia, our system uses less than 10% of disk space and achieves at least an order of magnitude speed-up to retrieve fine-grained metadata for documents with thousands of revisions.