Consistency and High Availability of Information Dissemination in Multi-Processor Networks.

Author: Idit Keidar.

Supervised by: Danny Dolev.

Dissertation for the degree ``Doctor of Philosophy''. Institute of Computer Science, The Hebrew University of Jerusalem, Jerusalem, Israel, October 1998.

Abstract:

This dissertation presents general tools for the development of highly available distributed applications such as replicated servers and Computer Supported Cooperative Work (CSCW) applications. A desktop and multi-media conferencing tool is an example of a CSCW application, incorporating various activities such as video transmission and management of replicated work space. These services are becoming popular today, with the world-wide increase of communication capacity: Replicated servers in clusters are becoming a leading solutions for scalability, fault tolerance and performance, and world-spanning conferences and interactive games over the Internet are becoming more and more popular. Unfortunately, the subtleties involved in such systems are not well understood and many industries apply ad-hoc solutions without fully understanding their limitations and guarantees.

The contribution of this dissertation is in providing application builders with tools and concepts that facilitate the development of such systems while accurately understanding their limitations and guarantees. These concepts are demonstrated and were tested in prototype implementations.

This dissertation suggests a comprehensive framework for the development of highly available groupware and CSCW applications, geared towards multi-process failure prone environments (e.g., the Internet). The services are fault tolerant and scalable. The suggested framework incorporates a wide variety of services ranging from efficient communication solutions to tools for maintaining consistency of distributed information in the face of faults. These services support multi-party conferencing in dynamic discussion groups, while keeping track of the dynamically changing set of participants in each group.

The services exploit the group communication paradigm: Group communication systems provide application builders with reliable multicast communication services within dynamically changing groups, as well as membership services which inform the members when other members crash or join the group.

Unlike classical group communication systems, our design separates the membership services from the multicast communication substrate. The membership is implemented as a separate server (daemon) on each machine that interacts with the multicast communication substrate and with the application. This separation makes the communication services more efficient since most of the time the membership does not change. This design also facilitates using the same membership services for a variety of quality of service (QoS) communication options. Examples of QoS options include high bandwidth, low latency, and also reliable (loss free) multicast. Decomposing the service into separate modules also makes it easier to reason about, i.e., formally specify the service guarantees and assumptions, and prove correctness. The membership service is presented in detail in Chapter 5.

In addition to the multicast communication service and the membership service, we provide application builders with session level services. These services relieve the application builder of the need to explicitly deal with the subtleties of changes in the network situation. The session services exploit the strong group communication semantics in order to efficiently maintain consistency of objects in the face of failures. This dissertation focuses on important building blocks for consistent replication: Chapter 6 presents a highly available Totally Ordered Broadcast service, which may be used, e.g., for consistent replication. Chapter 7 presents a highly available service for maintaining the primary network component in the network.

Chapter 6 presents a Totally Ordered Broadcast protocol, which guarantees a fully serializable history of object updates. This is achieved by prohibiting arbitrary updates of the object in disjoint network components; often, only the members of a primary component may update the object. The algorithm in Chapter 6 exploits group communication as a building block. It always allows members of a primary component in the system to update the object. It may be used in conjunction with several types of primary component services that notify processes when they are members of the current primary component e.g., a service based on dynamic voting presented in Chapter 7.

The underlying concepts demonstrated by the services constructed in this dissertation are general and apply to a large family of distributed systems and applications.

Fault tolerant distributed services are now being developed by many commercial companies; highly available servers running in clusters are the leading new generation solution for scalability and performance. At the basis of many of these commercial systems lie concepts that were developed in academic projects and systems. The underlying concepts of this work will play a role in future development of such systems. In particular, the formal reasoning we apply to our systems will assists commercial system builders understand the guarantees and limitations behind the systems they construct, and also help identify the tradeoffs involved.

Chapter 8 presents a novel protocol, E3PC, for atomic commitment, that always allows a majority to make progress. The ``classical'' three phase commit (3PC) protocol sometimes allows a majority to make progress, but if failures cascade, a majority may become connected and still remain blocked. We have identified this shortcoming, and have developed a new structure that allows information to propagate through the sequence of majorities formed in the system. E3PC improves the classical 3PC without adding extra communication by following this structure.

Download (postscript or pdf): ps, pdf, ps.gz. Israel mirror site: ps.gz.

Last modified: Mon Jul 1 14:34:08 EDT 2002