Lattice Agreement in Message Passing Systems

LIPICS - Leibniz International Proceedings in Informatics, Sep 2018

This paper studies the lattice agreement problem and the generalized lattice agreement problem in distributed message passing systems. In the lattice agreement problem, given input values from a lattice, processes have to non-trivially decide output values that lie on a chain. We consider the lattice agreement problem in both synchronous and asynchronous systems. For synchronous lattice agreement, we present two algorithms which run in log(f) and min{O(log^2 h(L)), O(log^2 f)} rounds, respectively, where h(L) denotes the height of the input sublattice L, f < n is the number of crash failures the system can tolerate, and n is the number of processes in the system. These algorithms have significant better round complexity than previously known algorithms. The algorithm by Attiya et al. [Attiya et al. DISC, 1995] takes log(n) synchronous rounds, and the algorithm by Mavronicolasa [Mavronicolasa, 2018] takes min{O(h(L)), O(sqrt(f))} rounds. For asynchronous lattice agreement, we propose an algorithm which has time complexity of 2*min{h(L), f + 1} message delays which improves on the previously known time complexity of O(n) message delays. The generalized lattice agreement problem defined by Faleiro et al in [Faleiro et al. PODC, 2012] is a generalization of the lattice agreement problem where it is applied for the replicated state machine. We propose an algorithm which guarantees liveness when a majority of the processes are correct in asynchronous systems. Our algorithm requires min{O(h(L)), O(f)} units of time in the worst case which is better than O(n) units of time required by the algorithm in [Faleiro et al. PODC, 2012].

A PDF file should load here. If you do not see its contents the file may be temporarily unavailable at the journal website or you do not have a PDF plug-in installed and enabled in your browser.

Alternatively, you can download the file locally and open with any standalone PDF reader:

http://drops.dagstuhl.de/opus/volltexte/2018/9830/pdf/LIPIcs-DISC-2018-41.pdf

Lattice Agreement in Message Passing Systems

D I S C Lattice Agreement in Message Passing Systems Vijay K. Garg 0 1 2 0 Xiong Zheng University of Texas at Austin , Austin, TX 78712 , USA 1 University of Texas at Austin , Austin, TX 78712 , USA 2 Changyong Hu University of Texas at Austin , Austin, TX 78712 , USA This paper studies the lattice agreement problem and the generalized lattice agreement problem in distributed message passing systems. In the lattice agreement problem, given input values from a lattice, processes have to non-trivially decide output values that lie on a chain. We consider the lattice agreement problem in both synchronous and asynchronous systems. For synchronous lattice agreement, we present two algorithms which run in log(f ) and min{O(log2h(L)), O(log2f )} rounds, respectively, where h(L) denotes the height of the input sublattice L, f < n is the number of crash failures the system can tolerate, and n is the number of processes in the system. These algorithms have significant better round complexity than previously known algorithms. The algorithm by Attiya et al. [Attiya et al. DISC, 1995] takes log(n) synchronous rounds, and the algorithm by Mavronicolasa [Mavronicolasa, 2018] takes min{O(h(L)), O(sqrt(f ))} rounds. For asynchronous lattice agreement, we propose an algorithm which has time complexity of 2 ? min{h(L), f + 1} message delays which improves on the previously known time complexity of O(n) message delays. The generalized lattice agreement problem defined by Faleiro et al in [Faleiro et al. PODC, 2012] is a generalization of the lattice agreement problem where it is applied for the replicated state machine. We propose an algorithm which guarantees liveness when a majority of the processes are correct in asynchronous systems. Our algorithm requires min{O(h(L)), O(f )} units of time in the worst case which is better than O(n) units of time required by the algorithm in [Faleiro et al. PODC, 2012]. 2012 ACM Subject Classification Theory of computation ? Distributed algorithms Related Version A full version of the paper is available at https://arxiv.org/abs/1807. 11557. Funding Supported by NSF CNS-1812349, NSF CNS-1563544, NSF CNS-1346245, Huawei Inc., and the Cullen Trust for Higher Education Endowed Professorship. Acknowledgements We want to thank John Kaippallimalil for providing some useful application cases for CRDT and generalized lattice agreement. and phrases Lattice Agreement; Replicated State Machine; Consensus - 1 Introduction Lattice agreement, introduced in [2] to solve the atomic snapshot problem [1] in shared memory, is an important decision problem in distributed systems. In this problem, processes start with input values from a lattice and need to decide values which are comparable to each other. Lattice agreement problem is a weaker decision problem than consensus. In synchronous systems, consensus cannot be solved in fewer than f + 1 rounds [6], but lattice agreement can be solved in log f rounds (shown by an algorithm we propose). In asynchronous systems, the consensus problem cannot be solved even with one failure [8], whereas the lattice agreement problem can be solved in asynchronous systems when a majority of processes is correct [7]. In synchronous message passing systems, a log n rounds recursive algorithm based on ?branch-and-bound? approach is proposed in [2] to solve the lattice agreement problem with message complexity of O(n2). It can tolerate at most n ? 1 process failures. Later, [12] gave an algorithm with round complexity of min{1 + h(L), b(3 + ?8f + 1/2)c}, for any execution where at most f < n processes may crash. Their algorithm has the early-stopping property and is the first algorithm with round complexity that depends on the actual height of the input lattice. Our first algorithm, for synchronous lattice agreement, LA?, requires log h(L) rounds. It assumes that the height of the input lattice is known to all processes. By applying this algorithm as a building block, we give an algorithm, LA?, which requires only log f rounds without the height assumption in LA?. Instead of directly trying to decide on the comparable output values which are from a lattice with an unknown height, this algorithm first performs lattice agreement on the failure set known by each process by using LA?. Then each process removes values from faulty processes they know and outputs the join of all the remaining values. Our third algorithm, LA?, has round complexity of min{O(log2 h(L)), O(log2 f )), which depends on the height of the input lattice but does not assume that the height is known. This algorithm iteratively guesses the actual height of the input lattice and applies LA? with the guessed height as input, until all processes terminate. Lattice agreement in asynchronous message passing systems is useful due to its applications in atomic snapshot objects and fault-tolerant replicated state machines. Efficient implementation of atomic snapshot objects in crash-prone asynchronous message passing systems is important because they can make design of algorithms in such systems easier (examples of algorithms in message passing systems based on snapshot objects can be found in [18], [13] and [4]). As shown in [2], any algorithm for lattice agreement can be applied to solve the atomic snapshot problem in a shared memory system. We note that [3] does not directly use lattice agreement to solve the atomic snapshot problem, but their idea of producing comparable views for processes is essentially lattice agreement. Thus, by using the same transformation techniques in [2] and [3], algorithms for lattice agreement problem can be directly applied to implement atomic snapshot objects in crash-prone message passing systems. We give an algorithm for asynchronous lattice agreement problem which requires min{O(h(L)), O(f )} message delays. Then, by applying the technique in [3], our algorithm can be used to implement atomic snapshot objects on top of crash-prone asynchronous message passing systems and achieve time complexity of O(f ) message delays in the worst case. Our result significantly improves the message delays in the previous work by Delporte-Gallet, Fauconnier et al [5]. The algorithm in [5] directly implements an atomic snapshot object on top of crash-prone message passing systems and requires O(n) message delays in the worst case. Another related work for lattice agreement in asynchronous systems is by Faleiro et al. [7]. They solve the lattice agreement problem in asynchronous systems by giving a Paxos style protocol [10, 11], in which each proposer keeps proposing a value until it gets accept messages from a majority of acceptors. The acceptor only accepts a proposal when the proposal has a bigger value than its accepted value. Their algorithm requires O(n) message delays. Our asynchronous lattice agreement algorithm does not have Paxos style. Instead, it runs in round-trips. Each round-trip is composed of sending a message to all and getting n ? f acknowledgements back. Our algorithm guarantees termination in min{O(h(L)), O(f )} message delays which is a significant improvement over O(n) message delays. Generalized lattice agreement problem defined in [7] is a generalization of the lattice agreement problem in asynchronous systems. It is applied to implement a specific class of replicated state machines. In conventional replicated state machine approach [14], consensus based mechanism is used to implement strong consistency. Due to performance reasons, many systems relax the strong consistency requirement and support eventual consistency [17], i.e, all copies are eventually consistent. However, there is no guarantee on when this eventual consistency happens. Also, different copies could be in an inconsistent state before this eventual situation happens. Conflict-free replicated data types (CRDT) [15, 16] is a data structure which supports such eventual consistency. In CRDT, all operations are designed to be commutative such that they can be concurrently executed without coordination. As shown in [7] by applying generalized lattice agreement on top of CRDT, the states of any two copies can be made comparable and thus provide linearizability guarantee [9] for CRDT. The following example from [7] motivates generalized lattice agreement. Consider a replicated set data structure which supports adds and reads. Suppose there are two concurrent updates, add(a) and add(b), and two concurrent reads on copy one and two respectively. By using CRDT, it could happen that the two reads return {a} and {b} respectively. This execution is not linearizable [9], because if add(a) appears before add(b) in the linear order, then no read can return {b}. On the other hand, if we use conventional consensus replicated state machine technique, then all operations would be coordinated including the two reads. This greatly impacts the throughput of the system. By applying generalized lattice agreement on top of CRDT, all operations can be concurrently executed and any two reads always return comparable views of the system. In the above example, the two reads return either (i) {a} and {a, b} or (ii) {b} and {a, b} which is linearizable. Therefore, generalized lattice agreement can be applied on top of CRDT to provide better consistency guarantee than CRDT and better availability than conventional replicated state machine technique. Since the generalized lattice agreement problem has applications in building replicated state machines, it is important to reduce the message delays for a value to be learned. Faleiro et al. [7] propose an algorithm for the generalized lattice agreement by using their algorithm for the lattice agreement problem as a building block. Their generalized lattice agreement algorithm satisfies safety and liveness assuming f < d n2 e. A value is eventually learned in their algorithm after O(n) message delays in the worst case. Our algorithm guarantees that a value is learned in min{O(h(L)), O(f )} message delays. In summary, this paper makes the following contributions: We present an algorithm, LA? to solve the lattice agreement in synchronous system in log h(L) rounds assuming h(L) is known. Using LA?, we propose an algorithm, LA? to solve the standard lattice agreement problem in log f rounds. This bound is significantly better than the previously known upper bounds of log n by [3] and min{1 + h(L), b(3 + ?8f + 1/2)c} by [12] (and solves the open problem posed there). We also give an algorithm, LA? which runs in min{O(log2 h(L)), O(log2 f )} rounds. For the lattice agreement problem in asynchronous systems, we give an algorithm, LA? which requires 2 ? min{h(L), f + 1} message delays which improves the O(n) bound by [7]. Based on the asynchronous lattice agreement algorithm, we present an algorithm, GLA?, to solve the generalized lattice agreement with time complexity min{O(h(L)), O(f )} message delays which improves the O(n) bound by [7]. Related previous work and our results are summarized in Table 1. LA sync and LA async represent lattice agreement in synchronous systems and asynchronous systems, respectively. GLA async represents generalized lattice agreement in asynchronous systems. LA? is designed to solve the lattice agreement problem with the assumption that the height of the input lattice is given. It serves as a building block for LA? and LA?. For synchronous systems, the time complexity is given in terms of synchronous rounds. For asynchronous system, the time complexity is given in terms of message delays. The message column represents the total number of messages sent by all processes in one execution. For generalized lattice agreement problem, the message complexity is given in terms of the number of messages needed for a value to be learned. 2 2.1 System Model and Problem Definitions System Model We assume a distributed message passing system with n processes in a completely connected topology, denoted as p1, ..., pn. We consider both synchronous or asynchronous systems. Synchronous means that message delays and the duration of the operations performed by the process have an upper bound on the time. Asynchronous means that there is no upper bound on the time for a message to reach its destination. The model assumes that processes may have crash failures but no Byzantine failures. The model parameter f denotes the maximum number of processes that may crash in a run. We assume that the underlying communication system is reliable but the message channel may not be FIFO. We say a process is faulty in a run if it crashes and correct or non-faulty otherwise. In our following algorithms, when a process sends a message to all, it also sends this message to itself. 2.2 Lattice Agreement Let (X, ?, t) be a finite join semi-lattice with a partial order ? and join t. Two values u and v in X are comparable iff u ? v or v ? u. The join of u and v is denoted as t{u, v}. X is a join semi-lattice if a join exists for every nonempty finite subset of X. As customary in this area, we use the term lattice instead of join semi-lattice in this paper for simplicity. In the lattice agreement problem [2], each process pi can propose a value xi in X and must decide on some output yi also in X. An algorithm is said to solve the lattice agreement problem if the following properties are satisfied: Downward-Validity: For all i ? [1..n], xi ? yi. Upward-Validity: For all i ? [1..n], yi ? t{x1, ..., xn}. Comparability: For all i ? [1..n] and j ? [1..n], either yi ? yj or yj ? yi. In this paper, all the algorithms that we propose apply join operation to some subset of input values. Therefore, it is sufficient to focus on the join-closed subset of X that includes all input values. Let L be the join-closed subset of X that includes all input values. L is also a join semi-lattice. We call L the input sublattice of X. All algorithms proposed in this paper are based on L. Since the complexity of our algorithms depend on the height of lattice L, we give the formal definitions as below: I Definition 1. The height of a value v in a lattice X is the length of longest path from any minimal value to v, denoted as hX (v) or h(v) when it is clear. I Definition 2. The height of a lattice X is the height of its largest value, denoted as h(X). Each process proposes a value from a boolean lattice. Thus, the largest value in this lattice is the set consists of all the n values. From the definition 2, we have h(L) ? n. 2.3 Generalized Lattice Agreement In generalized lattice agreement problem, each process may receive a possibly infinite sequence of values as inputs that belong to a lattice at any point of time. Let xip denote the ith value p received by process p. The aim is for each process p to learn a sequence of output values yj which satisfies the following conditions: Validity: Any learned value yjp is a join of some set of received input values. Stability: The value learned by any process p is non-decreasing: j < k =? yjp ? ykp. Comparability: Any two values yjp and ykq learned by any two process p and q are comparable. Liveness: Every value xip received by a correct process p is eventually included in some learned value ykq of every correct process q: i.e, xi ? yk. p q 3 3.1 Lattice Agreement in Synchronous Systems Lattice Agreement with Known Height In this section, we first consider a simpler version of the standard lattice agreement problem by assuming that the height of the input sublattice L is known in advance, i.e, h(L) is given. We propose an algorithm, LA?, to solve this problem in log h(L) synchronous rounds. In section 3.2, we give an algorithm to solve the lattice agreement problem when the height is not given using this algorithm. Algorithm LA? runs in synchronous rounds. At each round, by calling a Classifier procedure (described below), processes within a same group (to be defined later) are classified into different groups. The algorithm guarantees that any two processes within the same group have equal values and any two processes in different groups have comparable values at the end. Thus, values of all processes are comparable to each other at the end. We present the algorithm by first introducing the fundamental Classifier procedure. 3.1.1 The Classifier Procedure The Classifier procedure is inspired by the Classifier procedure given by Attiya and Rachman in [3], called AR-Clasifier, where it is applied to solve the atomic snapshot problem in the shared memory system. The intuition behind the Classifier procedure is to classify processes to master or slave and ensure all master processes have values greater than all slave processes. The pseudo-code for Classifier is given in Figure 1. It takes two parameters: the input value v and the threshold value k. The output is composed of three items: the output value, the classification result and the decision status. The process which calls the Classifier procedure should update their value to be the output value. The classification result is either master or slave. The decision status is a Boolean value which is used to inform whether the invoking process can decide on the output value or not. The main functionality of the Classifier procedure is either to tell the invoking process to decide, or to classify the invoking process as a master or a slave. Details of the Classifier procedure are shown below: Line 1-3: The invoking process sends a message with its input value v and the threshold value k to all. It then collects all the received values associated with the threshold value k in a set U . Line 5-6: It checks whether all values in U are comparable to the input value. If they are comparable, it terminates the Classifier procedure and returns the input value as the output value and true as the decision status. Line 8-12: It performs classification based on received values. Let w be the join of all received values associated with the threshold value k. If the height of w in lattice L is greater than the threshold value k, then the Classifier returns w as the output value, master as the classification result and false as the decision status. Otherwise, it returns the input value as the output value, slave as the classification result and false as the decision status. From the classification steps, it is easy to see that the processes classified as master have values greater than those classified as slave because w is the join of all values in U . There are four main differences between the AR-Classifier and our Classifier : 1) The AR-Classifier is based on the shared memory model whereas our algorithm is based on synchronous message passing. 2) The AR-Classifier does not allow early termination. 3) Each process in the AR-Classifier needs values from all processes whereas our Classifier uses values only from processes within its group. 4) The AR-Classifier procedure requires the invoking process to read values of all processes again if the invoking process is classified as master where as our algorithm needs to receive values from all processes only once. 3.1.2 Algorithm LA? Algorithm LA? (shown in Figure 2) runs in at most log h(L) rounds. It assumes knowledge of H = h(L), the height of the input lattice. Let xi denote the initial input value of process i, vir denote the value held by process i at the beginning of round r, and class denote the classification result of the Classifier procedure. The class indicates whether the process is classified as a master or a slave. The decided variable shows whether the process has decided or not. Each process i has a label denoted as li. This label is updated at each round. Processes which have the same label l are said to be in the same group with label l. The definitions of label and group are formally given as: I Definition 3 (label). Each process has a label, which serves as a knowledge threshold and is passed as the threshold value k whenever the process calls the Classifier procedure. I Definition 4 (group). A group is a set of processes which have the same label. The label of a group is the label of the processes in this group. Classifier (v, k): v: input value k: threshold value 1: Send (v, k) to all 2: Receive messages of the form (?, k) 3: Let U be values contained in received messages 3: 4: /* Early Termination */ 5: if |U | = 0 or ?u ? U : v ? u ? u ? v 6: return (v, ?, true) 6: 7: /* Classification */ 8: Let w := t{u : u ? U } 9: if h(w) > k 10: return (w, master, f alse) 11: else 12: return (v, slave, f alse) A process has decided if it has set its decision status to true. Otherwise, it is undecided. At each round r, an undecided process invokes the Classifier procedure with its current value and its current label li as parameters v and k, respectively. Since each process passes its label as the threshold value k when invoking the Classifier procedure, line 2 of the Classifier is equivalent to receiving messages from processes within the same group; that is, at each round, a process performs the Classifier procedure within its group. Processes which are in different groups do not affect each other. At round r, by invoking the Classifier procedure, each process i sets vir+1, class and decided to the returned output value, the classification result and the decision status. Each process first checks the value of decided. If it is true, process i decides on vr+1 and terminates the algorithm. Otherwise, if it is classified as a i master, it increases its label by 2rH+1 . If it is classified as a slave, it decreases its label by 2rH+1 . Now we show how the Classifier procedure combined with this label update mechanism makes any two processes have comparable values at the end. Let G be a group of processes at round r. Let M (G) and S(G) be the group of processes which are classified as master and slave, respectively, when they run the Classifier procedure in group G. We say that G is the parent of M (G) and S(G). Thus, M (G) and S(G) are both groups at round r + 1. Process i ? M (G) or i ? S(G) indicates that i does not decide in group G at round r. Initially, all process have the same label H2 and are in the same group with label H2 . When they execute the Classifier, they will be classified into different groups. We can view the execution as processes traversing through a binary tree. Initially, all of them are at the root of the tree. As the program executes, if they are classified as master, then they go to the right child. Otherwise, they go to the left child. Before we prove the correctness of the given algorithm, we first give some useful properties satisfied by the Classifier procedure. Although Lemma 5 is similar to a lemma given in [5], it is discussed here in message passing systems and the proofs are different. I Lemma 5. Let G be a group at round r with label k. Let L and R be two nonnegative integers such that L ? k ? R. If L < h(vir) ? R for every process i ? G, and h(t{vir : i ? G}) ? R, then (p1) for each process i ? M (G), k < h(vir+1) ? R (p2) for each process i ? S(G), L < h(vir+1) ? k (p3) h(t{vvirr++11 :: ii ?? SM(G(G)})}))??kR,and (p4) h(t{ i (p5) for each process i ? M (G), vir+1 ? t{vir+1 : i ? S(G)} Proof. (p1)?(p3): Immediate from the Classifier procedure. (p4): Since S(G) is a group of processes which are at round r + 1, all processes in S(G) are correct (non-faulty) at round r. So, all processes in S(G) must have received values of each other in the Classifier procedure at round r in group G. Thus, h(t{vir+1 : i ? S(G)}) ? k, otherwise all of them should be in group M (G) instead of S(G), according to the condition at line 9 of the Classifier procedure. (p5): Since all processes in S(G) are correct at round r, all processes in M (G) must have received values of all processes in S(G) in the Classifier procedure at round r. Any process which proceeds to group M (G) takes the join of all received values at round r, according to line 10. Thus, for every process i ? M (G), vir+1 ? t{vir+1 : i ? S(G)}. J I Lemma 6. Let x be a value from a lattice L, and V be a set of values from L. Let U be any subset of V . If x is comparable with ? v ? V , then x is comparable with t{u | u ? U }. Proof. If ?u ? U : u ? x, then t{u | u ? U } ? x. Otherwise, ?y ? U : x ? y. Since y ? t{u | u ? U }, so x ? t{u | u ? U }. J I Lemma 7. If process i decides at round r on value yi, then yi is comparable with vjr for any correct process j. Proof. Let process i decide in group G at round r. Consider the two cases below: Case 1: j 6? G. Let G0 be a group at the maximum round r0 such that both i and j belong to G0. Then, either i ? M (G0) ? j ? S(G0) or j ? M (G0) ? i ? S(G0). We only consider the case i ? M (G0) ? j ? S(G0). The other case is similar. From (p5) of Lemma 5, we have t{vpr : p ? S(G0)} ? yi. Since j ? S(G0), then vj ? t{vpr : p ? S(G0)}. Thus, vjr ? yi. For r the other case, we have yi ? vjr. Therefore, yi is comparable with vjr. Case 2: j ? G, since process j is correct, then i must have received vjr at round r. Thus, by line 5 of the Classifier procedure, we have that yi is comparable with yjr. J Now we show that any two processes decide on comparable values. I Lemma 8. (Comparability) Let process i and j decide on yi and yj, respectively. Then yi and yj are comparable. Proof. Let process i and j decide at round ri and rj, respectively. Without loss of generality, assume ri ? rj. At round ri, from Lemma 7 we have yi is comparable with vkr for any correct vri undecided process k. Let V = { k | process k undecided and correct}. Since rj ? ri, yj is at most the join of a subset of V . Thus, from Lemma 6 we have yi and yj are comparable. J Now we prove that all processes decide within log H + 1 rounds by showing all processes in the same group at the beginning of round log H + 1 have equal values, given by Lemma 9 and Lemma 10. Since Lemma 9 and Lemma 10 and the corresponding proofs are similar to the ones given in [3], the proofs are omitted here and can be found in the full paper. Proof of Lemma 9 is based on (p1-p4) of Lemma 5 by induction. Proof of Lemma 10 is based on Lemma 9. I Lemma 9. Let G be a group of processes at round r with label k. Then (1) for each process i ? G, k ? 2Hr < h(vir) ? k + 2Hr (2) h(t{vir : i ? G}) ? k + 2Hr I Lemma 10. Let i and j be two processes that are within the same group G at the beginning of round r = log H + 1. Then vir and vjr are equal. I Lemma 11. All processes decide within log H + 1 rounds. Proof. From Lemma 10, we know any two processes which are in the same group at the beginning of round log H + 1 have equal values. Then, the condition in line 5 of Classifier procedure is satisfied. Thus, all undecided processes decide at round log H + 1. J I Remark 12. Since at the beginning of round log H + 1 all undecided processes have comparable values, LA? only needs log H rounds. For simplicity, one more round is executed to make all processes decide at line 5 of the Classifier procedure. I Theorem 13. Algorithm LA? solves lattice agreement problem in log H rounds and can tolerate f < n failures. Proof. Downward-Validity follows from the fact that the value of each process is nondecreasing at each round. For Upward-Validity, according to the Classifier procedure, each process either keeps its value unchanged or takes the join of the values proposed by other processes which could never be greater than t{x1, ..., xn}. For Comparability, from Lemma 8, we know for any two process i and j, if they decide, then their decision values must be comparable. From Lemma 11, we know all processes decide. Thus, comparability holds. J Complexity. Time complexity is log H rounds. For message complexity, since each process sends n messages per round, log H rounds results in n2 log H messages in total. Notice that the number of messages can be further reduced by keeping a set of processes which are not in its group. If a process p receives a message from process q with a threshold value different from its own threshold value, it knows that q is not in its group. Each process does not send messages to the processes in this set. Algorithm LA? runs in log height(L) rounds by assuming that height(L) is given. However, in order to know that actual height of input lattice, we need to know how many distinc values all process propose which needs extra effort. For this reason, in following sections, we introduce algorithms to solve the lattice agreement problem without this assumption. 3.2 Lattice Agreement with Unknown Height In this section, we consider the standard lattice agreement in which the height of the lattice is not known to any process. We propose algorithm, LA? , (shown in Figure 3) based on algorithm LA?. 3.2.1 Algorithm LA? Algorithm LA? runs in log f + 1 synchronous rounds. It makes use of algorithm LA? as a building block. Instead of directly agreeing on input values which are taken from a lattice with unknown height, we first do lattice agreement on the failure set that each process knows after one round of broadcast. The set of all failure sets forms a boolean lattice with union as the join operation and with height equal to f (since there are at most f failures). The algorithm consists of two phases. At Phase A, all processes exchange their values. Process i includes j into its failure set if it does not receive value from process j at the first phase. After the first phase, each process has a failure set which contains failed processes it knows. Then in phase B, they invoke algorithm LA? with f as the height and its failure set as input. After that, each process decides on a failure set which satisfies lattice agreement properties. The new failure set of any two process i and j are comparable to each other, i.e, Fi0 is comparable to Fj0 . Equipped with this comparable failure set, each process removes values it received from processes which are in its failure set and decides on the join of the remaining values. The following lemma shows that any two processes decide on comparable values. We only give the sketch of proof, and the detailed proof is available in the full paper. I Lemma 14. (Comparability) Let process i and j decide on yi and yj , respectively. Then yi and yj are comparable. Proof sketch. According to comparability of LA?, all processes have comparable failure sets. Then, the set of values they received at Phase A from correct processes must be comparable, i.e, Ci is comparable with Cj . Therefore, yi and yj are comparable. J I Theorem 15. LA? solves lattice agreement problem in log f + 1 rounds, where f < n is the maximum number of failures the system can tolerate. Proof. Downward-Validity. Initially, for correct process i, vi = xi. After Phase A, since i is correct, so i is not in any failure set of any process. At Phase B, process i invokes algorithm LA? with failure set as the input value. Thus, according to the Upward-Validity of LA?, i is not included in Fi0 . So, xi ? Ci. Therefore, xi ? yi. Upward-Validity is immediate from the fact that each process receives at most all values by all processes. Comparability follows from Lemma 14. J 3.2.2 Algorithm LA? Algorithm LA? solves lattice agreement in log f + 1 rounds whereas Algorithm LA? solves lattice agreement in log h(L) rounds assuming h(L) is given. We now propose an algorithm to solve lattice agreement which has round complexity related to h(L) even when h(L) is not known. This algorithm called LA? (shown in Figure 4), solves the standard lattice agreement in O(min{log2 h(L), log2 f }) rounds. The basic idea is to ?guess? the height of L and apply algorithm LA? using the guessed height as input. The algorithm is composed of two phases. At Phase A, each process simply broadcasts its value and takes the join of all received values. Phase B is the guessing phase which invokes algorithm LA? repeatedly. Notice that decided variable is updated at line 6 of LA?. Let wi denote the value of vi after Phase A. Let ? denote the sublattice formed by values of all correct processes after Phase A, i.e, ? = {u | (u ? L) ? (?i : wi ? u)}. Since there are at most f failures, we have h(?) ? f . Now we show that Phase B terminates in at most dlog h(?)e executions of LA?. We call the i-th execution of LA? as iteration i. Notice that the guessed height of iteration i is 2i. I Lemma 16. After iteration dlog h(?)e of LA? at Phase B, all processes decide. Proof. Since 2dlog h(?)e ? h(?), Lemma 9 still holds which implies Lemma 10. Thus, all undecided processes have equal values at the last round of iteration dlog h(?)e. Therefore, all undecided processes decide after iteration dlog h(?)e. J We now show that two processes decide on comparable values irrespective of whether they both decide on the same iteration of LA?. I Lemma 17. (Comparability) Let i and j be any two processes that decide on value yi and yj , respectively. Then yi and yj are comparable. Proof. Assume process i decides on Gi at round ri of execution ei of LA? and process j decides on Gj at round rj of execution ej of LA?. If ei = ej , then yi and yj are comparable by Lemma 8. Otherwise, ei 6= ej . Without loss of generality, suppose ei < ej . Consider round ri of execution ei of LA?. Since i decides on value yi at this round, then from Lemma 7, we have that yi is comparable with vkr for any correct process k. Let V = {vk | k is correct}. r Then, yj is at most the join of a subset of V . From Lemma 6, it follows that yi is comparable with yj . J I Theorem 18. LA? solves the lattice agreement problem and can tolerate f < n failures. Proof. Downward-Validity follows from that fact that the value of each process is nondecreasing along the execution. Upward-Validity follows since each process can receive at most all values from all processes. Comparability holds by Lemma 17. J Complexity. From Lemma 16, we know Phase B terminates in at most dlog h(?)e executions of LA?. Thus, Phase B takes log 2 + log 4 + ... + dlog h(?)e = (dlog h(?)e+1)?(dlog h(?)e) 2 rounds in worst case. Since h(?) ? f and h(?) ? h(L), LA? has round complexity of min{O(log2 h(L)), O(log2 f )}. Each process sends n messages at each round, thus message complexity is n2 ? min{O(log2 h(L)), O(log2 f )}. LA? for pi acceptVal := xi// accept value learnedV al := ? // learned value on receiving prop(vj , r) from pj : if vj ? acceptVal Send ACK (?accept?, ?, r) acceptVal := vj else Send ACK (?reject?, acceptVal, r) In this section, we discuss the lattice agreement problem in asynchronous systems. The algorithm proposed in [7] requires O(n) units of time, whereas our algorithm (LA? shown in Figure 5) requires only O(f ) units of time. We first note that I Theorem 19. The lattice agreement problem cannot be solved in asynchronous message n . systems if f ? 2 Proof. The proof follows from the standard partition argument. If two partitions have incomparable values then they can never decide on comparable values. J 4.1 Algorithm LA? On account of Theorem 19, we assume that f < n2 . The algorithm proceeds in round-trips. A single round-trip is composed of sending messages to all and getting n ? f acknowledgement messages back. At each round-trip, a process sends a prop message to all, with its current accepted value as the proposal value, and waits for n ? f ACK messages. If majority of these ACK messages are accept, then it decides on its current proposed value. Otherwise, it updates its current accept value to be the join of all values received and starts next round-trip. Whenever a process receives a proposal, i.e, a prop message, if the proposal has a value at least as big as its current value, then it sends back an ACK message with accept and updates its current accept value to be the received proposal value. Otherwise, it sends back an ACK message with reject. Let acceptV alir denote the accept value (variable acceptVal) held by pi at the beginning of round-trip r. Let L(r) = {u | (u ? L) ? (?i : acceptV alir ? u)}, i.e, L(r) denotes the join-closed subset of L that includes the accept values held by all undecided processes at the beginning of the round-trip r. Notice that L(1) = L. I Lemma 20. For any round-trip r, h(L(r+1)) < h(L(r)). Proof. If a process decides at round-trip r, its value is not in L(r+1). So, we only need to prove that h(acceptV alir) < h(acceptV alir+1) for any process i which does not decide at round-trip r. The fact that process i does not decide at round-trip r implies that i must have received at least one reject ACK with a greater value. Since acceptV alir+1 is the join of all values received at round-trip r, acceptV alir < acceptV alir+1. Hence, h(acceptV alir) < h(acceptV alir+1) for any undecided process i. Therefore, h(L(r+1)) < h(L(r)). J I Lemma 21. All process decide within min{h(L), f + 1} asynchronous round-trips. Proof. We first show that h(L(2)) ? f . At the first round-trip, each process receives n ? f ACKs, which is equivalent to receiving n ? f values. Therefore, h(L(2)) ? f . Let rmin = min{h(L), f + 1}. Combining the fact that h(L(2)) ? f with Lemma 20, we have h(L(rmin)) ? 1. This means that undecided correct processes have the same value. Thus, all of them receive n ? f ACK messages with accept and decide. Therefore, all processes decide within min{h(L), f + 1} round-trips. J We note here that the algorithm in [7] takes O(n) message delays for a value to be learned in the worst case. A crucial difference between LA? and the algorithm in [7] is that LA? starts with the accepted value as the input value. Hence, after the first round-trip, there is a significant reduction in the height of the sublattice, from n initially (in the worst case) to f . In [7], acceptors start with the accepted value as null. Hence, there is reduction of height by only 1 in the worst case. Since in their algorithm, acceptors are different from proposers (in the style of Paxos), acceptors do not have access to the proposed values. I Theorem 22. Algorithm LA? solves the lattice agreement problem in min{h(L), f + 1} round-trips. Proof. Down-Validity holds since the accept value is non-decreasing for any process i. Upward-Validity follows because each learned value must be the join of a subset of all initial values which is at most t{x1, ..., xn}. For Comparability, suppose process i and j decide on values yi and yj. There must be at least one process that has accepted both yi and yj. Since each process can only accept comparable values. Thus, we have either yi ? yj or yj ? yi. J Complexity. From Lemma 21, we know that LA? takes at most min{h(L), f + 1} roundtrips, which results in 2 ? min{h(L), f + 1} message delays, since one round-trip takes two message delays. At each round-trip, each process sends out at most 2n messages. Thus, the number of messages for all processes is at most 2 ? n2 ? min{h(L), f + 1}. 5 Generalized Lattice Agreement In this section, we discuss the generalized lattice agreement problem as defined in Section 2.3. Since it is easy to adapt algorithms for lattice agreement in synchronous systems to solve generalized lattice agreement problem, we only consider asynchronous systems. We show how to adapt LA? to solve the generalized lattice agreement problem (algorithm GLA? shown in Figure 6) in min{O(h(L)), O(f )} units of time. 5.1 Algorithm GLA? GLA? invokes the Agree() procedure to learn a new value multiple times. The Agree() procedure is an execution of LA? with some modifications (to be given later). A sequence number is associated with each execution of the Agree() procedure, thus each correct process has a learned value for each sequence number. The basic idea of GLA? is to let all processes sequentially execute LA? to learn values, and make sure: 1) any two learned values for the same sequence number are comparable, 2) any learned value for a bigger sequence number is at least as big as any learned value for a smaller sequence number. The first goal can be simply achieved by invoking LA? with the sequence number. In order to achieve the second goal, the key idea is to make any proposal for sequence number s + 1 to be at least as big as the largest learned value for sequence number s. Notice that at each round-trip of LA? execution, a process waits for n ? f ACKs, and any two set of n ? f processes have at least one process in common. Thus, the second goal can be achieved by making sure at least n ? f processes know the largest learned value after execution of LA? for a sequence number. Upon receiving a value v from client in a message tagged with ClientValue, a process adds v into its buffer and sends a ServerValue message with v to all other processes. The process can start to learn new values only when it succeeds at its current proposal. Otherwise, LA? may not terminate, as shown by an example in [7]. Upon receiving a ServerValue message with value v, a process simply adds v to its buffer. The Agree() procedure is automatically executed when the guard condition is satisfied; that is, it is not currently proposing a value and it has some value in its buffer or it has seen a sequence number bigger than its current sequence number. Inside the Agree() procedure, a process first updates its acceptVal to be the join of current acceptVal and buffVal. Then, it starts an adapted LA? execution. The original LA? and adapted LA? differ in the following ways: 1) Each message in the adapted LA? is associated with a sequence number. 2) A process can also decide on a value for a sequence number if it receives any decide ACK message for that sequence number. 3) On receiving a prop message associated with a sequence number s0, if s0 is smaller than its current sequence number which means it has learned a value for s0, then it simply sends ACK message with its learned value for s0 back. If s0 is greater than its current sequence number, it updates its maxSeq and waits until its current sequence number matches s0. After that it sends back ACK message with accept or reject based on whether the proposal value is bigger than its current accept value or not. The reason a process keeps track of the maximal sequence number it has ever seen, is to make sure each process has a learned value for each sequence number. When the maximum sequence number is bigger than its current sequence number, it has to invoke Agree() procedure even if it does not have any new value to propose. After execution of adapted LA?, a process increases its current sequence number. We next show the correctness of GLA?. Let acceptV alsp denote the acceptVal of process p at the end of Agree() procedure for sequence number s. Let LVp denote the map of sequence number to learned value (variable LV ) for process p and ms = t{LVp[s] : p ? [1..n]}, i.e, ms denotes the join of all learned values for sequence number s. Let LPs = {p | (p ? [1..n]) ? (ms ? acceptV alsp)}, i.e, LPs is the set of processes which have acceptVal greater than the join of all learned values for the sequence number s. Notice that a process has two ways to learn a value for its current sequence number in the Agree() procedure: 1) by receiving a majority of accept ACKs. 2) by receiving some decide ACKs. The following lemma proves that the adapted LA? satisfies the first goal. I Lemma 23. For any sequence number s, LVp[s] is comparable with LVq[s] for any two processes p and q. Proof. We only need to show that any two processes which learn by the first way must learn comparable values, since processes which learn by the second way simply learn values from processes which learn by the first way. By the same reasoning as Comparability of Theorem 22, we know this is true. J From Lemma 23, we know that ms is the largest learned value for sequence number s. I Lemma 24. For any sequence number s, |LPs| > n2 . GLA? for pi s := 0 // sequence number maxSeq := -1 // max seq number seen buffVal := ? // received values /* map from seq to learned value */ LV := ? acceptVal := ? active := false on receiving ClientValue(v): buffVal := buffVal t v Send ServerValue(v) to all on receiving ServerValue(v): buffVal := buffVal t v on receiving prop(vj, r, s0) from pj: if s0 < s Send ACK (?decide?, LV [s0], r, s0) break if s0 > s maxSeq := max{s0, maxSeq} wait until s = s0 if vj ? acceptVal Send ACK (?accept?, ?, r, s0) acceptVal := vj else Send ACK (?reject?, acceptVal, r, s0) Proof. Consider Agree() procedure for s. Since ms is the largest learned value for sequence number s, there must exist a process p which learns ms by the first way. Thus, p must have received a majority of accept ACKs, which means at least a majority of processes have acceptVal greater than ms after Agree() procedure for s. Therefore, |LPs| > n2 . J The lemma below shows that GLA? achieves the second goal. I Lemma 25. ms ? LVp[s + 1] for any process p and any sequence number s. Proof. From Lemma 24, we know for sequence number s at least a majority of processes have acceptVal greater than ms. To decide on LVp[s + 1], process p must get majority accept. Since any two majority has at least one process in common, ms ? LVp[s + 1]. J I Theorem 26. Algorithm GLA? solves generalized lattice agreement when a majority of processes is correct. Proof. Validity holds since any learned value is the join of a subset of values received. Stability. From Lemma 25 and the fact that LVp[s] ? ms, we have that LVp[s] ? LVp[s+1] for any process p and any sequence number s, which implies Stability. Comparability. We need to show that LVp[s] and LVq[s0] are comparable for any two processes p and q, and for any two sequence number s and s0. If s = s0, this is immediate from Lemma 23. Now consider the case when s 6= s0. Without loss of generality, assume s < s0. From Lemma 25, we can conclude that LVp[s] ? LVq[s0]. Thus, comparability holds. Liveness. Any received value v is eventually included in some proposal, i.e, prop message. From Theorem 22, we know that in at most 2 ? min{h(L), f + 1} message delays that proposal value will be included in some learned value. Thus, v will be learned eventually. J Complexity. For time complexity, from the analysis for liveness in Theorem 26, we know that a received value is learned in at most 2 ? min{h(L), f + 1} message delays. For message complexity, since each process sends out n messages per round-trip, the total number of messages needed to learn a value is 2 ? n2 ? min{h(L), f + 1}. 6 Conclusions We have presented algorithms for the lattice agreement problem and the generalized lattice agreement problem. These algorithms achieve significantly better time complexity than previous algorithms. For future work, we would like to know the answers to the following two questions: 1) Is log f rounds a lower bound for lattice agreement in synchronous message passing systems? 2) Is O(f ) message delays optimal for the lattice agreement and generalized lattice agreement problem in asynchronous message passing systems? 1 2 3 4 5 6 7 8 9 11 12 13 14 15 16 17 18 Atomic snapshots of shared memory . Journal of the ACM (JACM) , 40 ( 4 ): 873 - 890 , 1993 . Hagit Attiya , Maurice Herlihy, and Ophir Rachman . Atomic snapshots using lattice agreement . Distributed Computing , 8 ( 3 ): 121 - 132 , 1995 . Hagit Attiya and Ophir Rachman . Atomic snapshots in O(nlogn) operations . SIAM Journal on Computing , 27 ( 2 ): 319 - 340 , 1998 . Hagit Attiya and Jennifer Welch . Distributed computing: fundamentals, simulations, and advanced topics , volume 19 . John Wiley & Sons, 2004 . Carole Delporte-Gallet , Hugues Fauconnier, Sergio Rajsbaum, and Michel Raynal . Implementing snapshot objects on top of crash-prone asynchronous message-passing systems . In International Conference on Algorithms and Architectures for Parallel Processing , pages 341 - 355 . Springer, 2016 . SIAM Journal on Computing , 12 ( 4 ): 656 - 666 , 1983 . Generalized lattice agreement . In Proceedings of the 2012 ACM symposium on Principles of distributed computing , pages 125 - 134 . ACM, 2012 . Michael J Fischer , Nancy A Lynch, and Michael S Paterson. Impossibility of distributed consensus with one faulty process . Journal of the ACM (JACM) , 32 ( 2 ): 374 - 382 , 1985 . Maurice P Herlihy and Jeannette M Wing. Linearizability : A correctness condition for concurrent objects . ACM Transactions on Programming Languages and Systems (TOPLAS) , 12 ( 3 ): 463 - 492 , 1990 . Leslie Lamport . The part-time parliament . ACM Transactions on Computer Systems (TOCS) , 16 ( 2 ): 133 - 169 , 1998 . Leslie Lamport et al. Paxos made simple . ACM Sigact News , 32 ( 4 ): 18 - 25 , 2001 . Marios Mavronicolas . A bound on the rounds to reach lattice agreement , 2000 . URL: http://www.cs.ucy.ac.cy/~mavronic/pdf/lattice.pdf. Michel Raynal . Concurrent programming: algorithms, principles , and foundations. Springer Science & Business Media , 2012 . Fred B Schneider. Implementing fault -tolerant services using the state machine approach: A tutorial . ACM Computing Surveys (CSUR) , 22 ( 4 ): 299 - 319 , 1990 . Marc Shapiro , Nuno Pregui?a , Carlos Baquero, and Marek Zawirski . Conflict-free replicated data types . In Symposium on Self-Stabilizing Systems , pages 386 - 400 . Springer, 2011 . Marc Shapiro , Nuno Pregui?a , Carlos Baquero, and Marek Zawirski . Convergent and commutative replicated data types . Bulletin-European Association for Theoretical Computer Science , 104 : 67 - 88 , 2011 . Andrew S Tanenbaum and Maarten Van Steen . Distributed systems: principles and paradigms . Prentice-Hall, 2007 . Gadi Taubenfeld . Synchronization algorithms and concurrent programming . Pearson Education , 2006 .


This is a preview of a remote PDF: http://drops.dagstuhl.de/opus/volltexte/2018/9830/pdf/LIPIcs-DISC-2018-41.pdf

Xiong Zheng, Changyong Hu, Vijay K. Garg. Lattice Agreement in Message Passing Systems, LIPICS - Leibniz International Proceedings in Informatics, 2018, 41:1-41:17, DOI: 10.4230/LIPIcs.DISC.2018.41