[authority] Batch crash robustness #714

gdanezis · 2022-03-09T12:08:24Z

Following a conversation with @lxfind this PR attempt to make the interaction between AuthorityState and BatchManager more robust to crashes on either side.

What is the problem:

In the current design the AuthorityStore picks atomically a consecutive number for each transaction, and then non-atomically sends it over a mpsc channel to the Batch manager along with the digest. Since multiple threads may operate on the AuthorityState in parallel on different tokio tasks, the sequence numbers may not be in order. To fix this the Batch manager tries to wait until the complete in order sequence is received to sequence and batch.

However, what is an AuthorityState crashes? Then the sequence number will not be sent to the Batch Manager. That may produce a gap in the sequence, resulting in the Batch manager buffering for ever waiting for the missing sequence which will never arrive. This is bad.

This PR's solution

We augment the sender end of the mpsc channel with logic, that allows a thread to request a ticket -- assigning it a sequence number -- and then send a value (transaction digest) using this ticket to the batch manager. The twist is that if the ticket is dropped for example when an Err is returned, or a panic occurs within a tokio task, the ticket Drop logic notifies the Sender that the ticket will never be sent, and unblocks subsequent sents.

As a result now: we do not expect (if there are crashes) consecutive sequence numbers, but we do expect them to be increasing. The receiver on the Batch manager side does not have to do anything fancy to wait for old sequences, the output
of the channel can be directly sequenced and broadcast.

When the task of the batch manager crashes, then the batching and notifications end. However, the transactions and sequence numbers are still written to the DB for batching upon recovery.

Remaining issues (fixed)

There is still a window of vulnerability: if the authority crashes between successfully writing to the DB, and signaling to the batch manager, the batch manager will miss the update for ever. Here, the batch has been written to storage, so we can include some logic in the batch manager to recover: when there is a gap in the sequence number the batch manager can attempt to read from the DB any missing transactions. If found they are sequenced, otherwise they are skipped. This slower path, or reading from the DB should be activated infrequently (we do not expect authority state to crash all the time).

Potential improvements

We could use an AtomicU64 to assign tickets, but probably not to protect the sending logic.
If the sending / dropping number is next we do not need to store the value in the HashMap.
Since we store values in the hashmap we can actually do away with the channel, and create a recv for the sender, and wait on this. This involves quasi manual Futures coding. Which is not for the faint hearted.

sui_core/src/authority/authority_store.rs

lxfind · 2022-03-09T19:06:23Z

sui_core/src/authority_batch.rs

                    }

+                    // Add to batch and broadcast
+                    current_batch.push((seq, tx_digest));


Does this guarantee that the output transactions' sequence number always increase?
I imagine that it should also be common for the authority to send transactions to the batch out-of-order (i.e. seq < next_expected_sequence_number).
I was thinking to keep the loose_transactions, but make progress when the size of it is beyond a threshold.

Now the autoinc sender ensures the sequence numbers are sent down the channel in order. So we do not need this tricky logic in the batch maker, aside from the cases where the AuthorityState may have crashed, and there is a genuine gap.

huitseeker

Before I dive into the code, let me check if I understand the design:

the batch manager has a contract of in-order delivery, and does head-of-line blocking, is that correct?
since the sequence numbers are somewhat idiosyncratic (they're not reproducible across authorities), and the DB writes are not coerced to follow a meaningful (causal) order (as the header of this PR explains) is the in-order delivery important in the first place?
in case of a task panic, the batch manager may wait forever for a specific ticket number that may never arrive and that't the core of what we're trying to fix, is that correct?
this PR makes the assumption that what may lead to a crash is around the DB write, and so tries to register a sequence # which registration cancels on a panic (through a drop), is that correct?
compared to the somewhat simple classic answer to this problem (a watermark), this offers a better latency on crashes. Is latency important here?

cc. @velvia for relevant expertise, @laura-makdah for impact on / obsoletion by #382

lxfind

LGTM!

lanvidr

To answer @huitseeker on the impact on the resilient components, I see this would be a good candidate to use it, and we can fix any sequence gaps in handle_irrecoverable.

lanvidr · 2022-03-10T17:22:29Z

sui_core/src/authority/authority_autoinc_channel.rs

+        if !self.sent {
+            let mut aic = self.autoinc_sender.lock();
+            if aic.sender.is_closed() {
+                return;


Do we need to unlock here before returning?

Upon return the lock falls out of scope and the mutex is unlocked automatically -- this is the flip-side of RAII, dropping a resource frees it. Exactly in the same way as dropping the ticket does the cleanup.

Locks drop when they go out of scope.

lanvidr · 2022-03-10T17:29:59Z

sui_core/src/authority_batch.rs

+                            .skip_to(&next_expected_sequence_number)?
+                            .take_while(|(store_seq, _)| store_seq < &seq)
+                            .for_each(|(store_seq, store_digest)|
+                        {


What do we do if we can't recover the Tx, but the store is not dead?

Ah, that means that no Tx was written to the store, ie the task processing the transaction acquired a ticket, then crashed before the database commit. In that case there will simply be a gap in the batch sequence. And life goes on.

We need to adapt #714 batches in some way similar to #798 batches to better convey intent: otherwise the client might keep requesting a seq num they feel we signed off on (through considering the batch as a commitment) without the authority being able to serve it.

lanvidr · 2022-03-10T17:36:42Z

@gdanezis If I take a step back, might I ask why we need sequenced batches?

gdanezis · 2022-03-10T18:22:19Z

@gdanezis If I take a step back, might I ask why we need sequenced batches?

Sure, this is to support the follower functionality, namely when a client wants a list of all transactions that have been processed by an authority. Batching them allows the authority to sign them, and sequencing the transactions allows clients to request historical ranges in the past sequence.

gdanezis · 2022-03-10T18:32:24Z

the batch manager has a contract of in-order delivery, and does head-of-line blocking, is that correct?

In the current PR the batch manager both receives transactions in an increasing sequence number fashion (managed by the autoinc sender) and also emits them in an increasing sequence number. However, gaps are allowed when crashes may have occured.

since the sequence numbers are somewhat idiosyncratic (they're not reproducible across authorities), and the DB writes are not coerced to follow a meaningful (causal) order (as the header of this PR explains) is the in-order delivery important in the first place?

I think the sequence numbers are providing a causal sequence. Ie a follower executing them in this sequence will be able to update their DB without ever needing to re-order. The reason for this is that transaction execution, that drives the acquisition of tickets is causal. But it is something we need to re-check.

in case of a task panic, the batch manager may wait forever for a specific ticket number that may never arrive and that't the core of what we're trying to fix, is that correct?

That was the bug this PR is fixing. Now the Batch manager will not wait for ever any more.

this PR makes the assumption that what may lead to a crash is around the DB write, and so tries to register a sequence # which registration cancels on a panic (through a drop), is that correct?

This is not an assumption, it is the critical region where crashes matter. We are trying to maintain two structures consistent (1) the transactions written to disk in parallel, but with a sequence number each and (2) the sequence we are emitting as part of batching. If a crash happens before we assign a unique sequence number, really no biggie, as nothing has to happen. The opportunity for inconsistency occurs if there is a crash between Get Ticket and DB write, or between DB write and send update. After the DB write has happened and the update has been sent / batch generated again there is no opportunity for inconsistency.

compared to the somewhat simple classic answer to this problem (a watermark), this offers a better latency on crashes. Is latency important here?

I have no idea what a 'watermark' is in this context, so I guess it cannot be as simple. Do you mean you keep the sequence number on the sending end, and on the receiving end? If so, yes, this is exactly what we are doing here.

huitseeker

Thanks a lot for the explanation @gdanezis . I think this is suboptimal, in that it's still affected by stragglers, for no good reason.

Stragglers,

If the DB is ticketed to write TXes 42, 43, 44 concurrently and writing 42 takes 5x as much time as writing any of the others (1 unit of latency), the time to observe 44 is at least 5 units.

Yet 43, 44, may be transactions completely unrelated to 42, and it may be valuable to see them before 42. We could theoretically observe them in 1 unit.

Edit:

Dear reader, you might want to jump at my realization of what this consistency logic aims to do here

Higher level

Within a small set of interdependent transactions, it's likely the causal order is a total order, i.e. there's just one way to execute them in order,
but if we consider a larger set of transactions, e.g. what the node is executing over an hour, there isn't always a precedence link between any two transactions, they may stand on incomparable strands of a lattice, or in family terms, be incomparable siblings of a distant common parent (something we touched on a while ago),
in so far as this is trying to keep the linear numbering scheme in the DB and the observed one in the batch manager the same, the PR is definitely getting at a more robust solution to the problem,
but is it meaningful to try to keep those numberings, and the order of delivery the same?
since the Authority Store numbering follows execution, it will represent one linearization of the many possible valid (causal) execution orders,
if we just "let the DB crash" and perform no reordering or wait whatsoever at the Batch sender, what happens?
it seems to me there's no non-causal reordering due to concurrency,

Property I'm assuming

We have a pre-condition on the DB read: if TX2 is a direct dependent of TX1, we cannot execute TX2 (and a fortiori, write it or batch it) before the effects of TX1 have been successfully written to the DB,
seq_num assignments are in strictly increasing orders
channels conserve sending order,

Consequences

if a TX registers number k with the store, any of its (transitive) dependents have been successfully written to the store with a number j < k,
consequence: if a TX registers number k with the store, to be sent on the batch channel earlier than any of its dependent TXj, j<k, then TXk would need to be written and sent to the channel faster than it takes to just send TXj over (since that's the last remaining thing to do after TXj is readable),
since this is implausible, the channel output order respects causal orders in all plausible concurrent executions.

Example

the authority executes TX2, the store assigns seq_num 42 for it,
the authority tries to write TX2,
before that completes, the TXes at seq_num 43, 44 go through to the BatchManager
the write crashes,
TX3 comes in with a dependency on TX2,
the authority fetches its parent TX2 (from the gateway) and reexecutes TX2,
TXes [46, 48, 47, 49, 51, 50, 52, 53, 55, 54] go through to the BatchManager in the meanwhile, in this order,
the authority assigns seq_num 56 for TX2,
the write succeeds, TX2 is sent at 56 to the BatchManaget,
TXes [58, 57] go through in this order,
the authority processes TX3, assigns seq_num 59 for it,
the write succeeds, TX3 is send at 59 to the BatchManager

UX issues

Seq_nums can totally be jumbled on the output of the BatchManager. My point is that this is for causally non-comparable TXes (if 2 TXes are related by the partial causal order, they will be {executed + written} one after the other) and that the fix is entirely contained on the receiving end: periodically re-order the items received out of the channel that are older than a moving time deadline, aka a watermark (equivalently. that have been received for more than ~5s), and to expect this re-ordering to manifest the final order.

Or the end-user of the Batch information (e.g. an Explorer) may just consider the seq_nums are information that's only useful in speaking to that specific authority, and that for the purposes of talking to its end-user (e.g. somebody outside the chain) it can re-number the TXes in exactly the reception order. I think that order will respect all the partial causal relationships.

What I'm suggesting we do

Remove the head of line blocking at the BatchManager, just send through whatever we receive.

huitseeker · 2022-03-10T19:07:40Z

sui_core/src/authority/authority_autoinc_channel.rs

+        if !self.sent {
+            let mut aic = self.autoinc_sender.lock();
+            if aic.sender.is_closed() {
+                return;


Locks drop when they go out of scope.

gdanezis · 2022-03-10T20:29:47Z

If the DB is ticketed to write TXes 42, 43, 44 concurrently and writing 42 takes 5x as much time as writing any of the others (1 unit of latency), the time to observe 44 is at least 5 units.

At the point where the ticketing business happens, the execution has already been performed, and all is left is writing effects to the DB. Of course variable amount of data may lead to variable time here, but just to point this out.

if we just "let the DB crash" and perform no reordering or wait whatsoever at the Batch sender, what happens?

It depends on the nature of the crash. If there is a crash that takes down all future tasks, ie the DB really has died for good, and we are going to restart the authority (with a bigger HD). Then nothing bad happens: the recovery logic will just make a new batch with all the last transactions outside a batch.

However, if there is a crash between a DB write and the digest being sent through the channel to the batch manager, then this update may never be included in a batch, which means that the sequence of batches no longer represents the sequence that re-creates the DB of the authority. This is inconsistent state.

huitseeker · 2022-03-10T20:59:45Z

However, if there is a crash between a DB write and the digest being sent through the channel to the batch manager, then this update may never be included in a batch, which means that the sequence of batches no longer represents the sequence that re-creates the DB of the authority. This is inconsistent state.

Right, but I'm confused as to how this came about: the current PR does not mention or claim anything about recovery logic. It's just about not letting crashes block anything else. Or did I miss anything?

When I said "let the DB crash" and perform no reordering or wait whatsoever at the Batch sender", i meant that to me this behaves just as "good" as the ticket-based reordering, for some priors on "good". But I may have missed something on how the re-ordering is important?

gdanezis · 2022-03-10T21:31:55Z

@huitseeker I am a great fan of simpler solutions and less code. If you think there is a simpler design for the Batch Manager, do suggest a PR.

huitseeker · 2022-03-12T00:19:44Z

@gdanezis see #798

huitseeker · 2022-03-12T13:03:32Z

sui_core/src/authority_batch.rs

+                            .skip_to(&next_expected_sequence_number)?
+                            .take_while(|(store_seq, _)| store_seq < &seq)
+                            .for_each(|(store_seq, store_digest)|
+                        {


We need to adapt #714 batches in some way similar to #798 batches to better convey intent: otherwise the client might keep requesting a seq num they feel we signed off on (through considering the batch as a commitment) without the authority being able to serve it.

huitseeker · 2022-03-12T13:06:02Z

sui_core/src/authority_batch.rs

-    ) -> (BatchSender, BatchManager, BroadcastPair) {
-        let (tx_send, tx_recv) = channel(capacity);
+    ) -> Result<(BatchSender, BatchManager, BroadcastPair), SuiError> {
+        let (tx_send, tx_recv) = unbounded_channel();


We can't allow a crash to occur between successful TX write and the ticket send for it, because we'd be toast anyway: the batch goes out w/o the TX, yet the TX's sequence was not after the end of the last batch (cost of recovery is O(N)). One leading reason why this might happen is one of the channels on which the ticket redemption flows is out of capacity: ticket sending would fail.

This forces the use of an unbounded channel at the AutoIncSender and in the BatchManager, and of an unbounded waiting table at the AutoIncSender, to push back ticket redemption failure at the threshold of whole-process failure.

This means that tickets are Kryptonite: if anyone manages to put any one of them in a long-running task (where they will be stored but not de-allocated), this will crash the whole node through an OOM, irrecoverably.

Could we put those tickets on a reasonable timeout somehow? (e.g. forcibly flushing the waiting past a max size) I'm way more comfortable with an O(N) recovery than I am with a full node crash (no more TX voting!) because of some auxiliary read functionality.

huitseeker · 2022-03-14T16:54:20Z

sui_core/src/authority/authority_autoinc_channel.rs

+        while let Some(item_opt) = self.waiting.remove(&self.next_expected_sequence_number) {
+            if let Some(item) = item_opt {
+                if let Err(_err) = self.sender.send((self.next_expected_sequence_number, item)) {
+                    /*
+                        An error here indicates the other side of the channel is closed.
+                        There is not very much we can do, as if the batcher is closed we
+                        will write to the DB and the recover when we recover.
+                    */
+
+                    self.waiting.clear();
+                }
+            }
+            self.next_expected_sequence_number += 1;
+        }
+    }


One thing we could do is at least post a log line if the size of this is past a threshold, and possibly turn that into a bigger warning if it's growing w/o removal.

velvia · 2022-03-14T19:24:52Z

Sorry reposting here because the other PR is closed:

@gdanezis @huitseeker I'm hoping the two of you can expand on the following and explain a little bit. I'd like to understand what the complexities here are.

If we accept to have a O(N) recovery on crash, and no recovery / v expensive recovery upon task crash (basically no consistency between DB and batch sequence) we can do the following: we write the effects of the cert to the DB without sequencing the transaction; we optimistically send a transaction digest down the channel; upon reception the channel sequences the transaction (easy on a single task), batches and writes both the transactions with sequence numbers, and the batches to disk. This offers, no consistency (may not have a complete sequence upon crash, expensive recovery), but it offers a very regular, in-sequence, easy to query, no-gaps sequence and batches.
If we want consistency and ease of recovery we just take the cost of a DB write head-of-line blocking as suggested in other PRs, and work from there to optimise.

What is the source of inconsistency? Is it that, at the time a batch is defined, you want to have it represent a range of numbers, sequence numbers, but you believe that later transactions might fit in that number? If this is the problem, then the late sequencing idea surely solves it?
I'm used to consistency and recovery being done with checkpointing and watermarks. The way I see it is that there is no consistent watermarking being done in the DB and state today, which makes it hard to reason about what exactly needs to be recovered. Maybe we can discuss this a little bit.
I do see having a WAL-like sequencing of TX's being useful to a single authority, and syncing with a storage system. However, I don't think it will be useful to clients and fat replicas that need to sync state as the sequence numbers are not universally comparable at all -- and so I'm not sure what a consistent stream will help in the global context, since even for 2 authorities I'd have to look at the entire streams and compare detailed contents to sync between 2.
If the stream was able to use something universal like timestamps, that would make global comparisons much easier and checkpoints easier to reason about
As for going over the network - this will happen as storage will become remote. However, consistent, ordered sequencing of things over the network is a solved problem in the database world. I'm probably not understand things though6.

Thanks and look forward to your comments

* chore(deps): remove uneeded dpeendencies See https://github.com/MystenLabs/narwhal/runs/7700656786?check_suite_focus=true * chore(deps): sort dependencies

lxfind reviewed Mar 9, 2022

View reviewed changes

George Danezis added 6 commits March 10, 2022 11:54

Added ticketing system

4a0bcd6

Integrated autoinc channel

71899a5

Integrate autoinc with batcher logic

44cea7f

Simplify batcher now

4353707

Recover missing tx from database in corner case

601ae06

Changes following review

df9980f

gdanezis force-pushed the batch-crash-robustness branch from 5708153 to df9980f Compare March 10, 2022 12:05

Make fmt happy

432c98d

huitseeker reviewed Mar 10, 2022

View reviewed changes

lxfind reviewed Mar 10, 2022

View reviewed changes

lanvidr reviewed Mar 10, 2022

View reviewed changes

huitseeker reviewed Mar 10, 2022

View reviewed changes

huitseeker mentioned this pull request Mar 12, 2022

[authority] Emit transactions out-of-order #798

Closed

3 tasks

huitseeker reviewed Mar 12, 2022

View reviewed changes

huitseeker reviewed Mar 14, 2022

View reviewed changes

gdanezis closed this Mar 15, 2022

gdanezis mentioned this pull request Mar 15, 2022

[authority] Batch crash robustness v3 #835

Merged

huitseeker deleted the batch-crash-robustness branch April 22, 2022 17:11

mwtian pushed a commit that referenced this pull request Sep 12, 2022

Clean up dependencies (#714)

92e7058

* chore(deps): remove uneeded dpeendencies See https://github.com/MystenLabs/narwhal/runs/7700656786?check_suite_focus=true * chore(deps): sort dependencies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[authority] Batch crash robustness #714

[authority] Batch crash robustness #714

gdanezis commented Mar 9, 2022 •

edited

Loading

lxfind Mar 9, 2022

gdanezis Mar 10, 2022

huitseeker left a comment

lxfind left a comment

lanvidr left a comment

lanvidr Mar 10, 2022

gdanezis Mar 10, 2022

huitseeker Mar 10, 2022

lanvidr Mar 10, 2022

gdanezis Mar 10, 2022

huitseeker Mar 12, 2022 •

edited

Loading

lanvidr commented Mar 10, 2022

gdanezis commented Mar 10, 2022

gdanezis commented Mar 10, 2022

huitseeker left a comment •

edited

Loading

huitseeker Mar 10, 2022

gdanezis commented Mar 10, 2022

huitseeker commented Mar 10, 2022

gdanezis commented Mar 10, 2022

huitseeker commented Mar 12, 2022

huitseeker Mar 12, 2022 •

edited

Loading

huitseeker Mar 12, 2022 •

edited

Loading

huitseeker Mar 14, 2022

velvia commented Mar 14, 2022

[authority] Batch crash robustness #714

[authority] Batch crash robustness #714

Conversation

gdanezis commented Mar 9, 2022 • edited Loading

What is the problem:

This PR's solution

Remaining issues (fixed)

Potential improvements

Choose a reason for hiding this comment

Choose a reason for hiding this comment

huitseeker left a comment

Choose a reason for hiding this comment

lxfind left a comment

Choose a reason for hiding this comment

lanvidr left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

huitseeker Mar 12, 2022 • edited Loading

Choose a reason for hiding this comment

lanvidr commented Mar 10, 2022

gdanezis commented Mar 10, 2022

gdanezis commented Mar 10, 2022

huitseeker left a comment • edited Loading

Choose a reason for hiding this comment

Stragglers,

Edit:

Higher level

Property I'm assuming

Consequences

Example

UX issues

What I'm suggesting we do

Choose a reason for hiding this comment

gdanezis commented Mar 10, 2022

huitseeker commented Mar 10, 2022

gdanezis commented Mar 10, 2022

huitseeker commented Mar 12, 2022

huitseeker Mar 12, 2022 • edited Loading

Choose a reason for hiding this comment

huitseeker Mar 12, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

velvia commented Mar 14, 2022

gdanezis commented Mar 9, 2022 •

edited

Loading

huitseeker Mar 12, 2022 •

edited

Loading

huitseeker left a comment •

edited

Loading

huitseeker Mar 12, 2022 •

edited

Loading

huitseeker Mar 12, 2022 •

edited

Loading