Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[authority] Batch crash robustness #714

Closed
wants to merge 7 commits into from
Closed

Conversation

gdanezis
Copy link
Collaborator

@gdanezis gdanezis commented Mar 9, 2022

Following a conversation with @lxfind this PR attempt to make the interaction between AuthorityState and BatchManager more robust to crashes on either side.

What is the problem:

In the current design the AuthorityStore picks atomically a consecutive number for each transaction, and then non-atomically sends it over a mpsc channel to the Batch manager along with the digest. Since multiple threads may operate on the AuthorityState in parallel on different tokio tasks, the sequence numbers may not be in order. To fix this the Batch manager tries to wait until the complete in order sequence is received to sequence and batch.

However, what is an AuthorityState crashes? Then the sequence number will not be sent to the Batch Manager. That may produce a gap in the sequence, resulting in the Batch manager buffering for ever waiting for the missing sequence which will never arrive. This is bad.

This PR's solution

We augment the sender end of the mpsc channel with logic, that allows a thread to request a ticket -- assigning it a sequence number -- and then send a value (transaction digest) using this ticket to the batch manager. The twist is that if the ticket is dropped for example when an Err is returned, or a panic occurs within a tokio task, the ticket Drop logic notifies the Sender that the ticket will never be sent, and unblocks subsequent sents.

As a result now: we do not expect (if there are crashes) consecutive sequence numbers, but we do expect them to be increasing. The receiver on the Batch manager side does not have to do anything fancy to wait for old sequences, the output
of the channel can be directly sequenced and broadcast.

When the task of the batch manager crashes, then the batching and notifications end. However, the transactions and sequence numbers are still written to the DB for batching upon recovery.

Remaining issues (fixed)

There is still a window of vulnerability: if the authority crashes between successfully writing to the DB, and signaling to the batch manager, the batch manager will miss the update for ever. Here, the batch has been written to storage, so we can include some logic in the batch manager to recover: when there is a gap in the sequence number the batch manager can attempt to read from the DB any missing transactions. If found they are sequenced, otherwise they are skipped. This slower path, or reading from the DB should be activated infrequently (we do not expect authority state to crash all the time).

Potential improvements

  • We could use an AtomicU64 to assign tickets, but probably not to protect the sending logic.
  • If the sending / dropping number is next we do not need to store the value in the HashMap.
  • Since we store values in the hashmap we can actually do away with the channel, and create a recv for the sender, and wait on this. This involves quasi manual Futures coding. Which is not for the faint hearted.

sui_core/src/authority/authority_store.rs Outdated Show resolved Hide resolved
sui_core/src/authority/authority_store.rs Outdated Show resolved Hide resolved
}

// Add to batch and broadcast
current_batch.push((seq, tx_digest));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this guarantee that the output transactions' sequence number always increase?
I imagine that it should also be common for the authority to send transactions to the batch out-of-order (i.e. seq < next_expected_sequence_number).
I was thinking to keep the loose_transactions, but make progress when the size of it is beyond a threshold.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now the autoinc sender ensures the sequence numbers are sent down the channel in order. So we do not need this tricky logic in the batch maker, aside from the cases where the AuthorityState may have crashed, and there is a genuine gap.

@gdanezis gdanezis force-pushed the batch-crash-robustness branch from 5708153 to df9980f Compare March 10, 2022 12:05
Copy link
Contributor

@huitseeker huitseeker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before I dive into the code, let me check if I understand the design:

  1. the batch manager has a contract of in-order delivery, and does head-of-line blocking, is that correct?
  2. since the sequence numbers are somewhat idiosyncratic (they're not reproducible across authorities), and the DB writes are not coerced to follow a meaningful (causal) order (as the header of this PR explains) is the in-order delivery important in the first place?
  3. in case of a task panic, the batch manager may wait forever for a specific ticket number that may never arrive and that't the core of what we're trying to fix, is that correct?
  4. this PR makes the assumption that what may lead to a crash is around the DB write, and so tries to register a sequence # which registration cancels on a panic (through a drop), is that correct?
  5. compared to the somewhat simple classic answer to this problem (a watermark), this offers a better latency on crashes. Is latency important here?

cc. @velvia for relevant expertise, @laura-makdah for impact on / obsoletion by #382

Copy link
Contributor

@lxfind lxfind left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Copy link
Contributor

@lanvidr lanvidr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To answer @huitseeker on the impact on the resilient components, I see this would be a good candidate to use it, and we can fix any sequence gaps in handle_irrecoverable.

if !self.sent {
let mut aic = self.autoinc_sender.lock();
if aic.sender.is_closed() {
return;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to unlock here before returning?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Upon return the lock falls out of scope and the mutex is unlocked automatically -- this is the flip-side of RAII, dropping a resource frees it. Exactly in the same way as dropping the ticket does the cleanup.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Locks drop when they go out of scope.

.skip_to(&next_expected_sequence_number)?
.take_while(|(store_seq, _)| store_seq < &seq)
.for_each(|(store_seq, store_digest)|
{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do we do if we can't recover the Tx, but the store is not dead?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, that means that no Tx was written to the store, ie the task processing the transaction acquired a ticket, then crashed before the database commit. In that case there will simply be a gap in the batch sequence. And life goes on.

Copy link
Contributor

@huitseeker huitseeker Mar 12, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to adapt #714 batches in some way similar to #798 batches to better convey intent: otherwise the client might keep requesting a seq num they feel we signed off on (through considering the batch as a commitment) without the authority being able to serve it.

@lanvidr
Copy link
Contributor

lanvidr commented Mar 10, 2022

@gdanezis If I take a step back, might I ask why we need sequenced batches?

@gdanezis
Copy link
Collaborator Author

@gdanezis If I take a step back, might I ask why we need sequenced batches?

Sure, this is to support the follower functionality, namely when a client wants a list of all transactions that have been processed by an authority. Batching them allows the authority to sign them, and sequencing the transactions allows clients to request historical ranges in the past sequence.

@gdanezis
Copy link
Collaborator Author

  1. the batch manager has a contract of in-order delivery, and does head-of-line blocking, is that correct?

In the current PR the batch manager both receives transactions in an increasing sequence number fashion (managed by the autoinc sender) and also emits them in an increasing sequence number. However, gaps are allowed when crashes may have occured.

  1. since the sequence numbers are somewhat idiosyncratic (they're not reproducible across authorities), and the DB writes are not coerced to follow a meaningful (causal) order (as the header of this PR explains) is the in-order delivery important in the first place?

I think the sequence numbers are providing a causal sequence. Ie a follower executing them in this sequence will be able to update their DB without ever needing to re-order. The reason for this is that transaction execution, that drives the acquisition of tickets is causal. But it is something we need to re-check.

  1. in case of a task panic, the batch manager may wait forever for a specific ticket number that may never arrive and that't the core of what we're trying to fix, is that correct?

That was the bug this PR is fixing. Now the Batch manager will not wait for ever any more.

  1. this PR makes the assumption that what may lead to a crash is around the DB write, and so tries to register a sequence # which registration cancels on a panic (through a drop), is that correct?

This is not an assumption, it is the critical region where crashes matter. We are trying to maintain two structures consistent (1) the transactions written to disk in parallel, but with a sequence number each and (2) the sequence we are emitting as part of batching. If a crash happens before we assign a unique sequence number, really no biggie, as nothing has to happen. The opportunity for inconsistency occurs if there is a crash between Get Ticket and DB write, or between DB write and send update. After the DB write has happened and the update has been sent / batch generated again there is no opportunity for inconsistency.

  1. compared to the somewhat simple classic answer to this problem (a watermark), this offers a better latency on crashes. Is latency important here?

I have no idea what a 'watermark' is in this context, so I guess it cannot be as simple. Do you mean you keep the sequence number on the sending end, and on the receiving end? If so, yes, this is exactly what we are doing here.

Copy link
Contributor

@huitseeker huitseeker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for the explanation @gdanezis . I think this is suboptimal, in that it's still affected by stragglers, for no good reason.

Stragglers,

If the DB is ticketed to write TXes 42, 43, 44 concurrently and writing 42 takes 5x as much time as writing any of the others (1 unit of latency), the time to observe 44 is at least 5 units.

Yet 43, 44, may be transactions completely unrelated to 42, and it may be valuable to see them before 42. We could theoretically observe them in 1 unit.

Edit:

Dear reader, you might want to jump at my realization of what this consistency logic aims to do here

Higher level

  • Within a small set of interdependent transactions, it's likely the causal order is a total order, i.e. there's just one way to execute them in order,
  • but if we consider a larger set of transactions, e.g. what the node is executing over an hour, there isn't always a precedence link between any two transactions, they may stand on incomparable strands of a lattice, or in family terms, be incomparable siblings of a distant common parent (something we touched on a while ago),
  • in so far as this is trying to keep the linear numbering scheme in the DB and the observed one in the batch manager the same, the PR is definitely getting at a more robust solution to the problem,
  • but is it meaningful to try to keep those numberings, and the order of delivery the same?
  • since the Authority Store numbering follows execution, it will represent one linearization of the many possible valid (causal) execution orders,
  • if we just "let the DB crash" and perform no reordering or wait whatsoever at the Batch sender, what happens?
  • it seems to me there's no non-causal reordering due to concurrency,

Property I'm assuming

  • We have a pre-condition on the DB read: if TX2 is a direct dependent of TX1, we cannot execute TX2 (and a fortiori, write it or batch it) before the effects of TX1 have been successfully written to the DB,
  • seq_num assignments are in strictly increasing orders
  • channels conserve sending order,

Consequences

  • if a TX registers number k with the store, any of its (transitive) dependents have been successfully written to the store with a number j < k,
  • consequence: if a TX registers number k with the store, to be sent on the batch channel earlier than any of its dependent TXj, j<k, then TXk would need to be written and sent to the channel faster than it takes to just send TXj over (since that's the last remaining thing to do after TXj is readable),
  • since this is implausible, the channel output order respects causal orders in all plausible concurrent executions.

Example

  • the authority executes TX2, the store assigns seq_num 42 for it,
  • the authority tries to write TX2,
  • before that completes, the TXes at seq_num 43, 44 go through to the BatchManager
  • the write crashes,
  • TX3 comes in with a dependency on TX2,
  • the authority fetches its parent TX2 (from the gateway) and reexecutes TX2,
  • TXes [46, 48, 47, 49, 51, 50, 52, 53, 55, 54] go through to the BatchManager in the meanwhile, in this order,
  • the authority assigns seq_num 56 for TX2,
  • the write succeeds, TX2 is sent at 56 to the BatchManaget,
  • TXes [58, 57] go through in this order,
  • the authority processes TX3, assigns seq_num 59 for it,
  • the write succeeds, TX3 is send at 59 to the BatchManager

UX issues

Seq_nums can totally be jumbled on the output of the BatchManager. My point is that this is for causally non-comparable TXes (if 2 TXes are related by the partial causal order, they will be {executed + written} one after the other) and that the fix is entirely contained on the receiving end: periodically re-order the items received out of the channel that are older than a moving time deadline, aka a watermark (equivalently. that have been received for more than ~5s), and to expect this re-ordering to manifest the final order.

Or the end-user of the Batch information (e.g. an Explorer) may just consider the seq_nums are information that's only useful in speaking to that specific authority, and that for the purposes of talking to its end-user (e.g. somebody outside the chain) it can re-number the TXes in exactly the reception order. I think that order will respect all the partial causal relationships.

What I'm suggesting we do

Remove the head of line blocking at the BatchManager, just send through whatever we receive.

if !self.sent {
let mut aic = self.autoinc_sender.lock();
if aic.sender.is_closed() {
return;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Locks drop when they go out of scope.

@gdanezis
Copy link
Collaborator Author

If the DB is ticketed to write TXes 42, 43, 44 concurrently and writing 42 takes 5x as much time as writing any of the others (1 unit of latency), the time to observe 44 is at least 5 units.

At the point where the ticketing business happens, the execution has already been performed, and all is left is writing effects to the DB. Of course variable amount of data may lead to variable time here, but just to point this out.

if we just "let the DB crash" and perform no reordering or wait whatsoever at the Batch sender, what happens?

It depends on the nature of the crash. If there is a crash that takes down all future tasks, ie the DB really has died for good, and we are going to restart the authority (with a bigger HD). Then nothing bad happens: the recovery logic will just make a new batch with all the last transactions outside a batch.

However, if there is a crash between a DB write and the digest being sent through the channel to the batch manager, then this update may never be included in a batch, which means that the sequence of batches no longer represents the sequence that re-creates the DB of the authority. This is inconsistent state.

@huitseeker
Copy link
Contributor

However, if there is a crash between a DB write and the digest being sent through the channel to the batch manager, then this update may never be included in a batch, which means that the sequence of batches no longer represents the sequence that re-creates the DB of the authority. This is inconsistent state.

Right, but I'm confused as to how this came about: the current PR does not mention or claim anything about recovery logic. It's just about not letting crashes block anything else. Or did I miss anything?

When I said "let the DB crash" and perform no reordering or wait whatsoever at the Batch sender", i meant that to me this behaves just as "good" as the ticket-based reordering, for some priors on "good". But I may have missed something on how the re-ordering is important?

@gdanezis
Copy link
Collaborator Author

@huitseeker I am a great fan of simpler solutions and less code. If you think there is a simpler design for the Batch Manager, do suggest a PR.

@huitseeker
Copy link
Contributor

@gdanezis see #798

.skip_to(&next_expected_sequence_number)?
.take_while(|(store_seq, _)| store_seq < &seq)
.for_each(|(store_seq, store_digest)|
{
Copy link
Contributor

@huitseeker huitseeker Mar 12, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to adapt #714 batches in some way similar to #798 batches to better convey intent: otherwise the client might keep requesting a seq num they feel we signed off on (through considering the batch as a commitment) without the authority being able to serve it.

) -> (BatchSender, BatchManager, BroadcastPair) {
let (tx_send, tx_recv) = channel(capacity);
) -> Result<(BatchSender, BatchManager, BroadcastPair), SuiError> {
let (tx_send, tx_recv) = unbounded_channel();
Copy link
Contributor

@huitseeker huitseeker Mar 12, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can't allow a crash to occur between successful TX write and the ticket send for it, because we'd be toast anyway: the batch goes out w/o the TX, yet the TX's sequence was not after the end of the last batch (cost of recovery is O(N)). One leading reason why this might happen is one of the channels on which the ticket redemption flows is out of capacity: ticket sending would fail.

This forces the use of an unbounded channel at the AutoIncSender and in the BatchManager, and of an unbounded waiting table at the AutoIncSender, to push back ticket redemption failure at the threshold of whole-process failure.

This means that tickets are Kryptonite: if anyone manages to put any one of them in a long-running task (where they will be stored but not de-allocated), this will crash the whole node through an OOM, irrecoverably.

Could we put those tickets on a reasonable timeout somehow? (e.g. forcibly flushing the waiting past a max size) I'm way more comfortable with an O(N) recovery than I am with a full node crash (no more TX voting!) because of some auxiliary read functionality.

Comment on lines +18 to +32
while let Some(item_opt) = self.waiting.remove(&self.next_expected_sequence_number) {
if let Some(item) = item_opt {
if let Err(_err) = self.sender.send((self.next_expected_sequence_number, item)) {
/*
An error here indicates the other side of the channel is closed.
There is not very much we can do, as if the batcher is closed we
will write to the DB and the recover when we recover.
*/

self.waiting.clear();
}
}
self.next_expected_sequence_number += 1;
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing we could do is at least post a log line if the size of this is past a threshold, and possibly turn that into a bigger warning if it's growing w/o removal.

@velvia
Copy link
Contributor

velvia commented Mar 14, 2022

Sorry reposting here because the other PR is closed:

@gdanezis @huitseeker I'm hoping the two of you can expand on the following and explain a little bit. I'd like to understand what the complexities here are.

If we accept to have a O(N) recovery on crash, and no recovery / v expensive recovery upon task crash (basically no consistency between DB and batch sequence) we can do the following: we write the effects of the cert to the DB without sequencing the transaction; we optimistically send a transaction digest down the channel; upon reception the channel sequences the transaction (easy on a single task), batches and writes both the transactions with sequence numbers, and the batches to disk. This offers, no consistency (may not have a complete sequence upon crash, expensive recovery), but it offers a very regular, in-sequence, easy to query, no-gaps sequence and batches.
If we want consistency and ease of recovery we just take the cost of a DB write head-of-line blocking as suggested in other PRs, and work from there to optimise.

  1. What is the source of inconsistency? Is it that, at the time a batch is defined, you want to have it represent a range of numbers, sequence numbers, but you believe that later transactions might fit in that number? If this is the problem, then the late sequencing idea surely solves it?
  2. I'm used to consistency and recovery being done with checkpointing and watermarks. The way I see it is that there is no consistent watermarking being done in the DB and state today, which makes it hard to reason about what exactly needs to be recovered. Maybe we can discuss this a little bit.
  3. I do see having a WAL-like sequencing of TX's being useful to a single authority, and syncing with a storage system. However, I don't think it will be useful to clients and fat replicas that need to sync state as the sequence numbers are not universally comparable at all -- and so I'm not sure what a consistent stream will help in the global context, since even for 2 authorities I'd have to look at the entire streams and compare detailed contents to sync between 2.
  4. If the stream was able to use something universal like timestamps, that would make global comparisons much easier and checkpoints easier to reason about
  5. As for going over the network - this will happen as storage will become remote. However, consistent, ordered sequencing of things over the network is a solved problem in the database world. I'm probably not understand things though6.

Thanks and look forward to your comments

@gdanezis gdanezis closed this Mar 15, 2022
@huitseeker huitseeker deleted the batch-crash-robustness branch April 22, 2022 17:11
mwtian pushed a commit that referenced this pull request Sep 12, 2022
* chore(deps): remove uneeded dpeendencies

See https://github.com/MystenLabs/narwhal/runs/7700656786?check_suite_focus=true

* chore(deps): sort dependencies
mwtian pushed a commit to mwtian/sui that referenced this pull request Sep 29, 2022
* chore(deps): remove uneeded dpeendencies

See https://github.com/MystenLabs/narwhal/runs/7700656786?check_suite_focus=true

* chore(deps): sort dependencies
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants