-
Notifications
You must be signed in to change notification settings - Fork 11.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[authority] Batch crash robustness #714
Conversation
} | ||
|
||
// Add to batch and broadcast | ||
current_batch.push((seq, tx_digest)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this guarantee that the output transactions' sequence number always increase?
I imagine that it should also be common for the authority to send transactions to the batch out-of-order (i.e. seq < next_expected_sequence_number).
I was thinking to keep the loose_transactions, but make progress when the size of it is beyond a threshold.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now the autoinc sender ensures the sequence numbers are sent down the channel in order. So we do not need this tricky logic in the batch maker, aside from the cases where the AuthorityState may have crashed, and there is a genuine gap.
5708153
to
df9980f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Before I dive into the code, let me check if I understand the design:
- the batch manager has a contract of in-order delivery, and does head-of-line blocking, is that correct?
- since the sequence numbers are somewhat idiosyncratic (they're not reproducible across authorities), and the DB writes are not coerced to follow a meaningful (causal) order (as the header of this PR explains) is the in-order delivery important in the first place?
- in case of a task panic, the batch manager may wait forever for a specific ticket number that may never arrive and that't the core of what we're trying to fix, is that correct?
- this PR makes the assumption that what may lead to a crash is around the DB write, and so tries to register a sequence # which registration cancels on a panic (through a drop), is that correct?
- compared to the somewhat simple classic answer to this problem (a watermark), this offers a better latency on crashes. Is latency important here?
cc. @velvia for relevant expertise, @laura-makdah for impact on / obsoletion by #382
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To answer @huitseeker on the impact on the resilient components, I see this would be a good candidate to use it, and we can fix any sequence gaps in handle_irrecoverable.
if !self.sent { | ||
let mut aic = self.autoinc_sender.lock(); | ||
if aic.sender.is_closed() { | ||
return; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to unlock here before returning?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Upon return the lock
falls out of scope and the mutex is unlocked automatically -- this is the flip-side of RAII, dropping a resource frees it. Exactly in the same way as dropping the ticket does the cleanup.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Locks drop when they go out of scope.
.skip_to(&next_expected_sequence_number)? | ||
.take_while(|(store_seq, _)| store_seq < &seq) | ||
.for_each(|(store_seq, store_digest)| | ||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do we do if we can't recover the Tx, but the store is not dead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, that means that no Tx was written to the store, ie the task processing the transaction acquired a ticket, then crashed before the database commit. In that case there will simply be a gap in the batch sequence. And life goes on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gdanezis If I take a step back, might I ask why we need sequenced batches? |
Sure, this is to support the follower functionality, namely when a client wants a list of all transactions that have been processed by an authority. Batching them allows the authority to sign them, and sequencing the transactions allows clients to request historical ranges in the past sequence. |
In the current PR the batch manager both receives transactions in an increasing sequence number fashion (managed by the autoinc sender) and also emits them in an increasing sequence number. However, gaps are allowed when crashes may have occured.
I think the sequence numbers are providing a causal sequence. Ie a follower executing them in this sequence will be able to update their DB without ever needing to re-order. The reason for this is that transaction execution, that drives the acquisition of tickets is causal. But it is something we need to re-check.
That was the bug this PR is fixing. Now the Batch manager will not wait for ever any more.
This is not an assumption, it is the critical region where crashes matter. We are trying to maintain two structures consistent (1) the transactions written to disk in parallel, but with a sequence number each and (2) the sequence we are emitting as part of batching. If a crash happens before we assign a unique sequence number, really no biggie, as nothing has to happen. The opportunity for inconsistency occurs if there is a crash between Get Ticket and DB write, or between DB write and send update. After the DB write has happened and the update has been sent / batch generated again there is no opportunity for inconsistency.
I have no idea what a 'watermark' is in this context, so I guess it cannot be as simple. Do you mean you keep the sequence number on the sending end, and on the receiving end? If so, yes, this is exactly what we are doing here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for the explanation @gdanezis . I think this is suboptimal, in that it's still affected by stragglers, for no good reason.
Stragglers,
If the DB is ticketed to write TXes 42, 43, 44 concurrently and writing 42 takes 5x as much time as writing any of the others (1 unit of latency), the time to observe 44 is at least 5 units.
Yet 43, 44, may be transactions completely unrelated to 42, and it may be valuable to see them before 42. We could theoretically observe them in 1 unit.
Edit:
Dear reader, you might want to jump at my realization of what this consistency logic aims to do here
Higher level
- Within a small set of interdependent transactions, it's likely the causal order is a total order, i.e. there's just one way to execute them in order,
- but if we consider a larger set of transactions, e.g. what the node is executing over an hour, there isn't always a precedence link between any two transactions, they may stand on incomparable strands of a lattice, or in family terms, be incomparable siblings of a distant common parent (something we touched on a while ago),
- in so far as this is trying to keep the linear numbering scheme in the DB and the observed one in the batch manager the same, the PR is definitely getting at a more robust solution to the problem,
- but is it meaningful to try to keep those numberings, and the order of delivery the same?
- since the Authority Store numbering follows execution, it will represent one linearization of the many possible valid (causal) execution orders,
- if we just "let the DB crash" and perform no reordering or wait whatsoever at the Batch sender, what happens?
- it seems to me there's no non-causal reordering due to concurrency,
Property I'm assuming
- We have a pre-condition on the DB read: if TX2 is a direct dependent of TX1, we cannot execute TX2 (and a fortiori, write it or batch it) before the effects of TX1 have been successfully written to the DB,
- seq_num assignments are in strictly increasing orders
- channels conserve sending order,
Consequences
- if a TX registers number k with the store, any of its (transitive) dependents have been successfully written to the store with a number j < k,
- consequence: if a TX registers number k with the store, to be sent on the batch channel earlier than any of its dependent TXj, j<k, then TXk would need to be written and sent to the channel faster than it takes to just send TXj over (since that's the last remaining thing to do after TXj is readable),
- since this is implausible, the channel output order respects causal orders in all plausible concurrent executions.
Example
- the authority executes TX2, the store assigns seq_num 42 for it,
- the authority tries to write TX2,
- before that completes, the TXes at seq_num 43, 44 go through to the BatchManager
- the write crashes,
- TX3 comes in with a dependency on TX2,
- the authority fetches its parent TX2 (from the gateway) and reexecutes TX2,
- TXes [46, 48, 47, 49, 51, 50, 52, 53, 55, 54] go through to the BatchManager in the meanwhile, in this order,
- the authority assigns seq_num 56 for TX2,
- the write succeeds, TX2 is sent at 56 to the BatchManaget,
- TXes [58, 57] go through in this order,
- the authority processes TX3, assigns seq_num 59 for it,
- the write succeeds, TX3 is send at 59 to the BatchManager
UX issues
Seq_nums can totally be jumbled on the output of the BatchManager. My point is that this is for causally non-comparable TXes (if 2 TXes are related by the partial causal order, they will be {executed + written} one after the other) and that the fix is entirely contained on the receiving end: periodically re-order the items received out of the channel that are older than a moving time deadline, aka a watermark (equivalently. that have been received for more than ~5s), and to expect this re-ordering to manifest the final order.
Or the end-user of the Batch information (e.g. an Explorer) may just consider the seq_nums are information that's only useful in speaking to that specific authority, and that for the purposes of talking to its end-user (e.g. somebody outside the chain) it can re-number the TXes in exactly the reception order. I think that order will respect all the partial causal relationships.
What I'm suggesting we do
Remove the head of line blocking at the BatchManager
, just send through whatever we receive.
if !self.sent { | ||
let mut aic = self.autoinc_sender.lock(); | ||
if aic.sender.is_closed() { | ||
return; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Locks drop when they go out of scope.
At the point where the ticketing business happens, the execution has already been performed, and all is left is writing effects to the DB. Of course variable amount of data may lead to variable time here, but just to point this out.
It depends on the nature of the crash. If there is a crash that takes down all future tasks, ie the DB really has died for good, and we are going to restart the authority (with a bigger HD). Then nothing bad happens: the recovery logic will just make a new batch with all the last transactions outside a batch. However, if there is a crash between a DB write and the digest being sent through the channel to the batch manager, then this update may never be included in a batch, which means that the sequence of batches no longer represents the sequence that re-creates the DB of the authority. This is inconsistent state. |
Right, but I'm confused as to how this came about: the current PR does not mention or claim anything about recovery logic. It's just about not letting crashes block anything else. Or did I miss anything? When I said "let the DB crash" and perform no reordering or wait whatsoever at the Batch sender", i meant that to me this behaves just as "good" as the ticket-based reordering, for some priors on "good". But I may have missed something on how the re-ordering is important? |
@huitseeker I am a great fan of simpler solutions and less code. If you think there is a simpler design for the Batch Manager, do suggest a PR. |
.skip_to(&next_expected_sequence_number)? | ||
.take_while(|(store_seq, _)| store_seq < &seq) | ||
.for_each(|(store_seq, store_digest)| | ||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
) -> (BatchSender, BatchManager, BroadcastPair) { | ||
let (tx_send, tx_recv) = channel(capacity); | ||
) -> Result<(BatchSender, BatchManager, BroadcastPair), SuiError> { | ||
let (tx_send, tx_recv) = unbounded_channel(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can't allow a crash to occur between successful TX write and the ticket send for it, because we'd be toast anyway: the batch goes out w/o the TX, yet the TX's sequence was not after the end of the last batch (cost of recovery is O(N)). One leading reason why this might happen is one of the channels on which the ticket redemption flows is out of capacity: ticket sending would fail.
This forces the use of an unbounded channel at the AutoIncSender
and in the BatchManager
, and of an unbounded waiting
table at the AutoIncSender
, to push back ticket redemption failure at the threshold of whole-process failure.
This means that tickets are Kryptonite: if anyone manages to put any one of them in a long-running task (where they will be stored but not de-allocated), this will crash the whole node through an OOM, irrecoverably.
Could we put those tickets on a reasonable timeout somehow? (e.g. forcibly flushing the waiting past a max size) I'm way more comfortable with an O(N) recovery than I am with a full node crash (no more TX voting!) because of some auxiliary read functionality.
while let Some(item_opt) = self.waiting.remove(&self.next_expected_sequence_number) { | ||
if let Some(item) = item_opt { | ||
if let Err(_err) = self.sender.send((self.next_expected_sequence_number, item)) { | ||
/* | ||
An error here indicates the other side of the channel is closed. | ||
There is not very much we can do, as if the batcher is closed we | ||
will write to the DB and the recover when we recover. | ||
*/ | ||
|
||
self.waiting.clear(); | ||
} | ||
} | ||
self.next_expected_sequence_number += 1; | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One thing we could do is at least post a log line if the size of this is past a threshold, and possibly turn that into a bigger warning if it's growing w/o removal.
Sorry reposting here because the other PR is closed: @gdanezis @huitseeker I'm hoping the two of you can expand on the following and explain a little bit. I'd like to understand what the complexities here are.
Thanks and look forward to your comments |
* chore(deps): remove uneeded dpeendencies See https://github.com/MystenLabs/narwhal/runs/7700656786?check_suite_focus=true * chore(deps): sort dependencies
* chore(deps): remove uneeded dpeendencies See https://github.com/MystenLabs/narwhal/runs/7700656786?check_suite_focus=true * chore(deps): sort dependencies
Following a conversation with @lxfind this PR attempt to make the interaction between
AuthorityState
andBatchManager
more robust to crashes on either side.What is the problem:
In the current design the AuthorityStore picks atomically a consecutive number for each transaction, and then non-atomically sends it over a mpsc channel to the Batch manager along with the digest. Since multiple threads may operate on the AuthorityState in parallel on different tokio tasks, the sequence numbers may not be in order. To fix this the Batch manager tries to wait until the complete in order sequence is received to sequence and batch.
However, what is an AuthorityState crashes? Then the sequence number will not be sent to the Batch Manager. That may produce a gap in the sequence, resulting in the Batch manager buffering for ever waiting for the missing sequence which will never arrive. This is bad.
This PR's solution
We augment the sender end of the mpsc channel with logic, that allows a thread to request a ticket -- assigning it a sequence number -- and then send a value (transaction digest) using this ticket to the batch manager. The twist is that if the ticket is dropped for example when an Err is returned, or a panic occurs within a tokio task, the ticket Drop logic notifies the Sender that the ticket will never be sent, and unblocks subsequent sents.
As a result now: we do not expect (if there are crashes) consecutive sequence numbers, but we do expect them to be increasing. The receiver on the Batch manager side does not have to do anything fancy to wait for old sequences, the output
of the channel can be directly sequenced and broadcast.
When the task of the batch manager crashes, then the batching and notifications end. However, the transactions and sequence numbers are still written to the DB for batching upon recovery.
Remaining issues (fixed)
There is still a window of vulnerability: if the authority crashes between successfully writing to the DB, and signaling to the batch manager, the batch manager will miss the update for ever. Here, the batch has been written to storage, so we can include some logic in the batch manager to recover: when there is a gap in the sequence number the batch manager can attempt to read from the DB any missing transactions. If found they are sequenced, otherwise they are skipped. This slower path, or reading from the DB should be activated infrequently (we do not expect authority state to crash all the time).
Potential improvements