Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[authority] Batch crash robustness v3 #835

Merged
merged 11 commits into from
Mar 21, 2022
Merged

Conversation

gdanezis
Copy link
Collaborator

@gdanezis gdanezis commented Mar 15, 2022

The Background

This is the v3 attempt at creating the infrastructure on the authority side to support Priority A, namely making a sequence of transactions and interspersed batches providing meta-data and authentication. It follows the saga of:

The issues with previous approaches were:

  • When we had a consistent sequence, we either had to block quite a bit, or/and were using potentially unbounded buffers.
  • When we were not consistent, we were not consistent. Which reduces the sequence's value in terms of being a canonical representation of the authority state that can be reliable and efficiently queried and followed (with no surprises).

What this PR does

  • This PR refactors the (no defunct) batch manager to use the database directly to buffer / store the transaction sequence (we store there sequence there anyway) instead of an extra in-memory structure that needs to be kept both consistent with the DB and also not overfill. No additional data is used if the batch manager dies, or any tasks handling transactions die. Thus covering concerns around memory exhaustion.
  • It uses a Notifier based infrastructure to just nudge the batch creation to read new transactions from the DB. The notifier also keep a "low watermark" of safe transactions that can be stored in batches, as they will never be re-ordered and are complete up to that point. This is done through the help of (much simplified) tickets that either correspond to transactions or signal, when they are dropped, that the transaction will never commit and the watermark can rise.
  • A notable feature of the notifier is that it exposes the transaction sequence as an asynchronous stream, that is consumed by the batch creation process. The infra to listen to UpdateItems is unchanged and based on the tokio broadcast primitive.

Minor issues fixed:

  • Thanks to the tests now we do not try to re-genesis an authority that a store with objects (check @awelc ?).

What this PR does not do

Sadly this PR does not resolve the fact that if two transactions interact with each other poorly, and are given sequence numbers in the opposite order than they are written to disk, it is possible that the second may have to wait for the first to write to disk. This is a slight delay.

@gdanezis gdanezis force-pushed the batch-crash-robustness-v3 branch from 49dcc49 to ddef948 Compare March 15, 2022 02:19
@oxade oxade self-requested a review March 15, 2022 17:10
@gdanezis gdanezis mentioned this pull request Mar 16, 2022
@lanvidr lanvidr self-requested a review March 16, 2022 16:32
.expect("Init batches failed!");

// Only initialize an empty database.
if store
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re https://github.com/MystenLabs/sui/pull/807/files#r828079220: this piece of code is only expected to run during Genesis. And when I (as a new developer) think of genesis, I will assume everything starts from a fresh copy. Should we add code to clean up the DB if it's not empty during genesis?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am quite worried about deleting data here. Its better if we throw an error maybe, and let the user know they have pointed to a database that already contains data? Maybe its a mistake and the data is valuable to them.

.high_watermark
.fetch_add(1, std::sync::atomic::Ordering::SeqCst);
Ok(TransactionNotifierTicket {
transaction_notifier: self.clone(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I notice that we do a lot of self.clone() on the transaction notifier, which has an AuthorityStore inside. Does this not create an opportunity for divergence of store contents of different notifiers?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Our authority store is wrapped in a Arc<AuthorityStore> within the notifier structure. So the .clone will copy the pointer (and increment the atomic counter within the Arc>, and they will all point to the same store.

Ok(TransactionNotifier {
state,
low_watermark: AtomicU64::new(seq),
high_watermark: AtomicU64::new(seq),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the watermarks don't seem to get updated yet, is that a future PR?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The low watermark get updated on the ticket drop:

impl Drop for TransactionNotifierTicket {
    fn drop(&mut self) {
        self.transaction_notifier
            .low_watermark
            .fetch_add(1, std::sync::atomic::Ordering::SeqCst);
        self.transaction_notifier.notify.notify_one();
    }
}

And the high watermark on the ticket issuance:

    /// Get a ticket with a sequence number
    pub fn ticket(self: &Arc<Self>) -> SuiResult<TransactionNotifierTicket> {
        if self.is_closed.load(std::sync::atomic::Ordering::SeqCst) {
            return Err(SuiError::ClosedNotifierError);
        }

        let seq = self
            .high_watermark
            .fetch_add(1, std::sync::atomic::Ordering::SeqCst);
        Ok(TransactionNotifierTicket {
            transaction_notifier: self.clone(),
            seq,
        })
    }

I think that ensures that low_watermark <= high_watermark and the "gap" represents the number of certificates still pending being written to the db.

.is_err()
{
return Err(SuiError::ConcurrentIteratorError);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is only one stream allowed?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This stream is read by the task running the run_batch_service that creates batches. I wanted to make it hard for anyone to make a mistake here and create two concurrent services that create batches, since they would (a) potentially take turns being notified (not sure if notify wakes everyone up), and (b) would compete in making batches for transactions within the same range.

store.side_sequence(t8.seq(), &TransactionDigest::random());
drop(t8);

assert!(matches!(iter.next().await, Some((5, _))));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so dropping a sequence leaves it intact in the batch

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The drop calls above are tickets. Dropping a ticket increases the low watermark (see above) and notifies the that there may be a new transaction available and safe to sequence. The sequence itself is stored in the db (when the write has not failed) so it is there for the stream generated by TransactionNotifier::iter_from to pick up and deliver (if it exists).

Comment on lines +55 to +72
let _batch_join_handle = tokio::task::spawn(async move {
local_server
.state
.run_batch_service(min_batch_size, max_delay)
.await
});
Copy link
Contributor

@lanvidr lanvidr Mar 16, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This spawn_batch_subsystem can be retrofit to be run inside a resilient component.

We would create a wrapper struct and add in the local_server, min_batch_size, max_delay, etc. and then implement a trait that would call spawn_batch_subsystem which would clone local_server , etc from self and pass that into the tokio task. This will work because it is cloned. 👍

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is where the re-startable component would come in very handy. When the task handling the run_batch_service fails, we need to reclaim any resources (there are none for the moment actually) and then start a new one.

@lanvidr lanvidr requested review from lanvidr and 666lcz March 21, 2022 17:06
@gdanezis gdanezis force-pushed the batch-crash-robustness-v3 branch from ddef948 to 69c3f55 Compare March 21, 2022 18:46
@gdanezis gdanezis merged commit c0bc30a into main Mar 21, 2022
@gdanezis gdanezis deleted the batch-crash-robustness-v3 branch March 21, 2022 19:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants