[authority] Batch crash robustness v3 #835

gdanezis · 2022-03-15T01:31:28Z

The Background

This is the v3 attempt at creating the infrastructure on the authority side to support Priority A, namely making a sequence of transactions and interspersed batches providing meta-data and authentication. It follows the saga of:

The issues with previous approaches were:

When we had a consistent sequence, we either had to block quite a bit, or/and were using potentially unbounded buffers.
When we were not consistent, we were not consistent. Which reduces the sequence's value in terms of being a canonical representation of the authority state that can be reliable and efficiently queried and followed (with no surprises).

What this PR does

This PR refactors the (no defunct) batch manager to use the database directly to buffer / store the transaction sequence (we store there sequence there anyway) instead of an extra in-memory structure that needs to be kept both consistent with the DB and also not overfill. No additional data is used if the batch manager dies, or any tasks handling transactions die. Thus covering concerns around memory exhaustion.
It uses a Notifier based infrastructure to just nudge the batch creation to read new transactions from the DB. The notifier also keep a "low watermark" of safe transactions that can be stored in batches, as they will never be re-ordered and are complete up to that point. This is done through the help of (much simplified) tickets that either correspond to transactions or signal, when they are dropped, that the transaction will never commit and the watermark can rise.
A notable feature of the notifier is that it exposes the transaction sequence as an asynchronous stream, that is consumed by the batch creation process. The infra to listen to UpdateItems is unchanged and based on the tokio broadcast primitive.

Minor issues fixed:

Thanks to the tests now we do not try to re-genesis an authority that a store with objects (check @awelc ?).

What this PR does not do

Sadly this PR does not resolve the fact that if two transactions interact with each other poorly, and are given sequence numbers in the opposite order than they are written to disk, it is possible that the second may have to wait for the first to write to disk. This is a slight delay.

666lcz · 2022-03-16T18:23:50Z

sui_core/src/authority.rs

+            .expect("Init batches failed!");
+
+        // Only initialize an empty database.
+        if store


Re https://github.com/MystenLabs/sui/pull/807/files#r828079220: this piece of code is only expected to run during Genesis. And when I (as a new developer) think of genesis, I will assume everything starts from a fresh copy. Should we add code to clean up the DB if it's not empty during genesis?

I am quite worried about deleting data here. Its better if we throw an error maybe, and let the user know they have pointed to a database that already contains data? Maybe its a mistake and the data is valuable to them.

lanvidr · 2022-03-16T18:50:16Z

sui_core/src/authority/authority_notifier.rs

+            .high_watermark
+            .fetch_add(1, std::sync::atomic::Ordering::SeqCst);
+        Ok(TransactionNotifierTicket {
+            transaction_notifier: self.clone(),


I notice that we do a lot of self.clone() on the transaction notifier, which has an AuthorityStore inside. Does this not create an opportunity for divergence of store contents of different notifiers?

Our authority store is wrapped in a Arc<AuthorityStore> within the notifier structure. So the .clone will copy the pointer (and increment the atomic counter within the Arc>, and they will all point to the same store.

lanvidr · 2022-03-16T19:28:13Z

sui_core/src/authority/authority_notifier.rs

+        Ok(TransactionNotifier {
+            state,
+            low_watermark: AtomicU64::new(seq),
+            high_watermark: AtomicU64::new(seq),


the watermarks don't seem to get updated yet, is that a future PR?

The low watermark get updated on the ticket drop:

impl Drop for TransactionNotifierTicket { fn drop(&mut self) { self.transaction_notifier .low_watermark .fetch_add(1, std::sync::atomic::Ordering::SeqCst); self.transaction_notifier.notify.notify_one(); } }

And the high watermark on the ticket issuance:

/// Get a ticket with a sequence number pub fn ticket(self: &Arc<Self>) -> SuiResult<TransactionNotifierTicket> { if self.is_closed.load(std::sync::atomic::Ordering::SeqCst) { return Err(SuiError::ClosedNotifierError); } let seq = self .high_watermark .fetch_add(1, std::sync::atomic::Ordering::SeqCst); Ok(TransactionNotifierTicket { transaction_notifier: self.clone(), seq, }) }

I think that ensures that low_watermark <= high_watermark and the "gap" represents the number of certificates still pending being written to the db.

lanvidr · 2022-03-16T19:30:35Z

sui_core/src/authority/authority_notifier.rs

+            .is_err()
+        {
+            return Err(SuiError::ConcurrentIteratorError);
+        }


why is only one stream allowed?

This stream is read by the task running the run_batch_service that creates batches. I wanted to make it hard for anyone to make a mistake here and create two concurrent services that create batches, since they would (a) potentially take turns being notified (not sure if notify wakes everyone up), and (b) would compete in making batches for transactions within the same range.

lanvidr · 2022-03-16T19:33:59Z

sui_core/src/authority/authority_notifier.rs

+        store.side_sequence(t8.seq(), &TransactionDigest::random());
+        drop(t8);
+
+        assert!(matches!(iter.next().await, Some((5, _))));


so dropping a sequence leaves it intact in the batch

The drop calls above are tickets. Dropping a ticket increases the low watermark (see above) and notifies the that there may be a new transaction available and safe to sequence. The sequence itself is stored in the db (when the write has not failed) so it is there for the stream generated by TransactionNotifier::iter_from to pick up and deliver (if it exists).

lanvidr · 2022-03-16T19:44:35Z

sui_core/src/authority_server.rs

+        let _batch_join_handle = tokio::task::spawn(async move {
+            local_server
+                .state
+                .run_batch_service(min_batch_size, max_delay)
+                .await
+        });


This spawn_batch_subsystem can be retrofit to be run inside a resilient component.

We would create a wrapper struct and add in the local_server, min_batch_size, max_delay, etc. and then implement a trait that would call spawn_batch_subsystem which would clone local_server , etc from self and pass that into the tokio task. This will work because it is cloned. 👍

Yes, this is where the re-startable component would come in very handy. When the task handling the run_batch_service fails, we need to reclaim any resources (there are none for the moment actually) and then start a new one.

gdanezis force-pushed the batch-crash-robustness-v3 branch from 49dcc49 to ddef948 Compare March 15, 2022 02:19

gdanezis mentioned this pull request Mar 15, 2022

[authority] Emit transactions out-of-order #798

Closed

3 tasks

gdanezis requested review from velvia, lxfind and huitseeker March 15, 2022 02:29

oxade self-requested a review March 15, 2022 17:10

gdanezis mentioned this pull request Mar 16, 2022

Refactor Genesis Code #807

Merged

lanvidr self-requested a review March 16, 2022 16:32

666lcz reviewed Mar 16, 2022

View reviewed changes

lanvidr reviewed Mar 16, 2022

View reviewed changes

lanvidr requested review from lanvidr and 666lcz March 21, 2022 17:06

lanvidr approved these changes Mar 21, 2022

View reviewed changes

George Danezis added 11 commits March 21, 2022 18:15

Added ticketing system

d2e4163

Integrated autoinc channel

2f132d1

Integrate autoinc with batcher logic

e8a0396

Simplify batcher now

c2b8844

Recover missing tx from database in corner case

81e366c

Changes following review

7cadf13

Make fmt happy

a65db38

Notifier based sequencer

d46eeaf

Remove pubs

a512cb1

happy fmt

22444fa

Make clippy happy

69c3f55

gdanezis force-pushed the batch-crash-robustness-v3 branch from ddef948 to 69c3f55 Compare March 21, 2022 18:46

gdanezis merged commit c0bc30a into main Mar 21, 2022

gdanezis deleted the batch-crash-robustness-v3 branch March 21, 2022 19:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[authority] Batch crash robustness v3 #835

[authority] Batch crash robustness v3 #835

gdanezis commented Mar 15, 2022 •

edited

Loading

666lcz Mar 16, 2022

gdanezis Mar 17, 2022

lanvidr Mar 16, 2022

gdanezis Mar 18, 2022

lanvidr Mar 16, 2022

gdanezis Mar 18, 2022

lanvidr Mar 16, 2022

gdanezis Mar 18, 2022

lanvidr Mar 16, 2022

gdanezis Mar 18, 2022

lanvidr Mar 16, 2022 •

edited

Loading

gdanezis Mar 18, 2022

[authority] Batch crash robustness v3 #835

[authority] Batch crash robustness v3 #835

Conversation

gdanezis commented Mar 15, 2022 • edited Loading

The Background

What this PR does

What this PR does not do

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lanvidr Mar 16, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gdanezis commented Mar 15, 2022 •

edited

Loading

lanvidr Mar 16, 2022 •

edited

Loading