Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

update elastic scaling guide #6739

Draft
wants to merge 13 commits into
base: master
Choose a base branch
from

Conversation

alindima
Copy link
Contributor

@alindima alindima commented Dec 3, 2024

Resolves #5050

Updates the elastic scaling guide, taking into consideration:

  • the completed implementation of RFC-103, which enables an untrusted collator set for elastic scaling. Adds the necessary instructions for configuring the collator so that it can leverage this implementation
  • general updates for bits that became out of date

This PR should not be merged until:

  1. the CandidateReceiptV2 node feature bit is enabled on all networks
  2. the functionality hidden under the experimental-ump-signals feature of the parachain-system pallet is turned on by default (which can only be done after 1)

TODO:

@alindima alindima added the T11-documentation This PR/Issue is related to documentation. label Dec 3, 2024
@alindima alindima marked this pull request as draft December 3, 2024 09:59
@sandreim sandreim mentioned this pull request Dec 9, 2024
10 tasks
@alindima alindima requested a review from sandreim December 10, 2024 09:38
//!
//! - The `DefaultCoreSelector` implements a round-robin selection on the cores that can be
//! occupied by the parachain at the very next relay parent. This is the equivalent to what all
//! parachains on production networks have been using so far.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. Shall we rename this as part of this PR? It seems like LookaheadCoreSelector should be the "default" as we expect any new parachain to use asynchronous backing?

//! <div class="warning">If you configure a velocity which is different from the number of assigned
//! cores, the measured velocity in practice will be the minimum of these two. However, be mindful
//! that if the velocity is higher than the number of assigned cores, it's possible that
//! <a href="https://github.com/paritytech/polkadot-sdk/issues/6667"> only a subset of the collator set will be authoring blocks.</a></div>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The question is why do we need to configure a velocity at all, seems redundant.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once the slot based collator can produce multiple blocks per slot we should also add that we recommend slot durations of at least 6s, preferably even 12. (better censorship resistance)

//! `overseer_handle` and `relay_chain_slot_duration` params passed to `start_consensus` and pass
//! in the `slot_based_handle`.
//!
//! ### Phase 2 - Configure core selection policy in the parachain runtime
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Phase 2 assumes candidate receipt v2 feature bit was enabled.
This phase will change after the feature bit is enabled on all networks and a form of #6939 is merged

@@ -15,7 +15,9 @@ use polkadot_sdk::*;
use cumulus_client_cli::CollatorOptions;
use cumulus_client_collator::service::CollatorService;
#[docify::export(lookahead_collator)]
use cumulus_client_consensus_aura::collators::lookahead::{self as aura, Params as AuraParams};
use cumulus_client_consensus_aura::collators::slot_based::{
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes in this file will be rolled back before merge, but currently showcase what a parachain team using the template would need to do on the node-side to use elastic scaling

//!
//! ### Phase 3 - Configure maximum scaling factor in the runtime
//!
//! First of all, you need to decide the upper limit to how many parachain blocks you need to
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually the thinking is the other way around - what is the minimum target block time? It is then no longer needed to configure any other parameters manually as you can compute them from this value.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can also make all the calculations based on the velocity, which is what I describe here

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can see what is described here, but I want a better DX.

As you've noticed recently, people didn't ask "how many parachains blocks can I produce per relay chain block ?", Instead they ask "How can I get 500ms blocks ?" because that is what their end users care about. The velocity of the parachain is largely an implementation detail.

With that being said, we can then remove all of the details about velocity and concerns around they need to compute all sorts of other constants.

//!
//! ## Current constraints
//!
//! Elastic scaling is still considered experimental software, so stability is not guaranteed.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After launching on Polkadot this is not true.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True, will update when that is the case

Comment on lines 28 to 38
//! duration of 2 seconds per block.** Using the current implementation with multiple collators
//! adds additional latency to the block production pipeline. Assuming block execution takes
//! about the same as authorship, the additional overhead is equal the duration of the authorship
//! plus the block announcement. Each collator must first import the previous block before
//! authoring a new one, so it is clear that the highest throughput can be achieved using a
//! single collator. Experiments show that the peak performance using more than one collator
//! (measured up to 10 collators) is utilising 2 cores with authorship time of 1.3 seconds per
//! block, which leaves 400ms for networking overhead. This would allow for 2.6 seconds of
//! execution, compared to the 2 seconds async backing enabled.
//! The development required for enabling maximum compute throughput for multiple collators is tracked by
//! [this issue](https://github.com/paritytech/polkadot-sdk/issues/5190).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can do much better in terms of structure here vs a large blob of text which is not that easy to read and focus important information.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I rewrote this section. let me know how it looks

//! this should obviously only be used for testing purposes, due to the clear lack of decentralisation
//! and resilience. Experiments show that the peak compute throughput using more than one collator
//! (measured up to 10 collators) is utilising 2 cores with authorship time of 1.3 seconds per block,
//! which leaves 400ms for networking overhead. This would allow for 2.6 seconds of execution, compared
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add the formula as a function of latency to compute the max usable execution time.

//!
//! ### Phase 3 - Configure maximum scaling factor in the runtime
//!
//! First of all, you need to decide the upper limit to how many parachain blocks you need to
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can see what is described here, but I want a better DX.

As you've noticed recently, people didn't ask "how many parachains blocks can I produce per relay chain block ?", Instead they ask "How can I get 500ms blocks ?" because that is what their end users care about. The velocity of the parachain is largely an implementation detail.

With that being said, we can then remove all of the details about velocity and concerns around they need to compute all sorts of other constants.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
T11-documentation This PR/Issue is related to documentation.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Elastic scaling: update guide for RFC103
3 participants