Merge branch 'master' into ao-glossary-update

* master: ig: Fix description of execution retry delay (#6342) Added Amforc bootnodes for Polkadot and Kusama (#6077) [ci] fix build-implementers-guide (#6335) Rate limit improvements (#6315) Add PVF module documentation (#6293) Update async-trait version to v0.1.58 (#6319)
paritytech · Nov 25, 2022 · 5aa3fca · 5aa3fca
2 parents d1edc0b + 795b20c
commit 5aa3fca
Show file tree

Hide file tree

Showing 19 changed files with 300 additions and 245 deletions.
diff --git a/Cargo.lock b/Cargo.lock
diff --git a/node/core/pvf/src/executor_intf.rs b/node/core/pvf/src/executor_intf.rs
@@ -96,7 +96,7 @@ pub fn prevalidate(code: &[u8]) -> Result<RuntimeBlob, sc_executor_common::error
 }
 
 /// Runs preparation on the given runtime blob. If successful, it returns a serialized compiled
-/// artifact which can then be used to pass into [`execute`] after writing it to the disk.
+/// artifact which can then be used to pass into `Executor::execute` after writing it to the disk.
 pub fn prepare(blob: RuntimeBlob) -> Result<Vec<u8>, sc_executor_common::error::WasmError> {
 	sc_executor_wasmtime::prepare_runtime_artifact(blob, &CONFIG.semantics)
 }

diff --git a/node/core/pvf/src/lib.rs b/node/core/pvf/src/lib.rs
@@ -16,18 +16,27 @@
 
 #![warn(missing_docs)]
 
-//! A crate that implements PVF validation host.
+//! A crate that implements the PVF validation host.
+//!
+//! For more background, refer to the Implementer's Guide: [PVF
+//! Pre-checking](https://paritytech.github.io/polkadot/book/pvf-prechecking.html) and [Candidate
+//! Validation](https://paritytech.github.io/polkadot/book/node/utility/candidate-validation.html#pvf-host).
+//!
+//! # Entrypoint
 //!
 //! This crate provides a simple API. You first [`start`] the validation host, which gives you the
 //! [handle][`ValidationHost`] and the future you need to poll.
 //!
-//! Then using the handle the client can send two types of requests:
+//! Then using the handle the client can send three types of requests:
+//!
+//! (a) PVF pre-checking. This takes the PVF [code][`Pvf`] and tries to prepare it (verify and
+//! compile) in order to pre-check its validity.
 //!
-//! (a) PVF execution. This accepts the PVF [`params`][`polkadot_parachain::primitives::ValidationParams`]
+//! (b) PVF execution. This accepts the PVF [`params`][`polkadot_parachain::primitives::ValidationParams`]
 //!     and the PVF [code][`Pvf`], prepares (verifies and compiles) the code, and then executes PVF
 //!     with the `params`.
 //!
-//! (b) Heads up. This request allows to signal that the given PVF may be needed soon and that it
+//! (c) Heads up. This request allows to signal that the given PVF may be needed soon and that it
 //!     should be prepared for execution.
 //!
 //! The preparation results are cached for some time after they either used or was signaled in heads up.
@@ -39,7 +48,7 @@
 //! PVF execution requests can specify the [priority][`Priority`] with which the given request should
 //! be handled. Different priority levels have different effects. This is discussed below.
 //!
-//! Preparation started by a heads up signal always starts in with the background priority. If there
+//! Preparation started by a heads up signal always starts with the background priority. If there
 //! is already a request for that PVF preparation under way the priority is inherited. If after heads
 //! up, a new PVF execution request comes in with a higher priority, then the original task's priority
 //! will be adjusted to match the new one if it's larger.
@@ -48,18 +57,22 @@
 //!
 //! # Under the hood
 //!
+//! ## The flow
+//!
 //! Under the hood, the validation host is built using a bunch of communicating processes, not
 //! dissimilar to actors. Each of such "processes" is a future task that contains an event loop that
 //! processes incoming messages, potentially delegating sub-tasks to other "processes".
 //!
 //! Two of these processes are queues. The first one is for preparation jobs and the second one is for
 //! execution. Both of the queues are backed by separate pools of workers of different kind.
 //!
-//! Preparation workers handle preparation requests by preverifying and instrumenting PVF wasm code,
+//! Preparation workers handle preparation requests by prevalidating and instrumenting PVF wasm code,
 //! and then passing it into the compiler, to prepare the artifact.
 //!
-//! Artifact is a final product of preparation. If the preparation succeeded, then the artifact will
-//! contain the compiled code usable for quick execution by a worker later on.
+//! ## Artifacts
+//!
+//! An artifact is the final product of preparation. If the preparation succeeded, then the artifact
+//! will contain the compiled code usable for quick execution by a worker later on.
 //!
 //! If the preparation failed, then the worker will still write the artifact with the error message.
 //! We save the artifact with the error so that we don't try to prepare the artifacts that are broken
@@ -68,12 +81,14 @@
 //! The artifact is saved on disk and is also tracked by an in memory table. This in memory table
 //! doesn't contain the artifact contents though, only a flag that the given artifact is compiled.
 //!
+//! A pruning task will run at a fixed interval of time. This task will remove all artifacts that
+//! weren't used or received a heads up signal for a while.
+//!
+//!	## Execution
+//!
 //! The execute workers will be fed by the requests from the execution queue, which is basically a
 //! combination of a path to the compiled artifact and the
 //! [`params`][`polkadot_parachain::primitives::ValidationParams`].
-//!
-//! Each fixed interval of time a pruning task will run. This task will remove all artifacts that
-//! weren't used or received a heads up signal for a while.
 
 mod artifacts;
 mod error;

diff --git a/node/core/pvf/src/priority.rs b/node/core/pvf/src/priority.rs
@@ -24,7 +24,7 @@ pub enum Priority {
 	Normal,
 	/// This priority is used for requests that are required to be processed as soon as possible.
 	///
-	/// For example, backing is on critical path and require execution as soon as possible.
+	/// For example, backing is on a critical path and requires execution as soon as possible.
 	Critical,
 }
 

diff --git a/node/network/dispute-distribution/src/receiver/mod.rs b/node/network/dispute-distribution/src/receiver/mod.rs
@@ -302,6 +302,12 @@ where
 
 		// Queue request:
 		if let Err((authority_id, req)) = self.peer_queues.push_req(authority_id, req) {
+			gum::debug!(
+				target: LOG_TARGET,
+				?authority_id,
+				?peer,
+				"Peer hit the rate limit - dropping message."
+			);
 			req.send_outgoing_response(OutgoingResponse {
 				result: Err(()),
 				reputation_changes: vec![COST_APPARENT_FLOOD],

diff --git a/node/network/dispute-distribution/src/sender/mod.rs b/node/network/dispute-distribution/src/sender/mod.rs
@@ -108,8 +108,6 @@ impl DisputeSender {
 		runtime: &mut RuntimeInfo,
 		msg: DisputeMessage,
 	) -> Result<()> {
-		self.rate_limit.limit().await;
-
 		let req: DisputeRequest = msg.into();
 		let candidate_hash = req.0.candidate_receipt.hash();
 		match self.disputes.entry(candidate_hash) {
@@ -118,6 +116,8 @@ impl DisputeSender {
 				return Ok(())
 			},
 			Entry::Vacant(vacant) => {
+				self.rate_limit.limit("in start_sender", candidate_hash).await;
+
 				let send_task = SendTask::new(
 					ctx,
 					runtime,
@@ -169,10 +169,12 @@ impl DisputeSender {
 
 		// Iterates in order of insertion:
 		let mut should_rate_limit = true;
-		for dispute in self.disputes.values_mut() {
+		for (candidate_hash, dispute) in self.disputes.iter_mut() {
 			if have_new_sessions || dispute.has_failed_sends() {
 				if should_rate_limit {
-					self.rate_limit.limit().await;
+					self.rate_limit
+						.limit("while going through new sessions/failed sends", *candidate_hash)
+						.await;
 				}
 				let sends_happened = dispute
 					.refresh_sends(ctx, runtime, &self.active_sessions, &self.metrics)
@@ -193,7 +195,7 @@ impl DisputeSender {
 		// recovered at startup will be relatively "old" anyway and we assume that no more than a
 		// third of the validators will go offline at any point in time anyway.
 		for dispute in unknown_disputes {
-			self.rate_limit.limit().await;
+			self.rate_limit.limit("while going through unknown disputes", dispute.1).await;
 			self.start_send_for_dispute(ctx, runtime, dispute).await?;
 		}
 		Ok(())
@@ -383,14 +385,18 @@ impl RateLimit {
 	}
 
 	/// Wait until ready and prepare for next call.
-	async fn limit(&mut self) {
+	///
+	/// String given as occasion and candidate hash are logged in case the rate limit hit.
+	async fn limit(&mut self, occasion: &'static str, candidate_hash: CandidateHash) {
 		// Wait for rate limit and add some logging:
 		poll_fn(|cx| {
 			let old_limit = Pin::new(&mut self.limit);
 			match old_limit.poll(cx) {
 				Poll::Pending => {
 					gum::debug!(
 						target: LOG_TARGET,
+						?occasion,
+						?candidate_hash,
 						"Sending rate limit hit, slowing down requests"
 					);
 					Poll::Pending

diff --git a/node/service/chain-specs/kusama.json b/node/service/chain-specs/kusama.json
@@ -23,7 +23,9 @@
     "/dns/boot.stake.plus/tcp/31333/p2p/12D3KooWLa1UyG5xLPds2GbiRBCTJjpsVwRWHWN7Dff14yiNJRpR",
     "/dns/boot.stake.plus/tcp/31334/wss/p2p/12D3KooWLa1UyG5xLPds2GbiRBCTJjpsVwRWHWN7Dff14yiNJRpR",
     "/dns/boot-node.helikon.io/tcp/7060/p2p/12D3KooWL4KPqfAsPE2aY1g5Zo1CxsDwcdJ7mmAghK7cg6M2fdbD",
-    "/dns/boot-node.helikon.io/tcp/7062/wss/p2p/12D3KooWL4KPqfAsPE2aY1g5Zo1CxsDwcdJ7mmAghK7cg6M2fdbD"
+    "/dns/boot-node.helikon.io/tcp/7062/wss/p2p/12D3KooWL4KPqfAsPE2aY1g5Zo1CxsDwcdJ7mmAghK7cg6M2fdbD",
+    "/dns/kusama.bootnode.amforc.com/tcp/30333/p2p/12D3KooWLx6nsj6Fpd8biP1VDyuCUjazvRiGWyBam8PsqRJkbUb9",
+    "/dns/kusama.bootnode.amforc.com/tcp/30334/wss/p2p/12D3KooWLx6nsj6Fpd8biP1VDyuCUjazvRiGWyBam8PsqRJkbUb9"
   ],
   "telemetryEndpoints": [
     [

diff --git a/node/service/chain-specs/polkadot.json b/node/service/chain-specs/polkadot.json
@@ -23,7 +23,9 @@
     "/dns/boot.stake.plus/tcp/30333/p2p/12D3KooWKT4ZHNxXH4icMjdrv7EwWBkfbz5duxE5sdJKKeWFYi5n",
     "/dns/boot.stake.plus/tcp/30334/wss/p2p/12D3KooWKT4ZHNxXH4icMjdrv7EwWBkfbz5duxE5sdJKKeWFYi5n",
     "/dns/boot-node.helikon.io/tcp/7070/p2p/12D3KooWS9ZcvRxyzrSf6p63QfTCWs12nLoNKhGux865crgxVA4H",
-    "/dns/boot-node.helikon.io/tcp/7072/wss/p2p/12D3KooWS9ZcvRxyzrSf6p63QfTCWs12nLoNKhGux865crgxVA4H"
+    "/dns/boot-node.helikon.io/tcp/7072/wss/p2p/12D3KooWS9ZcvRxyzrSf6p63QfTCWs12nLoNKhGux865crgxVA4H",
+    "/dns/polkadot.bootnode.amforc.com/tcp/30333/p2p/12D3KooWAsuCEVCzUVUrtib8W82Yne3jgVGhQZN3hizko5FTnDg3",
+    "/dns/polkadot.bootnode.amforc.com/tcp/30334/wss/p2p/12D3KooWAsuCEVCzUVUrtib8W82Yne3jgVGhQZN3hizko5FTnDg3"
   ],
   "telemetryEndpoints": [
     [

diff --git a/roadmap/implementers-guide/README.md b/roadmap/implementers-guide/README.md
@@ -22,6 +22,11 @@ Then install and build the book:
 ```sh
 cargo install mdbook mdbook-linkcheck mdbook-graphviz mdbook-mermaid mdbook-last-changed
 mdbook serve roadmap/implementers-guide
+```
+
+and in a second terminal window run:
+
+```sh
 open http://localhost:3000
 ```
 

diff --git a/roadmap/implementers-guide/src/SUMMARY.md b/roadmap/implementers-guide/src/SUMMARY.md
@@ -75,7 +75,6 @@
     - [Availability](types/availability.md)
     - [Overseer and Subsystem Protocol](types/overseer-protocol.md)
     - [Runtime](types/runtime.md)
-    - [Chain](types/chain.md)
     - [Messages](types/messages.md)
     - [Network](types/network.md)
     - [Approvals](types/approval.md)

diff --git a/roadmap/implementers-guide/src/glossary.md b/roadmap/implementers-guide/src/glossary.md
@@ -47,4 +47,3 @@ exactly one downward message queue.
 Also of use is the [Substrate Glossary](https://substrate.dev/docs/en/knowledgebase/getting-started/glossary).
 
 [0]: https://wiki.polkadot.network/docs/learn-consensus
-[1]: #pvf
diff --git a/roadmap/implementers-guide/src/node/utility/candidate-validation.md b/roadmap/implementers-guide/src/node/utility/candidate-validation.md
@@ -48,4 +48,39 @@ Once we have all parameters, we can spin up a background task to perform the val
 
 If we can assume the presence of the relay-chain state (that is, during processing [`CandidateValidationMessage`][CVM]`::ValidateFromChainState`) we can run all the checks that the relay-chain would run at the inclusion time thus confirming that the candidate will be accepted.
 
+### PVF Host
+
+The PVF host is responsible for handling requests to prepare and execute PVF
+code blobs.
+
+One high-level goal is to make PVF operations as deterministic as possible, to
+reduce the rate of disputes. Disputes can happen due to e.g. a job timing out on
+one machine, but not another. While we do not yet have full determinism, there
+are some dispute reduction mechanisms in place right now.
+
+#### Retrying execution requests
+
+If the execution request fails during **preparation**, we will retry if it is
+possible that the preparation error was transient (e.g. if the error was a panic
+or time out). We will only retry preparation if another request comes in after
+15 minutes, to ensure any potential transient conditions had time to be
+resolved. We will retry up to 5 times.
+
+If the actual **execution** of the artifact fails, we will retry once if it was
+an ambiguous error after a brief delay, to allow any potential transient
+conditions to clear.
+
+#### Preparation timeouts
+
+We use timeouts for both preparation and execution jobs to limit the amount of
+time they can take. As the time for a job can vary depending on the machine and
+load on the machine, this can potentially lead to disputes where some validators
+successfuly execute a PVF and others don't.
+
+One mitigation we have in place is a more lenient timeout for preparation during
+execution than during pre-checking. The rationale is that the PVF has already
+passed pre-checking, so we know it should be valid, and we allow it to take
+longer than expected, as this is likely due to an issue with the machine and not
+the PVF.
+
 [CVM]: ../../types/overseer-protocol.md#validationrequesttype
diff --git a/roadmap/implementers-guide/src/node/utility/pvf-prechecker.md b/roadmap/implementers-guide/src/node/utility/pvf-prechecker.md
@@ -12,11 +12,11 @@ This subsytem does not produce any output messages either. The subsystem will, h
 
 If the node is running in a collator mode, this subsystem will be disabled. The PVF pre-checker subsystem keeps track of the PVFs that are relevant for the subsystem. 
 
-To be relevant for the subsystem, a PVF must be returned by `pvfs_require_precheck` [`pvfs_require_precheck` runtime API][PVF pre-checking runtime API] in any of the active leaves. If the PVF is not present in any of the active leaves, it ceases to be relevant.
+To be relevant for the subsystem, a PVF must be returned by the [`pvfs_require_precheck` runtime API][PVF pre-checking runtime API] in any of the active leaves. If the PVF is not present in any of the active leaves, it ceases to be relevant.
 
 When a PVF just becomes relevant, the subsystem will send a message to the [Candidate Validation] subsystem asking for the pre-check.
 
-Upon receving a message from the candidate-validation subsystem, the pre-checker will note down that the PVF has its judgement and will also sign and submit a [`PvfCheckStatement`] via the [`submit_pvf_check_statement` runtime API][PVF pre-checking runtime API]. In case, a judgement was received for a PVF that is no longer in view it is ignored. It is possible that the candidate validation was not able to check the PVF. In that case, the PVF pre-checker will abstain and won't submit any check statements.
+Upon receving a message from the candidate-validation subsystem, the pre-checker will note down that the PVF has its judgement and will also sign and submit a [`PvfCheckStatement`][PvfCheckStatement] via the [`submit_pvf_check_statement` runtime API][PVF pre-checking runtime API]. In case, a judgement was received for a PVF that is no longer in view it is ignored. It is possible that the candidate validation was not able to check the PVF. In that case, the PVF pre-checker will abstain and won't submit any check statements.
 
 Since a vote only is valid during [one session][overview], the subsystem will have to resign and submit the statements for for the new session. The new session is assumed to be started if at least one of the leaves has a greater session index that was previously observed in any of the leaves.
 
@@ -28,4 +28,4 @@ If the node is not in the active validator set, it will still perform all the ch
 [Runtime API]: runtime-api.md
 [PVF pre-checking runtime API]: ../../runtime-api/pvf-prechecking.md
 [Candidate Validation]: candidate-validation.md
-[`PvfCheckStatement`]: ../../types/pvf-prechecking.md
+[PvfCheckStatement]: ../../types/pvf-prechecking.md#pvfcheckstatement
diff --git a/roadmap/implementers-guide/src/types/candidate.md b/roadmap/implementers-guide/src/types/candidate.md
@@ -92,6 +92,22 @@ struct CandidateDescriptor {
 }
 ```
 
+## `ValidationParams`
+
+```rust
+/// Validation parameters for evaluating the parachain validity function.
+pub struct ValidationParams {
+	/// Previous head-data.
+	pub parent_head: HeadData,
+	/// The collation body.
+	pub block_data: BlockData,
+	/// The current relay-chain block number.
+	pub relay_parent_number: RelayChainBlockNumber,
+	/// The relay-chain block's storage root.
+	pub relay_parent_storage_root: Hash,
+}
+```
+
 ## `PersistedValidationData`
 
 The validation data provides information about how to create the inputs for validation of a candidate. This information is derived from the chain state and will vary from para to para, although some of the fields may be the same for every para.

diff --git a/roadmap/implementers-guide/src/types/chain.md b/roadmap/implementers-guide/src/types/chain.md
diff --git a/roadmap/implementers-guide/src/types/overseer-protocol.md b/roadmap/implementers-guide/src/types/overseer-protocol.md
@@ -681,9 +681,7 @@ enum ProvisionerMessage {
 
 The Runtime API subsystem is responsible for providing an interface to the state of the chain's runtime.
 
-This is fueled by an auxiliary type encapsulating all request types defined in the Runtime API section of the guide.
-
-> To do: link to the Runtime API section. Not possible currently because of https://github.com/Michael-F-Bryan/mdbook-linkcheck/issues/25. Once v0.7.1 is released it will work.
+This is fueled by an auxiliary type encapsulating all request types defined in the [Runtime API section](../runtime-api) of the guide.
 
 ```rust
 enum RuntimeApiRequest {

diff --git a/roadmap/implementers-guide/src/types/pvf-prechecking.md b/roadmap/implementers-guide/src/types/pvf-prechecking.md
@@ -1,5 +1,7 @@
 # PVF Pre-checking types
 
+## `PvfCheckStatement`
+
 > ⚠️ This type was added in v2.
 
 One of the main units of information on which PVF pre-checking voting is build is the `PvfCheckStatement`.

diff --git a/scripts/ci/gitlab/lingua.dic b/scripts/ci/gitlab/lingua.dic
@@ -209,6 +209,7 @@ preconfigured
 preimage/MS
 preopen
 prepend/G
+prevalidating
 prevalidation
 preverify/G
 programmatically

diff --git a/scripts/ci/gitlab/pipeline/build.yml b/scripts/ci/gitlab/pipeline/build.yml
@@ -171,14 +171,12 @@ build-implementers-guide:
   # git depth is set on purpose: https://github.com/paritytech/polkadot/issues/6284
   variables:
     GIT_DEPTH:                     0
+    CI_IMAGE:                      paritytech/mdbook-utils:e14aae4a-20221123
   script:
-    - apt-get -y update; apt-get install -y graphviz
-    - cargo install mdbook mdbook-mermaid mdbook-linkcheck mdbook-graphviz mdbook-last-changed
     - mdbook build ./roadmap/implementers-guide
     - mkdir -p artifacts
     - mv roadmap/implementers-guide/book artifacts/
-    # FIXME: remove me after CI image gets nonroot
-    - chown -R nonroot:nonroot artifacts/
+    - ls -la artifacts/
 
 build-short-benchmark:
   stage:                           build