Skip to content
This repository has been archived by the owner on Nov 15, 2023. It is now read-only.

Implement PVF pre-checking using pending ExecutorParams #7139

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

s0me0ne-unkn0wn
Copy link
Contributor

@s0me0ne-unkn0wn s0me0ne-unkn0wn commented Apr 26, 2023

Closes paritytech/polkadot-sdk#694 (at least, I hope so).

Configuration changes are postponed for two sessions, and enacting a pre-checked PVF is also delayed for two sessions. So if we use executor parameters from the pending configuration to pre-check PVFs, we ensure the pending changes would not render a PVF just enacted unusable.

Cumulus companion: paritytech/cumulus#2499

@s0me0ne-unkn0wn s0me0ne-unkn0wn added A0-please_review Pull request needs code review. B0-silent Changes should not be mentioned in any release notes C1-low PR touches the given topic and has a low impact on builders. D5-nicetohaveaudit ⚠️ PR contains trivial changes to logic that should be properly reviewed. T4-parachains_engineering This PR/Issue is related to Parachains performance, stability, maintenance. labels Apr 26, 2023
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we document this in the impl guide?

Also, will we eventually have some mechanism to re-do pre-checking when pending exec params are queued? Because say some params are queued one session after we do pre-checking for the first time - that only gives us a window of one session to do any re-checking.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added some docs to the implementer's guide. Regarding the situation you're describing, I hope that's not a valid concern. Say we pre-check PVF in session N and configuration changes are scheduled in session N+1. So, the PVF will be enacted in session N+2, and the configuration will change in session N+3. That means we still need to check the PVF against the session N+2 configuration as we want it to be active during that session. Could the configuration changes in session N+3 render it unusable? I hope not. As @bkchr mentioned, we should carefully evaluate any executor parameters change against every PVF which is on-chain before applying them, and I believe we'll have some tooling for that in the near future. And the PVF in question is already on-chain at session N, although it's not enacted yet, and we still have to check the configuration changes against it, too.

Something that bothers me more than that is the race condition inside a single session. Say, PVF is submitted to upgrade at block 10, and validators start pre-check voting at block 11, and the configuration change is scheduled at block 15, everything is inside a single session. In that case, we really can end up with a PVF which is enacted in session N+2 but cannot be executed due to configuration changes. It's a super corner case, but nothing is impossible. That risk should be evaluated, and if proved to be a valid vector of attack we should do something to mitigate it, but I didn't think much about it yet. Purely intuitively, I don't like the idea of re-pre-checking. It seems to me the solution could be too heavyweight for such a corner case. Probably there are some simpler ways like prohibiting configuration changes in the session where any PVF upgrades are submitted or something like that (I'm just thinking aloud, not saying it's a valid solution).

@s0me0ne-unkn0wn
Copy link
Contributor Author

Oh, wait a bit; if the new node software is running an old runtime, the pre-checking will always fail. Let me rework it a little.

@paritytech-cicd-pr
Copy link

The CI pipeline was cancelled due to failure one of the required jobs.
Job name: test-linux-stable
Logs: https://gitlab.parity.io/parity/mirrors/polkadot/-/jobs/2746553

Copy link
Member

@eskimor eskimor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does not look right. pending params are not session buffered, but can change anytime. Hence if we have multiple changes within a session, we are checking with something, but not with what will eventually be enacted. Also this seems really racy/brittle: Relying on pending changes to be enacted the same time as the PVF.

The argument raised by @bkchr that we should only ever increase limits or otherwise make sure we are not breaking things (which includes PVFs that are only about to be upgraded) should be good enough.

Is there any parameter right now, where we might fear this would not be possible in all cases? If so we could think of banning runtime upgrades for the change period for example - at least optionally.

Err(RuntimeRequestFailed::NotSupported) => {
// Old runtime that doesn't support retrieving pending configuration. Use the
// current configuration as a backward compatibility measure.
// TODO: Remove this after all the networks are upgraded to runtimes supporting
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Todos should reference a ticket.

node/core/candidate-validation/src/lib.rs Show resolved Hide resolved
@s0me0ne-unkn0wn
Copy link
Contributor Author

This does not look right. pending params are not session buffered, but can change anytime. Hence if we have multiple changes within a session, we are checking with something, but not with what will eventually be enacted. Also this seems really racy/brittle: Relying on pending changes to be enacted the same time as the PVF.

Yes, there's a race inside a single session (I summarized that above), but I believe it's not that brittle, pending configuration changes are always applied on session start, as well as PVFs are enacted, so they are still in sync. I hope to address that single-session race in a follow-up when I figure out how to do that the right way.

The argument raised by @bkchr that we should only ever increase limits or otherwise make sure we are not breaking things (which includes PVFs that are only about to be upgraded) should be good enough.

Is there any parameter right now, where we might fear this would not be possible in all cases? If so we could think of banning runtime upgrades for the change period for example - at least optionally.

Afair, that argument was raised when we only had things like max. memory and stack limits in the executor params, and it was a perfectly valid point. Now, when we also have timeouts there, they may become the params we may want to decrease someday. At least I think so, correct me if I'm wrong.

@eskimor
Copy link
Member

eskimor commented May 3, 2023

What I meant with brittle is that the enforcement at the same time (of new PVFs and Execution Env Parameters) is rather coincidental than on purpose - right? If one of those delays gets changed for any reason, the assumptions made here break without anyone noticing. I think this is what bothers me the most. The race goes into a similar vein - I think we should be more explicit here. E.g. encode the actual intent more directly:

"We assume any already existing PVF has been checked to work with the new parameters. Any new PVF that is registered in the session where new parameters are set will be prechecked with those new parameter and therefore also only enacted once they are active."

If I understood correctly, we are buffered for two session. If that's true, than one piece of the puzzle could be that we delay prechecking by one session. This would avoid the race.

@s0me0ne-unkn0wn
Copy link
Contributor Author

What I meant with brittle is that the enforcement at the same time (of new PVFs and Execution Env Parameters) is rather coincidental than on purpose - right? If one of those delays gets changed for any reason, the assumptions made here break without anyone noticing. I think this is what bothers me the most.

I was thinking about that, too. I had some considerations about why that's not really a problem, but now, re-thinking that once more, I realize they weren't sound. Well, we can either document that those two must be equal, or just do a small refactoring to use a single constant for them to force them to be always equal (and also document why that's needed).

The race goes into a similar vein - I think we should be more explicit here. E.g. encode the actual intent more directly:

"We assume any already existing PVF has been checked to work with the new parameters. Any new PVF that is registered in the session where new parameters are set will be prechecked with those new parameter and therefore also only enacted once they are active."

If I understood correctly, we are buffered for two session. If that's true, than one piece of the puzzle could be that we delay prechecking by one session. This would avoid the race.

Sounds like a good idea! I'll give it a try, ty!

@eskimor
Copy link
Member

eskimor commented May 5, 2023

The shared constant with documentation would already be pretty good. Maybe even add a test, with a comment saying: "This test exists so that this is not accidentally changed - read the documentation at CONSTANT_NAME"

Then to break this one has to ignore documentation at two places + adjust a test.

@s0me0ne-unkn0wn
Copy link
Contributor Author

Just a quick update: I'm exploring the code, and it seems like the overall idea of pre-checking with the pending executor params is screwed up as the documentation about the PVF upgrade process is somewhat misleading (it's not wrong, just not clear enough).

The configuration update delay and the parachain onboarding delay are indeed derived from the single constant SESSION_DELAY. However, configuration is using shared::scheduled_session(), which just adds SESSION_DELAY to the current session index, and the PVF onboarding uses other logic which I cannot say I understand despite well-commented code.

But the PVF upgrade process does not follow that delay at all, it uses its own on-chain-configurable delay. I checked the current Polkadot activeConfig, and it's only 600 blocks, that is one hour.

So the following steps are:

  1. Figure out what is actual logic for determining delays in cases of onboarding and upgrade
  2. Fix the documentation accordingly to make it clear that the delays are different and how they are calculated
  3. Figure out if pre-checking with the pending config is still helpful or not (maybe in case of initial onboarding only?)

@mrcnski
Copy link
Contributor

mrcnski commented May 22, 2023

the PVF onboarding uses other logic which I cannot say I understand despite well-commented code.

I think this is saying that the delay takes into account how long the pre-checking already took. But the delay must always be at least 1 from now. Looks like that's what the code does too.

		// we should onboard only after `SESSION_DELAY` sessions but we should take
		// into account the number of sessions the PVF pre-checking occupied.
		//
		// we cannot onboard at the current session, so it must be at least one
		// session ahead.

I don't know what the reasoning for this is though. Maybe you can get the original PR from the blame and see if there's more info there.

@eskimor
Copy link
Member

eskimor commented May 23, 2023

Yeah we should get the reasoning straight. I am assuming the reasoning for the delay is in both cases probabilistic finality. If so, we should be able to

  1. make them the same.
  2. add documentation explaining the reasoning for the delay and its length.

Related: paritytech/polkadot-sdk#633

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
A0-please_review Pull request needs code review. B0-silent Changes should not be mentioned in any release notes C1-low PR touches the given topic and has a low impact on builders. D5-nicetohaveaudit ⚠️ PR contains trivial changes to logic that should be properly reviewed. T4-parachains_engineering This PR/Issue is related to Parachains performance, stability, maintenance.
Projects
Status: Review in progress
Status: Review in progress
Development

Successfully merging this pull request may close these issues.

Executor parameters change may affect PVF validity and executability retrospectively
4 participants