Implement PVF pre-checking using pending ExecutorParams #7139

s0me0ne-unkn0wn · 2023-04-26T19:05:30Z

Closes paritytech/polkadot-sdk#694 (at least, I hope so).

Configuration changes are postponed for two sessions, and enacting a pre-checked PVF is also delayed for two sessions. So if we use executor parameters from the pending configuration to pre-check PVFs, we ensure the pending changes would not render a PVF just enacted unusable.

Cumulus companion: paritytech/cumulus#2499

mrcnski · 2023-04-26T19:31:47Z

node/core/candidate-validation/src/lib.rs

Can we document this in the impl guide?

Also, will we eventually have some mechanism to re-do pre-checking when pending exec params are queued? Because say some params are queued one session after we do pre-checking for the first time - that only gives us a window of one session to do any re-checking.

Added some docs to the implementer's guide. Regarding the situation you're describing, I hope that's not a valid concern. Say we pre-check PVF in session N and configuration changes are scheduled in session N+1. So, the PVF will be enacted in session N+2, and the configuration will change in session N+3. That means we still need to check the PVF against the session N+2 configuration as we want it to be active during that session. Could the configuration changes in session N+3 render it unusable? I hope not. As @bkchr mentioned, we should carefully evaluate any executor parameters change against every PVF which is on-chain before applying them, and I believe we'll have some tooling for that in the near future. And the PVF in question is already on-chain at session N, although it's not enacted yet, and we still have to check the configuration changes against it, too.

Something that bothers me more than that is the race condition inside a single session. Say, PVF is submitted to upgrade at block 10, and validators start pre-check voting at block 11, and the configuration change is scheduled at block 15, everything is inside a single session. In that case, we really can end up with a PVF which is enacted in session N+2 but cannot be executed due to configuration changes. It's a super corner case, but nothing is impossible. That risk should be evaluated, and if proved to be a valid vector of attack we should do something to mitigate it, but I didn't think much about it yet. Purely intuitively, I don't like the idea of re-pre-checking. It seems to me the solution could be too heavyweight for such a corner case. Probably there are some simpler ways like prohibiting configuration changes in the session where any PVF upgrades are submitted or something like that (I'm just thinking aloud, not saying it's a valid solution).

s0me0ne-unkn0wn · 2023-04-27T18:48:38Z

Oh, wait a bit; if the new node software is running an old runtime, the pre-checking will always fail. Let me rework it a little.

paritytech-cicd-pr · 2023-04-27T20:47:06Z

The CI pipeline was cancelled due to failure one of the required jobs.
Job name: test-linux-stable
Logs: https://gitlab.parity.io/parity/mirrors/polkadot/-/jobs/2746553

…th-pending-config

eskimor

This does not look right. pending params are not session buffered, but can change anytime. Hence if we have multiple changes within a session, we are checking with something, but not with what will eventually be enacted. Also this seems really racy/brittle: Relying on pending changes to be enacted the same time as the PVF.

The argument raised by @bkchr that we should only ever increase limits or otherwise make sure we are not breaking things (which includes PVFs that are only about to be upgraded) should be good enough.

Is there any parameter right now, where we might fear this would not be possible in all cases? If so we could think of banning runtime upgrades for the change period for example - at least optionally.

eskimor · 2023-05-02T17:00:30Z

node/core/candidate-validation/src/lib.rs

+		Err(RuntimeRequestFailed::NotSupported) => {
+			// Old runtime that doesn't support retrieving pending configuration. Use the
+			// current configuration as a backward compatibility measure.
+			// TODO: Remove this after all the networks are upgraded to runtimes supporting


Todos should reference a ticket.

node/core/candidate-validation/src/lib.rs

s0me0ne-unkn0wn · 2023-05-02T19:10:17Z

This does not look right. pending params are not session buffered, but can change anytime. Hence if we have multiple changes within a session, we are checking with something, but not with what will eventually be enacted. Also this seems really racy/brittle: Relying on pending changes to be enacted the same time as the PVF.

Yes, there's a race inside a single session (I summarized that above), but I believe it's not that brittle, pending configuration changes are always applied on session start, as well as PVFs are enacted, so they are still in sync. I hope to address that single-session race in a follow-up when I figure out how to do that the right way.

The argument raised by @bkchr that we should only ever increase limits or otherwise make sure we are not breaking things (which includes PVFs that are only about to be upgraded) should be good enough.

Is there any parameter right now, where we might fear this would not be possible in all cases? If so we could think of banning runtime upgrades for the change period for example - at least optionally.

Afair, that argument was raised when we only had things like max. memory and stack limits in the executor params, and it was a perfectly valid point. Now, when we also have timeouts there, they may become the params we may want to decrease someday. At least I think so, correct me if I'm wrong.

eskimor · 2023-05-03T15:32:35Z

What I meant with brittle is that the enforcement at the same time (of new PVFs and Execution Env Parameters) is rather coincidental than on purpose - right? If one of those delays gets changed for any reason, the assumptions made here break without anyone noticing. I think this is what bothers me the most. The race goes into a similar vein - I think we should be more explicit here. E.g. encode the actual intent more directly:

"We assume any already existing PVF has been checked to work with the new parameters. Any new PVF that is registered in the session where new parameters are set will be prechecked with those new parameter and therefore also only enacted once they are active."

If I understood correctly, we are buffered for two session. If that's true, than one piece of the puzzle could be that we delay prechecking by one session. This would avoid the race.

s0me0ne-unkn0wn · 2023-05-03T19:46:53Z

What I meant with brittle is that the enforcement at the same time (of new PVFs and Execution Env Parameters) is rather coincidental than on purpose - right? If one of those delays gets changed for any reason, the assumptions made here break without anyone noticing. I think this is what bothers me the most.

I was thinking about that, too. I had some considerations about why that's not really a problem, but now, re-thinking that once more, I realize they weren't sound. Well, we can either document that those two must be equal, or just do a small refactoring to use a single constant for them to force them to be always equal (and also document why that's needed).

The race goes into a similar vein - I think we should be more explicit here. E.g. encode the actual intent more directly:

"We assume any already existing PVF has been checked to work with the new parameters. Any new PVF that is registered in the session where new parameters are set will be prechecked with those new parameter and therefore also only enacted once they are active."

If I understood correctly, we are buffered for two session. If that's true, than one piece of the puzzle could be that we delay prechecking by one session. This would avoid the race.

Sounds like a good idea! I'll give it a try, ty!

eskimor · 2023-05-05T05:35:46Z

The shared constant with documentation would already be pretty good. Maybe even add a test, with a comment saying: "This test exists so that this is not accidentally changed - read the documentation at CONSTANT_NAME"

Then to break this one has to ignore documentation at two places + adjust a test.

s0me0ne-unkn0wn · 2023-05-22T18:16:08Z

Just a quick update: I'm exploring the code, and it seems like the overall idea of pre-checking with the pending executor params is screwed up as the documentation about the PVF upgrade process is somewhat misleading (it's not wrong, just not clear enough).

The configuration update delay and the parachain onboarding delay are indeed derived from the single constant SESSION_DELAY. However, configuration is using shared::scheduled_session(), which just adds SESSION_DELAY to the current session index, and the PVF onboarding uses other logic which I cannot say I understand despite well-commented code.

But the PVF upgrade process does not follow that delay at all, it uses its own on-chain-configurable delay. I checked the current Polkadot activeConfig, and it's only 600 blocks, that is one hour.

So the following steps are:

Figure out what is actual logic for determining delays in cases of onboarding and upgrade
Fix the documentation accordingly to make it clear that the delays are different and how they are calculated
Figure out if pre-checking with the pending config is still helpful or not (maybe in case of initial onboarding only?)

mrcnski · 2023-05-22T19:07:03Z

the PVF onboarding uses other logic which I cannot say I understand despite well-commented code.

I think this is saying that the delay takes into account how long the pre-checking already took. But the delay must always be at least 1 from now. Looks like that's what the code does too.

		// we should onboard only after `SESSION_DELAY` sessions but we should take
		// into account the number of sessions the PVF pre-checking occupied.
		//
		// we cannot onboard at the current session, so it must be at least one
		// session ahead.

I don't know what the reasoning for this is though. Maybe you can get the original PR from the blame and see if there's more info there.

eskimor · 2023-05-23T13:04:40Z

Yeah we should get the reasoning straight. I am assuming the reasoning for the delay is in both cases probabilistic finality. If so, we should be able to

make them the same.
add documentation explaining the reasoning for the delay and its length.

Related: paritytech/polkadot-sdk#633

Implement PVF pre-checking using pending ExecutorParams

9ea4f3c

s0me0ne-unkn0wn requested review from tdimitrov, eskimor, mrcnski and sandreim April 26, 2023 19:05

mrcnski suggested changes Apr 26, 2023

View reviewed changes

Add documentations

672ea6c

Add backward compatibility layer

51cf479

mrcnski approved these changes Apr 28, 2023

View reviewed changes

s0me0ne-unkn0wn added 2 commits May 1, 2023 15:15

Fix tests

2752d6d

Merge remote-tracking branch 'origin/master' into s0me0ne/precheck-wi…

d17b555

…th-pending-config

s0me0ne-unkn0wn mentioned this pull request May 1, 2023

Add pending_executor_params() call paritytech/cumulus#2499

Open

eskimor reviewed May 2, 2023

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement PVF pre-checking using pending ExecutorParams #7139

Implement PVF pre-checking using pending ExecutorParams #7139

s0me0ne-unkn0wn commented Apr 26, 2023 •

edited

Loading

mrcnski Apr 26, 2023

s0me0ne-unkn0wn Apr 27, 2023

s0me0ne-unkn0wn commented Apr 27, 2023

paritytech-cicd-pr commented Apr 27, 2023

eskimor left a comment

eskimor May 2, 2023

s0me0ne-unkn0wn commented May 2, 2023

eskimor commented May 3, 2023

s0me0ne-unkn0wn commented May 3, 2023

eskimor commented May 5, 2023

s0me0ne-unkn0wn commented May 22, 2023

mrcnski commented May 22, 2023

eskimor commented May 23, 2023

Implement PVF pre-checking using pending ExecutorParams #7139

Are you sure you want to change the base?

Implement PVF pre-checking using pending ExecutorParams #7139

Conversation

s0me0ne-unkn0wn commented Apr 26, 2023 • edited Loading

mrcnski Apr 26, 2023

Choose a reason for hiding this comment

s0me0ne-unkn0wn Apr 27, 2023

Choose a reason for hiding this comment

s0me0ne-unkn0wn commented Apr 27, 2023

paritytech-cicd-pr commented Apr 27, 2023

eskimor left a comment

Choose a reason for hiding this comment

eskimor May 2, 2023

Choose a reason for hiding this comment

s0me0ne-unkn0wn commented May 2, 2023

eskimor commented May 3, 2023

s0me0ne-unkn0wn commented May 3, 2023

eskimor commented May 5, 2023

s0me0ne-unkn0wn commented May 22, 2023

mrcnski commented May 22, 2023

eskimor commented May 23, 2023

s0me0ne-unkn0wn commented Apr 26, 2023 •

edited

Loading