-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Implement PVF pre-checking using pending ExecutorParams #7139
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we document this in the impl guide?
Also, will we eventually have some mechanism to re-do pre-checking when pending exec params are queued? Because say some params are queued one session after we do pre-checking for the first time - that only gives us a window of one session to do any re-checking.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added some docs to the implementer's guide. Regarding the situation you're describing, I hope that's not a valid concern. Say we pre-check PVF in session N and configuration changes are scheduled in session N+1. So, the PVF will be enacted in session N+2, and the configuration will change in session N+3. That means we still need to check the PVF against the session N+2 configuration as we want it to be active during that session. Could the configuration changes in session N+3 render it unusable? I hope not. As @bkchr mentioned, we should carefully evaluate any executor parameters change against every PVF which is on-chain before applying them, and I believe we'll have some tooling for that in the near future. And the PVF in question is already on-chain at session N, although it's not enacted yet, and we still have to check the configuration changes against it, too.
Something that bothers me more than that is the race condition inside a single session. Say, PVF is submitted to upgrade at block 10, and validators start pre-check voting at block 11, and the configuration change is scheduled at block 15, everything is inside a single session. In that case, we really can end up with a PVF which is enacted in session N+2 but cannot be executed due to configuration changes. It's a super corner case, but nothing is impossible. That risk should be evaluated, and if proved to be a valid vector of attack we should do something to mitigate it, but I didn't think much about it yet. Purely intuitively, I don't like the idea of re-pre-checking. It seems to me the solution could be too heavyweight for such a corner case. Probably there are some simpler ways like prohibiting configuration changes in the session where any PVF upgrades are submitted or something like that (I'm just thinking aloud, not saying it's a valid solution).
Oh, wait a bit; if the new node software is running an old runtime, the pre-checking will always fail. Let me rework it a little. |
The CI pipeline was cancelled due to failure one of the required jobs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This does not look right. pending params are not session buffered, but can change anytime. Hence if we have multiple changes within a session, we are checking with something, but not with what will eventually be enacted. Also this seems really racy/brittle: Relying on pending changes to be enacted the same time as the PVF.
The argument raised by @bkchr that we should only ever increase limits or otherwise make sure we are not breaking things (which includes PVFs that are only about to be upgraded) should be good enough.
Is there any parameter right now, where we might fear this would not be possible in all cases? If so we could think of banning runtime upgrades for the change period for example - at least optionally.
Err(RuntimeRequestFailed::NotSupported) => { | ||
// Old runtime that doesn't support retrieving pending configuration. Use the | ||
// current configuration as a backward compatibility measure. | ||
// TODO: Remove this after all the networks are upgraded to runtimes supporting |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Todos should reference a ticket.
Yes, there's a race inside a single session (I summarized that above), but I believe it's not that brittle, pending configuration changes are always applied on session start, as well as PVFs are enacted, so they are still in sync. I hope to address that single-session race in a follow-up when I figure out how to do that the right way.
Afair, that argument was raised when we only had things like max. memory and stack limits in the executor params, and it was a perfectly valid point. Now, when we also have timeouts there, they may become the params we may want to decrease someday. At least I think so, correct me if I'm wrong. |
What I meant with brittle is that the enforcement at the same time (of new PVFs and Execution Env Parameters) is rather coincidental than on purpose - right? If one of those delays gets changed for any reason, the assumptions made here break without anyone noticing. I think this is what bothers me the most. The race goes into a similar vein - I think we should be more explicit here. E.g. encode the actual intent more directly: "We assume any already existing PVF has been checked to work with the new parameters. Any new PVF that is registered in the session where new parameters are set will be prechecked with those new parameter and therefore also only enacted once they are active." If I understood correctly, we are buffered for two session. If that's true, than one piece of the puzzle could be that we delay prechecking by one session. This would avoid the race. |
I was thinking about that, too. I had some considerations about why that's not really a problem, but now, re-thinking that once more, I realize they weren't sound. Well, we can either document that those two must be equal, or just do a small refactoring to use a single constant for them to force them to be always equal (and also document why that's needed).
Sounds like a good idea! I'll give it a try, ty! |
The shared constant with documentation would already be pretty good. Maybe even add a test, with a comment saying: "This test exists so that this is not accidentally changed - read the documentation at Then to break this one has to ignore documentation at two places + adjust a test. |
Just a quick update: I'm exploring the code, and it seems like the overall idea of pre-checking with the pending executor params is screwed up as the documentation about the PVF upgrade process is somewhat misleading (it's not wrong, just not clear enough). The configuration update delay and the parachain onboarding delay are indeed derived from the single constant But the PVF upgrade process does not follow that delay at all, it uses its own on-chain-configurable delay. I checked the current Polkadot So the following steps are:
|
I think this is saying that the delay takes into account how long the pre-checking already took. But the delay must always be at least 1 from now. Looks like that's what the code does too. // we should onboard only after `SESSION_DELAY` sessions but we should take
// into account the number of sessions the PVF pre-checking occupied.
//
// we cannot onboard at the current session, so it must be at least one
// session ahead. I don't know what the reasoning for this is though. Maybe you can get the original PR from the blame and see if there's more info there. |
Yeah we should get the reasoning straight. I am assuming the reasoning for the delay is in both cases probabilistic finality. If so, we should be able to
Related: paritytech/polkadot-sdk#633 |
Closes paritytech/polkadot-sdk#694 (at least, I hope so).
Configuration changes are postponed for two sessions, and enacting a pre-checked PVF is also delayed for two sessions. So if we use executor parameters from the pending configuration to pre-check PVFs, we ensure the pending changes would not render a PVF just enacted unusable.
Cumulus companion: paritytech/cumulus#2499