Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Offchain runtime upgrades #102

Draft
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

eskimor
Copy link

@eskimor eskimor commented Jul 13, 2024

One step closer to making reduction of PVF storage deposits smaller feasible and also in generally improving performance and reliability for parachains.

### Introduce a new UMP message type `RequestCodeUpgrade`

As part of elastic scaling we are already planning to increase flexibility of [UMP
messages](https://github.com/polkadot-fellows/RFCs/issues/92#issuecomment-2144538974), we can now use this to our advantage and introduce another UMP message:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"We just need this hack for one thing and will not use it for anything else" ;)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does feels indeed it is creeping out into other features/change like this one but it offers a lot of advantages in the short term. I would not call it a hack, but more of a generalisation of the UMP queue. The alternative is PVF versioning which I believe is the long term solution that we'll likely to develop in 2025.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean for CoreIndex it is clearly a hack. RequestCodeUpgrade is an actual message that is sort of fine to be passed here. However, this brings up the topic on, should we add it to XCM? Should we make UMP messages generic, where one variant is XCM and the others are more UMP related?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't want to add it to XCM, instead we will have a UMP queue separator between regular XCM messages and the possible additional ones for CoreIndex and the RequestCodeUpgrade. I will soon post the RFC which explains the UMP changes in detail.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't want to add it to XCM, instead we will have a UMP queue separator between regular XCM messages and the possible additional ones for CoreIndex and the RequestCodeUpgrade

I know what the plan was/is. However, this doesn't really invalidate what I said above.

Copy link
Contributor

@sandreim sandreim Jul 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see. I would prefer to make the UMP messages more generic in this case, having two variants, one wrapping XCM and the other UMPSignal as defined here. Sounds much better than using a separator. If we agree to this I will update it also in #103

Change the upgrade process of a parachain runtime upgrade to become an off-chain
process with regards to the relay chain. Upgrades are still contained in
parachain blocks, but will no longer need to end up in relay chain blocks nor in
relay chain state.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, off-chain upgrades make sense: I mildly pushed for PVF upgrade to live in parablocks early on, but we descided for upgradfes on the relay chain since all validators need the data eventually anyways. It's true however that (a) validator set churn makes off-chain an optimization, and being on-chain incurs extra costs, like repeated downloads.


In case they received the collation via PoV distribution instead of from the
collator itself, they will use the exact same message to fetch from the valiator
they got the PoV from.
Copy link

@burdges burdges Jul 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not make the code upgade simply be the parachain block? Isn't that how substrate worked from the beginning?

If the code were bigger than a block, then you could incrementally build the PVF in parachain state, and incrementally hash it. Or do some special larger code block type.

Then on each further candidate from that chain that counter gets decremented.
Validators which have not yet succeeded fetching will now try again. This game
continues until the counter reached `0`. Now it is mandatory to have to code in
order to sign a `1` in the bitfield.
Copy link

@burdges burdges Jul 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You've just pushed the availability into the last of these fake blocks here. I guess this works, but I'm not convinced this is better than doing some big block availability variant:

We'd process the code availability in a single big parachain block, which only provides data but nerver gets executed. This takes as long as it takes, maybe runnoing at some lower priority. It occupies the availability code for that whole time, exactly like this scheme does.

After that runs, we have code available on chain so everyone must fetch it and build the artifact. We must delay the PVF upgrade being usable until those builds succeed, which could be done either by a second fake parablock type, or else by some message of the sort discussed here.

Validators in availability distribution will be changed to only sign a `1` in
the bitfield of a candidate if they not only have the chunk, but also the
currently active PVF. They will fetch it from backers in case they don't have it
yet.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah this makes sense regardless.

But the majority of validators should always keep the latest code of any
parachain and only prune the previous one, once the first candidate using the
new code got finalized. This ensures that disputes will always be able to
resolve.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah 1 is an improvement here, previously I'd envisions parachains doing code reuploads once per day, just so the code stays in availability


1. They received a collation sending `RequestCodeUpgrade`.
2. They received a collation, but they don't yet have the code that was
previously registered on the relaychain. (E.g. disk pruned, new validator)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it still feasible to prepare PVFs in advance (when node becomes a validator in next session)?


1. Fetching can happen over a longer period of time with low priority. E.g. if
we waited for the PVF at the very first avaialbility distribution, this might
actually affect liveness of other chains on the same core. Distributing
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't we still starve the next parachain if the inclusion is delayed until the code was fetched by 2/3 validators ? I mean, if we treat these as low priority this can be an issue.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's why we have a configurable amount of parachain blocks to do the fetching. If we ever run into availability problems we can:

  1. Increase that amount of blocks we have time to fetch the PVF.
  2. Limit the amount of runtime upgrades we are willing to do in a timespan and add priority fees (already planned) to requests, to secure a spot in case of competition.

Note however that right now we do distribute those upgrades within a single relaychain slot twice, once via statement distribution then via the relay chain block. In the new scheme if we set the number of required parachain blocks to 10, we reduced pressure 20 times. Thus I doubt it will be a problem in practice and if it ever were, we have means to fix it.

order to sign a `1` in the bitfield.

PVF pre-checking will happen after the candidate which brought the counter to
`0` has been successfully included and thus is also able to assume that 2/3 of
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there an expiry date for when the parachain needs to reach 0, otherwise the code upgrade is dropped ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. Will add a a section.

@eskimor
Copy link
Author

eskimor commented Jul 17, 2024

For fees with this proposal, given that storage cost is essentially now limited to validator disk space, we should be able to bring down deposit costs significantly. E.g. if we assume that 1TB costs 100 Euro and such a disk lives for 3 years, we have yearly costs of 33 Euro (all very roughly). Everything we store is stored on thousand validators, thus 1MB needs 1GB of storage. This means roughly storage costs for 1MB PVF of 3.3 Cent. So a full blown PVF of 5 MB roughly 17 cent. With 10% staking rewards, this means we would only need to lock up token worth of roughly 2 Euro. Even if we make it 10x that or even 100 times that, we would still be way lower than the current on chain storage.

Obviously this is just a back-of-the-envelope calculation, but assuming I missed a few cost factors (like electricity, ...), going 10x my calculation would still be cheap (20 Euro worth of token locked).

I love it and I will extend the RFC a bit more to at least have everything prepared for smart contract storage.

@bkchr
Copy link
Contributor

bkchr commented Jul 17, 2024

For fees with this proposal, given that storage cost is essentially now limited to validator disk space, we should be able to bring down deposit costs significantly.

Even before this was already taking validators into account. This is a decentralized network and you have no control over how many nodes are running aka how many copies exist. Thus, the previous model could also already not include any kind of storage costs from random nodes in the network.

Your biggest argument last week was also the costs for compiling the code. I don't see how this RFC changes the cost for compiling the PVFs.

@eskimor
Copy link
Author

eskimor commented Jul 18, 2024

Your biggest argument last week was also the costs for compiling the code. I don't see how this RFC changes the cost for compiling the PVFs.

The biggest concern was actually the blockspace used on the relay chain which is fully solved by this proposal. For preparation, indeed nothing changed. Best solution to this problem would be PolkaVM. Until we have that or some other solution, you indeed brought up a good argument to not go that low with fees. Although we should probably differentiate between storage used for the PVF which needs to be prepared and additional storage offered (e.g. for smart contracts), which don't impose a cost on PVF compilation. 🤔

@burdges
Copy link

burdges commented Jul 18, 2024

All my above comments can be sumarized like:

Why is this availability voting countdown hack better than simply occupying one availability core for longer?

We're not going starve the system of cores of course. A priori, we do not really care how long an availability core stays occupied since they never delay finality.

Are you worried there are parablocks which must be aswsigned to one particular core?

If this were the concern, then we could solve this in other ways, some of which maybe more "orthogonal" in some sense. We could've a "code upgrade" system parachian into which all parachains post their code. It'd be "virtual" in that it has no state, no collators, and no PVF of its own, but it takes arbitrarily large blocks.

You want this countdown for billing perhaps? I'd buy that reasoning, not much point having a whole seperate billing system.

@burdges
Copy link

burdges commented Jul 18, 2024

I noticed "chunk" twice in this document. If you envision ever doing reconstruction from erasure coded data, then you need approval checkers who check the erasure coding, otherwise someone could replace some chunks with garbage.

Instead, you could've some notion of mirroring/code core, or state of an availability core, in which validators only sign the bit once they've fetched the the whole data block. This saves some nodes reencoding the PVF since everyone wants the PVF eventually anyways.

@eskimor
Copy link
Author

eskimor commented Jul 18, 2024

Why is this availability voting countdown hack better than simply occupying one availability core for longer?

Because that would affect block times of that parachain, if it was core sharing even block times of other chains on that core.

The counter is just an easy solution to:

  1. Give it more time than normal availability - so we can re-use availability for this, without introducing another protocol.
  2. Indeed it is also an easy way to do some billing. Not perfect, but kind of apt for the purpose. Updating your runtime affects all validators and is therefore rather costly, hence the worst case of wasting a bit of coretime for empty blocks seems fine (low volume, on-demand chain) and we charge for it anyway.

(1) is more important. My biggest concern is usability issues, but should be fine as well with good documentation and emitting events about counter state.

Virtual cores are an interesting idea, although I think this adds actually more complexity, both to code and to cognitive load. It would be complexity we don't need to expose though, thus maybe good. Will think about it.

In fact I plan on using the coretime chain for the initial upload of the PVF (parachain registration).

I noticed "chunk" twice in this document. If you envision ever doing reconstruction from erasure coded data, then you need approval checkers who check the erasure coding, otherwise someone could replace some chunks with garbage.

I don't think it makes sense to chunk the data given that all validators need the full data anyway.

continues until the counter reached `0`. Now it is mandatory to have to code in
order to sign a `1` in the bitfield.

PVF pre-checking will happen after the candidate which brought the counter to
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: Do we need to use availability bitfields here or can we rely on pre-checking only?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bitfields offer the advantage that we have an incentive for backers (at least for the last one) and it avoids having impose the work of pre-checking without the "attacker" having paid their bill (produced enough blocks).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other things to consider:

  1. To remove the wart for on-demand chains of having to produce n blocks, we could introduce a backwards compatible "fast-track" fee. With this you either produce n blocks (backwards compatible with existing chains) or you pay the fast-track fee, which removes this restriction and also will remove the 2 session delay: We just have pre-checking only succeed if either those two sessions passed or validators have seen the including block finalized and the fast-track fee has been paid. Backers can then be incentivized to provide the code by getting a cut of that fast-track fee iff prechecking succeeds. Which will only succeed if validators were able to fetch the code obviously.

  2. We could see a stop-gap solution until we go full off-chain by doing the following:
    2.1. Introduce a requirement to be eligible for a runtime upgrade, by having produced n blocks since the last one. With n being something like 1000. This will hardly be noticeable by existing chains (backwards compatible), but will rate-limit upgrades for on-demand chains + ramp up the cost. The effective rate limit is for x cores being available: x/10. So with 100 cores, this would be 1/10. Meaning by fully utilizing 100 cores someone could trigger a runtime-upgrade every 10 blocks, causing 10% service degradation worst case. We can get even better by either fully implementing this RFC or by increasing n further.
    2.2. Have that above fast-track fee to cater to legit on-demand chains and also allow for secure fast-tracking of upgrades in general.
    2.3. A relay chain block containing a candidate which contains a runtime-upgrade is illegal, if the parachain has not produced n blocks and is not paying the fast-track fee. Note: This might be problematic, as depending on n this might no longer be that backwards compatible and more importantly a parachain could end up permanently DoSing itself.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants