Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PVF host: Investigate how to better launch worker processes #880

Open
pepyakin opened this issue Jan 14, 2022 · 0 comments
Open

PVF host: Investigate how to better launch worker processes #880

pepyakin opened this issue Jan 14, 2022 · 0 comments

Comments

@pepyakin
Copy link
Contributor

Right now the PVF validation host spawns worker processes for doing compilation of wasm blobs and executing wasm code during validation of candidates.

The processes are spawned by taking the polkadot executable and invoking it via Command with additional parameters that specify the type of the worker and the path to the IPC socket that the worker will receive commands with. At the same time, the polkadot binary provides command-line arguments that would pass the control to the workers if a special worker-type command was passed.

This proved to be an awkward way of doing things. It is one of the issues what prevents us from testing the validation host thoroughly (#2342) and already shot us in the foot (e.g. this).

I did a preliminary research on how to do the things better.

First what comes to mind is fork(2). We fork the process and then pass the control to the worker entrypoint. However, this seems to be rather fragile. The reason is that fork preserves the memory content of the process, it's opened fds and at the same time only the thread that called fork will be running. The implication of that, is that the state of the program in the child preserves all the mutexes and the control flow can be anywhere. One of obvious examples of what can go wrong, is that if another thread called malloc and got the lock then if the thread that called the fork calls malloc as well, then it will get deadlocked. So therefore, only async-signal-safe are callable. Theoretically this may be solved by resuming threads but that's also obviously a bad idea: an example problem is that the same thread will write into the same descriptor in both parent and child process. That can corrupt data and send gibberish into sockets.

vfork is a special case of fork which is typically followed by execve. The latter basically the primitive that is used by Command and requires the filename to execute, which basically brings us to the same problem as the existing solution.

If we step back and think what we actually need here we would notice that we just need to reuse the same executable image but specify a different entrypoint. Ideally, this entrypoint should be represented by a function address and would not require the executable to handle special arguments for invoking the required worker.

On Linux, there is a clone(2) syscall which is basically a pumped up fork. It allows for specifying the entrypoint as a function pointer and it also takes flags to specify what parts of the parent to be carried on to the child process. The parent can either share or not share the address space, can put the child into new namespace, and so on. (In fact, using this syscall is how pthread on Linux is implemented).

If we were to use it it would also allow us to introduce sandboxing of the worker processes, a feature envisioned in the initial design of the overhauled PVF validation host.

So in our case the parent will invoke clone, sharing almost nothing with the child process, passing an entrypoint to the required worker.

However, there is a problem with that. Even though Rust has a minimal runtime, it still has to be initialized. For that, we must make sure that the pre-main stuff is executed as if we were invoking a program but with a different main entrypoint. I am not sure if clone(2) handles that by default or if that should be taken care of manually. In the end, C also has pre-main and from what I have seen in C there is no special fluffing needed. In any case, this should be investigated.

One implication of going this way, is that clone(2) is Linux only syscall with no obvious (to me) alternatives in macOS, and we still want to support that. There is this discussion about that #881

@pepyakin pepyakin changed the title Investigate how to better launch worker processes PVF host: Investigate how to better launch worker processes Jan 14, 2022
mrcnski referenced this issue in paritytech/polkadot Aug 6, 2023
Clearing env vars with the `std::process::Command` API didn't get everything on
Mac, namely `__CF_USER_TEXT_ENCODING` was still present. While we don't support
Mac itself as a secure system, the same issue could exist on some Linux systems
either now or in the future. So it is better to just clear it on the child-side
and not worry about it. We may not use the `Command` API in the future, anyway:
https://github.com/paritytech/polkadot/issues/4721
@Sophia-Gold Sophia-Gold transferred this issue from paritytech/polkadot Aug 24, 2023
claravanstaden added a commit to Snowfork/polkadot-sdk that referenced this issue Dec 8, 2023
* Upgrade ssz_rs crate.

* Upgrade ssz_rs crate.

---------

Co-authored-by: claravanstaden <Cats 4 life!>
claravanstaden added a commit to Snowfork/polkadot-sdk that referenced this issue Dec 8, 2023
* Improve contracts

* More improvements to contracts

* Fix parachain build in proxy-contracts branch (paritytech#889)

* Upgrade ssz_rs crate. (paritytech#880)

* Upgrade ssz_rs crate.

* Upgrade ssz_rs crate.

---------

Co-authored-by: claravanstaden <Cats 4 life!>

* Bump nixpkgs to use its geth package (paritytech#885)

* Bump nixpkgs to use its geth package

* Use source config in PolkadotListener

* Match field order in struct def

* Move info log to relay creation

* Remove unused channel

* Whitespace

* Add context to errors

* Add logs

* Bump node & pnpm in workflow

* Rename locals

* Remove unused variable

* Add Troubleshooting README section

* Fix up .envrc-example files

* Add note about pure shells

* Update cumulus submodule (paritytech#886)

* Inbound queue benchmarks (paritytech#876)

* Start with inbound channel benchmarks.

* Add method to set execution header storage for benchmark test.

* Working on benchmarks

* Basic working version

* Cleanup

* Removes cleanup.

* Adds some comments for Alistair.

* Adds branch name.

* Makes note

* Test transactions

* Cleaning up beacon client deps.

* Clean up comments.

* Tests cleanup.

* Fixes non-benchmark test runs.

* Cleanup.

* Update fixtures and generates benchmarks.

* Revert relayer logs.

* Cleanup BenchmarkHelper impl and inbound queue dependencies.

* fmt

* Cleanup imports.

* Cleanup imports.

* Touch

* Adds weights in inbound queue pallet.

* Fix tests.

* Update cumulus.

---------

Co-authored-by: claravanstaden <Cats 4 life!>

* Fix parachain build

* Move BalanceOf outside of pallet

* remove benchmark for non existing method

* downgrade cargo.lock to match cumulus

* fix benchmarks

---------

Co-authored-by: Clara van Staden <[email protected]>
Co-authored-by: Alistair Singh <[email protected]>

* forge install: openzeppelin-contracts

v4.9.2

* sno-472 rebased (paritytech#890)

* Halting & resuming bridge pallets

* Ignore .env

* Remove .env

* Some polish

* Set owner of bridge pallets

* Upgrade ssz_rs crate. (paritytech#880)

* Upgrade ssz_rs crate.

* Upgrade ssz_rs crate.

---------

Co-authored-by: claravanstaden <Cats 4 life!>

* Update cumulus

* Bump nixpkgs to use its geth package (paritytech#885)

* Bump nixpkgs to use its geth package

* Use source config in PolkadotListener

* Match field order in struct def

* Move info log to relay creation

* Remove unused channel

* Whitespace

* Add context to errors

* Add logs

* Bump node & pnpm in workflow

* Rename locals

* Remove unused variable

* Add Troubleshooting README section

* Fix up .envrc-example files

* Add note about pure shells

* Relax RANDAO_COMMIT_DELAY for local setup

* Update cumulus

* Update cumulus submodule (paritytech#886)

* Update cumulus

* Inbound queue benchmarks (paritytech#876)

* Start with inbound channel benchmarks.

* Add method to set execution header storage for benchmark test.

* Working on benchmarks

* Basic working version

* Cleanup

* Removes cleanup.

* Adds some comments for Alistair.

* Adds branch name.

* Makes note

* Test transactions

* Cleaning up beacon client deps.

* Clean up comments.

* Tests cleanup.

* Fixes non-benchmark test runs.

* Cleanup.

* Update fixtures and generates benchmarks.

* Revert relayer logs.

* Cleanup BenchmarkHelper impl and inbound queue dependencies.

* fmt

* Cleanup imports.

* Cleanup imports.

* Touch

* Adds weights in inbound queue pallet.

* Fix tests.

* Update cumulus.

---------

Co-authored-by: claravanstaden <Cats 4 life!>

* Fix test

* Fix parachain build

* Move BalanceOf outside of pallet

* remove benchmark for non existing method

* downgrade cargo.lock to match cumulus

* fix benchmarks

* Clara/sno 552 (paritytech#887)

* Spacing

* Spacing

* Undo typo.

* Minor updates.

* Adds comment about IrrelevantUpdate.

* One more comment.

* Update error name.

---------

Co-authored-by: claravanstaden <Cats 4 life!>

* Halting & resuming bridge pallets (paritytech#883)

* Halting & resuming bridge pallets

* Ignore .env

* Remove .env

* Some polish

* Set owner of bridge pallets

* Update cumulus

* Relax RANDAO_COMMIT_DELAY for local setup

* Update cumulus

* Update cumulus

* Fix test

* Fix Warnings

* Fix test

* Fix build & format

* Fix benchmark test

* Check for duplicate versions of substrate and polkadot (paritytech#891)

* modified pre-commit

* fixes

* testing

* testing

* testing

* fixed tests

* Format

* Some fix

* Update cumulus

* Update cumulus

---------

Co-authored-by: Clara van Staden <[email protected]>
Co-authored-by: David Dunn <[email protected]>
Co-authored-by: Alistair Singh <[email protected]>

* Proxy contracts Tests fixes (paritytech#892)

* fixed tests

* warnings and imports

* rustfmt

* updated cumulus

* Revert rustfmt (paritytech#896)

* Revert "rustfmt"

This reverts commit b83cec7929cdcc6ac972a2b1cad3a0e1fde81870.

* reverted xcm-builder

* Create agent (paritytech#895)

* base

* removed location conversion

* completed implementation

* remove xcm-builder

* update cumulus

* update cumulus

* use contains_key

* Fix openzeppelin-contracts submodule

---------

Co-authored-by: David Dunn <[email protected]>

* improve API for sending tokens

* Messy working version.

* Cleanup.

* More cleanup.

* Rollback unnecessary changes.

* Cleanup whitespace.

* Fix tests.

* Fuzz the submit method.

* Revert rebase oopsies.

* More fuzzing.

* More fuzzing.

* Last extrinsic.

* Remove unnecessary dependency.

* Revert readme.

* Fix tests and feature issues.

* Cleanup types.

* More cleanup.

* Call extrinsics directly. Adds readme.

* Adds CI.

* cd to correct dir

* Update CI.

* Correct nightly param.

* Remove runs.

* Own impl for SyncCommittee.

* Remove rng.
Cleans up workflow.

* Last cleanup.

* Update parachain/pallets/ethereum-beacon-client/fuzz/Cargo.toml

Co-authored-by: David Dunn <[email protected]>

* Adds cargo-fuzz to init script.

* PR comment changes.

* Update parachain/pallets/ethereum-beacon-client/fuzz/src/impls.rs

Co-authored-by: David Dunn <[email protected]>

* Update rust-toolchain.toml

Co-authored-by: David Dunn <[email protected]>

* Update .github/workflows/parachain.yml

Co-authored-by: David Dunn <[email protected]>

* Less runs for shorter Github actions.

---------

Co-authored-by: Vincent Geddes <[email protected]>
Co-authored-by: David Dunn <[email protected]>
Co-authored-by: Alistair Singh <[email protected]>
Co-authored-by: Ron <[email protected]>
Co-authored-by: claravanstaden <Cats 4 life!>
helin6 pushed a commit to boolnetwork/polkadot-sdk that referenced this issue Feb 5, 2024
* Update substrate to polkadot-v0.9.30

 - Update substrate to commit 1b1a5e1
 - Upgrade rust-toolchain to rustc 1.64.0-nightly (7fe022f5a 2022-07-24)
 - Make benchmark command as an optional feature
 - Fix benchmark template and script
 - Upgrade web3 to v1.8 and update ts-tests/package-lock.json

Signed-off-by: koushiro <[email protected]>

* Remove useless dep

Signed-off-by: koushiro <[email protected]>

* Fix ident

Signed-off-by: koushiro <[email protected]>

* Install protoc

* Update some deps

* Fix test

* update prometheus to v0.13.3

Signed-off-by: koushiro <[email protected]>

* Use evm 0.36.0

Signed-off-by: koushiro <[email protected]>

* update lru 0.8.0 ==> 0.8.1

Signed-off-by: koushiro <[email protected]>

* update scale-codec and scale-info

Signed-off-by: koushiro <[email protected]>

* Fix fmt

Signed-off-by: koushiro <[email protected]>

Signed-off-by: koushiro <[email protected]>
bkchr pushed a commit that referenced this issue Apr 10, 2024
* update CA certificates in relay images

* clippy

* clippy

* clippy
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Backlog
Development

No branches or pull requests

1 participant