-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Vault client indefinitely waiting for other tasks to shut down #538
Comments
@pendulum-chain/product this is a bug that seems to occur more frequently when running the client in a cluster. We should try to investigate and fix this soon. |
@b-yap I checked the logs of a successful restart on a Pendulum vault and it looks like this
and comparing it to the logs of the incidents it seems like this is missing
Maybe this is the task unable to shut down. |
@ebma The waiting of x tasks to shutdown is only for those that are calling spacewalk/clients/vault/src/oracle/agent.rs Lines 87 to 135 in 53f81a2
I am trying to rewrite the code, on how to better handle this. This was difficult from the start. See the attempt of stopping this loop with 2 |
Hmm you are right, that's likely the problem 🤔 |
* add agent to `VaultService` * remove shutdown subscriber * remove unused function * Update clients/stellar-relay-lib/src/overlay.rs Co-authored-by: Marcel Ebert <[email protected]> * show error response * update slot for archive unit tests, and update configs * fix https://github.com/pendulum-chain/spacewalk/actions/runs/9657641728/job/26648202437?pr=539#step:12:4486 * cargo fmt --------- Co-authored-by: Marcel Ebert <[email protected]>
Fixed with #539 |
* add agent to `VaultService` * remove shutdown subscriber * remove unused function * Update clients/stellar-relay-lib/src/overlay.rs Co-authored-by: Marcel Ebert <[email protected]> * show error response * update slot for archive unit tests, and update configs * fix https://github.com/pendulum-chain/spacewalk/actions/runs/9657641728/job/26648202437?pr=539#step:12:4486 * cargo fmt --------- Co-authored-by: Marcel Ebert <[email protected]>
* update dependencies, fixing compilation errors (WIP) * update pallets to 1.1.0 * fix subxt feature flags * fix client configuration new fields * fix clippy issues and missing std flag * update fixes and missing configs for runtimes * fixes to node configuration (wip) * node client fixes * bump subxt to minimum viable version * use static wrapper for account_id and currency_id metadata replacement * use account_id struct from subxt implementing Encode and Decode as type to avoid Static wrapper * fix account type conversion using new subxt type * missing flag, cleanup unused types * temprary remove conditional metadata * use foucoco metadata only * warnings and type conversion * re-implement hashable trait * replace account_id on tests * use subxt version 0.29 across crates * modification of service config values * testing increase in timeout for tests * prints for testing * bump to subxt 0.30.0 * temp remove unrelated jobs * work with ubuntu latest * bump to subxt 0.33 seal block periodically seal blocks more often seal blocks more often block finalization force hack * local linux tests * testing finalizing first block solution * setup working locally * test only integration on ci * cleaning up temporary changes * small code improvements * more comment removal * use ubuntu latest os for main test * Use different function to set exchange rate * `lookup_bytes` is only used here. Removing `.clone()`. [`fn fetch_raw(...)` accepts `Vec<u8>`](https://github.com/paritytech/subxt/blob/f06a95d687605bf826db9d83b2932a73a57b169f/subxt/src/storage/storage_type.rs#L51). `fn build()` already [accepts Url](https://github.com/paritytech/jsonrpsee/blob/v0.20.0/client/transport/src/ws/mod.rs#L279) * cleanup * handle all metadata instances * temporary disable unused code on test file * add all-features flag to vault * addresses the confusing `all-features` feature of `runtime` package; Reuse the `UrlParseError` by changing it to `url::ParseError`, instead of jsonrpsee's `InvalidUri` * comment cleanups * use macos image for main workflow testing * revert * print current directory * remove toolchain * ran cargo +nightly-2024-02-09 fmt --all * revert, to use ubuntu * Revert to mac-os This reverts commit 7f25b8f. * uncomment failing test * force install rustup * use macos-13 * ignore subxt issues, and print out proof error * Revert "ignore subxt issues, and print out proof error" This reverts commit 08da9d6. * Revert "Merge branch 'main' into polkadot-v1.1.0" This reverts commit 255845c, reversing changes made to 08da9d6. * ignore failing subxt tests and print out proof error * update cargo lock https://github.com/pendulum-chain/spacewalk/actions/runs/10079792308/job/27899770357?pr=536#step:14:24 * fix cargo.lock; update only dia-oracle * update network * add back iowa * [DO NOT MERGE] Test issue on Linux due to update. (#540) * add extra cargo.toml specially for CI testing. * stellar relay config sdftest2 ip change * fixing some regressions from incomplete merge * fixing more merge issues * missing changes from #538 * fix AccountId PrettyPrint impl * fix compare spec version name * fix benchmark changes merge issue * remove when possible explicit declaration of pallets in construct_runtime macro * cargo fmt * remove all-features feature * testing inherent modification * Revert "remove when possible explicit declaration of pallets in construct_runtime macro" This reverts commit 33a6067. * Revert "Revert "remove when possible explicit declaration of pallets in construct_runtime macro"" This reverts commit 17f8100. * regenerate metadata after removal of explicit pallet delcaration in macro * Revert "testing inherent modification" This reverts commit 3345c90. * use pendulum's polkadot-sdk fork with modified constant * modification of stellar sdftest3 ip * modify also vault resources sdftest3 ip --------- Co-authored-by: Marcel Ebert <[email protected]> Co-authored-by: b-yap <[email protected]>
Context
For some non-recoverable errors, the vault client tries to restart. Before restarting, it will wait for pending tasks to shut down. It seems like not all tasks are receiving the shutdown signal, or maybe they do but are still stuck. This causes the vault client to wait indefinitely (not even the periodic restart will work here).
The following incidents happened on lower-spec machines, so maybe they are more likely to occur when the clients don't have much resources available.
TODO
Try to find the tasks that are not successfully shut down.
Incident 1
Incident 2
Incident 3
The text was updated successfully, but these errors were encountered: