Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[dag] decouple payload 3/n #12126

Closed

Conversation

ibalajiarun
Copy link
Contributor

Description

Test Plan

Copy link

trunk-io bot commented Feb 21, 2024

⏱️ 10h 55m total CI duration on this PR
Job Cumulative Duration Recent Runs
rust-smoke-coverage 3h 39m 🟩
windows-build 3h 10m 🟩🟩🟩🟩🟩 (+3 more)
rust-unit-tests 1h 36m 🟥🟥🟥🟥
forge-e2e-test / forge 35m 🟥🟥
rust-images / rust-all 28m 🟩🟩
run-tests-main-branch 26m 🟥🟥🟥🟥
check-dynamic-deps 17m 🟩🟩🟩🟩🟩 (+3 more)
rust-unit-coverage 14m 🟥
rust-lints 13m 🟥🟥🟥🟥
general-lints 10m 🟩🟩🟩🟩
semgrep/ci 3m 🟩🟩🟩🟩🟩 (+3 more)
check 1m 🟥🟥🟥 (+2 more)
file_change_determinator 55s 🟩🟩🟩 (+2 more)
file_change_determinator 50s 🟩🟩🟩🟩🟩
permission-check 24s 🟩🟩🟩🟩🟩 (+3 more)
permission-check 20s 🟩🟩🟩🟩🟩 (+3 more)
permission-check 18s 🟩🟩🟩🟩 (+1 more)
file_change_determinator 17s 🟩🟩
permission-check 14s 🟩🟩🟩🟩 (+1 more)
permission-check 4s 🟩🟩
determine-docker-build-metadata 3s 🟩🟩

🚨 3 jobs on the last run were significantly faster/slower than expected

Job Duration vs 7d avg Delta
forge-e2e-test / forge 20m 14m +37%
windows-build 26m 19m +31%
rust-images / rust-all 16m 13m +22%

settingsfeedbackdocs ⋅ learn more about trunk.io

@ibalajiarun
Copy link
Contributor Author

ibalajiarun commented Feb 21, 2024

Warning

This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
Learn more

This stack of pull requests is managed by Graphite. Learn more about stacking.

Join @ibalajiarun and the rest of your teammates on Graphite Graphite

@ibalajiarun ibalajiarun marked this pull request as ready for review February 21, 2024 00:51
@ibalajiarun ibalajiarun force-pushed the balaji/payload-manager-3 branch from 5ec01ca to 8c4684d Compare February 21, 2024 17:42
@ibalajiarun ibalajiarun added CICD:build-images when this label is present github actions will start build+push rust images from the PR. CICD:run-forge-e2e-perf Run the e2e perf forge only labels Feb 21, 2024

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

Copy link
Contributor

❌ Forge suite realistic_env_max_load failure on 3c3b2504afd4c095ba9a0c441d3c4253ad58638a

Test Failed: test NetworkLoadTest

Caused by:
    Waiting for nodes to catch up to target version and epoch (Some(3918759), None) timed out after 360 seconds, current status: Ok([("validator-0", 4359511, 3), ("validator-1", 4359511, 3), ("validator-2", 4359511, 3), ("validator-3", 4359511, 3), ("validator-4", 4359511, 3), ("validator-5", 3652146, 2), ("validator-6", 4359511, 3), ("validator-7", 4359511, 3), ("validator-8", 4359511, 3), ("validator-9", 4359511, 3), ("validator-10", 4359511, 3), ("validator-11", 4359511, 3), ("validator-12", 4359511, 3), ("validator-13", 4359511, 3), ("validator-14", 4359511, 3), ("validator-15", 4359511, 3), ("validator-16", 4359511, 3), ("validator-17", 4359511, 3), ("validator-18", 4359511, 3), ("validator-19", 4359511, 3), ("fullnode-0", 4359511, 3), ("fullnode-1", 4359511, 3), ("fullnode-2", 4359511, 3), ("fullnode-3", 4359511, 3), ("fullnode-4", 4359511, 3), ("fullnode-5", 4357574, 3), ("fullnode-6", 4359511, 3), ("fullnode-7", 4359511, 3), ("fullnode-8", 4359511, 3), ("fullnode-9", 4359511, 3), ("fullnode-10", 4359511, 3), ("fullnode-11", 4359511, 3), ("fullnode-12", 4359511, 3), ("fullnode-13", 4359511, 3), ("fullnode-14", 4359511, 3), ("fullnode-15", 4359511, 3), ("fullnode-16", 4359511, 3), ("fullnode-17", 4359511, 3), ("fullnode-18", 4359511, 3), ("fullnode-19", 4359511, 3)])

Stack backtrace:
   0: anyhow::error::<impl anyhow::Error>::msg
             at /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/anyhow-1.0.79/src/error.rs:83:36
   1: aptos_forge::interface::swarm::wait_for_all_nodes_to_catchup_to_target_version_or_epoch::{{closure}}
             at ./testsuite/forge/src/interface/swarm.rs:457:24
   2: aptos_forge::interface::swarm::wait_for_all_nodes_to_catchup_to_version::{{closure}}
             at ./testsuite/forge/src/interface/swarm.rs:375:6
   3: aptos_forge::interface::swarm::wait_for_all_nodes_to_catchup::{{closure}}
             at ./testsuite/forge/src/interface/swarm.rs:482:88
   4: aptos_forge::interface::swarm::SwarmExt::wait_for_all_nodes_to_catchup::{{closure}}
             at ./testsuite/forge/src/interface/swarm.rs:283:90
   5: <core::pin::Pin<P> as core::future::future::Future>::poll
             at /rustc/82e1608dfa6e0b5569232559e3d385fea5a93112/library/core/src/future/future.rs:125:9
   6: <aptos_testcases::dag_onchain_enable_test::DagOnChainEnableTest as aptos_testcases::NetworkLoadTest>::test::{{closure}}
             at ./testsuite/testcases/src/dag_onchain_enable_test.rs:105:18
   7: tokio::runtime::park::CachedParkThread::block_on::{{closure}}
             at /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.35.1/src/runtime/park.rs:282:63
   8: tokio::runtime::coop::with_budget
             at /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.35.1/src/runtime/coop.rs:107:5
   9: tokio::runtime::coop::budget
             at /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.35.1/src/runtime/coop.rs:73:5
  10: tokio::runtime::park::CachedParkThread::block_on
             at /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.35.1/src/runtime/park.rs:282:31
  11: tokio::runtime::context::blocking::BlockingRegionGuard::block_on
             at /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.35.1/src/runtime/context/blocking.rs:66:9
  12: tokio::runtime::scheduler::multi_thread::MultiThread::block_on::{{closure}}
             at /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.35.1/src/runtime/scheduler/multi_thread/mod.rs:87:13
  13: tokio::runtime::context::runtime::enter_runtime
             at /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.35.1/src/runtime/context/runtime.rs:65:16
  14: tokio::runtime::scheduler::multi_thread::MultiThread::block_on
             at /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.35.1/src/runtime/scheduler/multi_thread/mod.rs:86:9
  15: tokio::runtime::runtime::Runtime::block_on
             at /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.35.1/src/runtime/runtime.rs:350:50
  16: <aptos_testcases::dag_onchain_enable_test::DagOnChainEnableTest as aptos_testcases::NetworkLoadTest>::test
             at ./testsuite/testcases/src/dag_onchain_enable_test.rs:102:9
  17: <dyn aptos_testcases::NetworkLoadTest>::network_load_test
             at ./testsuite/testcases/src/lib.rs:311:13
  18: <dyn aptos_testcases::NetworkLoadTest as aptos_forge::interface::network::NetworkTest>::run
             at ./testsuite/testcases/src/lib.rs:180:30
  19: aptos_forge::runner::Forge<F>::run::{{closure}}
             at ./testsuite/forge/src/runner.rs:598:42
  20: aptos_forge::runner::run_test
             at ./testsuite/forge/src/runner.rs:666:11
  21: aptos_forge::runner::Forge<F>::run
             at ./testsuite/forge/src/runner.rs:598:30
  22: forge::run_forge
             at ./testsuite/forge-cli/src/main.rs:425:11
  23: forge::main
             at ./testsuite/forge-cli/src/main.rs:351:21
  24: core::ops::function::FnOnce::call_once
             at /rustc/82e1608dfa6e0b5569232559e3d385fea5a93112/library/core/src/ops/function.rs:250:5
  25: std::sys_common::backtrace::__rust_begin_short_backtrace
             at /rustc/82e1608dfa6e0b5569232559e3d385fea5a93112/library/std/src/sys_common/backtrace.rs:154:18
  26: std::rt::lang_start::{{closure}}
             at /rustc/82e1608dfa6e0b5569232559e3d385fea5a93112/library/std/src/rt.rs:167:18
  27: core::ops::function::impls::<impl core::ops::function::FnOnce<A> for &F>::call_once
             at /rustc/82e1608dfa6e0b5569232559e3d385fea5a93112/library/core/src/ops/function.rs:284:13
  28: std::panicking::try::do_call
             at /rustc/82e1608dfa6e0b5569232559e3d385fea5a93112/library/std/src/panicking.rs:552:40
  29: std::panicking::try
             at /rustc/82e1608dfa6e0b5569232559e3d385fea5a93112/library/std/src/panicking.rs:516:19
  30: std::panic::catch_unwind
             at /rustc/82e1608dfa6e0b5569232559e3d385fea5a93112/library/std/src/panic.rs:142:14
  31: std::rt::lang_start_internal::{{closure}}
             at /rustc/82e1608dfa6e0b5569232559e3d385fea5a93112/library/std/src/rt.rs:148:48
  32: std::panicking::try::do_call
             at /rustc/82e1608dfa6e0b5569232559e3d385fea5a93112/library/std/src/panicking.rs:552:40
  33: std::panicking::try
             at /rustc/82e1608dfa6e0b5569232559e3d385fea5a93112/library/std/src/panicking.rs:516:19
  34: std::panic::catch_unwind
             at /rustc/82e1608dfa6e0b5569232559e3d385fea5a93112/library/std/src/panic.rs:142:14
  35: std::rt::lang_start_internal
             at /rustc/82e1608dfa6e0b5569232559e3d385fea5a93112/library/std/src/rt.rs:148:20
  36: std::rt::lang_start
             at /rustc/82e1608dfa6e0b5569232559e3d385fea5a93112/library/std/src/rt.rs:166:17
  37: __libc_start_main
  38: _start
Trailing Log Lines:
             at /rustc/82e1608dfa6e0b5569232559e3d385fea5a93112/library/std/src/panicking.rs:516:19
  34: std::panic::catch_unwind
             at /rustc/82e1608dfa6e0b5569232559e3d385fea5a93112/library/std/src/panic.rs:142:14
  35: std::rt::lang_start_internal
             at /rustc/82e1608dfa6e0b5569232559e3d385fea5a93112/library/std/src/rt.rs:148:20
  36: std::rt::lang_start
             at /rustc/82e1608dfa6e0b5569232559e3d385fea5a93112/library/std/src/rt.rs:166:17
  37: __libc_start_main
  38: _start


Swarm logs can be found here: See fgi output for more information.
{"level":"INFO","source":{"package":"aptos_forge","file":"testsuite/forge/src/backend/k8s/cluster_helper.rs:292"},"thread_name":"main","hostname":"forge-e2e-pr-12126-1708554039-3c3b2504afd4c095ba9a0c441d3c4253a","timestamp":"2024-02-21T22:38:24.529784Z","message":"Deleting namespace forge-e2e-pr-12126: Some(NamespaceStatus { conditions: None, phase: Some(\"Terminating\") })"}
{"level":"INFO","source":{"package":"aptos_forge","file":"testsuite/forge/src/backend/k8s/cluster_helper.rs:400"},"thread_name":"main","hostname":"forge-e2e-pr-12126-1708554039-3c3b2504afd4c095ba9a0c441d3c4253a","timestamp":"2024-02-21T22:38:24.529808Z","message":"aptos-node resources for Forge removed in namespace: forge-e2e-pr-12126"}

failures:
    dag reconfig enable test

test result: FAILED. 0 passed; 1 failed; 0 filtered out

Failed to run tests:
Tests Failed
Error: Tests Failed

Stack backtrace:
   0: anyhow::error::<impl anyhow::Error>::msg
             at /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/anyhow-1.0.79/src/error.rs:83:36
   1: aptos_forge::runner::Forge<F>::run
             at ./testsuite/forge/src/runner.rs:618:13
   2: forge::run_forge
             at ./testsuite/forge-cli/src/main.rs:425:11
   3: forge::main
             at ./testsuite/forge-cli/src/main.rs:351:21
   4: core::ops::function::FnOnce::call_once
             at /rustc/82e1608dfa6e0b5569232559e3d385fea5a93112/library/core/src/ops/function.rs:250:5
   5: std::sys_common::backtrace::__rust_begin_short_backtrace
             at /rustc/82e1608dfa6e0b5569232559e3d385fea5a93112/library/std/src/sys_common/backtrace.rs:154:18
   6: std::rt::lang_start::{{closure}}
             at /rustc/82e1608dfa6e0b5569232559e3d385fea5a93112/library/std/src/rt.rs:167:18
   7: core::ops::function::impls::<impl core::ops::function::FnOnce<A> for &F>::call_once
             at /rustc/82e1608dfa6e0b5569232559e3d385fea5a93112/library/core/src/ops/function.rs:284:13
   8: std::panicking::try::do_call
             at /rustc/82e1608dfa6e0b5569232559e3d385fea5a93112/library/std/src/panicking.rs:552:40
   9: std::panicking::try
             at /rustc/82e1608dfa6e0b5569232559e3d385fea5a93112/library/std/src/panicking.rs:516:19
  10: std::panic::catch_unwind
             at /rustc/82e1608dfa6e0b5569232559e3d385fea5a93112/library/std/src/panic.rs:142:14
  11: std::rt::lang_start_internal::{{closure}}
             at /rustc/82e1608dfa6e0b5569232559e3d385fea5a93112/library/std/src/rt.rs:148:48
  12: std::panicking::try::do_call
             at /rustc/82e1608dfa6e0b5569232559e3d385fea5a93112/library/std/src/panicking.rs:552:40
  13: std::panicking::try
             at /rustc/82e1608dfa6e0b5569232559e3d385fea5a93112/library/std/src/panicking.rs:516:19
  14: std::panic::catch_unwind
             at /rustc/82e1608dfa6e0b5569232559e3d385fea5a93112/library/std/src/panic.rs:142:14
  15: std::rt::lang_start_internal
             at /rustc/82e1608dfa6e0b5569232559e3d385fea5a93112/library/std/src/rt.rs:148:20
  16: std::rt::lang_start
             at /rustc/82e1608dfa6e0b5569232559e3d385fea5a93112/library/std/src/rt.rs:166:17
  17: __libc_start_main
  18: _start
Debugging output:
NAME                                   READY   STATUS      RESTARTS   AGE
aptos-node-0-fullnode-eforge46-0       1/1     Running     0          16m
aptos-node-0-validator-0               1/1     Running     0          16m
aptos-node-1-fullnode-eforge46-0       1/1     Running     0          16m
aptos-node-1-validator-0               1/1     Running     0          16m
aptos-node-10-fullnode-eforge46-0      1/1     Running     0          16m
aptos-node-10-validator-0              1/1     Running     0          16m
aptos-node-11-fullnode-eforge46-0      1/1     Running     0          16m
aptos-node-11-validator-0              1/1     Running     0          16m
aptos-node-12-fullnode-eforge46-0      1/1     Running     0          16m
aptos-node-12-validator-0              1/1     Running     0          16m
aptos-node-13-fullnode-eforge46-0      1/1     Running     0          16m
aptos-node-13-validator-0              1/1     Running     0          16m
aptos-node-14-fullnode-eforge46-0      1/1     Running     0          16m
aptos-node-14-validator-0              1/1     Running     0          16m
aptos-node-15-fullnode-eforge46-0      1/1     Running     0          16m
aptos-node-15-validator-0              1/1     Running     0          16m
aptos-node-16-fullnode-eforge46-0      1/1     Running     0          16m
aptos-node-16-validator-0              1/1     Running     0          16m
aptos-node-17-fullnode-eforge46-0      1/1     Running     0          16m
aptos-node-17-validator-0              1/1     Running     0          16m
aptos-node-18-fullnode-eforge46-0      1/1     Running     0          16m
aptos-node-18-validator-0              1/1     Running     0          16m
aptos-node-19-fullnode-eforge46-0      1/1     Running     0          16m
aptos-node-19-validator-0              1/1     Running     0          16m
aptos-node-2-fullnode-eforge46-0       1/1     Running     0          16m
aptos-node-2-validator-0               1/1     Running     0          16m
aptos-node-3-fullnode-eforge46-0       1/1     Running     0          16m
aptos-node-3-validator-0               1/1     Running     0          16m
aptos-node-4-fullnode-eforge46-0       1/1     Running     0          16m
aptos-node-4-validator-0               1/1     Running     0          16m
aptos-node-5-fullnode-eforge46-0       1/1     Running     0          16m
aptos-node-5-validator-0               1/1     Running     0          7m16s
aptos-node-6-fullnode-eforge46-0       1/1     Running     0          16m
aptos-node-6-validator-0               1/1     Running     0          16m
aptos-node-7-fullnode-eforge46-0       1/1     Running     0          16m
aptos-node-7-validator-0               1/1     Running     0          16m
aptos-node-8-fullnode-eforge46-0       1/1     Running     0          16m
aptos-node-8-validator-0               1/1     Running     0          16m
aptos-node-9-fullnode-eforge46-0       1/1     Running     0          16m
aptos-node-9-validator-0               1/1     Running     0          16m
genesis-aptos-genesis-eforge46-x2t74   0/1     Completed   0          17m

Copy link
Contributor

@sasha8 sasha8 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where do you call prefetch_payload()?
Also, I do not see all the logic for caching the responses for execution.

// TODO: decide if payload should be fetched here or wait until later
let Some(payload) =
self.payload_manager.get_payload_if_exists(node.as_ref())
else {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is some trade-off here.

@ibalajiarun ibalajiarun mentioned this pull request Feb 22, 2024
task::{block_in_place, JoinHandle},
};
use tokio_retry::strategy::ExponentialBackoff;

#[derive(Clone)]
struct BootstrapBaseState {
dag_store: Arc<DagStore>,
payload_store: Arc<DagPayloadStore>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see this being used

@@ -199,6 +225,7 @@ impl SyncMode {
async fn run_internal(
self,
dag_rpc_rx: &mut Receiver<Author, IncomingDAGRequest>,
_payload_link_rx: &mut mpsc::UnboundedReceiver<PayloadLinkMsg>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why this is not run in sync mode? won't it have potential issues that buffer manager keeps sending messages? we should also probably decouple it from the dag tasks?

@@ -57,18 +53,18 @@ impl<T> Stream for FetchWaiter<T> {
}

pub trait TFetchRequester: Send + Sync {
fn request_for_node(&self, node: Node) -> anyhow::Result<()>;
fn request_for_node(&self, node: NodeMessage) -> anyhow::Result<()>;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why we need to change this?

@@ -439,7 +442,7 @@ impl InMemDag {
pub struct DagStore {
dag: RwLock<InMemDag>,
storage: Arc<dyn DAGStorage>,
payload_manager: Arc<dyn TPayloadManager>,
external_payload_manager: Arc<dyn TPayloadManager>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why it's called external?

error!("Error deleting expired nodes: {:?}", e);
}
payload_digests.into_iter().filter_map(|d| d).collect()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: just call flatten?

self.config.mempool_txn_pull_timeout_ms,
))
let mut quorum_store_builder = match (is_dag_enabled, is_quorum_store_enabled) {
(true, true) => unreachable!("not yet supported"),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is supported, no?

fn prefetch_payload_data(&self, payload: &Payload, timestamp: u64) {
self.prefetch_payload_data(payload, timestamp);
fn prefetch_dag_payload_data(&self, payload: &DagPayload, timestamp: u64) {
match payload {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this looks weird, we can convert DagPayload to Payload first then do a single line prefetch?

@ibalajiarun ibalajiarun force-pushed the balaji/dag-payload-manager branch from cb5bb85 to af74f3b Compare February 24, 2024 20:19
Copy link
Contributor

This issue is stale because it has been open 45 days with no activity. Remove the stale label, comment or push a commit - otherwise this will be closed in 15 days.

@github-actions github-actions bot added the Stale label Apr 17, 2024
@github-actions github-actions bot closed this May 2, 2024
@ibalajiarun ibalajiarun reopened this May 2, 2024
@github-actions github-actions bot removed the Stale label May 3, 2024
Copy link
Contributor

This issue is stale because it has been open 45 days with no activity. Remove the stale label, comment or push a commit - otherwise this will be closed in 15 days.

@github-actions github-actions bot added the Stale label Jun 17, 2024
@github-actions github-actions bot closed this Jul 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CICD:build-images when this label is present github actions will start build+push rust images from the PR. CICD:run-forge-e2e-perf Run the e2e perf forge only Stale
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants