Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[move-prover] Sharding feature + re-animate Prover.toml #15775

Merged
merged 1 commit into from
Jan 21, 2025
Merged

Conversation

wrwg
Copy link
Contributor

@wrwg wrwg commented Jan 20, 2025

Description

Boogie is using exhaustive amounts of memory if run as part of proving aptos-framework (100GB+), which makes CI and local testing fail randomly. This PR introduces a new --shards option which splits the Boogie job into multiple shards, if specified. Also re-animates the use of Prover.toml in a package directory allowing to set such options.

How Has This Been Tested?

Existing tests

Type of Change

  • New feature
  • Bug fix
  • Breaking change
  • Performance improvement
  • Refactoring
  • Dependency update
  • Documentation update
  • Tests

Which Components or Systems Does This Change Impact?

  • Validator Node
  • Full Node (API, Indexer, etc.)
  • Move/Aptos Virtual Machine
  • Aptos Framework
  • Aptos CLI/SDK
  • Developer Infrastructure
  • Move Compiler
  • Other (specify)

Copy link

trunk-io bot commented Jan 20, 2025

⏱️ 4h 36m total CI duration on this PR
Slowest 15 Jobs Cumulative Duration Recent Runs
execution-performance / single-node-performance 1h 49m 🟩🟩🟥🟥🟩 (+1 more)
execution-performance / test-target-determinator 31m 🟩🟩🟩🟩🟩 (+1 more)
test-target-determinator 31m 🟩🟩🟩🟩🟩 (+1 more)
check-dynamic-deps 23m 🟩🟩🟩🟩🟩 (+6 more)
rust-cargo-deny 19m 🟩🟩🟩🟩🟩 (+6 more)
fetch-last-released-docker-image-tag 10m 🟩🟩🟩🟩🟩 (+1 more)
rust-doc-tests 7m 🟩
rust-doc-tests 7m 🟩
rust-doc-tests 6m 🟩
rust-doc-tests 6m 🟩
rust-doc-tests 6m 🟩
rust-doc-tests 6m 🟩
general-lints 6m 🟩🟩🟩🟩🟩 (+6 more)
semgrep/ci 4m 🟩🟩🟩🟩🟩 (+6 more)
file_change_determinator 2m 🟩🟩🟩🟩🟩 (+6 more)

🚨 2 jobs on the last run were significantly faster/slower than expected

Job Duration vs 7d avg Delta
test-target-determinator 7m 5m +39%
execution-performance / test-target-determinator 7m 5m +37%

settingsfeedbackdocs ⋅ learn more about trunk.io

@wrwg wrwg requested review from fEst1ck, rahxephon89 and vineethk and removed request for davidiw, areshand and movekevin January 20, 2025 20:48
@wrwg wrwg force-pushed the wrwg/prover-shards branch 2 times, most recently from 5bf5802 to afe2d9e Compare January 20, 2025 23:24
@wrwg wrwg force-pushed the wrwg/prover-shards branch from afe2d9e to 22a726c Compare January 20, 2025 23:36
@wrwg wrwg enabled auto-merge (squash) January 20, 2025 23:53

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

@wrwg wrwg force-pushed the wrwg/prover-shards branch from 22a726c to 4fc54a1 Compare January 21, 2025 00:38

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

@wrwg wrwg force-pushed the wrwg/prover-shards branch from 4fc54a1 to 0c22d8f Compare January 21, 2025 01:48

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

@wrwg wrwg force-pushed the wrwg/prover-shards branch from 0c22d8f to e04c494 Compare January 21, 2025 02:22

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

@wrwg wrwg force-pushed the wrwg/prover-shards branch from e04c494 to 61cbe98 Compare January 21, 2025 02:57

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

Boogie is using exhaustive amounts of memory if run as part of proving aptos-framework (100GB+), which makes CI and local testing fail randomly. This PR introduces a new `--shards` option which splits the Boogie job into multiple shards, if specified. Also re-animates the use of `Prover.toml` in a package directory allowing to set such options.
@wrwg wrwg force-pushed the wrwg/prover-shards branch from 61cbe98 to 35d45b6 Compare January 21, 2025 03:34

This comment has been minimized.

This comment has been minimized.

Copy link
Contributor

✅ Forge suite realistic_env_max_load success on 35d45b6f46e4f2fbbc20b2bf6008c9b63d6ad761

two traffics test: inner traffic : committed: 14376.22 txn/s, latency: 2758.54 ms, (p50: 2700 ms, p70: 2700, p90: 3000 ms, p99: 4200 ms), latency samples: 5466080
two traffics test : committed: 99.96 txn/s, latency: 1482.95 ms, (p50: 1400 ms, p70: 1500, p90: 1600 ms, p99: 3000 ms), latency samples: 1800
Latency breakdown for phase 0: ["MempoolToBlockCreation: max: 1.734, avg: 1.495", "ConsensusProposalToOrdered: max: 0.305, avg: 0.293", "ConsensusOrderedToCommit: max: 0.410, avg: 0.401", "ConsensusProposalToCommit: max: 0.714, avg: 0.694"]
Max non-epoch-change gap was: 1 rounds at version 4834473 (avg 0.00) [limit 4], 1.87s no progress at version 4834473 (avg 0.20s) [limit 15].
Max epoch-change gap was: 0 rounds at version 0 (avg 0.00) [limit 4], 0.65s no progress at version 2918666 (avg 0.65s) [limit 16].
Test Ok

Copy link
Contributor

✅ Forge suite compat success on 17540fad8e88ab5681f3a91190b9f5d37e53d2ef ==> 35d45b6f46e4f2fbbc20b2bf6008c9b63d6ad761

Compatibility test results for 17540fad8e88ab5681f3a91190b9f5d37e53d2ef ==> 35d45b6f46e4f2fbbc20b2bf6008c9b63d6ad761 (PR)
1. Check liveness of validators at old version: 17540fad8e88ab5681f3a91190b9f5d37e53d2ef
compatibility::simple-validator-upgrade::liveness-check : committed: 10898.97 txn/s, latency: 2910.26 ms, (p50: 2400 ms, p70: 3500, p90: 3900 ms, p99: 12400 ms), latency samples: 373020
2. Upgrading first Validator to new version: 35d45b6f46e4f2fbbc20b2bf6008c9b63d6ad761
compatibility::simple-validator-upgrade::single-validator-upgrading : committed: 2972.23 txn/s, latency: 9874.25 ms, (p50: 10600 ms, p70: 11700, p90: 12600 ms, p99: 12800 ms), latency samples: 67400
compatibility::simple-validator-upgrade::single-validator-upgrade : committed: 2991.14 txn/s, latency: 11189.78 ms, (p50: 12500 ms, p70: 12700, p90: 12800 ms, p99: 12900 ms), latency samples: 115920
3. Upgrading rest of first batch to new version: 35d45b6f46e4f2fbbc20b2bf6008c9b63d6ad761
compatibility::simple-validator-upgrade::half-validator-upgrading : committed: 4358.31 txn/s, latency: 7095.74 ms, (p50: 7800 ms, p70: 8500, p90: 9000 ms, p99: 9200 ms), latency samples: 92020
compatibility::simple-validator-upgrade::half-validator-upgrade : committed: 4550.46 txn/s, latency: 7496.88 ms, (p50: 8400 ms, p70: 8400, p90: 8700 ms, p99: 8800 ms), latency samples: 162160
4. upgrading second batch to new version: 35d45b6f46e4f2fbbc20b2bf6008c9b63d6ad761
compatibility::simple-validator-upgrade::rest-validator-upgrading : committed: 7747.88 txn/s, latency: 3915.34 ms, (p50: 4400 ms, p70: 4600, p90: 5200 ms, p99: 5400 ms), latency samples: 142500
compatibility::simple-validator-upgrade::rest-validator-upgrade : committed: 2853.91 txn/s, submitted: 2854.03 txn/s, expired: 0.12 txn/s, latency: 4464.00 ms, (p50: 4700 ms, p70: 5100, p90: 5300 ms, p99: 5400 ms), latency samples: 255509
5. check swarm health
Compatibility test for 17540fad8e88ab5681f3a91190b9f5d37e53d2ef ==> 35d45b6f46e4f2fbbc20b2bf6008c9b63d6ad761 passed
Test Ok

@@ -12,14 +12,14 @@ VectorPicture { length: 30720 } 34 0.924 1.081 6900.0
VectorPictureRead { length: 30720 } 34 0.938 1.089 6900.0
SmartTablePicture { length: 30720, num_points_per_txn: 200 } 34 0.972 1.074 42970.1
SmartTablePicture { length: 1048576, num_points_per_txn: 300 } 34 0.960 1.066 73865.4
ResourceGroupsSenderWriteTag { string_length: 1024 } 34 0.898 1.062 16.2
ResourceGroupsSenderWriteTag { string_length: 1024 } 34 0.898 1.062 19.2
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR does not touch anything in the execution stack, so this most be from previous PRs. (Perhaps permissioned signers? @igor-aptos

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like signer has indeed added noise - here it the recalibration PR - #15778

generated by doing:

./testsuite/single_node_performance_calibration.py --time-interval=3d
./testsuite/single_node_performance_calibration.py --move-e2e --time-interval=3d

note - this particular line didn't regress as much as you saw from the first run - you can see your second run much closer to the original value. But I assume all the changes create way too much noise anyways.

Copy link
Contributor

@vineethk vineethk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some comments, otherwise looks good to me.

#[clap(long)]
pub shards: Option<usize>,

/// If there are multiple shards, the shard to which verification shall be narrowed.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From looking at other related code, it looks like the shards are numbered 1.. and not 0... We should document this, or perhaps make it numbered 0..?

}
if let Some(shard) = self.for_shard {
// Check whether the shard is included.
if self.options.only_shard.is_some() && self.options.only_shard != Some(shard + 1) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: could use is_some_and() instead.

return false;
}
// Check whether it is part of the shard.
let mut hasher = DefaultHasher::new();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this hasher is non-deterministic? Thus, if you decide to shard one by one, a target might fall into different shards each time? Thus, we may want to use a deterministic hasher.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wrwg Checking in to see if you missed this before merging.

@wrwg wrwg merged commit 4125bd8 into main Jan 21, 2025
46 checks passed
@wrwg wrwg deleted the wrwg/prover-shards branch January 21, 2025 16:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants