Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Quorum Store] improvements to prevent some batches not getting quorum #11629

Merged
merged 7 commits into from
Feb 14, 2024

Conversation

bchocho
Copy link
Contributor

@bchocho bchocho commented Jan 11, 2024

Description

Occasionally, the metrics show batches not getting quorum, in forge and testnet.

  1. There are false positives (didn't actually expire, only in the metrics), due to batches getting late votes, after they have been committed. This is resolved by caching the committed batches until expiration time; only really really late votes would cause false positives.
  2. There are proofs that actually expire without getting quorum. These are written too late into the QS DB; if the entry does not exist in the DB, the vote is discarded, and in some cases more than 33% of votes are actually discarded. This is resolved by writing to the DB in batch_generator, before broadcasting the Batch.

In addition the counters for counting proof votes are cleaned up, so the votes/stake are counted on expire time (as originally intended).

Test Plan

Run realistic_env_max_load_large and observe that no more proofs fail to get quorum, and metrics look better

Copy link

trunk-io bot commented Jan 11, 2024

⏱️ 18h 39m total CI duration on this PR
Job Cumulative Duration Recent Runs
rust-unit-coverage 4h 19m 🟩
rust-unit-tests 3h 9m 🟥🟩🟩🟩 (+3 more)
rust-smoke-coverage 2h 50m 🟩
windows-build 2h 39m 🟩🟥🟩🟩🟩 (+4 more)
forge-e2e-test / forge 1h 26m 🟩🟩🟩🟩🟩 (+1 more)
rust-images / rust-all 1h 16m 🟥🟩🟩🟩 (+3 more)
rust-lints 1h 🟥🟩🟩🟩 (+3 more)
run-tests-main-branch 37m 🟩🟩🟩🟩🟩 (+3 more)
check 32m 🟩🟩🟩🟩🟩 (+4 more)
check-dynamic-deps 22m 🟩🟩🟩🟩🟩 (+4 more)
general-lints 21m 🟩🟩🟩🟩🟩 (+3 more)
semgrep/ci 3m 🟩🟩🟩🟩🟩 (+4 more)
file_change_determinator 2m 🟩🟩🟩🟩🟩 (+4 more)
file_change_determinator 1m 🟩🟩🟩🟩🟩 (+3 more)
file_change_determinator 1m 🟩🟩🟩🟩🟩 (+3 more)
permission-check 25s 🟩🟩🟩🟩🟩 (+3 more)
permission-check 24s 🟩🟩🟩🟩🟩 (+3 more)
permission-check 23s 🟩🟩🟩🟩🟩 (+3 more)
permission-check 21s 🟩🟩🟩🟩🟩 (+3 more)
permission-check 21s 🟩🟩🟩🟩🟩 (+3 more)
determine-docker-build-metadata 18s 🟩🟩🟩🟩🟩 (+3 more)
upload-to-codecov 11s 🟩

🚨 2 jobs on the last run were significantly faster/slower than expected

Job Duration vs 7d avg Delta
check-dynamic-deps 3m 2m +63%
rust-images / rust-all 13m 10m +21%

settingsfeedbackdocs ⋅ learn more about trunk.io

@bchocho bchocho added the CICD:run-forge-e2e-perf Run the e2e perf forge only label Jan 11, 2024

This comment has been minimized.

This comment has been minimized.

Copy link

codecov bot commented Jan 11, 2024

Codecov Report

Attention: 4 lines in your changes are missing coverage. Please review.

Comparison is base (e34cd18) 69.8% compared to head (a95a912) 69.7%.
Report is 5 commits behind head on main.

❗ Current head a95a912 differs from pull request most recent head 0be0dc7. Consider uploading reports for the commit 0be0dc7 to get more accurate results

Files Patch % Lines
consensus/src/quorum_store/proof_coordinator.rs 0.0% 4 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main   #11629      +/-   ##
==========================================
- Coverage    69.8%    69.7%    -0.2%     
==========================================
  Files        2200     2111      -89     
  Lines      417626   401250   -16376     
==========================================
- Hits       291876   279734   -12142     
+ Misses     125750   121516    -4234     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

@bchocho bchocho changed the title debug logs [Quorum Store] improvements to prevent some proofs not getting quorum Jan 12, 2024
@bchocho bchocho force-pushed the brian/debug-occasional-pos-failed branch from 4e407e1 to a703ed6 Compare January 12, 2024 23:20

This comment has been minimized.

This comment has been minimized.

@bchocho bchocho force-pushed the brian/debug-occasional-pos-failed branch from a703ed6 to bc1baa9 Compare January 13, 2024 00:13

This comment has been minimized.

This comment has been minimized.

@bchocho bchocho marked this pull request as ready for review January 13, 2024 00:51
@bchocho bchocho requested a review from igor-aptos January 13, 2024 00:51
@bchocho
Copy link
Contributor Author

bchocho commented Jan 16, 2024

This comment has been minimized.

This comment has been minimized.

@bchocho bchocho changed the title [Quorum Store] improvements to prevent some proofs not getting quorum [Quorum Store] improvements to prevent some batches not getting quorum Jan 16, 2024
for batch in batches.clone().into_iter() {
persist_requests.push(batch.into());
}
self.batch_writer.persist(persist_requests);
Copy link
Contributor

@ibalajiarun ibalajiarun Feb 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you observe any bottlenecks here? This might need to be parallelized or sent to another task, or this can block the loop for a while.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Running max_load_large I didn't see any issues

max_batch_expiry_gap_usecs: u64,
validator: &ValidatorVerifier,
) -> anyhow::Result<()> {
if sender != self.signer {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: use ensure is simpler

if !state.completed {
counters::TIMEOUT_BATCHES_COUNT.inc();
}
Self::update_counters(&state);
if !state.completed {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this looks like a duplicate if condition as above?

Self::update_counters(&state);
if !state.completed {
info!(
LogSchema::new(LogEvent::ProofOfStoreInit),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why the event is ProofOfStoreInit?

existing_proof.remove();
let incremental_proof = existing_proof.get();
if !incremental_proof.completed {
warn!("QS: received commit notification for batch that did not complete: {}, self_voted: {}", digest, incremental_proof.self_voted);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm how can this happen?

@bchocho bchocho force-pushed the brian/debug-occasional-pos-failed branch from ba18903 to 0be0dc7 Compare February 12, 2024 22:55

This comment has been minimized.

This comment has been minimized.

@bchocho bchocho enabled auto-merge (squash) February 14, 2024 20:30

This comment has been minimized.

This comment has been minimized.

Copy link
Contributor

✅ Forge suite compat success on aptos-node-v1.8.3 ==> 0be0dc7b3ebb82647062805c245baa949c7e8bac

Compatibility test results for aptos-node-v1.8.3 ==> 0be0dc7b3ebb82647062805c245baa949c7e8bac (PR)
1. Check liveness of validators at old version: aptos-node-v1.8.3
compatibility::simple-validator-upgrade::liveness-check : committed: 5045 txn/s, latency: 6587 ms, (p50: 6300 ms, p90: 9900 ms, p99: 12900 ms), latency samples: 186680
2. Upgrading first Validator to new version: 0be0dc7b3ebb82647062805c245baa949c7e8bac
compatibility::simple-validator-upgrade::single-validator-upgrade : committed: 1846 txn/s, latency: 15720 ms, (p50: 18400 ms, p90: 22300 ms, p99: 22600 ms), latency samples: 92320
3. Upgrading rest of first batch to new version: 0be0dc7b3ebb82647062805c245baa949c7e8bac
compatibility::simple-validator-upgrade::half-validator-upgrade : committed: 1691 txn/s, latency: 16601 ms, (p50: 18900 ms, p90: 22200 ms, p99: 23200 ms), latency samples: 87940
4. upgrading second batch to new version: 0be0dc7b3ebb82647062805c245baa949c7e8bac
compatibility::simple-validator-upgrade::rest-validator-upgrade : committed: 3443 txn/s, latency: 8948 ms, (p50: 9800 ms, p90: 12000 ms, p99: 13000 ms), latency samples: 144620
5. check swarm health
Compatibility test for aptos-node-v1.8.3 ==> 0be0dc7b3ebb82647062805c245baa949c7e8bac passed
Test Ok

Copy link
Contributor

✅ Forge suite realistic_env_max_load success on 0be0dc7b3ebb82647062805c245baa949c7e8bac

two traffics test: inner traffic : committed: 6972 txn/s, latency: 5467 ms, (p50: 5100 ms, p90: 7000 ms, p99: 13200 ms), latency samples: 3019000
two traffics test : committed: 100 txn/s, latency: 2196 ms, (p50: 2100 ms, p90: 2400 ms, p99: 8100 ms), latency samples: 1800
Latency breakdown for phase 0: ["QsBatchToPos: max: 0.276, avg: 0.215", "QsPosToProposal: max: 0.233, avg: 0.201", "ConsensusProposalToOrdered: max: 0.615, avg: 0.568", "ConsensusOrderedToCommit: max: 0.482, avg: 0.458", "ConsensusProposalToCommit: max: 1.074, avg: 1.025"]
Max round gap was 1 [limit 4] at version 952978. Max no progress secs was 9.049449 [limit 15] at version 952978.
Test Ok

@bchocho bchocho merged commit e7cc7e0 into main Feb 14, 2024
79 of 80 checks passed
@bchocho bchocho deleted the brian/debug-occasional-pos-failed branch February 14, 2024 21:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CICD:run-forge-e2e-perf Run the e2e perf forge only
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants