Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

consensus: fast-sync mode: Skip dead reckoning #457

Merged
merged 19 commits into from
Oct 3, 2023

Conversation

mitjat
Copy link
Contributor

@mitjat mitjat commented Jun 22, 2023

This PR makes parallel analyzers (which have been possible to instantiate since #456) actually work. To achieve this, it implements "fast-sync" mode for block analyzers:

  • During it, the analyzer avoids producing SQL queries/updates whose effect will be overridden by parsing the genesis (for consensus) or by running item analyzers (for EVM runtimes); this mostly means it disables dead-reckoning.
  • Skipping dead-reckoning and similar updates also prevents the DB constraints from being violated even when blocks are processed out of order.

Getting this to work involved finding and plugging multiple holes. I only spun off two PRs; most other fixes are relatively small in terms of lines of code, so I recommend just reviewing this commit by commit. Explanations for individual commits follow, in order:

  • analyzer/consensus: fast-sync: reduce dead-reckoning, data fetching
  • analyzer/runtime: fast-sync: reduce dead-reckoning
    • This and the previous commit are the core of the PR; see PR-level description
  • analyzer/block: unlock blocks when terminating; do not work on expired blocks
    • Adds more explicit unlocking of any blocks that the analyzer locked but couldn't process before being shut down. I found this useful in local development, when recompiling+running frequently without wiping the db.
  • genesis: Use DELETE b/c TRUNCATE does not respect txs
    • There is no DB tx-level isolation for TRUNCATE. We use TRUNCATE in genesis processing. I don't think we ever run multiple of these in parallel, so this is a "just in case". Our tables are small enough that DELETE is not appreciably slower.
  • genesis: populate runtimes table before runtime_nodes, because the latter references the former`
    • Having the order the other way around can lead to violated foreign keys during genesis processing.
  • debug: Optionally dump genesis JSON
    • Was helpful for debugging node registry stuff, see below
  • genesis: StateToGenesis(height): Fetch nodes separately
    • StateToGenesis() is designed primarily with dump-and-restore in mind. So when it dumps nodes, it ignores everything but validator nodes because others "will reregister immediately [after the dump-restore upgrade] anyway", per Jernej. Our use case is a little different; we need the full state of nodes at height H after having run fast-sync up to H. This commit makes it so we fetch the full set of nodes with a separate RPC, in addition to StateToGenesis().
  • bugfix: analyzer/consensus: Reset nodes voting power
    • Bug in our consensus analyzer. We always fetch the full set of validators from the node. Then we need to update the voting power of the nodes in the RPC response (old code already does that), but we also have to update the voting power of all other validators to 0 (= this fix).
  • consensus/consensus,genesis: ignore expired-on-arrival node registrations
    • oasis-node can return outdated node registrations, both in node events and in StateToGenesis. In both cases, when queried at height H, it can report that a node has registered and the registration will expire at height H'<H. This commit ignores such registrations.

Post-review edit: Here are more changes I made after the review.
I believe they are non-controversial, so I'll (ab)use the LGTM and merge this PR pretty soon. Happy to address review comments in follow-up PRs though; there will be more work.
Additional changes/commits; trivial commits and commits addressing review feedback are skipped:

  • (bugfix) api: do not count accts with 0 balance among token holders
    • What it says. Discovered because fast-sync vs slow-sync produce a different set of accounts with 0 balance (as opposed to never creating DB rows for those accts)
  • analyzer/batch: If batch deadline expires, do not attempt further blocks
    • Mostly for ease of reading logs. Previously, on an expired batch context, the analyzer would try every remaining block, and have each of them fail deeper down the stack because of the expired context. Those failures spammed the logs.
  • analyzer/runtime: fast-sync: do not dead-reckon runtime_accounts.num_txs
  • analyzer/runtime: fast-sync: do not dead-reckon runtime_accounts.gas_for_calling
    • This and previous: These previously dead-reckoned values were causing a bunch of deadlock in Emerald when running with parallelism > 1 because some accts appear in most txs.
  • Remove chain.runtime_{withdraws,deposits} tables
    • They are unused, and will likely remain so. The explorer will use events.
  • bugfix: genesis: correctly insert proposal cancellations
    • Minor, we were inserting 5 empty to represent 4 null strings (a little ugly) and 1 null int (postgres error)

Testing:
e2e from #525 , with and without parallelism. A future test will be deploying this in staging. That's also when I plan to measure real-world speedups, and see how bad the DB contention is. Deployed in staging; indexed entire testnet (and found the "correctly insert proposal cancellations" bug), indexing of mainnet in progress (and prompted the reductions in runtime dead-reckoning).

@mitjat mitjat force-pushed the mitjat/fast-sync-consensus branch 3 times, most recently from 57e64bd to 810ce3c Compare September 26, 2023 00:28
@mitjat mitjat marked this pull request as ready for review September 26, 2023 00:54
@mitjat mitjat force-pushed the mitjat/fast-sync-consensus branch from 810ce3c to 2ce36a6 Compare September 27, 2023 19:45
Copy link
Member

@ptrus ptrus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for nicely organizing the commits! Really made the changes easy to follow.

storage/oasis/consensus.go Show resolved Hide resolved
analyzer/consensus/consensus.go Show resolved Hide resolved
analyzer/runtime/runtime.go Outdated Show resolved Hide resolved
analyzer/queries/queries.go Outdated Show resolved Hide resolved
analyzer/consensus/consensus.go Outdated Show resolved Hide resolved
analyzer/consensus/genesis.go Outdated Show resolved Hide resolved
// A new node is registered.
if nodeEvent.IsRegistration && nodeEvent.Expiration >= currentEpoch {
// A new node is registered; the expiration check above is needed because oasis-node sometimes returns
// obsolete registration events, i.e. registrations that are already expired when they are produced.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, weird do you know why/when does this happen?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No :(
Tadej was vaguely aware that there's some sloppiness around registry data, for efficiency sake; but he was unsure of details.
I'll bring it up during demo.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Forgot to :/

@mitjat mitjat force-pushed the mitjat/fast-sync-consensus branch 3 times, most recently from 0532bd0 to f585caa Compare October 3, 2023 03:17
@mitjat mitjat force-pushed the mitjat/fast-sync-consensus branch from f585caa to f99705d Compare October 3, 2023 04:21
@mitjat mitjat force-pushed the mitjat/fast-sync-consensus branch from f99705d to 5ae1199 Compare October 3, 2023 05:59
@mitjat
Copy link
Contributor Author

mitjat commented Oct 3, 2023

Thank you @ptrus !
I added some more changes; since they were related, I just piled them on ather than arbitrarily split the work into two PRs. See the update description. I plan to submit Tue morning my time, but can address comments later too.

@mitjat mitjat enabled auto-merge October 3, 2023 22:24
@mitjat mitjat merged commit c0b1933 into main Oct 3, 2023
6 checks passed
@mitjat mitjat deleted the mitjat/fast-sync-consensus branch October 3, 2023 22:29
@csillag csillag mentioned this pull request Dec 20, 2024
10 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants