Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[table info][2/4] add utils for table info backup and restore and redesign the db read #11793

Merged
merged 5 commits into from
Feb 14, 2024

Conversation

jillxuu
Copy link
Contributor

@jillxuu jillxuu commented Jan 26, 2024

This PR

  1. add gcs operator utils

Overview

[1/4] separate indexer async v2 db from aptosdb : #11799
[2/4] add gcs operational utils to set up for backup and restore to gcs: #11793
[3/4] add epoch based backup logic: #11794
[4/4] add db snapshot restore logic: #11795

Goal

https://www.notion.so/aptoslabs/Internal-Indexer-Deprecation-Next-Steps-824d5af7f16a4ff3aeccc4a4edee2763?pvs=4

  1. Migrating internal indexer off of aptos db critical path, move it into its own standalone runtime service.
  2. Still provide the table info mapping to both API and indexer services.
  3. improve table info parsing performance to unblock indexer perf bottleneck

Context

This effort is broken down into two parts:

  1. part 1: [indexer-grpc-table-info] add table info parsing logic to indexer grpc fn #10783. Add moving the table info service out of the critical path and convert it into multithread to concurrently process the request.
  2. part 2 is this PR. Now we have the table info service, but this service will only be enabled by a handful FNs, so when a new FN wants to join the network but they don't want to sync from the genesis, they should be able to restore the db from a cloud service. In order to provide such a cloud service for others to download the db snapshot, this pr focuses on two things: backup and restore. backup is optional in the config, but restore logic is written in the code.
before after
Screenshot 2023-11-15 at 10 15 01 PM Screenshot 2024-01-25 at 11 14 05 AM

Detailed Changes

Screenshot 2024-01-25 at 11 56 53 AM

Tradeoffs

backup based on epoch or transaction version? what frequency?

Pros of backup using version:

  1. We have more control over the backup frequency as frequency is tunable.
    Cons of backup using version:
  2. Overhead of managing and comparing backed up versions and current processing versions

Pros of backup using epoch:

  1. When to backup logic is much cleaner and less error prone
    Cons of backup using epoch:
  2. not very configurable but still tunable by setting frequency on how many epochs behind

Decided to use epoch, because on testnet we have little over 10k epoch, divided by total txns, it gives us on average 70k txns per epoch, this sounds about the frequency of txns we'd like to backup.

when to restore

Decided to restore when both conditions are met:

  1. the difference btw next versions to be processed and current version from ledger is greater than a version_diff.
  2. and time difference btw last restored timestamp in db and current timestamp is greater than a RESTORE_TIME_DIFF_SECS.
    This is to prevent fullnode crashlooping and constantly try to restore without luck, and when version difference is not that big we don't need to spam the gcs service but rather directly state syncing from that close to head version.

structure of the gcs bucket

I followed the similar structure as of indexer's filestore, where we keep a metadata.json file in the bucket to keep track of the chain id and newest backed up epoch. and then a files folder to keep all the epoch based backup. Each db snapshot is first compressed into a tar file from folder, and then gzipped to compress to the best size possible. based on larry's point, alternative compression like bzip2 is less performant.

threads

Using a separate thread for backup only, base on the past experience, gcs upload could be as slow as minutes.

gcs prunning

Couple options we could pursue, since each backup file is a full db backup,

  1. create another service to constantly clean up the backup files
  2. use gcs own policy to delete files based on time, and other conditions
  3. programmatically delete old files while upload
  4. constantly writing to the same file

Decided to go with gcs own policy with the proper configuration setup in the gcs deployment. Reason behind is that deploying and maintaining another service is overhead and costs more money, especially this service's responsibility is very singular; writing code for gcs object deletion is not ideal, since we're writing and deleting, need to handle different multitude of edge cases; constantly writing to the same file is def not gonna work, since gcs has strict limitation on single object write limit, only once per second.

Test Plan

  1. passing all the written unit tests on file system operation
  2. locally tested and verified backup and restore, as well as table info read

Concerns

There's a bottleneck on the size of db snapshot. Currently this db on testnet is around 250mb, based on the nature of db, the compression could get it to be 50-150mb per db snapshot. It's still too big to upload to gcs as its too slow.

TODO

E2E test

from rustie:

  1. set up a long-running fullnode in a new k8s project. We can use the same data-staging-us-central1 cluster, but use a new namespace, like indexer-fullnode-testnet-test or something. This is to isolate it from everything else, but we can use the same cluster for simplicity
  2. Set up a job in the same namespace that does the backup to GCS.
  3. We can set up a continuous job in aptos-core CI that:
    5.1. Spins up a fullnode based on the latest build. This would be the latest main nightly for instance
    3.2. We can quit immediately after verifying that the restore was successful
    3.3. The cost should be manageable, assuming that the restore process is quick.

integration test

load test

Couple things i want to verify with load testing

  1. 10 fns boostrapping, can restore work for all of them
  2. fns keep crashlooping, is gcs spammed based on egress & ingress
  3. when file gets bigger, will backup still work and how long it could take

Copy link

trunk-io bot commented Jan 26, 2024

⏱️ 1h 25m total CI duration on this PR
Job Cumulative Duration Recent Runs
rust-unit-tests 33m 🟩
windows-build 26m 🟩
rust-lints 9m 🟩
run-tests-main-branch 7m 🟥🟥
check 4m 🟩
general-lints 2m 🟩
check-dynamic-deps 2m 🟩
file_change_determinator 23s 🟩🟩
semgrep/ci 19s 🟩
file_change_determinator 10s 🟩
permission-check 7s 🟩🟩
permission-check 6s 🟩🟩
permission-check 6s 🟩🟩
permission-check 5s 🟩🟩

🚨 1 job on the last run was significantly faster/slower than expected

Job Duration vs 7d avg Delta
windows-build 26m 19m +37%

settingsfeedbackdocs ⋅ learn more about trunk.io

@jillxuu jillxuu requested a review from a team January 26, 2024 06:03
@jillxuu jillxuu marked this pull request as ready for review January 26, 2024 06:03
@bowenyang007
Copy link
Contributor

I would say that clean up shouldn't happen right away. Let's onboard the ecosystem before we delete the "old way"

@jillxuu
Copy link
Contributor Author

jillxuu commented Jan 26, 2024

I would say that clean up shouldn't happen right away. Let's onboard the ecosystem before we delete the "old way"

old way was not deleted, and it remains exactly the same. we're just changing the way to read and write our new db. updated the PR description to make it more explicit. @bowenyang007

@jillxuu jillxuu changed the title [table info][1/3] add utils for table info backup and restore and redesign the db read [table info][2/4] add utils for table info backup and restore and redesign the db read Jan 26, 2024
@jillxuu jillxuu changed the base branch from main to jill/indexer_async_v2_db January 26, 2024 20:07
@jillxuu jillxuu force-pushed the jill/backup_restore_utils branch from 0d4590f to 7a778c7 Compare January 26, 2024 20:08
// SPDX-License-Identifier: Apache-2.0

use flate2::{read::GzDecoder, write::GzEncoder, Compression};
use std::{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note: std operations can block the main thread like std::thread::sleep if used with tokio tasks.

should be fine with std threads.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep totally aware of that, in the next pr, creating a standalone spawn blocking thread to separately handle file operation and gcs upload, so that it won't block other concurrent async threads from running.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh my point is, you can use tokio::fs/tokio::io if IO operations needed in async context.

spawn_blocking is for CPU intensive operations.

@jillxuu jillxuu force-pushed the jill/indexer_async_v2_db branch from 65faadd to bbaedb8 Compare February 8, 2024 00:36
@jillxuu jillxuu requested a review from grao1991 as a code owner February 8, 2024 00:36
@jillxuu jillxuu force-pushed the jill/backup_restore_utils branch from 7a778c7 to 7ea0b68 Compare February 8, 2024 01:20
@jillxuu jillxuu force-pushed the jill/indexer_async_v2_db branch from 755af28 to b0b60e9 Compare February 9, 2024 17:56
Base automatically changed from jill/indexer_async_v2_db to main February 12, 2024 20:07
@jillxuu jillxuu force-pushed the jill/backup_restore_utils branch 3 times, most recently from b031aa8 to f184c78 Compare February 13, 2024 00:42
@jillxuu jillxuu force-pushed the jill/backup_restore_utils branch 2 times, most recently from 98bbcec to b1ca3ee Compare February 13, 2024 00:45
@jillxuu jillxuu requested a review from larry-aptos February 13, 2024 00:49
@msmouse
Copy link
Contributor

msmouse commented Feb 13, 2024

Are we comfortable operating the entire tarball in memory?

@jillxuu jillxuu force-pushed the jill/backup_restore_utils branch 2 times, most recently from 571c74f to 8c3352f Compare February 14, 2024 18:47
@jillxuu jillxuu requested a review from msmouse February 14, 2024 18:48
@jillxuu
Copy link
Contributor Author

jillxuu commented Feb 14, 2024

Are we comfortable operating the entire tarball in memory?

new commit addressed the memory concern to either write to disk directly or use bufread /bufwrite. also adding spawn_blocking when processing these sync io and fs operations to not block the async thread in general

@jillxuu jillxuu requested a review from msmouse February 14, 2024 22:00
@jillxuu jillxuu force-pushed the jill/backup_restore_utils branch from d7871fa to 2d80a56 Compare February 14, 2024 22:58
@jillxuu jillxuu enabled auto-merge (squash) February 14, 2024 23:00
@jillxuu jillxuu force-pushed the jill/backup_restore_utils branch from 2d80a56 to 7e7dfb1 Compare February 14, 2024 23:13

This comment has been minimized.

This comment has been minimized.

Copy link
Contributor

✅ Forge suite compat success on aptos-node-v1.8.3 ==> 7e7dfb10e4c50936306ad95d50b438c6fee98d56

Compatibility test results for aptos-node-v1.8.3 ==> 7e7dfb10e4c50936306ad95d50b438c6fee98d56 (PR)
1. Check liveness of validators at old version: aptos-node-v1.8.3
compatibility::simple-validator-upgrade::liveness-check : committed: 4749 txn/s, latency: 6800 ms, (p50: 6100 ms, p90: 9700 ms, p99: 24000 ms), latency samples: 175720
2. Upgrading first Validator to new version: 7e7dfb10e4c50936306ad95d50b438c6fee98d56
compatibility::simple-validator-upgrade::single-validator-upgrade : committed: 1850 txn/s, latency: 15601 ms, (p50: 18700 ms, p90: 22000 ms, p99: 22300 ms), latency samples: 92500
3. Upgrading rest of first batch to new version: 7e7dfb10e4c50936306ad95d50b438c6fee98d56
compatibility::simple-validator-upgrade::half-validator-upgrade : committed: 1809 txn/s, latency: 16188 ms, (p50: 18900 ms, p90: 22200 ms, p99: 22400 ms), latency samples: 94100
4. upgrading second batch to new version: 7e7dfb10e4c50936306ad95d50b438c6fee98d56
compatibility::simple-validator-upgrade::rest-validator-upgrade : committed: 3272 txn/s, latency: 9686 ms, (p50: 9900 ms, p90: 18500 ms, p99: 19800 ms), latency samples: 127640
5. check swarm health
Compatibility test for aptos-node-v1.8.3 ==> 7e7dfb10e4c50936306ad95d50b438c6fee98d56 passed
Test Ok

Copy link
Contributor

✅ Forge suite realistic_env_max_load success on 7e7dfb10e4c50936306ad95d50b438c6fee98d56

two traffics test: inner traffic : committed: 7101 txn/s, latency: 5353 ms, (p50: 5000 ms, p90: 6900 ms, p99: 12200 ms), latency samples: 3075100
two traffics test : committed: 100 txn/s, latency: 2236 ms, (p50: 2100 ms, p90: 2500 ms, p99: 6200 ms), latency samples: 1820
Latency breakdown for phase 0: ["QsBatchToPos: max: 0.283, avg: 0.210", "QsPosToProposal: max: 0.250, avg: 0.191", "ConsensusProposalToOrdered: max: 0.612, avg: 0.578", "ConsensusOrderedToCommit: max: 0.486, avg: 0.459", "ConsensusProposalToCommit: max: 1.061, avg: 1.036"]
Max round gap was 1 [limit 4] at version 1316410. Max no progress secs was 4.614018 [limit 15] at version 1316410.
Test Ok

@jillxuu jillxuu merged commit 1a99f92 into main Feb 14, 2024
41 of 43 checks passed
@jillxuu jillxuu deleted the jill/backup_restore_utils branch February 14, 2024 23:48
danielxiangzl added a commit that referenced this pull request Feb 16, 2024
* clean error log lines (#12019)

* [table info][2/4] add utils for table info backup and restore and redesign the db read (#11793)

* separate indexer async v2 db from aptosdb

* address comments

* add utils for table info backup and restore and redesign the db read

* address comments to spawn block sync file ops

* address comments

* tests for events and improve event v1 handling (#12012)

* [move-vm] Cache verified modules (#12002)

* [move-vm] Cache verified modules

* fixup! [move-vm] Cache verified modules

* [passkey] Add MAX_BYTES limit for signatures (#11697)

* [passkey] Add MAX_BYTES limit for signatures

* [passkey] Add tracing for AssertionSignature type and fix README

* [passkey] Rebased on latest main, rerun authenticator_regenerate.sh

* Object Code Deployment module with CLI commands generated (#11748)

* [simple] rename RG split in VmChangeSet flag (#12027)

* rename RG split in VmChangeSet flag

old name was stale, when charging was different

* [fuzzing] fixes oss-fuzz FP and fuzz.sh (#12030)

* [fuzzing] fixes oss-fuzz FP and fuzz.sh

* Update Docker images (#12026)

Co-authored-by: sionescu <[email protected]>

* Update release.yaml (#12020)

* Update release.yaml

* enable REFUNDABLE_BYTES

* enable FairnessShuffler

* enable WEBAUTHN_SIGNATURE

* AIP-54 Object Code Deployment release addition

* enable vtxn and jwk consensus

* Update release.yaml

adding aggregators v2 flags, and updating execution onchain config

* add feature flag for zkID (ZK-only mode)

* fix jwk/zkid entries in release yaml 1.10 (#12024)

* update

* update

* Update release.yaml

fix flag name

* Update release.yaml

rename feature

---------

Co-authored-by: aldenhu <[email protected]>
Co-authored-by: hariria <[email protected]>
Co-authored-by: John Chang <[email protected]>
Co-authored-by: danielxiangzl <[email protected]>
Co-authored-by: igor-aptos <[email protected]>
Co-authored-by: Alin Tomescu <[email protected]>
Co-authored-by: zhoujunma <[email protected]>

* Cherry-pick VM changes (#12021)

* [gas] add gas charges for type creation

* [gas-calibration] Add calibration sample

* [move-vm] Implement a per-frame cache for paranoid mode

* fixup! [move-vm] Implement a per-frame cache for paranoid mode

* fixup! fixup! [move-vm] Implement a per-frame cache for paranoid mode

* fixup! fixup! fixup! [move-vm] Implement a per-frame cache for paranoid mode

* fixup! fixup! fixup! fixup! [move-vm] Implement a per-frame cache for paranoid mode

* [gas] add gas charges for dependencies

---------

Co-authored-by: Runtian Zhou <[email protected]>

* trivial doc fix

* [GHA] Upgrade actions/checkout to v4

* jwk ob counters (#12048)

* Revert "[GHA] Upgrade actions/checkout to v4"

This reverts commit 04d078f.

* [CI][indexer] fix the e2e localnet. (#12047)

* fix the e2e localnet.

* fix the e2e localnet.

* bump latest gas feature version to 14

Also be conservative and leave legacy parameters in >14 versions for
now. Need to clean up after REFUNDABLE_BYTES feature is actually enabled
on all networks.

* compat test to be against the testnet tag

* [GHA] Upgrade lint-test.yaml and the dependent actions to checkout@v4

actions/checkout@v4 doesn't behave well if both a workflow and an
invoked action checkout the source code on top of each other.

* [GHA] Update pin for tj-actions/changed-files

* start jwk consensus for google (#12053)

* [consensus] check rpc epoch in epoch_manager (#12018)

* [consensus] check rpc epoch in epoch_manager

* fix gas version (13 is deprecated/cannot be used) (#12064)

* FatalVMError shouldn't create "Delayed materialization code invariant" (#12044)

* Move all visibility checking into AST-level function_checker, simplify that code a bit, and improve diagnostics. (#11948)

* rust changes to move all visibility checking to AST and clean it up a bit
* change `Known attribute ... position` warning to a neater `Attribute .. position` warning
* add FunctionData id_loc to allow pointing at function name in declaration for more concise error messages. abstract messages a bit in function_checker
* add 'inlined from' labels to diagnostics with labels, fix bug in function_checker to enable post-inlining visibility checking

* lint

* fix for small stakes

* assert

---------

Co-authored-by: igor-aptos <[email protected]>
Co-authored-by: jill <[email protected]>
Co-authored-by: George Mitenkov <[email protected]>
Co-authored-by: runtianz <[email protected]>
Co-authored-by: Andrew Hariri <[email protected]>
Co-authored-by: John Chang <[email protected]>
Co-authored-by: Gerardo Di Giacomo <[email protected]>
Co-authored-by: sionescu <[email protected]>
Co-authored-by: Junkil Park <[email protected]>
Co-authored-by: aldenhu <[email protected]>
Co-authored-by: danielxiangzl <[email protected]>
Co-authored-by: Alin Tomescu <[email protected]>
Co-authored-by: Victor Gao <[email protected]>
Co-authored-by: Stelian Ionescu <[email protected]>
Co-authored-by: Stelian Ionescu <[email protected]>
Co-authored-by: larry-aptos <[email protected]>
Co-authored-by: Balaji Arun <[email protected]>
Co-authored-by: Brian R. Murphy <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants