[table info][2/4] add utils for table info backup and restore and redesign the db read #11793

jillxuu · 2024-01-26T05:55:04Z

This PR

add gcs operator utils

Overview

[1/4] separate indexer async v2 db from aptosdb : #11799
[2/4] add gcs operational utils to set up for backup and restore to gcs: #11793
[3/4] add epoch based backup logic: #11794
[4/4] add db snapshot restore logic: #11795

Goal

https://www.notion.so/aptoslabs/Internal-Indexer-Deprecation-Next-Steps-824d5af7f16a4ff3aeccc4a4edee2763?pvs=4

Migrating internal indexer off of aptos db critical path, move it into its own standalone runtime service.
Still provide the table info mapping to both API and indexer services.
improve table info parsing performance to unblock indexer perf bottleneck

Context

This effort is broken down into two parts:

part 1: [indexer-grpc-table-info] add table info parsing logic to indexer grpc fn #10783. Add moving the table info service out of the critical path and convert it into multithread to concurrently process the request.
part 2 is this PR. Now we have the table info service, but this service will only be enabled by a handful FNs, so when a new FN wants to join the network but they don't want to sync from the genesis, they should be able to restore the db from a cloud service. In order to provide such a cloud service for others to download the db snapshot, this pr focuses on two things: backup and restore. backup is optional in the config, but restore logic is written in the code.

before	after

Detailed Changes

Tradeoffs

backup based on epoch or transaction version? what frequency?

Pros of backup using version:

We have more control over the backup frequency as frequency is tunable.
Cons of backup using version:
Overhead of managing and comparing backed up versions and current processing versions

Pros of backup using epoch:

When to backup logic is much cleaner and less error prone
Cons of backup using epoch:
not very configurable but still tunable by setting frequency on how many epochs behind

Decided to use epoch, because on testnet we have little over 10k epoch, divided by total txns, it gives us on average 70k txns per epoch, this sounds about the frequency of txns we'd like to backup.

when to restore

Decided to restore when both conditions are met:

the difference btw next versions to be processed and current version from ledger is greater than a version_diff.
and time difference btw last restored timestamp in db and current timestamp is greater than a RESTORE_TIME_DIFF_SECS.
This is to prevent fullnode crashlooping and constantly try to restore without luck, and when version difference is not that big we don't need to spam the gcs service but rather directly state syncing from that close to head version.

structure of the gcs bucket

I followed the similar structure as of indexer's filestore, where we keep a metadata.json file in the bucket to keep track of the chain id and newest backed up epoch. and then a files folder to keep all the epoch based backup. Each db snapshot is first compressed into a tar file from folder, and then gzipped to compress to the best size possible. based on larry's point, alternative compression like bzip2 is less performant.

threads

Using a separate thread for backup only, base on the past experience, gcs upload could be as slow as minutes.

gcs prunning

Couple options we could pursue, since each backup file is a full db backup,

create another service to constantly clean up the backup files
use gcs own policy to delete files based on time, and other conditions
programmatically delete old files while upload
constantly writing to the same file

Decided to go with gcs own policy with the proper configuration setup in the gcs deployment. Reason behind is that deploying and maintaining another service is overhead and costs more money, especially this service's responsibility is very singular; writing code for gcs object deletion is not ideal, since we're writing and deleting, need to handle different multitude of edge cases; constantly writing to the same file is def not gonna work, since gcs has strict limitation on single object write limit, only once per second.

Test Plan

passing all the written unit tests on file system operation
locally tested and verified backup and restore, as well as table info read

Concerns

There's a bottleneck on the size of db snapshot. Currently this db on testnet is around 250mb, based on the nature of db, the compression could get it to be 50-150mb per db snapshot. It's still too big to upload to gcs as its too slow.

TODO

E2E test

from rustie:

set up a long-running fullnode in a new k8s project. We can use the same data-staging-us-central1 cluster, but use a new namespace, like indexer-fullnode-testnet-test or something. This is to isolate it from everything else, but we can use the same cluster for simplicity
Set up a job in the same namespace that does the backup to GCS.
We can set up a continuous job in aptos-core CI that:
5.1. Spins up a fullnode based on the latest build. This would be the latest main nightly for instance
3.2. We can quit immediately after verifying that the restore was successful
3.3. The cost should be manageable, assuming that the restore process is quick.

integration test

load test

Couple things i want to verify with load testing

10 fns boostrapping, can restore work for all of them
fns keep crashlooping, is gcs spammed based on egress & ingress
when file gets bigger, will backup still work and how long it could take

trunk-io · 2024-01-26T05:55:08Z

⏱️ 1h 25m total CI duration on this PR

Job	Cumulative Duration	Recent Runs
rust-unit-tests	33m	🟩
windows-build	26m	🟩
rust-lints	9m	🟩
run-tests-main-branch	7m	🟥 🟥
check	4m	🟩
general-lints	2m	🟩
check-dynamic-deps	2m	🟩
file_change_determinator	23s	🟩 🟩
semgrep/ci	19s	🟩
file_change_determinator	10s	🟩
permission-check	7s	🟩 🟩
permission-check	6s	🟩 🟩
permission-check	6s	🟩 🟩
permission-check	5s	🟩 🟩

🚨 1 job on the last run was significantly faster/slower than expected

Job	Duration	vs 7d avg	Delta
windows-build	26m	19m

_{settings ⋅ feedback ⋅ docs ⋅ learn more about trunk.io}

bowenyang007 · 2024-01-26T18:58:10Z

I would say that clean up shouldn't happen right away. Let's onboard the ecosystem before we delete the "old way"

jillxuu · 2024-01-26T18:59:36Z

I would say that clean up shouldn't happen right away. Let's onboard the ecosystem before we delete the "old way"

old way was not deleted, and it remains exactly the same. we're just changing the way to read and write our new db. updated the PR description to make it more explicit. @bowenyang007

api/types/src/convert.rs

config/src/config/indexer_table_info_config.rs

ecosystem/indexer-grpc/indexer-grpc-table-info/Cargo.toml

larry-aptos · 2024-02-06T19:07:46Z

ecosystem/indexer-grpc/indexer-grpc-table-info/src/backup_restore/fs_ops.rs

+// SPDX-License-Identifier: Apache-2.0
+
+use flate2::{read::GzDecoder, write::GzEncoder, Compression};
+use std::{


note: std operations can block the main thread like std::thread::sleep if used with tokio tasks.

should be fine with std threads.

yep totally aware of that, in the next pr, creating a standalone spawn blocking thread to separately handle file operation and gcs upload, so that it won't block other concurrent async threads from running.

oh my point is, you can use tokio::fs/tokio::io if IO operations needed in async context.

spawn_blocking is for CPU intensive operations.

msmouse · 2024-02-13T19:17:42Z

Are we comfortable operating the entire tarball in memory?

ecosystem/indexer-grpc/indexer-grpc-table-info/src/backup_restore/fs_ops.rs

ecosystem/indexer-grpc/indexer-grpc-table-info/src/backup_restore/gcs.rs

jillxuu · 2024-02-14T18:50:04Z

Are we comfortable operating the entire tarball in memory?

new commit addressed the memory concern to either write to disk directly or use bufread /bufwrite. also adding spawn_blocking when processing these sync io and fs operations to not block the async thread in general

ecosystem/indexer-grpc/indexer-grpc-table-info/src/backup_restore/fs_ops.rs

ecosystem/indexer-grpc/indexer-grpc-table-info/src/backup_restore/gcs.rs

github-actions · 2024-02-14T23:45:50Z

✅ Forge suite `compat` success on `aptos-node-v1.8.3` ==> `7e7dfb10e4c50936306ad95d50b438c6fee98d56`

Compatibility test results for aptos-node-v1.8.3 ==> 7e7dfb10e4c50936306ad95d50b438c6fee98d56 (PR)
1. Check liveness of validators at old version: aptos-node-v1.8.3
compatibility::simple-validator-upgrade::liveness-check : committed: 4749 txn/s, latency: 6800 ms, (p50: 6100 ms, p90: 9700 ms, p99: 24000 ms), latency samples: 175720
2. Upgrading first Validator to new version: 7e7dfb10e4c50936306ad95d50b438c6fee98d56
compatibility::simple-validator-upgrade::single-validator-upgrade : committed: 1850 txn/s, latency: 15601 ms, (p50: 18700 ms, p90: 22000 ms, p99: 22300 ms), latency samples: 92500
3. Upgrading rest of first batch to new version: 7e7dfb10e4c50936306ad95d50b438c6fee98d56
compatibility::simple-validator-upgrade::half-validator-upgrade : committed: 1809 txn/s, latency: 16188 ms, (p50: 18900 ms, p90: 22200 ms, p99: 22400 ms), latency samples: 94100
4. upgrading second batch to new version: 7e7dfb10e4c50936306ad95d50b438c6fee98d56
compatibility::simple-validator-upgrade::rest-validator-upgrade : committed: 3272 txn/s, latency: 9686 ms, (p50: 9900 ms, p90: 18500 ms, p99: 19800 ms), latency samples: 127640
5. check swarm health
Compatibility test for aptos-node-v1.8.3 ==> 7e7dfb10e4c50936306ad95d50b438c6fee98d56 passed
Test Ok

github-actions · 2024-02-14T23:47:15Z

✅ Forge suite `realistic_env_max_load` success on `7e7dfb10e4c50936306ad95d50b438c6fee98d56`

two traffics test: inner traffic : committed: 7101 txn/s, latency: 5353 ms, (p50: 5000 ms, p90: 6900 ms, p99: 12200 ms), latency samples: 3075100
two traffics test : committed: 100 txn/s, latency: 2236 ms, (p50: 2100 ms, p90: 2500 ms, p99: 6200 ms), latency samples: 1820
Latency breakdown for phase 0: ["QsBatchToPos: max: 0.283, avg: 0.210", "QsPosToProposal: max: 0.250, avg: 0.191", "ConsensusProposalToOrdered: max: 0.612, avg: 0.578", "ConsensusOrderedToCommit: max: 0.486, avg: 0.459", "ConsensusProposalToCommit: max: 1.061, avg: 1.036"]
Max round gap was 1 [limit 4] at version 1316410. Max no progress secs was 4.614018 [limit 15] at version 1316410.
Test Ok

* clean error log lines (#12019) * [table info][2/4] add utils for table info backup and restore and redesign the db read (#11793) * separate indexer async v2 db from aptosdb * address comments * add utils for table info backup and restore and redesign the db read * address comments to spawn block sync file ops * address comments * tests for events and improve event v1 handling (#12012) * [move-vm] Cache verified modules (#12002) * [move-vm] Cache verified modules * fixup! [move-vm] Cache verified modules * [passkey] Add MAX_BYTES limit for signatures (#11697) * [passkey] Add MAX_BYTES limit for signatures * [passkey] Add tracing for AssertionSignature type and fix README * [passkey] Rebased on latest main, rerun authenticator_regenerate.sh * Object Code Deployment module with CLI commands generated (#11748) * [simple] rename RG split in VmChangeSet flag (#12027) * rename RG split in VmChangeSet flag old name was stale, when charging was different * [fuzzing] fixes oss-fuzz FP and fuzz.sh (#12030) * [fuzzing] fixes oss-fuzz FP and fuzz.sh * Update Docker images (#12026) Co-authored-by: sionescu <[email protected]> * Update release.yaml (#12020) * Update release.yaml * enable REFUNDABLE_BYTES * enable FairnessShuffler * enable WEBAUTHN_SIGNATURE * AIP-54 Object Code Deployment release addition * enable vtxn and jwk consensus * Update release.yaml adding aggregators v2 flags, and updating execution onchain config * add feature flag for zkID (ZK-only mode) * fix jwk/zkid entries in release yaml 1.10 (#12024) * update * update * Update release.yaml fix flag name * Update release.yaml rename feature --------- Co-authored-by: aldenhu <[email protected]> Co-authored-by: hariria <[email protected]> Co-authored-by: John Chang <[email protected]> Co-authored-by: danielxiangzl <[email protected]> Co-authored-by: igor-aptos <[email protected]> Co-authored-by: Alin Tomescu <[email protected]> Co-authored-by: zhoujunma <[email protected]> * Cherry-pick VM changes (#12021) * [gas] add gas charges for type creation * [gas-calibration] Add calibration sample * [move-vm] Implement a per-frame cache for paranoid mode * fixup! [move-vm] Implement a per-frame cache for paranoid mode * fixup! fixup! [move-vm] Implement a per-frame cache for paranoid mode * fixup! fixup! fixup! [move-vm] Implement a per-frame cache for paranoid mode * fixup! fixup! fixup! fixup! [move-vm] Implement a per-frame cache for paranoid mode * [gas] add gas charges for dependencies --------- Co-authored-by: Runtian Zhou <[email protected]> * trivial doc fix * [GHA] Upgrade actions/checkout to v4 * jwk ob counters (#12048) * Revert "[GHA] Upgrade actions/checkout to v4" This reverts commit 04d078f. * [CI][indexer] fix the e2e localnet. (#12047) * fix the e2e localnet. * fix the e2e localnet. * bump latest gas feature version to 14 Also be conservative and leave legacy parameters in >14 versions for now. Need to clean up after REFUNDABLE_BYTES feature is actually enabled on all networks. * compat test to be against the testnet tag * [GHA] Upgrade lint-test.yaml and the dependent actions to checkout@v4 actions/checkout@v4 doesn't behave well if both a workflow and an invoked action checkout the source code on top of each other. * [GHA] Update pin for tj-actions/changed-files * start jwk consensus for google (#12053) * [consensus] check rpc epoch in epoch_manager (#12018) * [consensus] check rpc epoch in epoch_manager * fix gas version (13 is deprecated/cannot be used) (#12064) * FatalVMError shouldn't create "Delayed materialization code invariant" (#12044) * Move all visibility checking into AST-level function_checker, simplify that code a bit, and improve diagnostics. (#11948) * rust changes to move all visibility checking to AST and clean it up a bit * change `Known attribute ... position` warning to a neater `Attribute .. position` warning * add FunctionData id_loc to allow pointing at function name in declaration for more concise error messages. abstract messages a bit in function_checker * add 'inlined from' labels to diagnostics with labels, fix bug in function_checker to enable post-inlining visibility checking * lint * fix for small stakes * assert --------- Co-authored-by: igor-aptos <[email protected]> Co-authored-by: jill <[email protected]> Co-authored-by: George Mitenkov <[email protected]> Co-authored-by: runtianz <[email protected]> Co-authored-by: Andrew Hariri <[email protected]> Co-authored-by: John Chang <[email protected]> Co-authored-by: Gerardo Di Giacomo <[email protected]> Co-authored-by: sionescu <[email protected]> Co-authored-by: Junkil Park <[email protected]> Co-authored-by: aldenhu <[email protected]> Co-authored-by: danielxiangzl <[email protected]> Co-authored-by: Alin Tomescu <[email protected]> Co-authored-by: Victor Gao <[email protected]> Co-authored-by: Stelian Ionescu <[email protected]> Co-authored-by: Stelian Ionescu <[email protected]> Co-authored-by: larry-aptos <[email protected]> Co-authored-by: Balaji Arun <[email protected]> Co-authored-by: Brian R. Murphy <[email protected]>

jillxuu requested a review from a team January 26, 2024 06:03

jillxuu marked this pull request as ready for review January 26, 2024 06:03

jillxuu requested review from msmouse, lightmark, JoshLind, gregnazario, banool and 0xmaayan as code owners January 26, 2024 06:04

bowenyang007 reviewed Jan 26, 2024

View reviewed changes

api/types/src/convert.rs Outdated Show resolved Hide resolved

bowenyang007 reviewed Jan 26, 2024

View reviewed changes

config/src/config/indexer_table_info_config.rs Outdated Show resolved Hide resolved

jillxuu changed the title ~~[table info][1/3] add utils for table info backup and restore and redesign the db read~~ [table info][2/4] add utils for table info backup and restore and redesign the db read Jan 26, 2024

jillxuu changed the base branch from main to jill/indexer_async_v2_db January 26, 2024 20:07

jillxuu force-pushed the jill/backup_restore_utils branch from 0d4590f to 7a778c7 Compare January 26, 2024 20:08

This was referenced Jan 26, 2024

[table info][1/4] separate indexer async v2 db from aptosdb #11799

Merged

[table info][3/4] add epoch based db snapshot backup #11794

Closed

[table info][4/4] add db snapshot restore logic #11795

Closed

bowenyang007 approved these changes Feb 5, 2024

View reviewed changes

larry-aptos reviewed Feb 6, 2024

View reviewed changes

ecosystem/indexer-grpc/indexer-grpc-table-info/Cargo.toml Show resolved Hide resolved

larry-aptos reviewed Feb 6, 2024

View reviewed changes

jillxuu force-pushed the jill/indexer_async_v2_db branch from 65faadd to bbaedb8 Compare February 8, 2024 00:36

jillxuu requested a review from grao1991 as a code owner February 8, 2024 00:36

jillxuu force-pushed the jill/backup_restore_utils branch from 7a778c7 to 7ea0b68 Compare February 8, 2024 01:20

jillxuu force-pushed the jill/indexer_async_v2_db branch from 755af28 to b0b60e9 Compare February 9, 2024 17:56

Base automatically changed from jill/indexer_async_v2_db to main February 12, 2024 20:07

jillxuu force-pushed the jill/backup_restore_utils branch 3 times, most recently from b031aa8 to f184c78 Compare February 13, 2024 00:42

jillxuu force-pushed the jill/backup_restore_utils branch 2 times, most recently from 98bbcec to b1ca3ee Compare February 13, 2024 00:45

jillxuu requested a review from larry-aptos February 13, 2024 00:49

larry-aptos approved these changes Feb 13, 2024

View reviewed changes

msmouse reviewed Feb 13, 2024

View reviewed changes

jillxuu force-pushed the jill/backup_restore_utils branch 2 times, most recently from 571c74f to 8c3352f Compare February 14, 2024 18:47

jillxuu requested a review from msmouse February 14, 2024 18:48

msmouse reviewed Feb 14, 2024

View reviewed changes

jillxuu requested a review from msmouse February 14, 2024 22:00

msmouse approved these changes Feb 14, 2024

View reviewed changes

ecosystem/indexer-grpc/indexer-grpc-table-info/src/backup_restore/gcs.rs Outdated Show resolved Hide resolved

jillxuu force-pushed the jill/backup_restore_utils branch from d7871fa to 2d80a56 Compare February 14, 2024 22:58

jillxuu enabled auto-merge (squash) February 14, 2024 23:00

jillxuu added 5 commits February 14, 2024 15:13

separate indexer async v2 db from aptosdb

a4c0acb

address comments

38a3d2c

add utils for table info backup and restore and redesign the db read

1067eb3

address comments to spawn block sync file ops

69a353c

address comments

7e7dfb1

jillxuu force-pushed the jill/backup_restore_utils branch from 2d80a56 to 7e7dfb1 Compare February 14, 2024 23:13

This comment has been minimized.

Sign in to view

jillxuu merged commit 1a99f92 into main Feb 14, 2024
41 of 43 checks passed

jillxuu deleted the jill/backup_restore_utils branch February 14, 2024 23:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[table info][2/4] add utils for table info backup and restore and redesign the db read #11793

[table info][2/4] add utils for table info backup and restore and redesign the db read #11793

jillxuu commented Jan 26, 2024 •

edited

Loading

trunk-io bot commented Jan 26, 2024 •

edited

Loading

bowenyang007 commented Jan 26, 2024

jillxuu commented Jan 26, 2024 •

edited

Loading

larry-aptos Feb 6, 2024

jillxuu Feb 13, 2024

larry-aptos Feb 13, 2024

msmouse commented Feb 13, 2024

jillxuu commented Feb 14, 2024

This comment has been minimized.

This comment has been minimized.

github-actions bot commented Feb 14, 2024

github-actions bot commented Feb 14, 2024

[table info][2/4] add utils for table info backup and restore and redesign the db read #11793

[table info][2/4] add utils for table info backup and restore and redesign the db read #11793

Conversation

jillxuu commented Jan 26, 2024 • edited Loading

This PR

Overview

Goal

Context

Detailed Changes

Tradeoffs

backup based on epoch or transaction version? what frequency?

when to restore

structure of the gcs bucket

threads

gcs prunning

Test Plan

Concerns

TODO

E2E test

integration test

load test

trunk-io bot commented Jan 26, 2024 • edited Loading

bowenyang007 commented Jan 26, 2024

jillxuu commented Jan 26, 2024 • edited Loading

larry-aptos Feb 6, 2024

Choose a reason for hiding this comment

jillxuu Feb 13, 2024

Choose a reason for hiding this comment

larry-aptos Feb 13, 2024

Choose a reason for hiding this comment

msmouse commented Feb 13, 2024

jillxuu commented Feb 14, 2024

This comment has been minimized.

This comment has been minimized.

github-actions bot commented Feb 14, 2024

✅ Forge suite compat success on aptos-node-v1.8.3 ==> 7e7dfb10e4c50936306ad95d50b438c6fee98d56

github-actions bot commented Feb 14, 2024

✅ Forge suite realistic_env_max_load success on 7e7dfb10e4c50936306ad95d50b438c6fee98d56

jillxuu commented Jan 26, 2024 •

edited

Loading

trunk-io bot commented Jan 26, 2024 •

edited

Loading

jillxuu commented Jan 26, 2024 •

edited

Loading

✅ Forge suite `compat` success on `aptos-node-v1.8.3` ==> `7e7dfb10e4c50936306ad95d50b438c6fee98d56`

✅ Forge suite `realistic_env_max_load` success on `7e7dfb10e4c50936306ad95d50b438c6fee98d56`