-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[table info][2/4] add utils for table info backup and restore and redesign the db read #11793
Conversation
⏱️ 1h 25m total CI duration on this PR
🚨 1 job on the last run was significantly faster/slower than expected
|
I would say that clean up shouldn't happen right away. Let's onboard the ecosystem before we delete the "old way" |
old way was not deleted, and it remains exactly the same. we're just changing the way to read and write our new db. updated the PR description to make it more explicit. @bowenyang007 |
0d4590f
to
7a778c7
Compare
// SPDX-License-Identifier: Apache-2.0 | ||
|
||
use flate2::{read::GzDecoder, write::GzEncoder, Compression}; | ||
use std::{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
note: std operations can block the main thread like std::thread::sleep
if used with tokio tasks.
should be fine with std threads.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep totally aware of that, in the next pr, creating a standalone spawn blocking thread to separately handle file operation and gcs upload, so that it won't block other concurrent async threads from running.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh my point is, you can use tokio::fs/tokio::io if IO operations needed in async context.
spawn_blocking is for CPU intensive operations.
65faadd
to
bbaedb8
Compare
7a778c7
to
7ea0b68
Compare
755af28
to
b0b60e9
Compare
b031aa8
to
f184c78
Compare
98bbcec
to
b1ca3ee
Compare
Are we comfortable operating the entire tarball in memory? |
ecosystem/indexer-grpc/indexer-grpc-table-info/src/backup_restore/fs_ops.rs
Outdated
Show resolved
Hide resolved
ecosystem/indexer-grpc/indexer-grpc-table-info/src/backup_restore/fs_ops.rs
Outdated
Show resolved
Hide resolved
ecosystem/indexer-grpc/indexer-grpc-table-info/src/backup_restore/gcs.rs
Outdated
Show resolved
Hide resolved
ecosystem/indexer-grpc/indexer-grpc-table-info/src/backup_restore/gcs.rs
Outdated
Show resolved
Hide resolved
571c74f
to
8c3352f
Compare
new commit addressed the memory concern to either write to disk directly or use bufread /bufwrite. also adding spawn_blocking when processing these sync io and fs operations to not block the async thread in general |
ecosystem/indexer-grpc/indexer-grpc-table-info/src/backup_restore/fs_ops.rs
Outdated
Show resolved
Hide resolved
ecosystem/indexer-grpc/indexer-grpc-table-info/src/backup_restore/gcs.rs
Outdated
Show resolved
Hide resolved
ecosystem/indexer-grpc/indexer-grpc-table-info/src/backup_restore/gcs.rs
Outdated
Show resolved
Hide resolved
ecosystem/indexer-grpc/indexer-grpc-table-info/src/backup_restore/gcs.rs
Outdated
Show resolved
Hide resolved
d7871fa
to
2d80a56
Compare
2d80a56
to
7e7dfb1
Compare
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
✅ Forge suite
|
✅ Forge suite
|
* clean error log lines (#12019) * [table info][2/4] add utils for table info backup and restore and redesign the db read (#11793) * separate indexer async v2 db from aptosdb * address comments * add utils for table info backup and restore and redesign the db read * address comments to spawn block sync file ops * address comments * tests for events and improve event v1 handling (#12012) * [move-vm] Cache verified modules (#12002) * [move-vm] Cache verified modules * fixup! [move-vm] Cache verified modules * [passkey] Add MAX_BYTES limit for signatures (#11697) * [passkey] Add MAX_BYTES limit for signatures * [passkey] Add tracing for AssertionSignature type and fix README * [passkey] Rebased on latest main, rerun authenticator_regenerate.sh * Object Code Deployment module with CLI commands generated (#11748) * [simple] rename RG split in VmChangeSet flag (#12027) * rename RG split in VmChangeSet flag old name was stale, when charging was different * [fuzzing] fixes oss-fuzz FP and fuzz.sh (#12030) * [fuzzing] fixes oss-fuzz FP and fuzz.sh * Update Docker images (#12026) Co-authored-by: sionescu <[email protected]> * Update release.yaml (#12020) * Update release.yaml * enable REFUNDABLE_BYTES * enable FairnessShuffler * enable WEBAUTHN_SIGNATURE * AIP-54 Object Code Deployment release addition * enable vtxn and jwk consensus * Update release.yaml adding aggregators v2 flags, and updating execution onchain config * add feature flag for zkID (ZK-only mode) * fix jwk/zkid entries in release yaml 1.10 (#12024) * update * update * Update release.yaml fix flag name * Update release.yaml rename feature --------- Co-authored-by: aldenhu <[email protected]> Co-authored-by: hariria <[email protected]> Co-authored-by: John Chang <[email protected]> Co-authored-by: danielxiangzl <[email protected]> Co-authored-by: igor-aptos <[email protected]> Co-authored-by: Alin Tomescu <[email protected]> Co-authored-by: zhoujunma <[email protected]> * Cherry-pick VM changes (#12021) * [gas] add gas charges for type creation * [gas-calibration] Add calibration sample * [move-vm] Implement a per-frame cache for paranoid mode * fixup! [move-vm] Implement a per-frame cache for paranoid mode * fixup! fixup! [move-vm] Implement a per-frame cache for paranoid mode * fixup! fixup! fixup! [move-vm] Implement a per-frame cache for paranoid mode * fixup! fixup! fixup! fixup! [move-vm] Implement a per-frame cache for paranoid mode * [gas] add gas charges for dependencies --------- Co-authored-by: Runtian Zhou <[email protected]> * trivial doc fix * [GHA] Upgrade actions/checkout to v4 * jwk ob counters (#12048) * Revert "[GHA] Upgrade actions/checkout to v4" This reverts commit 04d078f. * [CI][indexer] fix the e2e localnet. (#12047) * fix the e2e localnet. * fix the e2e localnet. * bump latest gas feature version to 14 Also be conservative and leave legacy parameters in >14 versions for now. Need to clean up after REFUNDABLE_BYTES feature is actually enabled on all networks. * compat test to be against the testnet tag * [GHA] Upgrade lint-test.yaml and the dependent actions to checkout@v4 actions/checkout@v4 doesn't behave well if both a workflow and an invoked action checkout the source code on top of each other. * [GHA] Update pin for tj-actions/changed-files * start jwk consensus for google (#12053) * [consensus] check rpc epoch in epoch_manager (#12018) * [consensus] check rpc epoch in epoch_manager * fix gas version (13 is deprecated/cannot be used) (#12064) * FatalVMError shouldn't create "Delayed materialization code invariant" (#12044) * Move all visibility checking into AST-level function_checker, simplify that code a bit, and improve diagnostics. (#11948) * rust changes to move all visibility checking to AST and clean it up a bit * change `Known attribute ... position` warning to a neater `Attribute .. position` warning * add FunctionData id_loc to allow pointing at function name in declaration for more concise error messages. abstract messages a bit in function_checker * add 'inlined from' labels to diagnostics with labels, fix bug in function_checker to enable post-inlining visibility checking * lint * fix for small stakes * assert --------- Co-authored-by: igor-aptos <[email protected]> Co-authored-by: jill <[email protected]> Co-authored-by: George Mitenkov <[email protected]> Co-authored-by: runtianz <[email protected]> Co-authored-by: Andrew Hariri <[email protected]> Co-authored-by: John Chang <[email protected]> Co-authored-by: Gerardo Di Giacomo <[email protected]> Co-authored-by: sionescu <[email protected]> Co-authored-by: Junkil Park <[email protected]> Co-authored-by: aldenhu <[email protected]> Co-authored-by: danielxiangzl <[email protected]> Co-authored-by: Alin Tomescu <[email protected]> Co-authored-by: Victor Gao <[email protected]> Co-authored-by: Stelian Ionescu <[email protected]> Co-authored-by: Stelian Ionescu <[email protected]> Co-authored-by: larry-aptos <[email protected]> Co-authored-by: Balaji Arun <[email protected]> Co-authored-by: Brian R. Murphy <[email protected]>
This PR
Overview
[1/4] separate indexer async v2 db from aptosdb : #11799
[2/4] add gcs operational utils to set up for backup and restore to gcs: #11793
[3/4] add epoch based backup logic: #11794
[4/4] add db snapshot restore logic: #11795
Goal
https://www.notion.so/aptoslabs/Internal-Indexer-Deprecation-Next-Steps-824d5af7f16a4ff3aeccc4a4edee2763?pvs=4
Context
This effort is broken down into two parts:
Detailed Changes
Tradeoffs
backup based on epoch or transaction version? what frequency?
Pros of backup using version:
Cons of backup using version:
Pros of backup using epoch:
Cons of backup using epoch:
Decided to use epoch, because on testnet we have little over 10k epoch, divided by total txns, it gives us on average 70k txns per epoch, this sounds about the frequency of txns we'd like to backup.
when to restore
Decided to restore when both conditions are met:
version_diff
.RESTORE_TIME_DIFF_SECS
.This is to prevent fullnode crashlooping and constantly try to restore without luck, and when version difference is not that big we don't need to spam the gcs service but rather directly state syncing from that close to head version.
structure of the gcs bucket
I followed the similar structure as of indexer's filestore, where we keep a
metadata.json
file in the bucket to keep track of the chain id and newest backed up epoch. and then a files folder to keep all the epoch based backup. Each db snapshot is first compressed into a tar file from folder, and then gzipped to compress to the best size possible. based on larry's point, alternative compression like bzip2 is less performant.threads
Using a separate thread for backup only, base on the past experience, gcs upload could be as slow as minutes.
gcs prunning
Couple options we could pursue, since each backup file is a full db backup,
Decided to go with gcs own policy with the proper configuration setup in the gcs deployment. Reason behind is that deploying and maintaining another service is overhead and costs more money, especially this service's responsibility is very singular; writing code for gcs object deletion is not ideal, since we're writing and deleting, need to handle different multitude of edge cases; constantly writing to the same file is def not gonna work, since gcs has strict limitation on single object write limit, only once per second.
Test Plan
Concerns
There's a bottleneck on the size of db snapshot. Currently this db on testnet is around 250mb, based on the nature of db, the compression could get it to be 50-150mb per db snapshot. It's still too big to upload to gcs as its too slow.
TODO
E2E test
from rustie:
5.1. Spins up a fullnode based on the latest build. This would be the latest main nightly for instance
3.2. We can quit immediately after verifying that the restore was successful
3.3. The cost should be manageable, assuming that the restore process is quick.
integration test
load test
Couple things i want to verify with load testing