Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[table info][4/4] add db snapshot restore logic #11795

Closed
wants to merge 4 commits into from

Conversation

jillxuu
Copy link
Contributor

@jillxuu jillxuu commented Jan 26, 2024

This PR

  1. add db snapshot restore logic

Overview

[1/4] separate indexer async v2 db from aptosdb : #11799
[2/4] add gcs operational utils to set up for backup and restore to gcs: #11793
[3/4] add epoch based backup logic: #11794
[4/4] add db snapshot restore logic: #11795

Goal

https://www.notion.so/aptoslabs/Internal-Indexer-Deprecation-Next-Steps-824d5af7f16a4ff3aeccc4a4edee2763?pvs=4

  1. Migrating internal indexer off of aptos db critical path, move it into its own standalone runtime service.
  2. Still provide the table info mapping to both API and indexer services.
  3. improve table info parsing performance to unblock indexer perf bottleneck

Context

This effort is broken down into two parts:

  1. part 1: [indexer-grpc-table-info] add table info parsing logic to indexer grpc fn #10783. Add moving the table info service out of the critical path and convert it into multithread to concurrently process the request.
  2. part 2 is this PR. Now we have the table info service, but this service will only be enabled by a handful FNs, so when a new FN wants to join the network but they don't want to sync from the genesis, they should be able to restore the db from a cloud service. In order to provide such a cloud service for others to download the db snapshot, this pr focuses on two things: backup and restore. backup is optional in the config, but restore logic is written in the code.
before after
Screenshot 2023-11-15 at 10 15 01 PM Screenshot 2024-01-25 at 11 14 05 AM

Detailed Changes

Screenshot 2024-01-25 at 11 56 53 AM

Tradeoffs

backup based on epoch or transaction version? what frequency?

Pros of backup using version:

  1. We have more control over the backup frequency as frequency is tunable.
    Cons of backup using version:
  2. Overhead of managing and comparing backed up versions and current processing versions

Pros of backup using epoch:

  1. When to backup logic is much cleaner and less error prone
    Cons of backup using epoch:
  2. not very configurable but still tunable by setting frequency on how many epochs behind

Decided to use epoch, because on testnet we have little over 10k epoch, divided by total txns, it gives us on average 70k txns per epoch, this sounds about the frequency of txns we'd like to backup.

when to restore

Decided to restore when both conditions are met:

  1. the difference btw next versions to be processed and current version from ledger is greater than a version_diff.
  2. and time difference btw last restored timestamp in db and current timestamp is greater than a RESTORE_TIME_DIFF_SECS.
    This is to prevent fullnode crashlooping and constantly try to restore without luck, and when version difference is not that big we don't need to spam the gcs service but rather directly state syncing from that close to head version.

structure of the gcs bucket

I followed the similar structure as of indexer's filestore, where we keep a metadata.json file in the bucket to keep track of the chain id and newest backed up epoch. and then a files folder to keep all the epoch based backup. Each db snapshot is first compressed into a tar file from folder, and then gzipped to compress to the best size possible. based on larry's point, alternative compression like bzip2 is less performant.

threads

Using a separate thread for backup only, base on the past experience, gcs upload could be as slow as minutes.

gcs prunning

Couple options we could pursue, since each backup file is a full db backup,

  1. create another service to constantly clean up the backup files
  2. use gcs own policy to delete files based on time, and other conditions
  3. programmatically delete old files while upload
  4. constantly writing to the same file

Decided to go with gcs own policy with the proper configuration setup in the gcs deployment. Reason behind is that deploying and maintaining another service is overhead and costs more money, especially this service's responsibility is very singular; writing code for gcs object deletion is not ideal, since we're writing and deleting, need to handle different multitude of edge cases; constantly writing to the same file is def not gonna work, since gcs has strict limitation on single object write limit, only once per second.

Test Plan

  1. passing all the written unit tests on file system operation
  2. locally tested and verified backup and restore, as well as table info read

Concerns

There's a bottleneck on the size of db snapshot. Currently this db on testnet is around 250mb, based on the nature of db, the compression could get it to be 50-150mb per db snapshot. It's still too big to upload to gcs as its too slow.

TODO

E2E test

from rustie:

  1. set up a long-running fullnode in a new k8s project. We can use the same data-staging-us-central1 cluster, but use a new namespace, like indexer-fullnode-testnet-test or something. This is to isolate it from everything else, but we can use the same cluster for simplicity
  2. Set up a job in the same namespace that does the backup to GCS.
  3. We can set up a continuous job in aptos-core CI that:
    5.1. Spins up a fullnode based on the latest build. This would be the latest main nightly for instance
    3.2. We can quit immediately after verifying that the restore was successful
    3.3. The cost should be manageable, assuming that the restore process is quick.

integration test

load test

Couple things i want to verify with load testing

  1. 10 fns boostrapping, can restore work for all of them
  2. fns keep crashlooping, is gcs spammed based on egress & ingress
  3. when file gets bigger, will backup still work and how long it could take

Copy link

trunk-io bot commented Jan 26, 2024

⏱️ 1h 21m total CI duration on this PR
Job Cumulative Duration Recent Runs
windows-build 32m 🟥🟥
rust-unit-tests 13m 🟥🟥
rust-lints 10m 🟥🟥
check 8m 🟩🟩
run-tests-main-branch 8m 🟥🟥
general-lints 5m 🟩🟩
check-dynamic-deps 4m 🟩🟩
semgrep/ci 37s 🟩🟩
file_change_determinator 26s 🟩🟩
file_change_determinator 20s 🟩🟩
permission-check 8s 🟩🟩
permission-check 6s 🟩🟩
permission-check 6s 🟩🟩
permission-check 4s 🟩🟩

settingsfeedbackdocs ⋅ learn more about trunk.io

@jillxuu jillxuu requested a review from a team January 26, 2024 06:04
@jillxuu jillxuu marked this pull request as ready for review January 26, 2024 06:04
},
};

let backup_restore_operator: Arc<GcsBackupRestoreOperator> = Arc::new(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why does it need to be an Arc?

// after reading db metadata info and deciding to restore, drop the db so that we could re-open it later
close_db(db);

sleep(Duration::from_millis(DB_OPERATION_INTERVAL_MS));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what do you see if you don't sleep?

use async sleep

.await;

// a different path to restore backup db snapshot to, to avoid db corruption
let restore_db_path = node_config
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably you wanna clean up the dest folder (deleting it)?

rename_db_folders_and_cleanup(&db_path, &tmp_db_path, &restore_db_path)
.expect("Failed to operate atomic restore in file system.");

sleep(Duration::from_millis(DB_OPERATION_INTERVAL_MS));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use async sleep

.expect("Failed to restore snapshot");

// Restore to a different folder and replace the target folder atomically
let tmp_db_path = db_root_path.join("tmp");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably want to clean up target folder

}
}

pub fn last_restored_timestamp(self) -> u64 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

naming: expect_timestamp() sounds more right

pub mod table_info;

use aptos_schemadb::ColumnFamilyName;

pub const DEFAULT_COLUMN_FAMILY_NAME: ColumnFamilyName = "default";
/// TODO(jill): to be deleted once INDEXER_METADATA_V2_CF_NAME is deployed
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note: you can't remove a column family from code unless you redo all the DB instances because RocksDB insists all existing CFs be mentioned in the open call. (but you can truncate the CF and rename the variable DEPRECATED_x)

Copy link
Contributor

This issue is stale because it has been open 45 days with no activity. Remove the stale label, comment or push a commit - otherwise this will be closed in 15 days.

@github-actions github-actions bot added the Stale label Mar 30, 2024
@github-actions github-actions bot closed this Apr 14, 2024
@grao1991 grao1991 reopened this Aug 16, 2024
@github-actions github-actions bot removed the Stale label Aug 17, 2024
Copy link
Contributor

github-actions bot commented Oct 2, 2024

This issue is stale because it has been open 45 days with no activity. Remove the stale label, comment or push a commit - otherwise this will be closed in 15 days.

@github-actions github-actions bot added the Stale label Oct 2, 2024
@github-actions github-actions bot closed this Oct 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants