Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[sui-tool] Introduce formal snapshot restore #13794

Merged
merged 2 commits into from
Oct 26, 2023
Merged

Conversation

williampsmith
Copy link
Contributor

@williampsmith williampsmith commented Sep 15, 2023

Description

Extend sui-tool snapshot downloader to download, verify and restore from formal snapshot. Note that if --verify is set to true, only protocol versions where commit_root_state_digest is true are eligible, as this relies on root state hash commitment at end of epoch.

The following tasks are orchestrated:

  • Performing checkpoint summary sync (with verification) to the end of the target epoch via archival store
  • Downloading all snapshot object refs
  • Checksumming all object refs to verify there is no discrepancy between the object store manifest and the contents
  • Accumulating all object refs and comparing against consensus checkpoint commitment (root state hash). This protects against restoring from a compromised snapshot and ensures that the state after restore is consistent with the network
  • Downloading and loading into perpetual store the end of epoch live object set contents from the snapshot
  • Setting other critical state necessary for node to startup and join the network (create committee store, create epoch start configuration, set checkpoint watermarks, etc)

Test Plan

  1. Run the following to perform snapshot restore
GCS_SNAPSHOT_SERVICE_ACCOUNT_FILE_PATH=<path> AWS_ARCHIVE_ACCESS_KEY_ID=<key> AWS_ARCHIVE_SECRET_ACCESS_KEY=<key> AWS_ARCHIVE_REGION=us-west-2 sui-tool download-db-snapshot --epoch 125 --genesis /opt/sui/config/genesis.blob --formal --network testnet --path /opt/sui/db/authorities_db/full_node_db --num-parallel-downloads 50
  1. Startup sui-node and observe that node is able to execute checkpoints successfully and ultimately reconfig to the next epoch.
[00:07:01] ████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ 977 out of 977 .ref files done
[03:34:42] ████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ 10670050/10670050(Checkpoint summary download is complete)
[00:02:36] ██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ 977 out of 977 ref files checksummed (Checksumming complete)
[02:20:05] ██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ 977 out of 977 ref files accumulated (Accumulation complete)
[02:20:05] ████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ 977 out of 977 .obj files done (Objects download complete)
2023-10-11T02:35:15.060244Z  INFO sui_archival::reader: Terminating the manifest sync loop                                           
2023-10-11T02:35:15.060334Z  INFO sui_tool: Formal snapshot state verification complete!
2023-10-11T02:35:15.468872Z  INFO sui_storage::mutex_table: Stopping mutex table cleanup!
2023-10-11T02:35:15.510039Z  INFO sui_storage::mutex_table: Stopping mutex table cleanup!
2023-10-11T02:35:19.644477Z  INFO sui_tool: Successfully restored state from snapshot at end of epoch 125

ubuntu@fullnode-compat-test-03:/opt/sui/db/authorities_db/full_node_db$ systemctl status sui-node
● sui-node.service - Sui Node
     Loaded: loaded (/etc/systemd/system/sui-node.service; disabled; vendor preset: enabled)
     Active: active (running) since Wed 2023-10-11 02:39:26 UTC; 6s ago
   Main PID: 344206 (sui-node)
      Tasks: 110 (limit: 308692)
     Memory: 1.1G (high: 246.0G max: 251.0G swap max: 0B available: 244.8G)
        CPU: 7.651s
     CGroup: /system.slice/sui-node.service
             └─344206 /opt/sui/bin/sui-node --config-path /opt/sui/config/sui-node.yaml

ubuntu@fullnode-compat-test-03:/opt/sui/db/authorities_db/full_node_db$ curl -s http://localhost:9184/metrics | grep 'current_epoch '
current_epoch 126
ubuntu@fullnode-compat-test-03:/opt/sui/db/authorities_db/full_node_db$ curl -s http://localhost:9184/metrics | grep 'last_executed_checkpoint '
last_executed_checkpoint 10678488
ubuntu@fullnode-compat-test-03:/opt/sui/db/authorities_db/full_node_db$ curl -s http://localhost:9184/metrics | grep 'last_executed_checkpoint '
last_executed_checkpoint 10679365

# after some time
ubuntu@fullnode-compat-test-03:/opt/sui/db/authorities_db/full_node_db$ curl -s http://localhost:9184/metrics | grep 'current_epoch '
current_epoch 129

Type of Change (Check all that apply)

  • protocol change
  • user-visible impact
  • breaking change for a client SDKs
  • breaking change for FNs (FN binary must upgrade)
  • breaking change for validators or node operators (must upgrade binaries)
  • breaking change for on-chain data layout
  • necessitate either a data wipe or data migration

Release notes

@vercel
Copy link

vercel bot commented Sep 15, 2023

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
mysten-ui ✅ Ready (Inspect) Visit Preview 💬 Add feedback Oct 26, 2023 6:19pm
sui-typescript-docs ✅ Ready (Inspect) Visit Preview 💬 Add feedback Oct 26, 2023 6:19pm
3 Ignored Deployments
Name Status Preview Comments Updated (UTC)
explorer ⬜️ Ignored (Inspect) Visit Preview Oct 26, 2023 6:19pm
multisig-toolkit ⬜️ Ignored (Inspect) Visit Preview Oct 26, 2023 6:19pm
sui-kiosk ⬜️ Ignored (Inspect) Visit Preview Oct 26, 2023 6:19pm

@vercel vercel bot temporarily deployed to Preview – mysten-ui September 15, 2023 05:18 Inactive
@williampsmith williampsmith marked this pull request as ready for review September 15, 2023 18:05
@vercel vercel bot temporarily deployed to Preview – mysten-ui September 18, 2023 18:00 Inactive
@vercel vercel bot temporarily deployed to Preview – mysten-ui September 18, 2023 18:21 Inactive
@williampsmith williampsmith force-pushed the formal-snapshot-dl branch 2 times, most recently from 8e59d41 to 636a069 Compare October 13, 2023 19:37
@vercel vercel bot temporarily deployed to Preview – mysten-ui October 13, 2023 19:38 Inactive
@vercel vercel bot temporarily deployed to Preview – mysten-ui October 13, 2023 19:39 Inactive
@vercel vercel bot temporarily deployed to Preview – mysten-ui October 13, 2023 19:43 Inactive
@vercel vercel bot temporarily deployed to Preview – mysten-ui October 13, 2023 19:51 Inactive
@vercel vercel bot temporarily deployed to Preview – mysten-ui October 13, 2023 20:01 Inactive
## Description

- Optimize checkpoint summary sync + verification
- Rather than blocking on verification during summary sync, which can be
slow as it requires that we sync in order, instead sync all checkpoint
summaries, and then locally verify.
- This optimization moves checkpoint summary sync and verification from
3.5 hours to ~20 minutes, as measured against testnet `epoch_125`
snapshot
- Optimize state accumulation
- Parallel divide and conquer partial accumulators (per file partition),
then union
- This speeds up accumulation from 3.2 hours to 20 minutes (for same
benchmark as above)
-  Introduce early termination on snapshot verification failure
- Introduce `verbose` flag, which, when not set, sets log level to `off`
for cleaner status output
- Factor out snapshot accumulation and object download/bulk-load for
easier readability

## Test Plan

Ran formal snapshot restore from `sui-tool` and verified improvements

---
If your changes are not user-facing and not a breaking change, you can
skip the following section. Otherwise, please indicate what changed, and
then add to the Release Notes section as highlighted during the release
process.

### Type of Change (Check all that apply)

- [ ] protocol change
- [ ] user-visible impact
- [ ] breaking change for a client SDKs
- [ ] breaking change for FNs (FN binary must upgrade)
- [ ] breaking change for validators or node operators (must upgrade
binaries)
- [ ] breaking change for on-chain data layout
- [ ] necessitate either a data wipe or data migration

### Release notes
@vercel vercel bot temporarily deployed to Preview – mysten-ui October 26, 2023 18:19 Inactive
@williampsmith williampsmith merged commit 4306ea6 into main Oct 26, 2023
32 checks passed
@williampsmith williampsmith deleted the formal-snapshot-dl branch October 26, 2023 19:23
jonas-lj pushed a commit to jonas-lj/sui that referenced this pull request Nov 2, 2023
## Description 

Extend sui-tool snapshot downloader to download, verify and restore from
formal snapshot. Note that if `--verify` is set to true, only protocol
versions where `commit_root_state_digest` is true are eligible, as this
relies on root state hash commitment at end of epoch.

The following tasks are orchestrated:

* Performing checkpoint summary sync (with verification) to the end of
the target epoch via archival store
* Downloading all snapshot object refs 
* Checksumming all object refs to verify there is no discrepancy between
the object store manifest and the contents
* Accumulating all object refs and comparing against consensus
checkpoint commitment (root state hash). This protects against restoring
from a compromised snapshot and ensures that the state after restore is
consistent with the network
* Downloading and loading into perpetual store the end of epoch live
object set contents from the snapshot
* Setting other critical state necessary for node to startup and join
the network (create committee store, create epoch start configuration,
set checkpoint watermarks, etc)

## Test Plan 

1. Run the following to perform snapshot restore

```
GCS_SNAPSHOT_SERVICE_ACCOUNT_FILE_PATH=<path> AWS_ARCHIVE_ACCESS_KEY_ID=<key> AWS_ARCHIVE_SECRET_ACCESS_KEY=<key> AWS_ARCHIVE_REGION=us-west-2 sui-tool download-db-snapshot --epoch 125 --genesis /opt/sui/config/genesis.blob --formal --network testnet --path /opt/sui/db/authorities_db/full_node_db --num-parallel-downloads 50
```
2. Startup `sui-node` and observe that node is able to execute
checkpoints successfully and ultimately reconfig to the next epoch.

```
[00:07:01] ████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ 977 out of 977 .ref files done
[03:34:42] ████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ 10670050/10670050(Checkpoint summary download is complete)
[00:02:36] ██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ 977 out of 977 ref files checksummed (Checksumming complete)
[02:20:05] ██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ 977 out of 977 ref files accumulated (Accumulation complete)
[02:20:05] ████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ 977 out of 977 .obj files done (Objects download complete)
2023-10-11T02:35:15.060244Z  INFO sui_archival::reader: Terminating the manifest sync loop                                           
2023-10-11T02:35:15.060334Z  INFO sui_tool: Formal snapshot state verification complete!
2023-10-11T02:35:15.468872Z  INFO sui_storage::mutex_table: Stopping mutex table cleanup!
2023-10-11T02:35:15.510039Z  INFO sui_storage::mutex_table: Stopping mutex table cleanup!
2023-10-11T02:35:19.644477Z  INFO sui_tool: Successfully restored state from snapshot at end of epoch 125

ubuntu@fullnode-compat-test-03:/opt/sui/db/authorities_db/full_node_db$ systemctl status sui-node
● sui-node.service - Sui Node
     Loaded: loaded (/etc/systemd/system/sui-node.service; disabled; vendor preset: enabled)
     Active: active (running) since Wed 2023-10-11 02:39:26 UTC; 6s ago
   Main PID: 344206 (sui-node)
      Tasks: 110 (limit: 308692)
     Memory: 1.1G (high: 246.0G max: 251.0G swap max: 0B available: 244.8G)
        CPU: 7.651s
     CGroup: /system.slice/sui-node.service
             └─344206 /opt/sui/bin/sui-node --config-path /opt/sui/config/sui-node.yaml

ubuntu@fullnode-compat-test-03:/opt/sui/db/authorities_db/full_node_db$ curl -s http://localhost:9184/metrics | grep 'current_epoch '
current_epoch 126
ubuntu@fullnode-compat-test-03:/opt/sui/db/authorities_db/full_node_db$ curl -s http://localhost:9184/metrics | grep 'last_executed_checkpoint '
last_executed_checkpoint 10678488
ubuntu@fullnode-compat-test-03:/opt/sui/db/authorities_db/full_node_db$ curl -s http://localhost:9184/metrics | grep 'last_executed_checkpoint '
last_executed_checkpoint 10679365

# after some time
ubuntu@fullnode-compat-test-03:/opt/sui/db/authorities_db/full_node_db$ curl -s http://localhost:9184/metrics | grep 'current_epoch '
current_epoch 129
```

### Type of Change (Check all that apply)

- [ ] protocol change
- [ ] user-visible impact
- [ ] breaking change for a client SDKs
- [ ] breaking change for FNs (FN binary must upgrade)
- [ ] breaking change for validators or node operators (must upgrade
binaries)
- [ ] breaking change for on-chain data layout
- [ ] necessitate either a data wipe or data migration

### Release notes
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants