Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Exit with an error if mappings can’t be created for the chosen block #3242

Merged
merged 2 commits into from
Nov 25, 2024

Conversation

teor2345
Copy link
Member

Replaces #3241.

If we can’t create mappings for the block chosen on the command line, we need to exit with an error.

Choosing the genesis block is useless because it has no mappings, and we often don’t have data for it in our store. So this PR changes mapping creation at the genesis block to start at block 1 instead.

Code contributor checklist:

Comment on lines 449 to 462
// If the user chooses an object mapping start block we don't have the data for, we can't
// create mappings for it, so the node must exit with an error.
let Some(last_archived_block) = client.block(last_archived_block_hash)? else {
let mapping = if create_object_mappings {
"mapping "
} else {
""
};
let error = format!(
"Missing data for {mapping}block {last_archived_block_number} hash {last_archived_block_hash}"
);
return Err(sp_blockchain::Error::Application(error.into()));
};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about checking this in initialize_archiver instead? It should be sufficient to query best_block_to_archive there.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn’t possible, because find_last_archived_block searches back through the segments until it finds a segment with a last archived block which is less than or equal to best_block_to_archive.

initialize_archiver doesn’t know about this logic, so it can’t check the actual last_archived_block_number found by find_last_archived_block.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Achiving has the invariant that from the tip of the chain (best block) we can traverse back to some block that allows us to initialize archiver. By overriding best_block_to_archive with block for which to create object mappings we constrain where to start.

Looking carefully we can see (best_block_to_archive..best_block_number) iteration that ensures to step up closer to the tip to find the blocks that actually exist (right now tip-100 may not actually exist in the database if we use Snap sync and with future Snap sync we may have holes in the database).

One edge case that was problematic before was genesis block, which you handled in #3247, but that was actually not necessary strictly speaking since it is a special case of a more general problem. The same exact issue can happen if we prune blocks since Substrate still never prunes headers (Shamil is working on fixing this in paritytech/polkadot-sdk#6451), so quick check for block existence using header in above mentioned loop will succeed, while block body is already gone.

So to handle both genesis block case after Snap sync and this, all you need to do is to add additional check after the loop that block body exists. Since the only reason for it to not exist is incorrect user input for object mapping creation CLI option, we can return a corresponding error back. In all other cases node sync maintains mentioned invariants already, hence no corresponding error handling in find_last_archived_block and why I'm insisting no changes need to be done there.

Moving override of best_block_to_archive to after the loop breaks these invariants and simply moves its handling to a different place unnecessarily.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense, but I'm still concerned there's an edge case where the data for the chosen mapping block is present, but the data for the block at the end of the segment before the chosen block doesn't have data (which is what find_last_archived_block checks and can panic on).

But maybe that will never happen, and if it does we'll find it in testing.

And the user can fix that by only choosing blocks at the end of segments, or by wiping their node.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is actually one edge case here. We constrain pruning of state to 140k blocks because state pruning is decoupled from finalization (in contrast to pruning of block bodies), which means it is possible for block body to be present and state not be present anymore.

In practice we chose 140k blocks as something that is the most conservative case for 1 segment worth of empty blocks, in practice blocks are typically not empty (contain votes at least), so this will span more than 1 segment.

To handle this as well you may add another check for state by doing a runtime API call (might query object mapping with an empty list of extrinsics, just to see if it fails altogether), that will cover 100% of necessary checks to the best of my knowledge.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in PR #3252

@teor2345 teor2345 enabled auto-merge November 20, 2024 05:02
@teor2345
Copy link
Member Author

I’ve been working through the different cases here, it’s a bit trickier than I thought.

Ideally, if we’re doing what the user said on the command line, we should handle it this way

  • starting at block 0: just initialise an empty archiver like we would if there was no cached state
  • starting anywhere in segment 0: empty archiver again
    • but how can we tell the segment from the block number? Look up segment 1 and compare with the last block number in the parent segment header? (and if there’s no segment 1, just start empty)

In those two cases, we’d never get to the new error case in find_last_archived_block(), because the empty archiver case follows a different code path.

And a block number in any other segment should just work, unless the block data has been pruned. In that case, we can either:

  1. return the new error added in this PR, and let the user decide what to do
  2. start with an empty archiver (which could cost a lot of network and CPU)

I’m not sure if we can do the check for the new error outside find_last_archived_block(), because we’re not starting with the data of the exact block number the user asked for on the command line. We’re starting with the last block in the last segment before that (and a few other conditions).

@nazar-pc
Copy link
Member

starting at block 0: just initialise an empty archiver like we would if there was no cached state

This is already happening, the only exception if node is started with default --sync snap instead of --sync full, in which case genesis state is missing as described in paritytech/polkadot-sdk#5366

starting anywhere in segment 0: empty archiver again

The same as above, it should already work correctly with --sync full, I don't think we need to change anything.

start with an empty archiver (which could cost a lot of network and CPU)

We certainly don't want to do any networking for the purposes of returning object mappings, just return an error. The only exception is if we have a fresh node, then we can take advantage of target block support that Shamil introduced for domain snap sync, but I wouldn't go there unless we have an explicit request for such a feature.

I’m not sure if we can do the check for the new error outside find_last_archived_block(), because we’re not starting with the data of the exact block number the user asked for on the command line. We’re starting with the last block in the last segment before that (and a few other conditions).

best_block_to_archive is an argument of find_last_archived_block, I see no reason why we can't check it before calling find_last_archived_block unless I'm missing something.

@teor2345 teor2345 force-pushed the exit-missing-map-block branch from cc29d2b to 87521ee Compare November 21, 2024 07:16
Copy link
Member Author

@teor2345 teor2345 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here’s where I got to today after checking the code in detail.

Comment on lines 449 to 462
// If the user chooses an object mapping start block we don't have the data for, we can't
// create mappings for it, so the node must exit with an error.
let Some(last_archived_block) = client.block(last_archived_block_hash)? else {
let mapping = if create_object_mappings {
"mapping "
} else {
""
};
let error = format!(
"Missing data for {mapping}block {last_archived_block_number} hash {last_archived_block_hash}"
);
return Err(sp_blockchain::Error::Application(error.into()));
};
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn’t possible, because find_last_archived_block searches back through the segments until it finds a segment with a last archived block which is less than or equal to best_block_to_archive.

initialize_archiver doesn’t know about this logic, so it can’t check the actual last_archived_block_number found by find_last_archived_block.

@teor2345 teor2345 requested a review from nazar-pc November 21, 2024 07:17
@teor2345 teor2345 force-pushed the exit-missing-map-block branch from 87521ee to 7ae4ff8 Compare November 24, 2024 22:51
@teor2345 teor2345 force-pushed the exit-missing-map-block branch from 7ae4ff8 to 47c241b Compare November 24, 2024 22:53
Copy link
Member

@nazar-pc nazar-pc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a state check to handle, but overall this looks good to me, thanks for the patience

Comment on lines +613 to +615
// There aren't any mappings in the genesis block, so starting there is pointless.
// (And causes errors on restart, because genesis block data is never stored during snap sync.)
best_block_to_archive = best_block_to_archive.min(block_number).max(1);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Block is already constrained to be non-zero in create_object_mappings, so not sure how much this is actually needed

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it’s still possible due to the saturating subtraction, see my comments in this thread:
#3242 (comment)

I'll follow up with a PR for the state check.

Comment on lines 449 to 462
// If the user chooses an object mapping start block we don't have the data for, we can't
// create mappings for it, so the node must exit with an error.
let Some(last_archived_block) = client.block(last_archived_block_hash)? else {
let mapping = if create_object_mappings {
"mapping "
} else {
""
};
let error = format!(
"Missing data for {mapping}block {last_archived_block_number} hash {last_archived_block_hash}"
);
return Err(sp_blockchain::Error::Application(error.into()));
};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is actually one edge case here. We constrain pruning of state to 140k blocks because state pruning is decoupled from finalization (in contrast to pruning of block bodies), which means it is possible for block body to be present and state not be present anymore.

In practice we chose 140k blocks as something that is the most conservative case for 1 segment worth of empty blocks, in practice blocks are typically not empty (contain votes at least), so this will span more than 1 segment.

To handle this as well you may add another check for state by doing a runtime API call (might query object mapping with an empty list of extrinsics, just to see if it fails altogether), that will cover 100% of necessary checks to the best of my knowledge.

@teor2345 teor2345 added this pull request to the merge queue Nov 24, 2024
Merged via the queue into main with commit 54081b6 Nov 25, 2024
8 checks passed
@teor2345 teor2345 deleted the exit-missing-map-block branch November 25, 2024 00:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants