Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Merged by Bors] - Apply hotfix for inconsistent head #1639

Closed
wants to merge 6 commits into from

Conversation

paulhauner
Copy link
Member

@paulhauner paulhauner commented Sep 21, 2020

Issue Addressed

Proposed Changes

If we look at the function which persists fork choice and the canonical head to disk:

/// Persists the core `BeaconChain` components (including the head block) and the fork choice.
///
/// ## Notes:
///
/// In this function we first obtain the head, persist fork choice, then persist the head. We
/// do it in this order to ensure that the persisted head is always from a time prior to fork
/// choice.
///
/// We want to ensure that the head never out dates the fork choice to avoid having references
/// to blocks that do not exist in fork choice.
pub fn persist_head_and_fork_choice(&self) -> Result<(), Error> {
let canonical_head_block_root = self
.canonical_head
.try_read_for(HEAD_LOCK_TIMEOUT)
.ok_or_else(|| Error::CanonicalHeadLockTimeout)?
.beacon_block_root;
let persisted_head = PersistedBeaconChain {
canonical_head_block_root,
genesis_block_root: self.genesis_block_root,
ssz_head_tracker: self.head_tracker.to_ssz_container(),
};
let fork_choice_timer = metrics::start_timer(&metrics::PERSIST_FORK_CHOICE);
let fork_choice = self.fork_choice.read();
self.store.put_item(
&Hash256::from_slice(&FORK_CHOICE_DB_KEY),
&PersistedForkChoice {
fork_choice: fork_choice.to_persisted(),
fork_choice_store: fork_choice.fc_store().to_persisted(),
},
)?;
drop(fork_choice);
metrics::stop_timer(fork_choice_timer);
let head_timer = metrics::start_timer(&metrics::PERSIST_HEAD);
self.store
.put_item(&Hash256::from_slice(&BEACON_CHAIN_DB_KEY), &persisted_head)?;
metrics::stop_timer(head_timer);
Ok(())
}

There is a race-condition which might cause the canonical head and fork choice values to be out-of-sync.

I believe this is the cause of #1616. I managed to recreate the issue and produce a database that was unable to sync under the master branch but able to sync with this branch.

These new changes solve the issue by ignoring the persisted canonical_head_block_root value and instead getting fork choice to generate it. This ensures that the canonical head is in-sync with fork choice.

Additional Info

This is hotfix method that leaves some crusty code hanging around. Once this PR is merged (to satisfy the v0.2.x users) we should later update and merge #1638 so we can have a clean fix for the v0.3.x versions.

Copy link
Member

@michaelsproul michaelsproul left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good overall, just one change needed IMO.

It will be good to revisit this for v0.3.0, and again as part of our work to resolve concurrency issues like #1557

.map_err(|e| format!("DB error when reading genesis state: {:?}", e))?
.ok_or_else(|| "Genesis block not found in store".to_string())?;

let slot_clock = SystemTimeSlotClock::new(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should conjure a SystemTimeSlotClock from thin air here, as I think it will wreak havoc on the tests that use the testing slot clock. Instead I think we should use self.slot_clock, and patch up the existing couple of places where resume_from_db is called on the builder before slot_clock or testing_slot_clock (e.g. in beacon_chain/src/test_utils.rs)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm yeah agree it's going to be bad for testing slot clock. The problem is that there's a chicken-and-egg problem; we need to know the genesis time (i.e., the genesis state) start the slot clock and we need the slot clock in order to learn the canonical head.

Presently those two actions are lumped together in one function and not generic across the slot clocks... I'll see what I can do...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I managed a decent solution (without changing the schema) that solves this. Let me know what you think :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep! Addresses the chicken-and-egg problem nicely!

@paulhauner paulhauner added the ready-for-review The code is ready for review label Sep 21, 2020
Copy link
Member

@michaelsproul michaelsproul left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good.

Approved conditional on the simulator being fixed

@paulhauner
Copy link
Member Author

@michaelsproul please let me know if you're good with 5bb065b :)

Copy link
Member

@michaelsproul michaelsproul left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All good 👍

@paulhauner
Copy link
Member Author

bors r+

bors bot pushed a commit that referenced this pull request Sep 22, 2020
## Issue Addressed

- Resolves #1616

## Proposed Changes

If we look at the function which persists fork choice and the canonical head to disk:

https://github.com/sigp/lighthouse/blob/1db8daae0c7bb34bf2e05644fa6bf313c2bea98e/beacon_node/beacon_chain/src/beacon_chain.rs#L234-L280

There is a race-condition which might cause the canonical head and fork choice values to be out-of-sync.

I believe this is the cause of #1616. I managed to recreate the issue and produce a database that was unable to sync under the `master` branch but able to sync with this branch.

These new changes solve the issue by ignoring the persisted `canonical_head_block_root` value and instead getting fork choice to generate it. This ensures that the canonical head is in-sync with fork choice.

## Additional Info

This is hotfix method that leaves some crusty code hanging around. Once this PR is merged (to satisfy the v0.2.x users) we should later update and merge #1638 so we can have a clean fix for the v0.3.x versions.
@bors bors bot changed the title Apply hotfix for inconsistent head [Merged by Bors] - Apply hotfix for inconsistent head Sep 22, 2020
@bors bors bot closed this Sep 22, 2020
michaelsproul added a commit that referenced this pull request Sep 28, 2020
Squashed commit of the following:

commit 74ed1be
Author: pawan <[email protected]>
Date:   Tue Sep 22 19:27:14 2020 +0530

    fmt

commit d61f956
Author: pawan <[email protected]>
Date:   Tue Sep 22 19:03:16 2020 +0530

    Fix merge issues

commit 282a973
Merge: 6759bd1 a97ec31
Author: pawan <[email protected]>
Date:   Tue Sep 22 18:30:10 2020 +0530

    Merge branch 'master' into directory-restructure

commit a97ec31
Author: Pawan Dhananjay <[email protected]>
Date:   Tue Sep 22 07:29:34 2020 +0000

    Subscribe to subnets an epoch in advance (#1600)

    ## Issue Addressed

    N/A

    ## Proposed Changes

    Subscibe to subnet an epoch in advance of the attestation slot instead of 4 slots in advance.

commit 7aceff4
Author: Michael Sproul <[email protected]>
Date:   Tue Sep 22 05:40:04 2020 +0000

    Add `safe_sum` and use it in state_processing (#1620)

    ## Issue Addressed

    Closes #1098

    ## Proposed Changes

    Add a `SafeArithIter` trait with a `safe_sum` method, and use it in `state_processing`. This seems to be the only place in `consensus` where it is relevant -- i.e. where we were using `sum` and the integer_arith lint is enabled.

    ## Additional Info

    This PR doesn't include any Clippy linting to prevent `sum` from being called. It seems there is no existing Clippy lint that suits our purpose, but I'm going to look into that and maybe schedule writing one as a lower-priority task.

    This theoretically _is_ a consensus breaking change, but it shouldn't impact Medalla (or any other testnet) because `slashings` shouldn't overflow!

commit 4fca306
Author: Michael Sproul <[email protected]>
Date:   Tue Sep 22 05:40:02 2020 +0000

    Update BLST, add force-adx support (#1595)

    ## Issue Addressed

    Closes #1504
    Closes #1505

    ## Proposed Changes

    * Update `blst` to the latest version, which is more portable and includes finer-grained compilation controls (see below).
    * Detect the case where a binary has been explicitly compiled with ADX support but it's missing at runtime, and report a nicer error than `SIGILL`.

    ## Known Issues

    * None. The previous issue with `make build-aarch64` (supranational/blst#27), has been resolved.

    ## Additional Info

    I think we should tweak our release process and our Docker builds so that we provide two options:

    Binaries:

    * `lighthouse`: compiled with `modern`/`force-adx`, for CPUs 2013 and newer
    * `lighthouse-portable`: compiled with `portable` for older CPUs

    Docker images:

    * `sigp/lighthouse:latest`: multi-arch image with `modern` x86_64 and vanilla aarch64 binary
    * `sigp/lighthouse:latest-portable`: multi-arch image with `portable` builds for x86_64 and aarch64

    And relevant Docker images for the releases (as per #1574 (comment)), tagged `v0.x.y` and `v0.x.y-portable`

commit d85d5a4
Author: Paul Hauner <[email protected]>
Date:   Tue Sep 22 04:45:15 2020 +0000

    Bump to v0.2.11 (#1645)

    ## Issue Addressed

    NA

    ## Proposed Changes

    - Bump version to v0.2.11
    - Run `cargo update`.

    ## Additional Info

    NA

commit bd39cc8
Author: Paul Hauner <[email protected]>
Date:   Tue Sep 22 02:06:10 2020 +0000

    Apply hotfix for inconsistent head (#1639)

    ## Issue Addressed

    - Resolves #1616

    ## Proposed Changes

    If we look at the function which persists fork choice and the canonical head to disk:

    https://github.com/sigp/lighthouse/blob/1db8daae0c7bb34bf2e05644fa6bf313c2bea98e/beacon_node/beacon_chain/src/beacon_chain.rs#L234-L280

    There is a race-condition which might cause the canonical head and fork choice values to be out-of-sync.

    I believe this is the cause of #1616. I managed to recreate the issue and produce a database that was unable to sync under the `master` branch but able to sync with this branch.

    These new changes solve the issue by ignoring the persisted `canonical_head_block_root` value and instead getting fork choice to generate it. This ensures that the canonical head is in-sync with fork choice.

    ## Additional Info

    This is hotfix method that leaves some crusty code hanging around. Once this PR is merged (to satisfy the v0.2.x users) we should later update and merge #1638 so we can have a clean fix for the v0.3.x versions.

commit 14ff385
Author: Pawan Dhananjay <[email protected]>
Date:   Tue Sep 22 01:12:36 2020 +0000

    Add trusted peers (#1640)

    ## Issue Addressed

    Closes #1581

    ## Proposed Changes

    Adds a new cli option for trusted peers who always have the maximum possible score.

commit 5d17eb8
Author: Michael Sproul <[email protected]>
Date:   Mon Sep 21 11:53:53 2020 +0000

    Update LevelDB to v0.8.6, removing patch (#1636)

    Removes our dependency on a fork of LevelDB now that skade/leveldb-sys#17 is merged

commit 1db8daa
Author: Age Manning <[email protected]>
Date:   Mon Sep 21 02:00:38 2020 +0000

    Shift metadata to the global network variables (#1631)

    ## Issue Addressed

    N/A

    ## Proposed Changes

    Shifts the local `metadata` to `network_globals` making it accessible to the HTTP API and other areas of lighthouse.

    ## Additional Info

    N/A

commit 7b97c4a
Author: Pawan Dhananjay <[email protected]>
Date:   Mon Sep 21 01:06:25 2020 +0000

    Snappy additional sanity checks (#1625)

    ## Issue Addressed

    N/A

    ## Proposed Changes

    Adds the following check from the spec

    > A reader SHOULD NOT read more than max_encoded_len(n) bytes after reading the SSZ length-prefix n from the header.

commit 371e1c1
Author: Paul Hauner <[email protected]>
Date:   Fri Sep 18 06:41:29 2020 +0000

    Bump version to v0.2.10 (#1630)

    ## Issue Addressed

    NA

    ## Proposed Changes

    Bump crate version so we can cut a new release with the fix from #1629.

    ## Additional Info

    NA

commit a17f748
Author: Paul Hauner <[email protected]>
Date:   Fri Sep 18 05:14:31 2020 +0000

    Fix bad assumption when checking finalized descendant (#1629)

    ## Issue Addressed

    - Resolves #1616

    ## Proposed Changes

    Fixes a bug where we are unable to read the finalized block from fork choice.

    ## Detail

    I had made an assumption that the finalized block always has a parent root of `None`:

    https://github.com/sigp/lighthouse/blob/e5fc6bab485fa54d7e518b325f4eb9201ba5c6a1/consensus/fork_choice/src/fork_choice.rs#L749-L752

    This was a faulty assumption, we don't set parent *roots* to `None`. Instead we *sometimes* set parent *indices* to `None`, depending if this pruning condition is satisfied:

    https://github.com/sigp/lighthouse/blob/e5fc6bab485fa54d7e518b325f4eb9201ba5c6a1/consensus/proto_array/src/proto_array.rs#L229-L232

    The bug manifested itself like this:

    1. We attempt to get the finalized block from fork choice
    1. We try to check that the block is descendant of the finalized block (note: they're the same block).
    1. We expect the parent root to be `None`, but it's actually the parent root of the finalized root.
    1. We therefore end up checking if the parent of the finalized root is a descendant of itself. (note: it's an *ancestor* not a *descendant*).
    1. We therefore declare that the finalized block is not a descendant of (or eq to) the finalized block. Bad.

    ## Additional Info

    In reflection, I made a poor assumption in the quest to obtain a probably negligible performance gain. The performance gain wasn't worth the risk and we got burnt.

commit 49ab414
Author: Age Manning <[email protected]>
Date:   Fri Sep 18 02:05:36 2020 +0000

    Shift gossipsub validation (#1612)

    ## Issue Addressed

    N/A

    ## Proposed Changes

    This will consider all gossipsub messages that have either the `from`, `seqno` or `signature` field as invalid.

    ## Additional Info

    We should not merge this until all other clients have been sending empty fields for a while.

    See ethereum/consensus-specs#1981 for reference

commit 2074bec
Author: Age Manning <[email protected]>
Date:   Fri Sep 18 02:05:34 2020 +0000

    Gossipsub message id to shortened bytes (#1607)

    ## Issue Addressed

    ethereum/consensus-specs#2044

    ## Proposed Changes

    Shifts the gossipsub message id to use the first 8 bytes of the SHA256 hash of the gossipsub message data field.

    ## Additional Info

    We should merge this in once the spec has been decided on. It will cause issues with gossipsub scoring and gossipsub propagation rates (as we won't receive IWANT) messages from clients that also haven't made this update.

commit e5fc6ba
Author: Michael Sproul <[email protected]>
Date:   Mon Sep 14 10:58:15 2020 +0000

    Remove redundant decompression in process_deposit (#1610)

    ## Issue Addressed

    Closes #1076

    ## Proposed Changes

    Remove an extra unnecessary decompression of the deposit public key from `process_deposit`. The key is decompressed and used to verify the signature in `verify_deposit_signature`, making this initial decompression redundant.

    ## Additional Info

    This is _not_ a consensus-breaking change because keys which previously failed the early decompression check will not be found in the pubkey cache (they are invalid), and will be checked and rejected as part of `verify_deposit_signature`.

commit c9596fc
Author: Age Manning <[email protected]>
Date:   Sun Sep 13 23:58:49 2020 +0000

    Temporary Sync Work-Around (#1615)

    ## Issue Addressed

    #1590

    ## Proposed Changes

    This is a temporary workaround that prevents finalized chain sync from swapping chains. I'm merging this in now until the full solution is ready.

commit c6abc56
Author: Age Manning <[email protected]>
Date:   Fri Sep 11 02:33:36 2020 +0000

    Prevent large step-size parameters (#1583)

    ## Issue Addressed

    Malicious users could request very large block ranges, more than we expect. Although technically legal, we are now quadraticaly weighting large step sizes in the filter. Therefore users may request large skips, but not a large number of blocks, to prevent requests forcing us to do long chain lookups.

    ## Proposed Changes

    Weight the step parameter in the RPC filter and prevent any overflows that effect us in the step parameter.

    ## Additional Info

commit 7f1b936
Author: blacktemplar <[email protected]>
Date:   Fri Sep 11 01:43:15 2020 +0000

    ignore too early / too late attestations instead of penalizing them (#1608)

    ## Issue Addressed

    NA

    ## Proposed Changes

    This ignores attestations that are too early or too late as it is specified in the spec (see https://github.com/ethereum/eth2.0-specs/blob/v0.12.1/specs/phase0/p2p-interface.md#global-topics first subpoint of `beacon_aggregate_and_proof`)

commit 810de2f
Author: Daniel Schonfeld <[email protected]>
Date:   Fri Sep 11 01:43:13 2020 +0000

    Static testnet configs (#1603)

    ## Issue Addressed

    #1431

    ## Proposed Changes

    Added an archived zip file with required files manually

    ## Additional Info

    1) Used zip, instead of tar.gz to add a single dependency instead of two.
    2) I left the download from github code for now, waiting to hear if you'd like it cleaned up or left to be used for some tooling needs.

commit 0525876
Author: Pawan Dhananjay <[email protected]>
Date:   Fri Sep 11 00:52:27 2020 +0000

    Dial cached enr's before making subnet discovery query (#1376)

    ## Issue Addressed

    Closes #1365

    ## Proposed Changes

    Dial peers in the `cached_enrs` who aren't connected, aren't banned and satisfy the subnet predicate before making a subnet discovery query.

commit d79366c
Author: Age Manning <[email protected]>
Date:   Thu Sep 10 04:43:22 2020 +0000

    Prevent printing binary in RPC errors (#1604)

    ## Issue Addressed

    #1566

    ## Proposed Changes

    Prevents printing binary characters in the RPC error response from peers.

commit b19cf02
Author: Age Manning <[email protected]>
Date:   Thu Sep 10 03:51:06 2020 +0000

    Penalise bad peer behaviour (#1602)

    ## Issue Addressed

    #1386

    ## Proposed Changes

    Penalises peers in our scoring system that produce invalid attestations or blocks.

commit dfe5077
Author: Paul Hauner <[email protected]>
Date:   Thu Sep 10 00:24:41 2020 +0000

    Remove references to rust-docs (#1601)

    ## Issue Addressed

    - Resolves #897
    - Resolves #821

    ## Proposed Changes

    Removes references to the rust docs that we're no long maintaining.

    ## Additional Info

    NA

commit 0821e6b
Author: Paul Hauner <[email protected]>
Date:   Wed Sep 9 02:28:35 2020 +0000

    Bump version to v0.2.9 (#1598)

    ## Issue Addressed

    NA

    ## Proposed Changes

    - Bump version tags
    - Run `cargo update`

    ## Additional Info

    NA

commit 6759bd1
Author: pawan <[email protected]>
Date:   Tue Sep 8 12:39:22 2020 +0530

    Add `strict-slashing-protection` flag

commit 1644717
Author: pawan <[email protected]>
Date:   Fri Sep 4 16:46:43 2020 +0530

    Remove `get_default_base_dir`

commit e4783bf
Author: pawan <[email protected]>
Date:   Fri Sep 4 16:24:45 2020 +0530

    Address review comments

commit 89b832c
Author: pawan <[email protected]>
Date:   Thu Sep 3 00:30:53 2020 +0530

    Fix tests

commit 5940a52
Author: pawan <[email protected]>
Date:   Wed Sep 2 21:27:57 2020 +0530

    Print out path information in account manager

commit 5ca1341
Author: pawan <[email protected]>
Date:   Wed Sep 2 20:49:29 2020 +0530

    Use datadir flag for account manager

commit 8c82769
Author: pawan <[email protected]>
Date:   Wed Sep 2 20:10:44 2020 +0530

    Rename BASE_DIR_FLAG to WALLET_DIR_FLAG

commit ad34a0f
Author: pawan <[email protected]>
Date:   Wed Sep 2 18:17:45 2020 +0530

    Add validator-dir cli option

commit 04fd9f7
Merge: 0e3ccba 8718120
Author: pawan <[email protected]>
Date:   Wed Sep 2 14:53:27 2020 +0530

    Merge branch 'master' into directory-restructure

commit 0e3ccba
Author: pawan <[email protected]>
Date:   Wed Sep 2 14:40:36 2020 +0530

    Minor fixes

commit ad22178
Author: pawan <[email protected]>
Date:   Tue Sep 1 19:14:44 2020 +0530

    Migration for ValidatorDefinitions

commit 1022f58
Author: pawan <[email protected]>
Date:   Tue Sep 1 16:02:07 2020 +0530

    Fix lints and udeps

commit 4ddbbfe
Author: pawan <[email protected]>
Date:   Tue Sep 1 12:56:12 2020 +0530

    Remove more hardocoded values

commit 4170e4c
Author: pawan <[email protected]>
Date:   Tue Sep 1 12:50:43 2020 +0530

    Remove some hardcoded values

commit 5c46823
Merge: ae533f3 8301a98
Author: pawan <[email protected]>
Date:   Tue Sep 1 12:02:35 2020 +0530

    Merge branch 'master' into directory-restructure

commit ae533f3
Author: pawan <[email protected]>
Date:   Mon Aug 17 20:47:20 2020 +0530

    Docs fixes

commit fe13b1b
Author: pawan <[email protected]>
Date:   Mon Aug 17 19:58:09 2020 +0530

    lint

commit e4a7e8c
Merge: 01e6166 9a97a0b
Author: pawan <[email protected]>
Date:   Mon Aug 17 19:47:27 2020 +0530

    Merge branch 'master' into directory-restructure

commit 01e6166
Author: pawan <[email protected]>
Date:   Mon Aug 17 19:23:57 2020 +0530

    All binaries get directory info from common::directory crate

commit 8a7c130
Author: pawan <[email protected]>
Date:   Mon Aug 17 17:18:44 2020 +0530

    Add migration code for restructure

commit a4c33cd
Author: pawan <[email protected]>
Date:   Mon Aug 17 15:51:24 2020 +0530

    Fix bug in secrets dir

commit f4b8529
Author: pawan <[email protected]>
Date:   Mon Aug 17 13:05:28 2020 +0530

    Restrucure validator directory paths

commit 54b4ae6
Author: pawan <[email protected]>
Date:   Mon Aug 17 12:20:40 2020 +0530

    Restructure account_manager directory paths

commit 53892af
Author: pawan <[email protected]>
Date:   Thu Aug 13 19:43:44 2020 +0530

    Move get_testnet_dir to clap utils

commit d1d269a
Author: pawan <[email protected]>
Date:   Thu Aug 13 19:33:34 2020 +0530

    Move beacon directory
@paulhauner paulhauner deleted the hotfix-head branch October 5, 2020 08:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ready-for-review The code is ready for review
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Beacon node can not sync after a hard reset.
2 participants