Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EpochRecover service event and transaction #420

Merged
merged 163 commits into from
Jul 29, 2024

Conversation

kc1116
Copy link
Contributor

@kc1116 kc1116 commented Apr 12, 2024

This PR updates the FlowEpoch smart contract to support recovering the network while in Epoch Fallback Mode. It adds a new service event EpochRecover which contains the metadata for the recovery epoch. This metadata is generated out of band using the bootstrap utility util epoch efm-recover-tx-args PR and submitted to the contract with the recovery_epoch.cdc transaction. The FlowEpoch contract will end the current epoch, start the recovery epoch and store the metadata for the recovery epoch in storage. This metadata will then be emitted to the network during the next heartbeat interval.

@joshuahannan
Copy link
Member

@kc1116 Can you base this off the stable-cadence branch instead of master since we wouldn't be doing the upgrade until after Cadence 1.0?

@kc1116 kc1116 changed the base branch from master to stable-cadence April 12, 2024 16:46
Copy link
Member

@AlexHentschel AlexHentschel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, I have limited background about the epoch smart contracts. Though, I feel there are a bunch of important details in this PR that still need to be worked out. Overall, we are filling various fields with values according to the happy path. But I miss explanations why those values we use (e.g. epoch counter, number of views for various phase, etc) are also correct for the unhappy path.

The general bar for us as BFT-engineers is that we can explain for every line of code in our PR, why it needs to be precisely like we wrote it. If that reasoning isn't absolutely trivial, a comment in the code is highly desired. There were multiple places, where I got stuck just not understanding why it is correct to chose a value according to your code.

I would suggest to go over the code again with the mindset:

  • during EFM, the Protocol State controlling the nodes has diverged from the Epoch Smart contract. Algorithmically, they are irreconcilable. The Epoch Recovery is a human intervention to reconcile both states:

    • Due to the prior diversion, the happy path algorithm for moving to the next epoch doesn't apply to either states anymore. That is fine, because we are having humans force-override values via a governance transaction.
    • So here, we are at the point where the Epoch Smart contract received the human override (governance transaction). The human provides some values, others are need to be computed. But how precisely? Keep in mind that the Epoch Smart contract potentially went though multiple Epochs that do not exist in reality.
    • You are writing this code for processing the human override here. For every value relevant to epochs (modified or not) you need to have a argument essentially proving that (i) when the value is processed by the Protocol State, it brings the Protocol State back to the happy path and the resulting state it is compatible with all assumptions and consistency requirements of the Epoch Protocol State and (ii) when the value is processed by the Epoch Smart contract, the new value brings the Epoch Smart contract back to the happy path and the resulting state it is compatible with all assumptions and consistency requirements of the Epoch Smart contract.

    Literally, I would suggest to go over every epoch-related value in this PR. Try to answer questions (i) and (ii) and modify your algorithm until you feel confident that you have a bullet-proof argument. If in doubt, talk to Jordan.

This will take time and we expect it to take time. Don't feel pressured to crank PRs out. Correctness and good documentation is by far a higher priority in our field of work than development speed.

contracts/epochs/FlowEpoch.cdc Outdated Show resolved Hide resolved
Comment on lines 220 to 221
/// The last view (inclusive) of the staking phase.
pub let stakingEndView: UInt64
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

above you state that "This struct is a 1:1 copy of the event EpochRecover". Assuming you mean its cadence definition right above, there is an inconsistency because EpochRecover does not have the field stakingEndView

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overlooked unused var removed 381ca6f

contracts/epochs/FlowEpoch.cdc Outdated Show resolved Hide resolved
Comment on lines 583 to 586
// Compute the target end time for the next epoch
let timingConfig = FlowEpoch.getEpochTimingConfig()
let proposedTargetDuration = timingConfig.duration
let proposedTargetEndTime = timingConfig.getTargetEndTimeForEpoch(FlowEpoch.proposedEpochCounter())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have enough context, but I am worried that we might not be able to use the happy path timing config, depending on what we want the recovery epoch to do.
Even if it is ok to use the happy path EpochTimingConfig, I would like a comment that explains why this is safe and which assumptions we are making (i.e. under which conditions this code would need to change).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right we cannot use the EpochTimingConfig in the unhappy path (epoch recover) because we would have already added epoch extensions that will throw off the expected target end time, we may also want to tweak the duration for the recovery epoch. Once the recovery epoch is finished it is safe for subsequent epochs to use the timing config. I've remove this code and added target duration / end time as a parameter. a4943ec

// Create new Epoch metadata for the next epoch
// with the new values
let newEpochMetadata = EpochMetadata(
counter: FlowEpoch.currentEpochCounter + 1,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure this is correct. During EFM, the Protocol State formally remains in the same epoch, which is regularly being extended until we reach an EpochRecovery event. However, the Epoch Smart contract does not know this and behaves as if the epochs where just progressing on the happy path. So very likely, the Epoch smart contract and the Protocol State have a different notion of what the "next" epoch counter should be.

Same suggestion as before:
Even if it is ok to compute the epoch counter like this, I would like a comment that explains why this is safe and which assumptions we are making (i.e. under which conditions this code would need to change).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is safe to use the epoch counter in the smart contract because the counter is not incremented when we enter EFM state. Som, despite adding EpochExtensions the counter will not be incremented until the manual intervention via epoch recover transaction. Added a comment 4c0dcf7

contracts/epochs/FlowEpoch.cdc Outdated Show resolved Hide resolved
clusterQCs: [],
dkgKeys: dkgPubKeys)

FlowEpoch.saveEpochMetadata(newEpochMetadata)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so we save the EpochMetadata and then change its fields? Is that safe? Can we avoid such confusion by maybe moving the saveEpochMetadata call further below, or add a comment (?)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

furthermore, aren't we changing parameters for all upcoming epochs (assuming no further human intervention via governance transactions) ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are not changing the fields after we save the epoch metadata. The epoch metadata is saved and then epoch metadata along with other data (dkg views, cluster qcs etc) is emitted in the EpochRecover event. I've moved it to the end of the func for better clarity cbb2885 . Also once we intervene via the EpochRecover transaction and override the EpochMetadata upcoming epochs will increment the epoch counter as expected and override with the epoch metadata for that epoch. New epoch metadata is created in the upcoming epochs setup phase.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are not changing the fields after we save the epoch metadata

I think calculateAndSetRewards will actually set the reward amount field, after saving the epoch metadata. But this is following the same pattern as existing code. Maybe there's an opportunity to improve both codepaths.

Comment on lines 606 to 607
// Start a new Epoch, which increments the current epoch counter
FlowEpoch.startNewEpoch()
Copy link
Member

@AlexHentschel AlexHentschel Apr 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Protocol State is the ultimate source of truth, because it controls what the network is actually doing.

Generally, when we process the Epoch Recovery transaction (governance transaction), I am not sure it is correct to start the new epoch right away. We should only start it once we reach the specified first view of the recovery transaction (I think ... but again not sure)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Protocol State is the ultimate source of truth

I don't completely agree with that mindset. It's true that the Protocol State is the source of truth for the network, but in the steady state (ignoring sporks for a moment) its state is populated by trusted service events from the smart contracts. Ultimately the smart contracts are determining t

When the system contract emits the EpochRecover event it can either:

  1. Immediately transition into the RecoveryEpoch
  2. Wait until the start view, then transition into the RecoveryEpoch

Intuitively (2) seems like the better option, but with the context that we are currently already in a bad state (EFM), I'm not fully convinced it is here. Neither are great options, but (1) is less complex and parallels how resetEpoch works.

@kc1116 kc1116 marked this pull request as ready for review April 16, 2024 16:53
@kc1116 kc1116 removed the request for review from durkmurder April 16, 2024 16:53
@kc1116 kc1116 requested a review from turbolent April 29, 2024 16:49
Copy link
Member

@turbolent turbolent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mainly reviewed the Cadence code from a code quality perspective. I can't say anything in regards to the business logic

contracts/epochs/FlowEpoch.cdc Outdated Show resolved Hide resolved
contracts/epochs/FlowEpoch.cdc Outdated Show resolved Hide resolved
Comment on lines 748 to 990
DKGPhase1FinalView: proposedEpochMetadata.startView + self.configurableMetadata.numViewsInStakingAuction + self.configurableMetadata.numViewsInDKGPhase - 1 as UInt64,
DKGPhase2FinalView: proposedEpochMetadata.startView + self.configurableMetadata.numViewsInStakingAuction + (2 as UInt64 * self.configurableMetadata.numViewsInDKGPhase) - 1 as UInt64,
DKGPhase3FinalView: proposedEpochMetadata.startView + self.configurableMetadata.numViewsInStakingAuction + (3 as UInt64 * self.configurableMetadata.numViewsInDKGPhase) - 1 as UInt64,
dkgPhase1FinalView: proposedEpochMetadata.startView + self.configurableMetadata.numViewsInStakingAuction + self.configurableMetadata.numViewsInDKGPhase - 1 as UInt64,
dkgPhase2FinalView: proposedEpochMetadata.startView + self.configurableMetadata.numViewsInStakingAuction + (2 as UInt64 * self.configurableMetadata.numViewsInDKGPhase) - 1 as UInt64,
dkgPhase3FinalView: proposedEpochMetadata.startView + self.configurableMetadata.numViewsInStakingAuction + (3 as UInt64 * self.configurableMetadata.numViewsInDKGPhase) - 1 as UInt64,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above: Remove unnecessary casts and introduce variables for the repeated long expressions

Copy link
Contributor Author

@kc1116 kc1116 May 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

contracts/epochs/FlowEpoch.cdc Outdated Show resolved Hide resolved
contracts/epochs/FlowEpoch.cdc Outdated Show resolved Hide resolved
/// gets into an EFM state the recoverEpoch transaction is used to override the current epoch and recover the network without
/// sporking. This service event is processed by protocol participants to update their local protocol state with the new recover
/// epoch data.
access(all) event EpochRecover (
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a massive event. Are you sure it's OK to include so much data right into the event?

It also seems like this event has (almost?) the same contents as RecoverEpochMetadata?

nodeIDs: [String])
{
pre {
FlowEpoch.isValidPhaseConfiguration(stakingEndView-startView+1, FlowEpoch.configurableMetadata.numViewsInDKGPhase, endView-startView+1):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even though it is only used here and not added, but maybe improve isValidPhaseConfiguration and add argument labels, it's very hard to see what the arguments passed to the function are, and if they are correct. They all seem to have the same type too, which makes it very easy to mix them up.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that commit got lost at some point.

This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.

contracts/epochs/FlowEpoch.cdc Outdated Show resolved Hide resolved
contracts/epochs/FlowEpoch.cdc Show resolved Hide resolved
epochAddress flow.Address,
expectedRecover EpochRecover) {
var emittedEvent EpochRecoverEvent
evtTypeStr := fmt.Sprintf("A.%s.FlowEpoch.EpochRecover", epochAddress.String())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use a common.AddressLocation and common.Location.TypeID(...) instead

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

@jordanschalm jordanschalm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First set of comments

contracts/epochs/FlowClusterQC.cdc Outdated Show resolved Hide resolved
contracts/epochs/FlowClusterQC.cdc Outdated Show resolved Hide resolved
Comment on lines 120 to 122
dkgPhase1FinalView: UInt64,
dkgPhase2FinalView: UInt64,
dkgPhase3FinalView: UInt64,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ I think this could be a breaking change (if not for us, then for clients). I'd suggest not changing the field name.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines 182 to 184
dkgPhase1FinalView: UInt64,
dkgPhase2FinalView: UInt64,
dkgPhase3FinalView: UInt64,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would name these the same as EpochSetup.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

contracts/epochs/FlowEpoch.cdc Outdated Show resolved Hide resolved
Comment on lines 200 to 204
/// Contains metadata about the recovery epoch, this data
/// is stored at the storage path /storage/recoverEpochMetadata.
/// This struct is a 1:1 copy of the event EpochRecover event and is
/// used to populate all the fields of the event when the heartbeat
/// detects a new RecoverEpochMetadata stored.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can remove this and just immediately emit the event now, alongside onflow/flow-go#5828

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines +584 to +565
/// sanity check all nodes should have a weight > 0
for nodeID in nodeIDs {
assert(
FlowIDTableStaking.NodeInfo(nodeID: nodeID).initialWeight > 0,
message: "all nodes in node ids list for recovery epoch must have a weight > 0"
)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We discussed this last week and decided to allow the consensus committee to differ from the set of DKG participants. See onflow/flow-go#6214.

This means that we can strictly require all nodes have weight > 0, as implemented here.

clusterQCs: [],
dkgKeys: dkgPubKeys)

FlowEpoch.saveEpochMetadata(newEpochMetadata)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are not changing the fields after we save the epoch metadata

I think calculateAndSetRewards will actually set the reward amount field, after saving the epoch metadata. But this is following the same pattern as existing code. Maybe there's an opportunity to improve both codepaths.

Comment on lines 606 to 607
// Start a new Epoch, which increments the current epoch counter
FlowEpoch.startNewEpoch()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Protocol State is the ultimate source of truth

I don't completely agree with that mindset. It's true that the Protocol State is the source of truth for the network, but in the steady state (ignoring sporks for a moment) its state is populated by trusted service events from the smart contracts. Ultimately the smart contracts are determining t

When the system contract emits the EpochRecover event it can either:

  1. Immediately transition into the RecoveryEpoch
  2. Wait until the start view, then transition into the RecoveryEpoch

Intuitively (2) seems like the better option, but with the context that we are currently already in a bad state (EFM), I'm not fully convinced it is here. Neither are great options, but (1) is less complex and parallels how resetEpoch works.

Comment on lines 647 to 651
/// Calculate rewards for the current epoch
/// and set the payout for the next epoch
FlowEpoch.calculateAndSetRewards()
/// Start a new Epoch, which increments the current epoch counter
FlowEpoch.startNewEpoch()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's possible that the EpochRecover will be rejected by the Protocol State, in which case we would need to send a new recoverEpoch transaction.

We should make sure this function is decoupled from reward payout so we can, if necessary, re-try this process without double-paying rewards.

@@ -258,6 +258,24 @@ access(all) contract FlowClusterQC {
}
}

/// Represents the quorum certificate vote data for a signer
/// of the certificate.
access(all) struct ClusterQCVoteData {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In order to match the current util command output, I think this structure needs to have the form:

struct ClusterQCVoteData {
  voteData: String // hex-encoded aggregated signature
  voterIDs: [String]
}

@kc1116 kc1116 changed the base branch from stable-cadence to master June 20, 2024 19:38
@kc1116 kc1116 changed the base branch from master to stable-cadence June 20, 2024 19:39
@kc1116 kc1116 changed the base branch from stable-cadence to master June 20, 2024 19:52
@kc1116 kc1116 changed the base branch from master to stable-cadence June 20, 2024 19:52
@kc1116 kc1116 changed the base branch from stable-cadence to master June 20, 2024 20:04
@kc1116 kc1116 changed the base branch from master to stable-cadence June 20, 2024 20:16
joshuahannan and others added 8 commits June 20, 2024 16:22
* update to view functions for stable cadence

* remove unreachable code

* update templates to Stable Cadence

* update to Stable Cadence preview 4

* update for stable cadence

* fix parse error

---------

Co-authored-by: Josh Hannan <[email protected]>
Co-authored-by: Bastian Müller <[email protected]>
@kc1116 kc1116 changed the base branch from master to feature/efm-recovery June 26, 2024 15:17
@kc1116 kc1116 requested a review from jordanschalm July 10, 2024 02:45
Copy link
Member

@jordanschalm jordanschalm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Took another pass - my main change request is to make sure that the expected post-recovery epoch counter that we can compute in the efm-recover-tx-args utility is passed explicitly to the transaction. I think this is something we overlooked previously, but including it would eliminate a class of potential problems.

contracts/epochs/FlowClusterQC.cdc Outdated Show resolved Hide resolved
contracts/epochs/FlowEpoch.cdc Outdated Show resolved Hide resolved
contracts/epochs/FlowEpoch.cdc Show resolved Hide resolved
nodeIDs: [String])
{
pre {
FlowEpoch.isValidPhaseConfiguration(stakingEndView-startView+1, FlowEpoch.configurableMetadata.numViewsInDKGPhase, endView-startView+1):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that commit got lost at some point.

This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.

Comment on lines +584 to +565
/// sanity check all nodes should have a weight > 0
for nodeID in nodeIDs {
assert(
FlowIDTableStaking.NodeInfo(nodeID: nodeID).initialWeight > 0,
message: "all nodes in node ids list for recovery epoch must have a weight > 0"
)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We discussed this last week and decided to allow the consensus committee to differ from the set of DKG participants. See onflow/flow-go#6214.

This means that we can strictly require all nodes have weight > 0, as implemented here.

// state with the new Epoch data.
// This transaction should only be used with the output of the bootstrap utility:
// util epoch efm-recover-tx-args
transaction(startView: UInt64,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should include the epochCounter of the RecoveryEpoch here and use it for safety checks in the recoverNewEpoch implementation.

In util epoch efm-recover-tx-args, we know exactly what the RecoveryEpoch epoch counter must be to be accepted by the Protocol State (current+1). If we don't pass that counter value here, then we're relying on the Protocol/SmartContract counters being synchronized in a particular way for correctness. In EFM, that is not guaranteed.

Currently we set the RecoveryEpoch counter as follows:

  • if initNewEpoch then RecoveryEpoch.counter = currentEpochCounter+1
  • if !initNewEpoch then RecoveryEpoch.counter = currentEpochCounter

I would change this to:

  • pass in recoveryEpochCounter as an argument
  • change initNewEpoch to be unsafeAllowOverwrite

In recoverNewEpoch, check that recoveryEpochCounter == currentEpochCounter+1.

  • If that passes, then continue
  • If that fails, then only continue if unsafeAllowOverwrite is set to true.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clusterQCs: nil,
dkgKeys: dkgPubKeys,
}
verifyEpochMetadata(t, b, env, expectedMetadata)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we also validate that curEpochCounter = startEpochCounter+1

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

expectedRecoverEvent.dkgPhase1FinalView = newStartView + numStakingViews + numDKGViews - 1
expectedRecoverEvent.dkgPhase2FinalView = newStartView + numStakingViews + (2 * numDKGViews) - 1
expectedRecoverEvent.dkgPhase3FinalView = newStartView + numStakingViews + (3 * numDKGViews) - 1
verifyEpochRecover(t, adapter, idTableAddress, expectedRecoverEvent)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we also validate:

  • curEpochCounter = startEpochCounter+1
  • check epochMetadata in addition to the EpochRecover event

Copy link
Contributor Author

@kc1116 kc1116 Jul 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/// - All nodes in the node ids list have a weight > 0.
/// If the configuration is a valid configuration the the staking auction,
/// qc voting and dkg will be ended depending on the current epoch phase.
access(all) fun recoverEpochPreChecks(startView: UInt64,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would consider renaming this function (or moving the latter half that stops voting, DKG, or the staking auction to a separate function).

To me, "sanity checks" implies that we are validating inputs but not changing state. Currently the function is both validating inputs and modifying state.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this should be access(self) also?

@@ -1516,3 +1517,203 @@ func TestEpochReset(t *testing.T) {
assertEqual(t, CadenceUFix64("249999.99300000"), result)
})
}

func TestEpochRecover(t *testing.T) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add documentation for the test here?

Currently, I think we are testing:

  • execute an epoch recover where recoveryEpochCounter == currentEpochCounter+1
  • execute an epoch recover where recoveryEpochCounter == currentEpochCounter

After we explicitly include the recoveryEpochCounter, I would also suggest testing the case where:

  • recoveryEpochCounter != currentEpochCounter+1 and the unsafeAllowOverwrite flag is not set

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if I follow the last point, if we have 2 separate funcs one to recover new epoch and one to recover current epoch we dont need to unsafeAllowOverwrite we dont pass unsafeAllowOverwrite to any of the FlowEpoch funcs we only use it in the transaction to decide which func recoverNewEpoch or recoverCurrentEpoch to call.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once we are passing in the recoveryEpochCounter, we must use that value as the epoch counter for the recovery epoch 100% of the time, otherwise we will produce an incompatible recovery epoch state. recoveryEpochCounter is the source of truth.

In the commit which was merged to the feature branch, the logic is:

  1. if unsafeAllowOverwrite is set, then use recoveryEpochCounter as the counter for the recovery epoch
  2. otherwise, ignore the recoveryEpochCounter input and assume that the the smart contract and protocol state epoch counters are synchronized

I think conditional branch 2 is unsafe and we should respect the recoveryEpochCounter codepath in all cases. Sorry if I didn't explain this properly in the previous review. The purpose of including recoveryEpochCounter at all is to make sure it is used as the epoch counter for the recovery epoch being constructed in the transaction, in all circumstances.

@kc1116 kc1116 requested a review from jordanschalm July 29, 2024 04:09

/// Stops epoch components. If the configuration is a valid configuration the staking auction,
/// qc voting and dkg will be ended depending on the current epoch phase.
access(all) fun stopEpochComponents()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

access(self)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/// - All nodes in the node ids list have a weight > 0.
/// If the configuration is a valid configuration the the staking auction,
/// qc voting and dkg will be ended depending on the current epoch phase.
access(all) fun recoverEpochPreChecks(startView: UInt64,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this should be access(self) also?

@@ -432,6 +497,234 @@ access(all) contract FlowEpoch {
FlowEpoch.account.storage.load<Bool>(from: /storage/flowAutomaticRewardsEnabled)
FlowEpoch.account.storage.save(enabled, to: /storage/flowAutomaticRewardsEnabled)
}

access(all) fun emitEpochRecoverEvent(epochCounter: UInt64,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

access(self)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/// does not calculate and set rewards avoiding double paying rewards for the same epoch.
/// This meta data will be emitted in the EpochRecover service event. This function is used
/// within sporks to recover the network from Epoch Fallback Mode (EFM).
access(all) fun recoverCurrentEpoch(startView: UInt64,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we sure we shouldn't combine this and recoverNewEpoch into a single function? Might save some lines of code and we could potentially just detect which one we need. I don't know for sure though

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes that how it initially was but we decided to make them 2 distinct functions so that it is more obvious the difference in behavior.

@kc1116 kc1116 merged commit 3118243 into feature/efm-recovery Jul 29, 2024
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants