Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[KV Store] ProtocolStateVersionUpgrade Service Event #5428

Merged

Conversation

jordanschalm
Copy link
Member

@jordanschalm jordanschalm commented Feb 21, 2024

This PR adds the service event type for ProtocolStateVersionUpgrade. I have provisionally added the new service event to NodeVersionBeacon contract (see also onflow/flow-core-contracts#411), which has some benefits and drawbacks:

  • Conceptually ProtocolStateVersionUpgrade does something similar to the existing Version Beacon, so it makes sense for their logic to be together
  • The Version Beacon contract already has a Heartbeat resource hooked up to the system chunk, which is necessary to emit service events
  • The existing version beacon code is written in a way that suggests the version it is operating on is global, rather than specifically scoped to the Execution State Machine
    • This, combined with the limitations of upgrading contracts, makes adding ProtocolStateVersionUpgrade cleanly difficult

Other Changes

  • Changes marshalling logic for all service events, to prevent data loss when unmarshalling in some cases using JSON (9741a9a)
  • Removes unused service event fixtures

Change to encode/decode passes tests, still requires cleanup:
- adds encode/decode for new event type
- adds tests for new event type
- changes how decoding is implemented to prevent a bug. Previously we
  would re-encode, then re-decode the entire service event, after first
  decoding it into a weakly typed map. In particular with JSON, this
  could cause the data to change between encode/decode cycles, I think because
  of numeric type sizes.
@codecov-commenter
Copy link

codecov-commenter commented Feb 22, 2024

Codecov Report

Attention: Patch coverage is 58.24742% with 81 lines in your changes are missing coverage. Please review.

Project coverage is 55.96%. Comparing base (b4f1bcf) to head (334aaf8).

Files Patch % Lines
model/convert/service_event.go 56.33% 24 Missing and 7 partials ⚠️
model/flow/service_event.go 53.57% 25 Missing and 1 partial ⚠️
utils/unittest/service_events_fixtures.go 67.34% 14 Missing and 2 partials ⚠️
utils/unittest/fixtures.go 0.00% 5 Missing ⚠️
model/flow/version_beacon.go 62.50% 3 Missing ⚠️
Additional details and impacted files
@@                        Coverage Diff                         @@
##           feature/protocol-state-kvstore    #5428      +/-   ##
==================================================================
- Coverage                           55.97%   55.96%   -0.02%     
==================================================================
  Files                                1018     1018              
  Lines                               97894    98055     +161     
==================================================================
+ Hits                                54800    54872      +72     
- Misses                              38994    39078      +84     
- Partials                             4100     4105       +5     
Flag Coverage Δ
unittests 55.96% <58.24%> (-0.02%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@jordanschalm jordanschalm changed the title Jordan/5351 upgrade service event [KV Store] ProtocolStateVersionUpgrade Service Event Feb 22, 2024
Copy link
Member

@AlexHentschel AlexHentschel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you so much for adding documentation for existing functions. Amazing!

  • Most of my comments are of stylistic nature.
  • Though there us one conceptual aspect that worries me: the convention of when the ProtocolStateVersionUpgrade takes effect. I have added a detailed explanation of my concern in this comment.

fvm/systemcontracts/system_contracts.go Outdated Show resolved Hide resolved
model/convert/service_event.go Outdated Show resolved Hide resolved
model/convert/service_event.go Show resolved Hide resolved
model/convert/service_event.go Show resolved Hide resolved
foundFieldCount++
activeViewValue = cdcEvent.Fields[i]
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need a default clause in the switch statement?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the moment the conversion allows for extra fields. If they are unknown, then they are ignored. This simplifies migration (for example, it saved some work and complexity when migrating the EpochSetup event to the FLIP 204 format, which adds fields).

See also https://github.com/onflow/flow-go/pull/5428/files#r1513023070.

Copy link
Member

@AlexHentschel AlexHentschel Mar 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This approach of "ignoring unexpected fields" is in my opinion a security vulnerability waiting to happen. We know precisely what we are expecting but accept huge set of binary-different messages. We have gone through great length in the networking layer that it de-duplicates binary identical messages and reduces the trust score of peers should they try to spam us with the same (binary) message repeatedly. But we allow to include extra garbage in the encoding, i.e. it becomes trivial for a node to send me 1 million times the same message, all looking different on a binary level to the networking layer ...
The malleability surface in the code base is quite similar in nature in my opinion.

I am ok with continuing the existing approach in this PR. Though a non-unique encoding scheme is just generally not suitable for a BFT system. It creates too many edge cases that we cannot reasonably anticipate and manage in my opinion. I would like us to move away from this as soon as possible.

cc @janezpodhostnik for visibility

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same discussion as below I guess

Copy link
Member Author

@jordanschalm jordanschalm Mar 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Completely agree in general, however these methods are specifically for use only on trusted inputs. Independent of whether or not we accept unknown fields, we cannot pass untrusted inputs from the network into these methods -- it would be trivial to crash the node or cause other problems. See this comment:

// CAUTION: This function must only be used for input events computed locally, by an
// Execution or Verification Node; it is not resilient to malicious inputs.

Because of this, I don't think we improve security by being more restrictive about event fields here.

Copy link
Member

@AlexHentschel AlexHentschel Mar 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks 👍

I think I get it now: this conversion is for the Execution Node or Verification node, to convert a service event, which was produced by local computation, from flow.Event (Compact Cadence Format [ccf]) to flow.ServiceEvent. I think the conversion from ccf is generally not BFT, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the conversion from ccf is generally not BFT, right?

I don't think that is specifically true about the CCF format in general. And here we are operating on everything in memory as Cadence Go types: it hasn't been encoded, so CCF doesn't play a role here. The key part is that the inputs to the function are produced strictly by local computation.

model/flow/service_event.go Outdated Show resolved Hide resolved
model/flow/version_beacon.go Outdated Show resolved Hide resolved
model/flow/version_beacon.go Show resolved Hide resolved
model/flow/version_beacon.go Outdated Show resolved Hide resolved
// NOTE: The above implies that, if the `ProtocolStateVersionUpgrade` event is sealed in a
// block with view >= `ActiveView`, then `NewProtocolStateVersion` takes effect immediately.
//
// Otherwise, the node software stops processing blocks, until it is manually updated
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Side node:

A cmd-tool would be great which allows us to override the protocol version requirement when the node starts up (maybe as simple as --allowIncompatibleProtocol=True). Just in case we mess something up that we (more precisely a supermajority of the node operators) can manually recover the protocol.

If you agree, would you be so kind to create an issue please, so that we don't forget?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should immediately make an issue, but it's worth considering. I will add it as a discussion point to the FLIP/discussion post (at the moment I can't edit the post...).

The reason I don't think we should necessarily pre-emptively do this:

  • Adding a flag that disable startup version checking would be a relatively small change, I think. Not having this in place preemptively would not be a big impediment to recovery.
  • If we do need this, my guess is that it would be necessary but not sufficient. We will need to evaluate how the different versions will interact, and possibly make some changes to the software in one or the other version, before recovering.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the explanation. Agree with all of your points.

model/convert/service_event.go Show resolved Hide resolved
// Event portion of a ServiceEvent into a specific typed structure.
// Forwards errors from the underlying marshaller (treat errors as you would from eg. json.Unmarshal)
func unmarshalWrapped[E any](b []byte, marshaller marshallerImpl) (*E, error) {
eventWrapper := serviceEventUnmarshalWrapper[E]{}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the only usage of the type serviceEventUnmarshalWrapper. It might be easier to read if the type is defined inside this function.

model/flow/service_event.go Show resolved Hide resolved
model/flow/version_beacon.go Show resolved Hide resolved
fvm/systemcontracts/system_contracts.go Outdated Show resolved Hide resolved
Co-authored-by: Alexander Hentschel <[email protected]>
}

versionUpgrade, err := DecodeCadenceValue("ProtocolStateVersionUpgrade payload", payload,
func(cdcEvent cadence.Event) (flow.ProtocolStateVersionUpgrade, error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we simplify it a bit by returning a pointer? In the end we still convert to pointer. Not sure if the api of DecodeCadenceValue allows it though.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point - we can return pointer directly 👍. Done in ec6e878.

fvm/systemcontracts/system_contracts.go Outdated Show resolved Hide resolved
foundFieldCount++
activeViewValue = cdcEvent.Fields[i]
}
}
Copy link
Member

@AlexHentschel AlexHentschel Mar 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This approach of "ignoring unexpected fields" is in my opinion a security vulnerability waiting to happen. We know precisely what we are expecting but accept huge set of binary-different messages. We have gone through great length in the networking layer that it de-duplicates binary identical messages and reduces the trust score of peers should they try to spam us with the same (binary) message repeatedly. But we allow to include extra garbage in the encoding, i.e. it becomes trivial for a node to send me 1 million times the same message, all looking different on a binary level to the networking layer ...
The malleability surface in the code base is quite similar in nature in my opinion.

I am ok with continuing the existing approach in this PR. Though a non-unique encoding scheme is just generally not suitable for a BFT system. It creates too many edge cases that we cannot reasonably anticipate and manage in my opinion. I would like us to move away from this as soon as possible.

cc @janezpodhostnik for visibility

Copy link
Member

@AlexHentschel AlexHentschel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Amazing work. Appreciate the detailed documentation, and specifications.

Added a few suggestions to further clarify a few details, but all concerns are addressed. Thank you.

model/flow/version_beacon.go Outdated Show resolved Hide resolved
model/flow/version_beacon.go Outdated Show resolved Hide resolved
model/flow/version_beacon.go Outdated Show resolved Hide resolved
Co-authored-by: Alexander Hentschel <[email protected]>
Copy link
Member

@durkmurder durkmurder left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great stuff, looking forward incorporating those changes into upgrade framework.

@jordanschalm jordanschalm merged commit 4d20066 into feature/protocol-state-kvstore Mar 11, 2024
50 of 51 checks passed
@jordanschalm jordanschalm deleted the jordan/5351-upgrade-service-event branch March 11, 2024 21:01
@franklywatson franklywatson linked an issue Apr 26, 2024 that may be closed by this pull request
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Dynamic Protocol State | M2] KV Store upgrade event
6 participants