-
Notifications
You must be signed in to change notification settings - Fork 179
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[KV Store] ProtocolStateVersionUpgrade
Service Event
#5428
[KV Store] ProtocolStateVersionUpgrade
Service Event
#5428
Conversation
Change to encode/decode passes tests, still requires cleanup: - adds encode/decode for new event type - adds tests for new event type - changes how decoding is implemented to prevent a bug. Previously we would re-encode, then re-decode the entire service event, after first decoding it into a weakly typed map. In particular with JSON, this could cause the data to change between encode/decode cycles, I think because of numeric type sizes.
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## feature/protocol-state-kvstore #5428 +/- ##
==================================================================
- Coverage 55.97% 55.96% -0.02%
==================================================================
Files 1018 1018
Lines 97894 98055 +161
==================================================================
+ Hits 54800 54872 +72
- Misses 38994 39078 +84
- Partials 4100 4105 +5
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
ProtocolStateVersionUpgrade
Service Event
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you so much for adding documentation for existing functions. Amazing!
- Most of my comments are of stylistic nature.
- Though there us one conceptual aspect that worries me: the convention of when the
ProtocolStateVersionUpgrade
takes effect. I have added a detailed explanation of my concern in this comment.
foundFieldCount++ | ||
activeViewValue = cdcEvent.Fields[i] | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need a default
clause in the switch statement?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At the moment the conversion allows for extra fields. If they are unknown, then they are ignored. This simplifies migration (for example, it saved some work and complexity when migrating the EpochSetup
event to the FLIP 204 format, which adds fields).
See also https://github.com/onflow/flow-go/pull/5428/files#r1513023070.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This approach of "ignoring unexpected fields" is in my opinion a security vulnerability waiting to happen. We know precisely what we are expecting but accept huge set of binary-different messages. We have gone through great length in the networking layer that it de-duplicates binary identical messages and reduces the trust score of peers should they try to spam us with the same (binary) message repeatedly. But we allow to include extra garbage in the encoding, i.e. it becomes trivial for a node to send me 1 million times the same message, all looking different on a binary level to the networking layer ...
The malleability surface in the code base is quite similar in nature in my opinion.
I am ok with continuing the existing approach in this PR. Though a non-unique encoding scheme is just generally not suitable for a BFT system. It creates too many edge cases that we cannot reasonably anticipate and manage in my opinion. I would like us to move away from this as soon as possible.
cc @janezpodhostnik for visibility
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same discussion as below I guess
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Completely agree in general, however these methods are specifically for use only on trusted inputs. Independent of whether or not we accept unknown fields, we cannot pass untrusted inputs from the network into these methods -- it would be trivial to crash the node or cause other problems. See this comment:
// CAUTION: This function must only be used for input events computed locally, by an
// Execution or Verification Node; it is not resilient to malicious inputs.
Because of this, I don't think we improve security by being more restrictive about event fields here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks 👍
I think I get it now: this conversion is for the Execution Node or Verification node, to convert a service event, which was produced by local computation, from flow.Event
(Compact Cadence Format [ccf]) to flow.ServiceEvent
. I think the conversion from ccf is generally not BFT, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the conversion from ccf is generally not BFT, right?
I don't think that is specifically true about the CCF format in general. And here we are operating on everything in memory as Cadence Go types: it hasn't been encoded, so CCF doesn't play a role here. The key part is that the inputs to the function are produced strictly by local computation.
// NOTE: The above implies that, if the `ProtocolStateVersionUpgrade` event is sealed in a | ||
// block with view >= `ActiveView`, then `NewProtocolStateVersion` takes effect immediately. | ||
// | ||
// Otherwise, the node software stops processing blocks, until it is manually updated |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Side node:
A cmd-tool would be great which allows us to override the protocol version requirement when the node starts up (maybe as simple as --allowIncompatibleProtocol=True
). Just in case we mess something up that we (more precisely a supermajority of the node operators) can manually recover the protocol.
If you agree, would you be so kind to create an issue please, so that we don't forget?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we should immediately make an issue, but it's worth considering. I will add it as a discussion point to the FLIP/discussion post (at the moment I can't edit the post...).
The reason I don't think we should necessarily pre-emptively do this:
- Adding a flag that disable startup version checking would be a relatively small change, I think. Not having this in place preemptively would not be a big impediment to recovery.
- If we do need this, my guess is that it would be necessary but not sufficient. We will need to evaluate how the different versions will interact, and possibly make some changes to the software in one or the other version, before recovering.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for the explanation. Agree with all of your points.
model/flow/service_event.go
Outdated
// Event portion of a ServiceEvent into a specific typed structure. | ||
// Forwards errors from the underlying marshaller (treat errors as you would from eg. json.Unmarshal) | ||
func unmarshalWrapped[E any](b []byte, marshaller marshallerImpl) (*E, error) { | ||
eventWrapper := serviceEventUnmarshalWrapper[E]{} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the only usage of the type serviceEventUnmarshalWrapper
. It might be easier to read if the type is defined inside this function.
Co-authored-by: Alexander Hentschel <[email protected]>
model/convert/service_event.go
Outdated
} | ||
|
||
versionUpgrade, err := DecodeCadenceValue("ProtocolStateVersionUpgrade payload", payload, | ||
func(cdcEvent cadence.Event) (flow.ProtocolStateVersionUpgrade, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we simplify it a bit by returning a pointer? In the end we still convert to pointer. Not sure if the api of DecodeCadenceValue
allows it though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point - we can return pointer directly 👍. Done in ec6e878.
foundFieldCount++ | ||
activeViewValue = cdcEvent.Fields[i] | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This approach of "ignoring unexpected fields" is in my opinion a security vulnerability waiting to happen. We know precisely what we are expecting but accept huge set of binary-different messages. We have gone through great length in the networking layer that it de-duplicates binary identical messages and reduces the trust score of peers should they try to spam us with the same (binary) message repeatedly. But we allow to include extra garbage in the encoding, i.e. it becomes trivial for a node to send me 1 million times the same message, all looking different on a binary level to the networking layer ...
The malleability surface in the code base is quite similar in nature in my opinion.
I am ok with continuing the existing approach in this PR. Though a non-unique encoding scheme is just generally not suitable for a BFT system. It creates too many edge cases that we cannot reasonably anticipate and manage in my opinion. I would like us to move away from this as soon as possible.
cc @janezpodhostnik for visibility
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Amazing work. Appreciate the detailed documentation, and specifications.
Added a few suggestions to further clarify a few details, but all concerns are addressed. Thank you.
Co-authored-by: Alexander Hentschel <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great stuff, looking forward incorporating those changes into upgrade framework.
4d20066
into
feature/protocol-state-kvstore
This PR adds the service event type for
ProtocolStateVersionUpgrade
. I have provisionally added the new service event toNodeVersionBeacon
contract (see also onflow/flow-core-contracts#411), which has some benefits and drawbacks:ProtocolStateVersionUpgrade
does something similar to the existing Version Beacon, so it makes sense for their logic to be togetherProtocolStateVersionUpgrade
cleanly difficultOther Changes