replace engine.Unit with ComponentManager in Pusher Engine #6747

tim-barry · 2024-11-21T00:53:03Z

Using #4219 as an example. Instead of starting new goroutines or directly processing messages in a blocking way, add messages to a queue that a worker pulls from.

The Pusher engine still currently implements network.Engine rather than network.MessageProcessor.

Using #4219 as an example. Instead of starting new goroutines or directly processing messages in a blocking way, messages are added to a queue that a worker pulls from. The Pusher engine still currently implements network.Engine rather than network.MessageProcessor.

codecov-commenter · 2024-11-21T00:56:32Z

Codecov Report

Attention: Patch coverage is 60.81081% with 29 lines in your changes missing coverage. Please review.

Project coverage is 41.25%. Comparing base (8a3055c) to head (2048b4a).

Files with missing lines	Patch %	Lines
engine/collection/pusher/engine.go	62.50%	21 Missing and 6 partials ⚠️
cmd/collection/main.go	0.00%	1 Missing ⚠️
engine/testutil/mock/nodes.go	0.00%	1 Missing ⚠️

Additional details and impacted files

@@                        Coverage Diff                         @@
##           feature/pusher-engine-refactor    #6747      +/-   ##
==================================================================
- Coverage                           41.26%   41.25%   -0.01%     
==================================================================
  Files                                2061     2061              
  Lines                              182702   182737      +35     
==================================================================
+ Hits                                75384    75386       +2     
- Misses                             101010   101036      +26     
- Partials                             6308     6315       +7

Flag	Coverage Δ
unittests	`41.25% <60.81%> (-0.01%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Because the event processing now happens in a worker, any errors raised within it are no longer visible to the caller of Process(). Because the test checked for error status, moved the tests to the same package and call the internal processing function directly.

When using Unit, calling Ready would also start the engine. With ComponentManager, we additionally need to invoke Start.

tim-barry · 2024-11-25T18:25:57Z

engine/collection/pusher/engine.go

+		asSCGMsg := asEngineWrapper.Payload.(*messages.SubmitCollectionGuarantee)
+		originID := asEngineWrapper.OriginID
+
+		_ = e.process(originID, asSCGMsg)


Seems incorrect to fully discard any error, should likely be logged instead?

Yes, we should be handling all possible error cases. In general, all functions which return an error should document all possible expected error cases in their documentation. Callers which observe an error return that they don't know explicitly how to handle safely, should propagate the error further up the stack. Eventually this results in either the component or entire process restarting.

Here is the part of our coding guidelines talking about error handling: https://github.com/onflow/flow-go/blob/master/CodingConventions.md#error-handling. In short, the rule is to prioritize safety over liveness.

Practically, the way we would do this here is using the irrecoverable.SignalerContext. That has a Throw method which will signal to restart the component or process.

I usually propagate errors all the way up the stack to where the SignalerContext is defined (here, that would be in inboundMessageWorker) then do ctx.Throw(err). You can also use irrecoverable.Throw to throw any context.Context.

Change Suggestions:

In general, make sure all functions in the file which return an error document their expected error returns (or the lack of expected errors). Here is an example of error documentation.

Propagate the error value to inboundMessageWorker and Throw unexpected errors at that frame.

agree with your comment that discarding any error is bad. Please take a look at our Coding Conventions, if you haven't already. Tldr:

just discarding any error: very bad

logging error and continuing: still bad (this is the pattern of continuing on best effort basis, which is not a viable approach for high-assurance software systems)

desired:

explicitly check error type and only continue on errors that are confirmed to be benign.

all other errors (outside of the specified benign code path) are fatal

The pusher.Engine is a bit of a simpler case, because it only distributes information to other nodes. So swallowing errors does not undermine safety as it would for many other Engines. BUT, we already had cases where nodes were logging OutOfMemory errors for hours, but the code just dropped those and the node was dysfunctional without the software noticing it ... so even if it is not safety critical, we still generally expect that errors are not swallowed (and neither logged and then just discarded).

There are exceptions to this rule, e.g. if it would be a massive time investment to clear up technical debt and thoroughly documenting possible benign errors. But then, a detailed comment is absolutely required why it is always fine to log and discard any errors in that particular case.

I would suggest to leave error handling for a bit and come back to it after some iterations. After having implemented some of the cleanup and simplifications I suggest further below, I suspect that error handling will become a lot more straightforward.

tim-barry · 2024-11-25T18:28:03Z

engine/collection/pusher/engine.go

-// Done returns a done channel that is closed once the engine has fully stopped.
-func (e *Engine) Done() <-chan struct{} {
-	return e.unit.Done()
+func (e *Engine) processInboundMessages(ctx context.Context) {


Should probably make sure functions have doc comments

Yes, please add a short godoc comment to each function (especially public functions, but ideally private ones too).

added short doc comments in f2f53a8

AlexHentschel · 2024-11-25T22:18:26Z

Issue for the present PR: https://github.com/onflow/flow-go-internal/issues/7018 (it helps reviewers when its linked in the PR description, so they can read up on the context)

jordanschalm

The overall structure of changes looks good. I've added some thoughts related to transitioning to network.MessageProcessor, which we can discuss separately.

In terms of things to address in this PR:

Make sure all functions have minimal documentation, including a description of expected error returns (if the function returns any error).
Make sure invocations of a function that returns expected errors explicitly handles those errors.
Throw unexpected top-level errors to the SignalerContext

jordanschalm · 2024-11-25T22:40:04Z

engine/collection/pusher/engine.go

+		asSCGMsg := asEngineWrapper.Payload.(*messages.SubmitCollectionGuarantee)
+		originID := asEngineWrapper.OriginID
+
+		_ = e.process(originID, asSCGMsg)


Yes, we should be handling all possible error cases. In general, all functions which return an error should document all possible expected error cases in their documentation. Callers which observe an error return that they don't know explicitly how to handle safely, should propagate the error further up the stack. Eventually this results in either the component or entire process restarting.

Here is the part of our coding guidelines talking about error handling: https://github.com/onflow/flow-go/blob/master/CodingConventions.md#error-handling. In short, the rule is to prioritize safety over liveness.

Practically, the way we would do this here is using the irrecoverable.SignalerContext. That has a Throw method which will signal to restart the component or process.

I usually propagate errors all the way up the stack to where the SignalerContext is defined (here, that would be in inboundMessageWorker) then do ctx.Throw(err). You can also use irrecoverable.Throw to throw any context.Context.

Change Suggestions:

In general, make sure all functions in the file which return an error document their expected error returns (or the lack of expected errors). Here is an example of error documentation.

Propagate the error value to inboundMessageWorker and Throw unexpected errors at that frame.

jordanschalm · 2024-11-25T23:11:14Z

engine/collection/pusher/engine.go

+	// TODO length observer metrics
+	inbound, err := fifoqueue.NewFifoQueue(1000)
+	if err != nil {
+		return nil, fmt.Errorf("could not create inbound fifoqueue: %w", err)
+	}


Let's add those length metrics now for completeness. You can use this engine as an example. We'll also need to add a parameter for the mempoolMetrics to the constructor.

addressed in ddb0cb7

jordanschalm · 2024-11-25T23:12:22Z

engine/collection/pusher/engine.go

-// Done returns a done channel that is closed once the engine has fully stopped.
-func (e *Engine) Done() <-chan struct{} {
-	return e.unit.Done()
+func (e *Engine) processInboundMessages(ctx context.Context) {


Yes, please add a short godoc comment to each function (especially public functions, but ideally private ones too).

jordanschalm · 2024-11-25T23:45:20Z

engine/collection/pusher/engine.go

+// TODO convert to network.MessageProcessor
+var _ network.Engine = (*Engine)(nil)
+var _ component.Component = (*Engine)(nil)


It may also be a good exercise to make the change to MessageProcessor here, though likely as a separate PR.

I'll add some thoughts below; I suggest we spend some time discussing this live as well. There is no need to implement any of the comments below before we have a chance to discuss.

The pusher engine is unusual in a couple of ways:

It is using an old framework for internal message-passing, where two components in the same process would pass messages between each other using standard methods (SubmitLocal and ProcessLocal). The meaning of the message was carried by its type (here: messages.SubmitCollectionGuarantee) and the logical structure was similar to that of messages received over the network. Message processing logic needed to potentially handle inputs originating both locally and remotely.

It does not expect to receive any inbound messages from non-local sources. It is send-only from a networking perspective.

Replacing network.Engine with network.MessageProcessor there are a few opportunities for simplifying the implementation:

Currently the internal message-passing is done through SubmitLocal. We can remove SubmitLocal and ProcessLocal and replace them with a context-specific function (and corresponding interface type, for use in the caller).

A reference example is the Compliance interface, which is exposed by the compliance engine for local message-passing.

For this reason, engine.MessageHandler may not be needed in this context. It is better suited for cases where we are handling external messages.

We then do not need to check originIDs and do not need both onSubmitCollectionGuarantee and SubmitCollectionGuarantee methods

Now, this context-specific function is the only way for local messages to enter the pusher engine, which will simplify our network message processing logic 👇

In order to satisfy network.MessageProcessor, we need a Process method, which accepts messages from remote network sources. Since this engine does not expect any remote messages, we should reject all inputs to Process.

In general, when we observe an unexpected message from another node, we should flag it as a possible Byzantine message. In the future, these messages would result in a slashing challenge targeting the originator of the message. For now, we can log a warning with the relevant log tag.

+1. Had a similar thought 👉 my comment

Change to network.MessageProcessor and the new interface for local messages will be part of a separate PR.

jordanschalm · 2024-11-25T23:45:52Z

engine/collection/pusher/engine.go

-// started.
-func (e *Engine) Ready() <-chan struct{} {
-	return e.unit.Ready()
+func (e *Engine) inboundMessageWorker(ctx irrecoverable.SignalerContext, ready component.ReadyFunc) {


Please add a short godoc for this method

jordanschalm · 2024-11-25T23:50:32Z

engine/collection/pusher/engine.go

-	})
+	err := e.messageHandler.Process(e.me.NodeID(), event)
+	if err != nil {
+		engine.LogError(e.log, err)


Suggested change

engine.LogError(e.log, err)

// TODO: do not swallow errors here, remove this function when transitioning to `network.MessageProcessor`

engine.LogError(e.log, err)

jordanschalm · 2024-11-25T23:50:42Z

engine/collection/pusher/engine.go

-	})
+	err := e.messageHandler.Process(originID, event)
+	if err != nil {
+		engine.LogError(e.log, err)


Suggested change

engine.LogError(e.log, err)

// TODO: do not swallow errors here, remove this function when transitioning to `network.MessageProcessor`

engine.LogError(e.log, err)

engine/collection/pusher/engine.go

AlexHentschel

Thanks for the PR. My comments are not a full review (yet). Rather I am aiming at highlighting the areas to pay attention to. When working on an issue, part of the goal is to address the issue, though fixing the issue is frequently not sufficient: one of the major challenges with implementing a high-assurance software system such as Flow is managing intellectual complexity, code complexity and component isolation. When an engineer is required to grok the entire code base or a large portion thereof in order to make any change, then we have lost the ability to extend and maintain the code base on business-relevant time scales.

The Pusher Engine is from a very early time when most engineers knew most of the code. As the code base grows more complex, by adding functionality towards the mature architecture, we need more precision and rigour in the implementation.

We appreciate that it is super hard for new engineers to identify technical debt as such. Therefore, I have provided detailed context why certain aspects are technical debt. As you grow more experienced and familiar with our best practises, it will become easier to spot tech debt. Nevertheless, I want to be clear about expectations: cleaning up tech debt in/around the code you touch as time permits. Always leave the code in a better state than you found it, especially wrt:

documentation in general (fixing outdated goDoc and adding missing docs)
error documentation in particular and error handling
code clarity and removing unnecessary abstraction layers
clearly apparent reasoning for correctness (either by the code itself or through added documentation)
API safety (is it hard to use the API incorrectly?)
test coverage

For quite some time, we will be helping you with spotting areas of tech debt and how to address them. If you see something that looks like tech debt and you don't know how to fix is, please ask - I loved your comment here. I can give you one recommendation: if you don't see anything to be improved at a piece of code (aside from implementing your issue) you should probably ask somebody for input. The code we touch tends to generally be older code (because the newer code often has already the desired features) and the older code is generally in dire need for cleaning up tech debt. Its frequent that engineers spend just as much time, frequently even longer, cleaning up tech debt than working on the actual issue. I think this will also be the case for the Pusher Engine.

engine/collection/pusher/engine.go

AlexHentschel · 2024-11-26T00:57:50Z

engine/collection/pusher/engine.go

+		asSCGMsg := asEngineWrapper.Payload.(*messages.SubmitCollectionGuarantee)
+		originID := asEngineWrapper.OriginID
+
+		_ = e.process(originID, asSCGMsg)


agree with your comment that discarding any error is bad. Please take a look at our Coding Conventions, if you haven't already. Tldr:

just discarding any error: very bad

logging error and continuing: still bad (this is the pattern of continuing on best effort basis, which is not a viable approach for high-assurance software systems)

desired:

explicitly check error type and only continue on errors that are confirmed to be benign.

all other errors (outside of the specified benign code path) are fatal

The pusher.Engine is a bit of a simpler case, because it only distributes information to other nodes. So swallowing errors does not undermine safety as it would for many other Engines. BUT, we already had cases where nodes were logging OutOfMemory errors for hours, but the code just dropped those and the node was dysfunctional without the software noticing it ... so even if it is not safety critical, we still generally expect that errors are not swallowed (and neither logged and then just discarded).

There are exceptions to this rule, e.g. if it would be a massive time investment to clear up technical debt and thoroughly documenting possible benign errors. But then, a detailed comment is absolutely required why it is always fine to log and discard any errors in that particular case.

I would suggest to leave error handling for a bit and come back to it after some iterations. After having implemented some of the cleanup and simplifications I suggest further below, I suspect that error handling will become a lot more straightforward.

AlexHentschel · 2024-11-26T01:04:58Z

engine/collection/pusher/engine.go

+	messageHandler := engine.NewMessageHandler(log, notifier, engine.Pattern{
+		Match: engine.MatchType[*messages.SubmitCollectionGuarantee],
+		Store: &engine.FifoMessageStore{FifoQueue: inbound},
+	})


I would recommend to not use the MessageHandler here for the following reasons:

I think it is good to apply the patter of the MessageHandler at least once by hand ("learning by doing")

The MessageHandler adds a layer of abstraction, if we can do without it that would be preferable in my opinion.

In this particular case, the Engine only processes a single input type SubmitCollectionGuarantee. We have tried it before and the message handler is actually more code compared to just doing the same logic by hand.

Yes, it did end up being less complicated - applied in f66ba69

AlexHentschel · 2024-11-26T01:14:06Z

engine/collection/pusher/engine.go

Please take a look at the core processing logic in the Engine and lets try to remove layers of complexity and indirection where possible:

flow-go/engine/collection/pusher/engine.go

Lines 182 to 191 in b5418dd

func (e *Engine) onSubmitCollectionGuarantee(originID flow.Identifier, req *messages.SubmitCollectionGuarantee) error {

if originID != e.me.NodeID() {

return fmt.Errorf("invalid remote request to submit collection guarantee (from=%x)", originID)

}

return e.SubmitCollectionGuarantee(&req.Guarantee)

}

// SubmitCollectionGuarantee submits the collection guarantee to all consensus nodes.

func (e *Engine) SubmitCollectionGuarantee(guarantee *flow.CollectionGuarantee) error {

Looking at this code raises a few questions:

we are only propagating collections that are originating from this node (👉 code). But the exported method SubmitCollectionGuarantee is not checking this? Why not?

If it is an important check, we should always check it. If we are fine with collections from other nodes being broadcast, then we don't need the check in onSubmitCollectionGuarantee either..

turns out, SubmitCollectionGuarantee is only ever called by the pusher.Engine itself (not even the tests call it).

Lets simplify this: merge SubmitCollectionGuarantee and onSubmitCollectionGuarantee.

Further digging into the code, I found that the originID is just completely useless in this context:

the pusher.Engine is fed only from internal components (so originID is always filled with the node's own identifier)

the pusher.Engine drops messages that are not from itself. Just hypothetically, lets assume that there were collections with originID other than the node's own ID. ❗Note that those messages would be queued first, wasting resources and then we have a different thread discard the Collection after dequeueing. Checks like that should always be applied before queueing irrelevant messages.

What you see here is a pattern we struggle with a lot: the code is very ambiguous and allows a whole bunch of stuff that is never intended to happen in practise. This is an unsuitable design for engineering complex systems such as Flow, as we then have to manually handle undesired cases.

Example 1: messages other than SubmitCollectionGuarantee. In comparison, if we had a method that only accepted SubmitCollectionGuarantee, the compiler would enforce this constraint for us as opposed to us having to write code. Often, people don't bother to manually disallow undesired inputs, resulting in software that is doing something incorrect in case of bugs as opposed to crashing (or the code not compiling in the first place)

Example 2: originID. We employ a generic patter which obfuscates that we don't need an originID in this particular example, because the Engine interface does not differentiate between node-external inputs (can be malicious) vs inputs from other components within its own node (always trusted) and mixes modes of processing (synchronous vs asynchronous).

For those reasons, we deprecated the Engine interface. Lets try to update the pusher and remove the engine interface (for concrete steps towards this, please see my next comment).

For Flow, the APIs should generally be as descriptive and restrictive as possible and make it hard to feed in wrong / mismatching inputs. Writing extra lines of boilerplate code is not a challenge for Flow. In contrast, it is challenging if you have to dig through hundreds or thousands of lines of code because the APIs are too generic and you need to know implementation details to use them correctly.

Partially adressed in b7166f3 and e25cba7: originID has been pretty much removed from the interface, now that Process and Submit (the functions in the Engine interface handling node-external inputs) simply error on any input.

AlexHentschel · 2024-11-26T02:24:06Z

engine/collection/pusher/engine.go

 // SubmitLocal submits an event originating on the local node.
 func (e *Engine) SubmitLocal(event interface{}) {


As an exercise, try to figure out which other components are feeing inputs into the pusher.Engine. We are lucky here because we have a rarely used message type SubmitCollectionGuarantee. But just imagine you would see a log that says no matching processor for message of type flow.Header from origin. With the Engine interface it is way too time intensive to figure out which components inside a node are feeding which other components with data.

Turns out that the Finalizer is the only component that is feeding the pusher.Engine (other than messages arriving from the networking layer). Let's make our lives easier in the future by improving the code:

introduce a dedicated interface GuaranteedCollectionPublisher. It only has one type-safe method
SubmitCollectionGuarantee(guarantee *flow.CollectionGuarantee)
this method just takes the message and puts it into the queue. Then we know that the queue will only ever contain elements of type flow.CollectionGuarantee. (currently, the Finalizer creates the wrapper messages.SubmitCollectionGuarantee around CollectionGuarantee ... and the pusher.Engine unwraps it and throws the wrapper away 😑 ).

change the Finalizer and all layers in between to work with the GuaranteedCollectionPublisher interface. This improves code clarity: currently, engineers would need to know what functionality was backing the generic Engine interface, e.g. here:

flow-go/engine/collection/epochmgr/factories/builder.go

Line 36 in b5418dd

pusher network.Engine,

because the right type of inputs need to be fed into the pusher.Engine. When using the GuaranteedCollectionPublisher interface we have the following benefits:

the type itself carries information about what functionality is behind this interface. Huge improvement of clarity.

its easy to find all places that can feed the Pusher with data, because we work with concrete interfaces and context-specific methods (e.g. SubmitCollectionGuarantee).

the compiler will reject code that feeds inputs with incompatible type into the Pusher

In my opinion, it would improve clarity if we were to discard all inputs from the network.MessageProcessor - conceptually they are messages from different node, which they are already broadcasting. The Pusher is for broadcasting our own messages. But at the moment, inputs from node-internal components as well as messages from other nodes passing through the networking layer share the same code path in the pusher.Enging, which makes it non-trivial to figure out where messages are coming from.

To summarize: one of the main benefits of the Component interface is that it provides lifecycles methods, but no methods for ingesting data - on purpose. This is because all data from components within the node should be passed via context-specific methods. The only exception is the interface network.MessageProcessor, which would still call Process with a generic message type.

Sorry for the lengthy comment. I hope you see how the very ambiguous structure of the Engine makes it really hard to determine from the code what we are expecting and where to drop messages, because its all hidden behind layers of ambiguous interfaces in a huge code base.

Thanks for the detailed comment :)

I added the described SubmitCollectionGuarantee method in f66ba69 , but the interface & changing the Finalizer will be pushed to the next PR.

Inputs from the network.MessageProcessor (via the Process method) are being discarded (e25cba7).

AlexHentschel · 2024-11-26T05:39:05Z

engine/collection/pusher/engine.go

 func New(log zerolog.Logger, net network.EngineRegistry, state protocol.State, engMetrics module.EngineMetrics, colMetrics module.CollectionMetrics, me module.Local, collections storage.Collections, transactions storage.Transactions) (*Engine, error) {
+	// TODO length observer metrics
+	inbound, err := fifoqueue.NewFifoQueue(1000)


I am not sure about the queue capacity we should pick here. I think we should have a good argument (and document it!) of why the number we chose makes sense. If we don't want to go through the trouble of defining a sensible limit, I think we could make this clear by using an unbounded queue (implementation).

Anyway, I think a limit makes sense:

In a nutshell, a collection must reference a block and expires $\texttt{DefaultTransactionExpiry}$ = 600 blocks thereafter

flow-go/model/flow/constants.go

Lines 18 to 22 in a164070

// Let E by the transaction expiry. If a transaction T specifies a reference

// block R with height H, then T may be included in any block B where:

// * R<-*B - meaning B has R as an ancestor, and

// * R.height < B.height <= R.height+E

const DefaultTransactionExpiry = 10 * 60

(its a subtle mechanic, feel free to ignore)

consensus nodes produce about $\rho$ = 1.25 blocks per second

collector clusters produce up to $\varrho$ = 3 collections per second

So very very roughly, between a collection being first proposed (usually referencing the newest known block) to the collection expiring, the collector cluster will produce $\texttt{DefaultTransactionExpiry}\cdot \rho^{-1} \cdot \varrho = 1440$ collections. So if we queued 1500 collections max, then we would be relatively confident that collections have been broadcast before they expire.
The downside is that collections might sit in the queue for quite some time, if several hundreds are before them. That could be an argument of queuing only much fewer collections (e.g. 200), so collection that don't make it out fast are dropped.
Another argument we could make is actually using a LIFO queue. If the queue is nearly empty, there is little difference. However, if there is a large backlog of collections to be broadcast, I think its a valid argument of broadcasting the newest first, at least then, some collections make it in time as opposed to all of them being queued past their expiry. Just some thoughts.

@jordanschalm what do you think?

I think the choice of queue size here is unlikely to have significant impact:

All collection nodes can publish all finalized guarantees.

In the happy path, collection nodes are observing finalized collections at a very consistent rate over tiem. We can broadcast collection guarantees much more quickly than they are finalized, and expect the queue to be empty most of the time.

In the unhappy path where a collection node is finalizing collections more quickly than usual, it is behind and these are old collections -- we don't care about successfully broadcasting them anyway.

So if we queued 1500 collections max, then we would be relatively confident that collections have been broadcast before they expire.

If we actually had 1500 collections waiting in this queue, I would feel much more confident that there was a severe problem with that node's ability to publish messages, than that collections would be broadcasted before expiring. 😀

General thoughts:

I think we should prefer dropping from the queue than making sure we publish every message (choose a relatively small queue size). I'm happy with 200 (~1 minute of collections) or even a bit less.

I agree that a LIFO queue (stack) would be better suited. I don't think we should actually make that change now (premature optimization)

Rename `inboundMessageWorker` and `processInboundMessages` to `outbound` and also propagate errors to the top level of the worker where they can be thrown.

- Partially implement suggestion #6747 (comment) - Make `SubmitCollectionGuarantee` non-exported and rename to `publishCollectionGuarantee` - Add new `SubmitCollectionGuarantee` exported function that just adds to the queue - Remove `messageHandler` field, instead directly add to queue from review: #6747 (comment) - `OriginID`s no longer included in messages in the queue, and therefore not checked by the worker - if necessary they should be checked when Submitting

This reverts commit 0aadc13. Instead of testing the internals, test the exported interface.

…gine-use-componentmanager

Rename queue and add length metrics for it, updating creation sites.

Co-authored-by: Alexander Hentschel <[email protected]>

jordanschalm

Looking good - thank you for the revisions and for noting which changes will be addressed in the follow-up PR.

General suggestions:

Remove unused field colMetrics
There are a couple functions returning errors, that I don't think we will be changing in the follow-up PR. For these, we can include the error documentation here.

engine/collection/pusher/engine.go

jordanschalm · 2024-11-29T21:50:44Z

engine/collection/pusher/engine.go

+func (e *Engine) SubmitCollectionGuarantee(msg *messages.SubmitCollectionGuarantee) {
+	ok := e.queue.Push(msg)


When we were using a generic function (eg. SubmitLocal(message any)), the type wrapper (*messages.SubmitCollectionGuarantee) was useful because it communicated the meaning of the message ("this collection guarantee should be broadcast").

When we have a context-specific method for enqueueing collection guarantees, we no longer need the type wrapper, and can simply pass *CollectionGuarantee directly: the method communicates the meaning of the message, not the argument type.

(Let's apply this change in the follow-up PR.)

jordanschalm · 2024-11-29T22:03:01Z

engine/collection/pusher/engine.go

-
-// SubmitCollectionGuarantee submits the collection guarantee to all consensus nodes.
-func (e *Engine) SubmitCollectionGuarantee(guarantee *flow.CollectionGuarantee) error {
+// publishCollectionGuarantee publishes the collection guarantee to all consensus nodes.


Could you please add error return documentation to this function. If it might return an expected sentinel error (including from sub-calls into Identities() and Publish()), we should list those possible error returns. Otherwise, we should add the line // No errors expected during normal operation.

If you haven't already, take a look at the godoc references for Identities() and Publish() and think about what errors they might return, and which of those are expected vs unexpected in the context of executing publishCollectionGuarantee.

In particular, Identities() does have a listed error return value -- state.ErrUnknownSnapshotReference -- but in the context here, it is not expected. Why is that?

For Identities(), I believe since we're using the Final() (latest finalized in cache) snapshot, we expect to already have a valid snapshot reference / queryable block in the state, so expect Identities() on it to never error.

From what I can tell Publish() only returns an error if the conduit channel has been closed (in which case I would assume we can't continue), or other "benign" errors when the constructed message is invalid in some way (pretty sure this is not expected) or some failure in the underlying libp2p (I assume also not expected).

jordanschalm · 2024-11-29T22:04:40Z

engine/collection/pusher/engine.go

-func (e *Engine) Done() <-chan struct{} {
-	return e.unit.Done()
+// processOutboundMessages processes any available messages from the queue.
+// Only returns when the queue is empty (or the engine is terminated).


Please add error documentation here. Either enumerate possible error types, or add // No errors expected during normal operation.

We don't expect any errors here, because the only sources are wrong-typed items in the queue (shouldn't happen since we only ever insert the correct type) or error in publishCollectionGuarantee, which is not expected.

module/metrics/labels.go

Doc comment changes, metrics naming, and queue length. For reasoning behind chosen queue length, see #6747 (comment) Co-authored-by: Jordan Schalm <[email protected]>

Instead of logging an error, add to the metrics when the queue is full and a message needs to be dropped instead of sent. https://github.com/onflow/flow-go/blob/85913ad09e605fb5234155301bb0d517946a75a5/engine/collection/compliance/engine.go#L139-L146

jordanschalm

🎸

AlexHentschel

Ver nice code 👏 . Only stylistic suggestions (all optional).

engine/collection/pusher/engine.go

- Remove pusher engine implementation of network.Engine - Replace with network.MessageProcessor - See: #6747 (comment) - Remove SubmitCollectionGuarantee message type - Was only used between Finalizer and Pusher engine - New interface passes and stores collection guarantees directly, instead of wrapping and then unwrapping them - See: #6747 (comment) - Add GuaranteedCollectionPublisher interface, implemented by pusher engine - Only used by the Finalizer (and intermediate constructors) - Mocks are generated for it, used in Finalizer unit tests - See: #6747 (comment)

tim-barry requested a review from jordanschalm as a code owner November 21, 2024 00:53

tim-barry requested a review from durkmurder November 21, 2024 00:57

tim-barry marked this pull request as draft November 21, 2024 17:43

tim-barry and others added 4 commits November 21, 2024 10:48

pusher engine test: update positive test

6d3f462

Start pusher engine in mocks

dec0d58

When using Unit, calling Ready would also start the engine. With ComponentManager, we additionally need to invoke Start.

Merge branch 'master' into tim/7018-pusher-engine-use-componentmanager

b5418dd

tim-barry marked this pull request as ready for review November 22, 2024 18:56

tim-barry commented Nov 25, 2024

View reviewed changes

jordanschalm reviewed Nov 26, 2024

View reviewed changes

AlexHentschel reviewed Nov 26, 2024

View reviewed changes

tim-barry mentioned this pull request Nov 26, 2024

Refactor Pusher Engine #6765

Closed

tim-barry added 4 commits November 26, 2024 16:36

Refactor pusher engine: merge function with no callers

b7166f3

Refactor pusher engine: error on non-local messages

e25cba7

Refactor pusher engine: rename and propagate error

f2f53a8

Rename `inboundMessageWorker` and `processInboundMessages` to `outbound` and also propagate errors to the top level of the worker where they can be thrown.

tim-barry changed the base branch from master to feature/pusher-engine-refactor November 27, 2024 17:58

tim-barry and others added 5 commits November 27, 2024 10:02

Revert "Pusher engine test: update negative test"

1c80949

This reverts commit 0aadc13. Instead of testing the internals, test the exported interface.

Refactor pusher engine: (lint) remove unused code

fbed8e7

Merge branch 'feature/pusher-engine-refactor' into tim/7018-pusher-en…

051e6d4

…gine-use-componentmanager

Refactor pusher engine: queue length metrics

ddb0cb7

Rename queue and add length metrics for it, updating creation sites.

Update pusher engine doc comment

17a9a2b

Co-authored-by: Alexander Hentschel <[email protected]>

tim-barry requested review from AlexHentschel and jordanschalm and removed request for durkmurder November 29, 2024 19:13

jordanschalm reviewed Nov 29, 2024

View reviewed changes

tim-barry and others added 4 commits November 29, 2024 16:10

Apply suggestions from code review

aad332c

Doc comment changes, metrics naming, and queue length. For reasoning behind chosen queue length, see #6747 (comment) Co-authored-by: Jordan Schalm <[email protected]>

Pusher engine refactor: remove unused collection metrics

ad04f27

Refactor pusher engine: add error return doc comments

88a7d45

jordanschalm approved these changes Dec 3, 2024

View reviewed changes

AlexHentschel approved these changes Dec 3, 2024

View reviewed changes

engine/collection/pusher/engine.go Outdated Show resolved Hide resolved

engine/collection/pusher/engine.go Show resolved Hide resolved

engine/collection/pusher/engine.go Outdated Show resolved Hide resolved

engine/collection/pusher/engine.go Show resolved Hide resolved

Apply suggestions from code review

2048b4a

tim-barry merged commit 7e4258a into feature/pusher-engine-refactor Dec 4, 2024
55 checks passed

tim-barry deleted the tim/7018-pusher-engine-use-componentmanager branch December 4, 2024 01:12

tim-barry mentioned this pull request Dec 4, 2024

Refactor Pusher Engine with updated interface #6780

Merged

This was referenced Dec 17, 2024

Refactor AsyncUploader to replace Engine.Unit with ComponentManager #6809

Merged

Refactor: Replace some uses of engine.Unit with ComponentManager #6833

Merged

This was referenced Jan 7, 2025

Refactor Consensus Sealing Engine: engine.Unit -> ComponentManager #6852

Open

Refactor Consensus Sealing Core: engine.Unit -> ComponentManager #6853

Open

Refactor Consensus Matching Engine: engine.Unit -> ComponentManager #6854

Closed

	engine.LogError(e.log, err)
	// TODO: do not swallow errors here, remove this function when transitioning to `network.MessageProcessor`
	engine.LogError(e.log, err)

	func (e Engine) onSubmitCollectionGuarantee(originID flow.Identifier, req messages.SubmitCollectionGuarantee) error {
	if originID != e.me.NodeID() {
	return fmt.Errorf("invalid remote request to submit collection guarantee (from=%x)", originID)
	}

	return e.SubmitCollectionGuarantee(&req.Guarantee)
	}

	// SubmitCollectionGuarantee submits the collection guarantee to all consensus nodes.
	func (e Engine) SubmitCollectionGuarantee(guarantee flow.CollectionGuarantee) error {

		// SubmitLocal submits an event originating on the local node.
		func (e *Engine) SubmitLocal(event interface{}) {

	// Let E by the transaction expiry. If a transaction T specifies a reference
	// block R with height H, then T may be included in any block B where:
	// * R<-*B - meaning B has R as an ancestor, and
	// * R.height < B.height <= R.height+E
	const DefaultTransactionExpiry = 10 * 60

		func (e Engine) SubmitCollectionGuarantee(msg messages.SubmitCollectionGuarantee) {
		ok := e.queue.Push(msg)

replace engine.Unit with ComponentManager in Pusher Engine #6747

replace engine.Unit with ComponentManager in Pusher Engine #6747

Conversation

tim-barry commented Nov 21, 2024

codecov-commenter commented Nov 21, 2024 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Change Suggestions:

AlexHentschel Nov 26, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AlexHentschel commented Nov 25, 2024

jordanschalm left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Change Suggestions:

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AlexHentschel left a comment

Choose a reason for hiding this comment

AlexHentschel Nov 26, 2024 • edited Loading

Choose a reason for hiding this comment

AlexHentschel Nov 26, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AlexHentschel Nov 26, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AlexHentschel Nov 26, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jordanschalm left a comment

Choose a reason for hiding this comment

jordanschalm Nov 29, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jordanschalm left a comment

Choose a reason for hiding this comment

AlexHentschel left a comment

Choose a reason for hiding this comment

codecov-commenter commented Nov 21, 2024 •

edited

Loading

AlexHentschel Nov 26, 2024 •

edited

Loading

AlexHentschel Nov 26, 2024 •

edited

Loading

AlexHentschel Nov 26, 2024 •

edited

Loading

AlexHentschel Nov 26, 2024 •

edited

Loading

AlexHentschel Nov 26, 2024 •

edited

Loading

jordanschalm Nov 29, 2024 •

edited

Loading