Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

replace engine.Unit with ComponentManager in Pusher Engine #6747

Conversation

tim-barry
Copy link
Contributor

Using #4219 as an example. Instead of starting new goroutines or directly processing messages in a blocking way, add messages to a queue that a worker pulls from.

The Pusher engine still currently implements network.Engine rather than network.MessageProcessor.

Using #4219 as an example.
Instead of starting new goroutines or directly processing messages
in a blocking way, messages are added to a queue that a worker pulls from.
The Pusher engine still currently implements network.Engine rather than
network.MessageProcessor.
@codecov-commenter
Copy link

codecov-commenter commented Nov 21, 2024

Codecov Report

Attention: Patch coverage is 60.81081% with 29 lines in your changes missing coverage. Please review.

Project coverage is 41.25%. Comparing base (8a3055c) to head (2048b4a).

Files with missing lines Patch % Lines
engine/collection/pusher/engine.go 62.50% 21 Missing and 6 partials ⚠️
cmd/collection/main.go 0.00% 1 Missing ⚠️
engine/testutil/mock/nodes.go 0.00% 1 Missing ⚠️
Additional details and impacted files
@@                        Coverage Diff                         @@
##           feature/pusher-engine-refactor    #6747      +/-   ##
==================================================================
- Coverage                           41.26%   41.25%   -0.01%     
==================================================================
  Files                                2061     2061              
  Lines                              182702   182737      +35     
==================================================================
+ Hits                                75384    75386       +2     
- Misses                             101010   101036      +26     
- Partials                             6308     6315       +7     
Flag Coverage Δ
unittests 41.25% <60.81%> (-0.01%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@tim-barry tim-barry requested a review from durkmurder November 21, 2024 00:57
@tim-barry tim-barry marked this pull request as draft November 21, 2024 17:43
tim-barry and others added 4 commits November 21, 2024 10:48
Because the event processing now happens in a worker, any errors
raised within it are no longer visible to the caller of Process().
Because the test checked for error status, moved the tests to the
same package and call the internal processing function directly.
When using Unit, calling Ready would also start the engine.
With ComponentManager, we additionally need to invoke Start.
@tim-barry tim-barry marked this pull request as ready for review November 22, 2024 18:56
asSCGMsg := asEngineWrapper.Payload.(*messages.SubmitCollectionGuarantee)
originID := asEngineWrapper.OriginID

_ = e.process(originID, asSCGMsg)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems incorrect to fully discard any error, should likely be logged instead?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we should be handling all possible error cases. In general, all functions which return an error should document all possible expected error cases in their documentation. Callers which observe an error return that they don't know explicitly how to handle safely, should propagate the error further up the stack. Eventually this results in either the component or entire process restarting.

Here is the part of our coding guidelines talking about error handling: https://github.com/onflow/flow-go/blob/master/CodingConventions.md#error-handling. In short, the rule is to prioritize safety over liveness.

Practically, the way we would do this here is using the irrecoverable.SignalerContext. That has a Throw method which will signal to restart the component or process.

I usually propagate errors all the way up the stack to where the SignalerContext is defined (here, that would be in inboundMessageWorker) then do ctx.Throw(err). You can also use irrecoverable.Throw to throw any context.Context.

Change Suggestions:

  • In general, make sure all functions in the file which return an error document their expected error returns (or the lack of expected errors). Here is an example of error documentation.
  • Propagate the error value to inboundMessageWorker and Throw unexpected errors at that frame.

Copy link
Member

@AlexHentschel AlexHentschel Nov 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree with your comment that discarding any error is bad. Please take a look at our Coding Conventions, if you haven't already. Tldr:

  • just discarding any error: very bad
  • logging error and continuing: still bad (this is the pattern of continuing on best effort basis, which is not a viable approach for high-assurance software systems)
  • desired:
    • explicitly check error type and only continue on errors that are confirmed to be benign.
    • all other errors (outside of the specified benign code path) are fatal

The pusher.Engine is a bit of a simpler case, because it only distributes information to other nodes. So swallowing errors does not undermine safety as it would for many other Engines. BUT, we already had cases where nodes were logging OutOfMemory errors for hours, but the code just dropped those and the node was dysfunctional without the software noticing it ... so even if it is not safety critical, we still generally expect that errors are not swallowed (and neither logged and then just discarded).

There are exceptions to this rule, e.g. if it would be a massive time investment to clear up technical debt and thoroughly documenting possible benign errors. But then, a detailed comment is absolutely required why it is always fine to log and discard any errors in that particular case.

I would suggest to leave error handling for a bit and come back to it after some iterations. After having implemented some of the cleanup and simplifications I suggest further below, I suspect that error handling will become a lot more straightforward.

// Done returns a done channel that is closed once the engine has fully stopped.
func (e *Engine) Done() <-chan struct{} {
return e.unit.Done()
func (e *Engine) processInboundMessages(ctx context.Context) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should probably make sure functions have doc comments

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, please add a short godoc comment to each function (especially public functions, but ideally private ones too).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added short doc comments in f2f53a8

@AlexHentschel
Copy link
Member

Issue for the present PR: https://github.com/onflow/flow-go-internal/issues/7018 (it helps reviewers when its linked in the PR description, so they can read up on the context)

Copy link
Member

@jordanschalm jordanschalm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The overall structure of changes looks good. I've added some thoughts related to transitioning to network.MessageProcessor, which we can discuss separately.

In terms of things to address in this PR:

  • Make sure all functions have minimal documentation, including a description of expected error returns (if the function returns any error).
  • Make sure invocations of a function that returns expected errors explicitly handles those errors.
  • Throw unexpected top-level errors to the SignalerContext

asSCGMsg := asEngineWrapper.Payload.(*messages.SubmitCollectionGuarantee)
originID := asEngineWrapper.OriginID

_ = e.process(originID, asSCGMsg)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we should be handling all possible error cases. In general, all functions which return an error should document all possible expected error cases in their documentation. Callers which observe an error return that they don't know explicitly how to handle safely, should propagate the error further up the stack. Eventually this results in either the component or entire process restarting.

Here is the part of our coding guidelines talking about error handling: https://github.com/onflow/flow-go/blob/master/CodingConventions.md#error-handling. In short, the rule is to prioritize safety over liveness.

Practically, the way we would do this here is using the irrecoverable.SignalerContext. That has a Throw method which will signal to restart the component or process.

I usually propagate errors all the way up the stack to where the SignalerContext is defined (here, that would be in inboundMessageWorker) then do ctx.Throw(err). You can also use irrecoverable.Throw to throw any context.Context.

Change Suggestions:

  • In general, make sure all functions in the file which return an error document their expected error returns (or the lack of expected errors). Here is an example of error documentation.
  • Propagate the error value to inboundMessageWorker and Throw unexpected errors at that frame.

Comment on lines 54 to 58
// TODO length observer metrics
inbound, err := fifoqueue.NewFifoQueue(1000)
if err != nil {
return nil, fmt.Errorf("could not create inbound fifoqueue: %w", err)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add those length metrics now for completeness. You can use this engine as an example. We'll also need to add a parameter for the mempoolMetrics to the constructor.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addressed in ddb0cb7

// Done returns a done channel that is closed once the engine has fully stopped.
func (e *Engine) Done() <-chan struct{} {
return e.unit.Done()
func (e *Engine) processInboundMessages(ctx context.Context) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, please add a short godoc comment to each function (especially public functions, but ideally private ones too).

Comment on lines +49 to +51
// TODO convert to network.MessageProcessor
var _ network.Engine = (*Engine)(nil)
var _ component.Component = (*Engine)(nil)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may also be a good exercise to make the change to MessageProcessor here, though likely as a separate PR.

I'll add some thoughts below; I suggest we spend some time discussing this live as well. There is no need to implement any of the comments below before we have a chance to discuss.

The pusher engine is unusual in a couple of ways:

  1. It is using an old framework for internal message-passing, where two components in the same process would pass messages between each other using standard methods (SubmitLocal and ProcessLocal). The meaning of the message was carried by its type (here: messages.SubmitCollectionGuarantee) and the logical structure was similar to that of messages received over the network. Message processing logic needed to potentially handle inputs originating both locally and remotely.
  2. It does not expect to receive any inbound messages from non-local sources. It is send-only from a networking perspective.

Replacing network.Engine with network.MessageProcessor there are a few opportunities for simplifying the implementation:

  • Currently the internal message-passing is done through SubmitLocal. We can remove SubmitLocal and ProcessLocal and replace them with a context-specific function (and corresponding interface type, for use in the caller).
    • A reference example is the Compliance interface, which is exposed by the compliance engine for local message-passing.
    • For this reason, engine.MessageHandler may not be needed in this context. It is better suited for cases where we are handling external messages.
    • We then do not need to check originIDs and do not need both onSubmitCollectionGuarantee and SubmitCollectionGuarantee methods
    • Now, this context-specific function is the only way for local messages to enter the pusher engine, which will simplify our network message processing logic 👇
  • In order to satisfy network.MessageProcessor, we need a Process method, which accepts messages from remote network sources. Since this engine does not expect any remote messages, we should reject all inputs to Process.
    • In general, when we observe an unexpected message from another node, we should flag it as a possible Byzantine message. In the future, these messages would result in a slashing challenge targeting the originator of the message. For now, we can log a warning with the relevant log tag.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1. Had a similar thought 👉 my comment

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change to network.MessageProcessor and the new interface for local messages will be part of a separate PR.

// started.
func (e *Engine) Ready() <-chan struct{} {
return e.unit.Ready()
func (e *Engine) inboundMessageWorker(ctx irrecoverable.SignalerContext, ready component.ReadyFunc) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a short godoc for this method

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

})
err := e.messageHandler.Process(e.me.NodeID(), event)
if err != nil {
engine.LogError(e.log, err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
engine.LogError(e.log, err)
// TODO: do not swallow errors here, remove this function when transitioning to `network.MessageProcessor`
engine.LogError(e.log, err)

})
err := e.messageHandler.Process(originID, event)
if err != nil {
engine.LogError(e.log, err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
engine.LogError(e.log, err)
// TODO: do not swallow errors here, remove this function when transitioning to `network.MessageProcessor`
engine.LogError(e.log, err)

engine/collection/pusher/engine.go Show resolved Hide resolved
Copy link
Member

@AlexHentschel AlexHentschel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR. My comments are not a full review (yet). Rather I am aiming at highlighting the areas to pay attention to. When working on an issue, part of the goal is to address the issue, though fixing the issue is frequently not sufficient: one of the major challenges with implementing a high-assurance software system such as Flow is managing intellectual complexity, code complexity and component isolation. When an engineer is required to grok the entire code base or a large portion thereof in order to make any change, then we have lost the ability to extend and maintain the code base on business-relevant time scales.

The Pusher Engine is from a very early time when most engineers knew most of the code. As the code base grows more complex, by adding functionality towards the mature architecture, we need more precision and rigour in the implementation.

We appreciate that it is super hard for new engineers to identify technical debt as such. Therefore, I have provided detailed context why certain aspects are technical debt. As you grow more experienced and familiar with our best practises, it will become easier to spot tech debt. Nevertheless, I want to be clear about expectations: cleaning up tech debt in/around the code you touch as time permits. Always leave the code in a better state than you found it, especially wrt:

  • documentation in general (fixing outdated goDoc and adding missing docs)
  • error documentation in particular and error handling
  • code clarity and removing unnecessary abstraction layers
  • clearly apparent reasoning for correctness (either by the code itself or through added documentation)
  • API safety (is it hard to use the API incorrectly?)
  • test coverage

For quite some time, we will be helping you with spotting areas of tech debt and how to address them. If you see something that looks like tech debt and you don't know how to fix is, please ask - I loved your comment here. I can give you one recommendation: if you don't see anything to be improved at a piece of code (aside from implementing your issue) you should probably ask somebody for input. The code we touch tends to generally be older code (because the newer code often has already the desired features) and the older code is generally in dire need for cleaning up tech debt. Its frequent that engineers spend just as much time, frequently even longer, cleaning up tech debt than working on the actual issue. I think this will also be the case for the Pusher Engine.

engine/collection/pusher/engine.go Outdated Show resolved Hide resolved
asSCGMsg := asEngineWrapper.Payload.(*messages.SubmitCollectionGuarantee)
originID := asEngineWrapper.OriginID

_ = e.process(originID, asSCGMsg)
Copy link
Member

@AlexHentschel AlexHentschel Nov 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree with your comment that discarding any error is bad. Please take a look at our Coding Conventions, if you haven't already. Tldr:

  • just discarding any error: very bad
  • logging error and continuing: still bad (this is the pattern of continuing on best effort basis, which is not a viable approach for high-assurance software systems)
  • desired:
    • explicitly check error type and only continue on errors that are confirmed to be benign.
    • all other errors (outside of the specified benign code path) are fatal

The pusher.Engine is a bit of a simpler case, because it only distributes information to other nodes. So swallowing errors does not undermine safety as it would for many other Engines. BUT, we already had cases where nodes were logging OutOfMemory errors for hours, but the code just dropped those and the node was dysfunctional without the software noticing it ... so even if it is not safety critical, we still generally expect that errors are not swallowed (and neither logged and then just discarded).

There are exceptions to this rule, e.g. if it would be a massive time investment to clear up technical debt and thoroughly documenting possible benign errors. But then, a detailed comment is absolutely required why it is always fine to log and discard any errors in that particular case.

I would suggest to leave error handling for a bit and come back to it after some iterations. After having implemented some of the cleanup and simplifications I suggest further below, I suspect that error handling will become a lot more straightforward.

Comment on lines 61 to 64
messageHandler := engine.NewMessageHandler(log, notifier, engine.Pattern{
Match: engine.MatchType[*messages.SubmitCollectionGuarantee],
Store: &engine.FifoMessageStore{FifoQueue: inbound},
})
Copy link
Member

@AlexHentschel AlexHentschel Nov 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would recommend to not use the MessageHandler here for the following reasons:

  • I think it is good to apply the patter of the MessageHandler at least once by hand ("learning by doing")
  • The MessageHandler adds a layer of abstraction, if we can do without it that would be preferable in my opinion.
  • In this particular case, the Engine only processes a single input type SubmitCollectionGuarantee. We have tried it before and the message handler is actually more code compared to just doing the same logic by hand.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it did end up being less complicated - applied in f66ba69

Copy link
Member

@AlexHentschel AlexHentschel Nov 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please take a look at the core processing logic in the Engine and lets try to remove layers of complexity and indirection where possible:

func (e *Engine) onSubmitCollectionGuarantee(originID flow.Identifier, req *messages.SubmitCollectionGuarantee) error {
if originID != e.me.NodeID() {
return fmt.Errorf("invalid remote request to submit collection guarantee (from=%x)", originID)
}
return e.SubmitCollectionGuarantee(&req.Guarantee)
}
// SubmitCollectionGuarantee submits the collection guarantee to all consensus nodes.
func (e *Engine) SubmitCollectionGuarantee(guarantee *flow.CollectionGuarantee) error {

Looking at this code raises a few questions:

  • we are only propagating collections that are originating from this node (👉 code). But the exported method SubmitCollectionGuarantee is not checking this? Why not?

    • If it is an important check, we should always check it. If we are fine with collections from other nodes being broadcast, then we don't need the check in onSubmitCollectionGuarantee either..
    • turns out, SubmitCollectionGuarantee is only ever called by the pusher.Engine itself (not even the tests call it).

    Lets simplify this: merge SubmitCollectionGuarantee and onSubmitCollectionGuarantee.

  • Further digging into the code, I found that the originID is just completely useless in this context:

    • the pusher.Engine is fed only from internal components (so originID is always filled with the node's own identifier)
    • the pusher.Engine drops messages that are not from itself. Just hypothetically, lets assume that there were collections with originID other than the node's own ID. ❗Note that those messages would be queued first, wasting resources and then we have a different thread discard the Collection after dequeueing. Checks like that should always be applied before queueing irrelevant messages.

    What you see here is a pattern we struggle with a lot: the code is very ambiguous and allows a whole bunch of stuff that is never intended to happen in practise. This is an unsuitable design for engineering complex systems such as Flow, as we then have to manually handle undesired cases.

    • Example 1: messages other than SubmitCollectionGuarantee. In comparison, if we had a method that only accepted SubmitCollectionGuarantee, the compiler would enforce this constraint for us as opposed to us having to write code. Often, people don't bother to manually disallow undesired inputs, resulting in software that is doing something incorrect in case of bugs as opposed to crashing (or the code not compiling in the first place)
    • Example 2: originID. We employ a generic patter which obfuscates that we don't need an originID in this particular example, because the Engine interface does not differentiate between node-external inputs (can be malicious) vs inputs from other components within its own node (always trusted) and mixes modes of processing (synchronous vs asynchronous).

    For those reasons, we deprecated the Engine interface. Lets try to update the pusher and remove the engine interface (for concrete steps towards this, please see my next comment).

For Flow, the APIs should generally be as descriptive and restrictive as possible and make it hard to feed in wrong / mismatching inputs. Writing extra lines of boilerplate code is not a challenge for Flow. In contrast, it is challenging if you have to dig through hundreds or thousands of lines of code because the APIs are too generic and you need to know implementation details to use them correctly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Partially adressed in b7166f3 and e25cba7: originID has been pretty much removed from the interface, now that Process and Submit (the functions in the Engine interface handling node-external inputs) simply error on any input.

Comment on lines 130 to 131
// SubmitLocal submits an event originating on the local node.
func (e *Engine) SubmitLocal(event interface{}) {
Copy link
Member

@AlexHentschel AlexHentschel Nov 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As an exercise, try to figure out which other components are feeing inputs into the pusher.Engine. We are lucky here because we have a rarely used message type SubmitCollectionGuarantee. But just imagine you would see a log that says no matching processor for message of type flow.Header from origin. With the Engine interface it is way too time intensive to figure out which components inside a node are feeding which other components with data.

Turns out that the Finalizer is the only component that is feeding the pusher.Engine (other than messages arriving from the networking layer). Let's make our lives easier in the future by improving the code:

  • introduce a dedicated interface GuaranteedCollectionPublisher. It only has one type-safe method
    SubmitCollectionGuarantee(guarantee *flow.CollectionGuarantee)
    this method just takes the message and puts it into the queue. Then we know that the queue will only ever contain elements of type flow.CollectionGuarantee. (currently, the Finalizer creates the wrapper messages.SubmitCollectionGuarantee around CollectionGuarantee ... and the pusher.Engine unwraps it and throws the wrapper away 😑 ).
  • change the Finalizer and all layers in between to work with the GuaranteedCollectionPublisher interface. This improves code clarity: currently, engineers would need to know what functionality was backing the generic Engine interface, e.g. here: because the right type of inputs need to be fed into the pusher.Engine. When using the GuaranteedCollectionPublisher interface we have the following benefits:
    • the type itself carries information about what functionality is behind this interface. Huge improvement of clarity.
    • its easy to find all places that can feed the Pusher with data, because we work with concrete interfaces and context-specific methods (e.g. SubmitCollectionGuarantee).
    • the compiler will reject code that feeds inputs with incompatible type into the Pusher
  • In my opinion, it would improve clarity if we were to discard all inputs from the network.MessageProcessor - conceptually they are messages from different node, which they are already broadcasting. The Pusher is for broadcasting our own messages. But at the moment, inputs from node-internal components as well as messages from other nodes passing through the networking layer share the same code path in the pusher.Enging, which makes it non-trivial to figure out where messages are coming from.

To summarize: one of the main benefits of the Component interface is that it provides lifecycles methods, but no methods for ingesting data - on purpose. This is because all data from components within the node should be passed via context-specific methods. The only exception is the interface network.MessageProcessor, which would still call Process with a generic message type.

Sorry for the lengthy comment. I hope you see how the very ambiguous structure of the Engine makes it really hard to determine from the code what we are expecting and where to drop messages, because its all hidden behind layers of ambiguous interfaces in a huge code base.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the detailed comment :)

I added the described SubmitCollectionGuarantee method in f66ba69 , but the interface & changing the Finalizer will be pushed to the next PR.

Inputs from the network.MessageProcessor (via the Process method) are being discarded (e25cba7).

func New(log zerolog.Logger, net network.EngineRegistry, state protocol.State, engMetrics module.EngineMetrics, colMetrics module.CollectionMetrics, me module.Local, collections storage.Collections, transactions storage.Transactions) (*Engine, error) {
// TODO length observer metrics
inbound, err := fifoqueue.NewFifoQueue(1000)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure about the queue capacity we should pick here. I think we should have a good argument (and document it!) of why the number we chose makes sense. If we don't want to go through the trouble of defining a sensible limit, I think we could make this clear by using an unbounded queue (implementation).

Anyway, I think a limit makes sense:

  • In a nutshell, a collection must reference a block and expires $\texttt{DefaultTransactionExpiry}$ = 600 blocks thereafter
    // Let E by the transaction expiry. If a transaction T specifies a reference
    // block R with height H, then T may be included in any block B where:
    // * R<-*B - meaning B has R as an ancestor, and
    // * R.height < B.height <= R.height+E
    const DefaultTransactionExpiry = 10 * 60
    (its a subtle mechanic, feel free to ignore)
  • consensus nodes produce about $\rho$ = 1.25 blocks per second
  • collector clusters produce up to $\varrho$ = 3 collections per second

So very very roughly, between a collection being first proposed (usually referencing the newest known block) to the collection expiring, the collector cluster will produce $\texttt{DefaultTransactionExpiry}\cdot \rho^{-1} \cdot \varrho = 1440$ collections. So if we queued 1500 collections max, then we would be relatively confident that collections have been broadcast before they expire.
The downside is that collections might sit in the queue for quite some time, if several hundreds are before them. That could be an argument of queuing only much fewer collections (e.g. 200), so collection that don't make it out fast are dropped.
Another argument we could make is actually using a LIFO queue. If the queue is nearly empty, there is little difference. However, if there is a large backlog of collections to be broadcast, I think its a valid argument of broadcasting the newest first, at least then, some collections make it in time as opposed to all of them being queued past their expiry. Just some thoughts.

@jordanschalm what do you think?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the choice of queue size here is unlikely to have significant impact:

  • All collection nodes can publish all finalized guarantees.
  • In the happy path, collection nodes are observing finalized collections at a very consistent rate over tiem. We can broadcast collection guarantees much more quickly than they are finalized, and expect the queue to be empty most of the time.
  • In the unhappy path where a collection node is finalizing collections more quickly than usual, it is behind and these are old collections -- we don't care about successfully broadcasting them anyway.

So if we queued 1500 collections max, then we would be relatively confident that collections have been broadcast before they expire.

If we actually had 1500 collections waiting in this queue, I would feel much more confident that there was a severe problem with that node's ability to publish messages, than that collections would be broadcasted before expiring. 😀

General thoughts:

  • I think we should prefer dropping from the queue than making sure we publish every message (choose a relatively small queue size). I'm happy with 200 (~1 minute of collections) or even a bit less.
  • I agree that a LIFO queue (stack) would be better suited. I don't think we should actually make that change now (premature optimization)

Rename `inboundMessageWorker` and `processInboundMessages` to `outbound`
and also propagate errors to the top level of the worker
where they can be thrown.
- Partially implement suggestion #6747 (comment)
  - Make `SubmitCollectionGuarantee` non-exported and rename to `publishCollectionGuarantee`
  - Add new `SubmitCollectionGuarantee` exported function that just adds to the queue
- Remove `messageHandler` field, instead directly add to queue
  from review: #6747 (comment)
- `OriginID`s no longer included in messages in the queue, and therefore not
   checked by the worker - if necessary they should be checked when Submitting
@tim-barry tim-barry changed the base branch from master to feature/pusher-engine-refactor November 27, 2024 17:58
tim-barry and others added 5 commits November 27, 2024 10:02
This reverts commit 0aadc13.
Instead of testing the internals, test the exported interface.
Rename queue and add length metrics for it, updating creation sites.
Co-authored-by: Alexander Hentschel <[email protected]>
@tim-barry tim-barry requested review from AlexHentschel and jordanschalm and removed request for durkmurder November 29, 2024 19:13
Copy link
Member

@jordanschalm jordanschalm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good - thank you for the revisions and for noting which changes will be addressed in the follow-up PR.

General suggestions:

  • Remove unused field colMetrics
  • There are a couple functions returning errors, that I don't think we will be changing in the follow-up PR. For these, we can include the error documentation here.

engine/collection/pusher/engine.go Outdated Show resolved Hide resolved
engine/collection/pusher/engine.go Outdated Show resolved Hide resolved
engine/collection/pusher/engine.go Outdated Show resolved Hide resolved
engine/collection/pusher/engine.go Outdated Show resolved Hide resolved
Comment on lines 184 to 185
func (e *Engine) SubmitCollectionGuarantee(msg *messages.SubmitCollectionGuarantee) {
ok := e.queue.Push(msg)
Copy link
Member

@jordanschalm jordanschalm Nov 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When we were using a generic function (eg. SubmitLocal(message any)), the type wrapper (*messages.SubmitCollectionGuarantee) was useful because it communicated the meaning of the message ("this collection guarantee should be broadcast").

When we have a context-specific method for enqueueing collection guarantees, we no longer need the type wrapper, and can simply pass *CollectionGuarantee directly: the method communicates the meaning of the message, not the argument type.

(Let's apply this change in the follow-up PR.)


// SubmitCollectionGuarantee submits the collection guarantee to all consensus nodes.
func (e *Engine) SubmitCollectionGuarantee(guarantee *flow.CollectionGuarantee) error {
// publishCollectionGuarantee publishes the collection guarantee to all consensus nodes.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please add error return documentation to this function. If it might return an expected sentinel error (including from sub-calls into Identities() and Publish()), we should list those possible error returns. Otherwise, we should add the line // No errors expected during normal operation.

If you haven't already, take a look at the godoc references for Identities() and Publish() and think about what errors they might return, and which of those are expected vs unexpected in the context of executing publishCollectionGuarantee.

In particular, Identities() does have a listed error return value -- state.ErrUnknownSnapshotReference -- but in the context here, it is not expected. Why is that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For Identities(), I believe since we're using the Final() (latest finalized in cache) snapshot, we expect to already have a valid snapshot reference / queryable block in the state, so expect Identities() on it to never error.

From what I can tell Publish() only returns an error if the conduit channel has been closed (in which case I would assume we can't continue), or other "benign" errors when the constructed message is invalid in some way (pretty sure this is not expected) or some failure in the underlying libp2p (I assume also not expected).

func (e *Engine) Done() <-chan struct{} {
return e.unit.Done()
// processOutboundMessages processes any available messages from the queue.
// Only returns when the queue is empty (or the engine is terminated).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add error documentation here. Either enumerate possible error types, or add // No errors expected during normal operation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't expect any errors here, because the only sources are wrong-typed items in the queue (shouldn't happen since we only ever insert the correct type) or error in publishCollectionGuarantee, which is not expected.

module/metrics/labels.go Outdated Show resolved Hide resolved
tim-barry and others added 4 commits November 29, 2024 16:10
Doc comment changes, metrics naming, and queue length.
For reasoning behind chosen queue length, see #6747 (comment)

Co-authored-by: Jordan Schalm <[email protected]>
Copy link
Member

@jordanschalm jordanschalm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎸

Copy link
Member

@AlexHentschel AlexHentschel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ver nice code 👏 . Only stylistic suggestions (all optional).

engine/collection/pusher/engine.go Outdated Show resolved Hide resolved
engine/collection/pusher/engine.go Show resolved Hide resolved
engine/collection/pusher/engine.go Outdated Show resolved Hide resolved
engine/collection/pusher/engine.go Show resolved Hide resolved
@tim-barry tim-barry merged commit 7e4258a into feature/pusher-engine-refactor Dec 4, 2024
55 checks passed
@tim-barry tim-barry deleted the tim/7018-pusher-engine-use-componentmanager branch December 4, 2024 01:12
tim-barry added a commit that referenced this pull request Dec 4, 2024
- Remove pusher engine implementation of network.Engine
  - Replace with network.MessageProcessor
  - See: #6747 (comment)
- Remove SubmitCollectionGuarantee message type
  - Was only used between Finalizer and Pusher engine
  - New interface passes and stores collection guarantees directly,
    instead of wrapping and then unwrapping them
  - See: #6747 (comment)
- Add GuaranteedCollectionPublisher interface, implemented by pusher engine
  - Only used by the Finalizer (and intermediate constructors)
  - Mocks are generated for it, used in Finalizer unit tests
  - See: #6747 (comment)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants