Remove workers from committee #670

arun-koshy · 2022-08-03T11:08:19Z

TLDR: This PR removes the worker information out of the Committee struct and moves it into its own WorkerCache struct.

This is part of a larger effort to change how worker discoverability and communication occurs within NW. Discussion can be found here for more details.

The worker information is bootstrapped from a new workers.json file into memory in a WorkerCache struct. This is mostly to keep the changes from the current behavior as minimal as possible. How the worker information is bootstrapped may end up being different compared to the committee. The end goal being that the worker information is always retrieved from the primary first until it is unreachable or an epoch change occurs, therefore the only information that has to be provided during the bootstrapping phase is the worker information of the primary that is being started.

This PR has refactored all instances of where the workers were retrieved from the Committee struct and now will be retrieved from the WorkerCache. In upcoming PRs the code to update the WorkerCache will be added. The protocol for this as shared initially by @huitseeker will be as follows:

The worker W_A from primary A comes to a primary B and asks to be assigned a peer
Primary B answers with the networking information of the worker W_B they choose to assign to the worker W_A,
The worker W_A memorizes in their WorkerCache who this peer W_B is for that authority, and uses it by default for all other requests (which replaces the by-index committee lookups currently going on)
Workers can (from a type PoV) answer all requests they receive from workers with a "not able to serve you" error. When that happens, the recipient of the error repeats steps 1-3.
On an epoch change the WorkerCache will be cleared and steps 1-3 will naturally be repeated.

What has been tested

Future PRs:

Update WorkerCache by contacting primaries
Evaluate what changes are required for reconfiguration

akichidis

Changes look good @arun-koshy , thanks for doing this! I would like to have one more look on it. A couple of things:

The changes will break the docker compose auto-generate function ex the gen.committee.py. Those should be modified as well - if not part of this PR let's open an issue to address in separate PR.
What's the intention regarding the WorkerCache ? Will this be the way to retrieve the workers across the code? What are the cases where the cache will be populated? Any sort of direction - high level description would help so can understand what are the design decisions we are going to make for the epic.

arun-koshy · 2022-08-09T00:09:48Z

@akichidis thanks taking a look at the PR!

The changes will break the docker compose auto-generate function ex the gen.committee.py. Those should be modified as well - if not part of this PR let's open an issue to address in separate PR.

I have modified the Docker files and added @allan-bailey for review of said changes. Please let me know if you can think of anything else I missed testing.

What's the intention regarding the WorkerCache ? Will this be the way to retrieve the workers across the code? What are the cases where the cache will be populated? Any sort of direction - high level description would help so can understand what are the design decisions we are going to make for the epic.

Edited the description of the PR with some more information. Let me know if we still need some more information and we can sync offline to discuss this further.

akichidis · 2022-08-09T16:14:35Z

@akichidis thanks taking a look at the PR!

The changes will break the docker compose auto-generate function ex the gen.committee.py. Those should be modified as well - if not part of this PR let's open an issue to address in separate PR.

I have modified the Docker files and added @allan-bailey for review of said changes. Please let me know if you can think of anything else I missed testing.

What's the intention regarding the WorkerCache ? Will this be the way to retrieve the workers across the code? What are the cases where the cache will be populated? Any sort of direction - high level description would help so can understand what are the design decisions we are going to make for the epic.

Edited the description of the PR with some more information. Let me know if we still need some more information and we can sync offline to discuss this further.

Thanks @arun-koshy for the extra info about this. So it seems to me that the WorkerCache will play a few roles in the code base and according to the component that is using it (either a primary or a worker) it might even hold different information? For example a worker that contacts a peer primary to be assigned to a worker will hold in its cache something like "my worker peers" info. Whereas in primary the WorkerCache will hold the networking info of its own workers. If the above separation is valid then I would probably look into splitting the semantics of the cache. That being said, don't have strong argument at this stage but I'll wait to see the upcoming PR(s).

akichidis · 2022-08-09T16:23:46Z

@arun-koshy I see that are probably quite many changes in main, could you please rebase? I am happy to approve but let's rebase first (and probably @allan-bailey review the docker changes)

arun-koshy · 2022-08-09T23:55:38Z

So it seems to me that the WorkerCache will play a few roles in the code base and according to the component that is using it (either a primary or a worker) it might even hold different information? For example a worker that contacts a peer primary to be assigned to a worker will hold in its cache something like "my worker peers" info. Whereas in primary the WorkerCache will hold the networking info of its own workers. If the above separation is valid then I would probably look into splitting the semantics of the cache.

This PR introduced a WorkerCache that would be shared like in the diagram below.

I think what you are describing would look something like the diagram below?

While the separation is valid I think the biggest issue with this route is the complexity that it adds for reconfiguration. I think we need to keep the SharedWorkerCache instance such that a reconfiguration signal can be sent to one place and have this cache be updated for every component that would need to access it. This PR essentially duplicated exactly how the SharedCommittee works for the SharedWorkerCache. However I think even that might introduce some additional complexity for reconfiguration that needs to be discussed further.

cc @asonnino

asonnino · 2022-08-10T12:04:24Z

Nice! Will this plan also work when we will start scaling by running workers on separates machines from the primary? We will then have to boot workers without assistance of the primary (as the primary will be on another machine)

akichidis · 2022-08-11T09:15:31Z

While the separation is valid I think the biggest issue with this route is the complexity that it adds for reconfiguration. I think we need to keep the SharedWorkerCache instance such that a reconfiguration signal can be sent to one place and have this cache be updated for every component that would need to access it. This PR essentially duplicated exactly how the SharedCommittee works for the SharedWorkerCache. However I think even that might introduce some additional complexity for reconfiguration that needs to be discussed further.

Thanks for the reply and the awesome design graphs! Exactly , I was thinking something closer to the second design you posted, ie the one with the MyWorkerCache and the PeerWorkerCache. Additionally the MyWorkerCache (the primary's) my also need to hold the mapping between which "others" workers have allocated to "my" workers (so can keep track, do some load balancing etc). I mean I see different concerns between the cache of a primary and the workers for this.

Regarding the SharedWorkerCache I definitely see the value in it. What I want to clarify though is that the cache will be shared between the components within a node (ex primary) it self and not across nodes (the primary and its own workers), right? Meaning that network calls between a primary and its own workers will be needed to update the cache entries accordingly to allow us deploy the nodes independently. Is that the plan?

Again I don't want to distract you from your design/plans, but those are thoughts that came to me by going through the PR.

Just one thing, regarding the SharedCommittee, I believe that's something that eventually we want to get away from as @asonnino has introduced functionality to support updating both for a NewEpoch and a UpdateCommittee (only when networking information changes) . Not saying that we need to following this approach for what we need here, but just to make sure you are aware of the SharedCommittee plans.

huitseeker

Thanks @arun-koshy for the epic PR! It's a meaningful step towards making the worker information more dynamic. I think this specific PR is too soon to ponder how worker bootstrap will work when we vary the number of workers - an otherwise interesting question. I left a few comments inline. But it seems we have two main matters at hand:

how does this work with reconfiguration?
where does each worker's information lives (as discussed, with diagrams, above between @arun-koshy and @akichidis)

I think it's worthwhile to discuss 1. because it actually enlightens 2. To crack this nutshell I think we need to look at signaling as it is today, and at the invariants we'll want to rely on tomorrow:

Signaling

In a nutshell here's the approach to state updates taken by this PR:

I think we need to keep the SharedWorkerCache instance such that a reconfiguration signal can be sent to one place and have this cache be updated for every component that would need to access it. This PR essentially duplicated exactly how the SharedCommittee works for the SharedWorkerCache.

Except the SharedCommittee was a pipe dream of mine (introduced in #295 and updated in #450) that never quite worked because it was insufficient: @asonnino 's PRs (#489 #626) astutely noticed that in some cases you need to perform more actions than just update the Committee in case of a reconfig/epoch change, and introduced a watch-channel based logic. The starting point of this signal is the primary's StateHandler, which feeds into all the primary components (look for the rx_reconfigure channels being polled in out select loops), and lands on the worker at the Synchronizer which itself propagates the update onwards in the worker:
https://gist.github.com/huitseeker/149a4f9fe805bc4f13446cc971607f65

(notice the update of address -> client tables in the worker,

Invariants

everyone must have a fresh copy of all the primaries' public key at all times (and making sure this stays fresh is the point of reconfiguration)
we can assume that at all times, the primary knows about its workers, and that the workers know about their primaries, and that this invariant survives epoch changes if the primary does,
eventually, we want the reconfiguration signal to only contain information about the primaries - the dynamic discoverability of workers by contacting their primary should take care of the updates

What to do about it?

OK, so where is the current PR potentially incorrect? Everywhere we have a WorkerCache that's not a SharedWorkerCache:
https://gist.github.com/huitseeker/6a57822b2e6fdb8b2d2b07fd1e538f99
(some of those may be illegitimate : do we need and use every WorkerCache above?)

As discussed offline and above, I think we can do one of the following:

I don't think it's an emergency to have a separate data structure to store the workers of one's own primary as part of this PR. But it's useful to armor the cleanup function in the {primary, worker} network so that it can't delete those specific workers.
add the WorkerCache information to the PrimaryReconfiguration messages, and make the processing of that message. The advantage is this is straightforward to implement, the disadvantage is most of this is code is we would replace with the dynamic dereferencing of workers at their primary.
lean into the singleton approach (that singleton /may/ be the SharedWorkerCache), make sure that the singleton value is updated in the two places where it must be (one in the primary, such as StateHandler, one in the worker, downstream of the Synchronizer), and then make sure everyone uses it correctly: access to a worker checks the primary is in the current (updated) Committee, then hits the singleton, and pops the address. The advantage is that when we switch to dynamically dereferencing workers, the code that changes is then encapsulated behind the singleton.

Docker/README.md

test_utils/src/lib.rs

node/src/generate_format.rs

config/src/lib.rs

types/src/primary.rs

huitseeker · 2022-08-11T18:47:38Z

One additional thing: it would be great to extend the snapshot tests that signal whether we break our configuration files with one that looks at whether we are breaking the new worker info configuration files you're introducing.

arun-koshy · 2022-08-16T08:32:22Z

Thank you for the review! @akichidis @asonnino @huitseeker 🙇🏽

@huitseeker has thoroughly answered the questions raised about the approach we should take for where the worker information will live and how it will be managed. The approach I have taken is to lean into the singleton approach for the WorkerCache and in doing so simplified where and how reconfiguration should happen for the WorkerCache.

Pending items for future PRs

I don't think it's an emergency to have a separate data structure to store the workers of one's own primary as part of this PR. But it's useful to armor the cleanup function in the {primary, worker} network so that it can't delete those specific workers.

Have added TODOs to address this in the network layer in separate PR.

huitseeker

Thanks for the humongous amount of work @arun-koshy ! I love the attention to detail, and the fixes to our benches & configs.

This LGTM (modulo a rebase)! This is a breaking change, @arun-koshy please scout when this gets shipped to Sui (through a NW pointer update there) and announce that in appropriate channels (this in particular will need edits to prod config files).

I left comments inline, here are the main ones:

I think we should log more profusely when, as we process an epoch change, we insert a primary for which we have no worker information. We discussed some worker info arriving with the reconfiguration message (i.e. ReconfigureNotification::NewEpoch should be extended to contain more than the Committee). Right now, it does not. This would break reconfiguration if we were really changing worker info through it. I think it's OK for now (dynamic worker discoverability will make this issue disappear) if we can have visibility on this + an issue to track fixing. /cc @asonnino (the relevant files are primary/state_handler.rs and worker/synchronizer.rs)
the network_diff logic is brittle, and should be replaced with something that does per-interface diffs, then unifies them. We should open an issue to track.

Docker/scripts/gen.workers.py

config/src/lib.rs

node/src/generate_format.rs

primary/src/state_handler.rs

worker/src/synchronizer.rs

worker/src/quorum_waiter.rs

primary/src/block_waiter.rs

In the hope of fixing the benchmarks

* Remove workers from committee * Fix local benchmark, client demo & tests * Fix remote benchmark * revert settings.json * Fix docker compose configurations * rebase * Use singleton cache & update reconfiguration * Update todo comments * Fix license check * Address review comments

arun-koshy requested review from huitseeker, akichidis, asonnino and bmwill August 4, 2022 04:33

arun-koshy marked this pull request as ready for review August 4, 2022 04:33

akichidis reviewed Aug 4, 2022

View reviewed changes

akichidis mentioned this pull request Aug 5, 2022

Fix network memory leaks #687

Merged

arun-koshy requested a review from allan-bailey August 9, 2022 00:09

arun-koshy force-pushed the worker branch from e1a72f7 to 4df7305 Compare August 10, 2022 10:37

huitseeker reviewed Aug 11, 2022

View reviewed changes

Docker/README.md Outdated Show resolved Hide resolved

test_utils/src/lib.rs Outdated Show resolved Hide resolved

node/src/generate_format.rs Outdated Show resolved Hide resolved

config/src/lib.rs Outdated Show resolved Hide resolved

types/src/primary.rs Outdated Show resolved Hide resolved

arun-koshy force-pushed the worker branch from 1af5300 to 69bf239 Compare August 16, 2022 08:19

akichidis mentioned this pull request Aug 17, 2022

Assume batches can be stored on workers with ids other than the author's #54

Closed

huitseeker approved these changes Aug 19, 2022

View reviewed changes

huitseeker added the breaking-change label Aug 19, 2022

huitseeker added a commit to huitseeker/narwhal that referenced this pull request Aug 22, 2022

fix: import bench fixes from MystenLabs#670

ae93f5f

In the hope of fixing the benchmarks

huitseeker mentioned this pull request Aug 22, 2022

Two small maintenance fixes #839

Merged

huitseeker added a commit to huitseeker/narwhal that referenced this pull request Aug 22, 2022

fix: import bench fixes from MystenLabs#670

7c46b01

In the hope of fixing the benchmarks

huitseeker added a commit to huitseeker/narwhal that referenced this pull request Aug 23, 2022

fix: import bench fixes from MystenLabs#670

95354b3

In the hope of fixing the benchmarks

arun-koshy added 4 commits August 22, 2022 23:14

Remove workers from committee

da6d05f

Fix local benchmark, client demo & tests

e7c4106

Fix remote benchmark

b801bfe

revert settings.json

784ab69

arun-koshy added 6 commits August 22, 2022 23:14

Fix docker compose configurations

6c6a499

rebase

16d9c79

Use singleton cache & update reconfiguration

fcbcf10

Update todo comments

75bf2ef

Fix license check

b14002e

Address review comments

235fb77

arun-koshy force-pushed the worker branch from 502141e to 235fb77 Compare August 23, 2022 06:20

arun-koshy merged commit 3b1a090 into main Aug 23, 2022

arun-koshy deleted the worker branch August 23, 2022 07:30

This was referenced Aug 25, 2022

Devnet 0.7.1 rc (for CI) #847

Merged

Revert 818 + increase the batch timeout #859

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove workers from committee #670

Remove workers from committee #670

arun-koshy commented Aug 3, 2022 •

edited

Loading

akichidis left a comment

arun-koshy commented Aug 9, 2022

akichidis commented Aug 9, 2022

akichidis commented Aug 9, 2022

arun-koshy commented Aug 9, 2022

asonnino commented Aug 10, 2022

akichidis commented Aug 11, 2022

huitseeker left a comment •

edited

Loading

huitseeker commented Aug 11, 2022

arun-koshy commented Aug 16, 2022

huitseeker left a comment

Remove workers from committee #670

Remove workers from committee #670

Conversation

arun-koshy commented Aug 3, 2022 • edited Loading

akichidis left a comment

Choose a reason for hiding this comment

arun-koshy commented Aug 9, 2022

akichidis commented Aug 9, 2022

akichidis commented Aug 9, 2022

arun-koshy commented Aug 9, 2022

asonnino commented Aug 10, 2022

akichidis commented Aug 11, 2022

huitseeker left a comment • edited Loading

Choose a reason for hiding this comment

Signaling

Invariants

What to do about it?

huitseeker commented Aug 11, 2022

arun-koshy commented Aug 16, 2022

huitseeker left a comment

Choose a reason for hiding this comment

arun-koshy commented Aug 3, 2022 •

edited

Loading

huitseeker left a comment •

edited

Loading