Skip to content
This repository has been archived by the owner on Oct 17, 2022. It is now read-only.

Remove workers from committee #670

Merged
merged 10 commits into from
Aug 23, 2022
Merged

Remove workers from committee #670

merged 10 commits into from
Aug 23, 2022

Conversation

arun-koshy
Copy link
Contributor

@arun-koshy arun-koshy commented Aug 3, 2022

TLDR: This PR removes the worker information out of the Committee struct and moves it into its own WorkerCache struct.

This is part of a larger effort to change how worker discoverability and communication occurs within NW. Discussion can be found here for more details.

The worker information is bootstrapped from a new workers.json file into memory in a WorkerCache struct. This is mostly to keep the changes from the current behavior as minimal as possible. How the worker information is bootstrapped may end up being different compared to the committee. The end goal being that the worker information is always retrieved from the primary first until it is unreachable or an epoch change occurs, therefore the only information that has to be provided during the bootstrapping phase is the worker information of the primary that is being started.

This PR has refactored all instances of where the workers were retrieved from the Committee struct and now will be retrieved from the WorkerCache. In upcoming PRs the code to update the WorkerCache will be added. The protocol for this as shared initially by @huitseeker will be as follows:

  1. The worker W_A from primary A comes to a primary B and asks to be assigned a peer
  2. Primary B answers with the networking information of the worker W_B they choose to assign to the worker W_A,
  3. The worker W_A memorizes in their WorkerCache who this peer W_B is for that authority, and uses it by default for all other requests (which replaces the by-index committee lookups currently going on)
  4. Workers can (from a type PoV) answer all requests they receive from workers with a "not able to serve you" error. When that happens, the recipient of the error repeats steps 1-3.
  5. On an epoch change the WorkerCache will be cleared and steps 1-3 will naturally be repeated.

What has been tested

  • Unit/Integration Tests
  • Local & Remote Benchmark
  • Client Demo
  • Docker configurations
  • Sui Integration

Future PRs:

  • Update WorkerCache by contacting primaries
  • Evaluate what changes are required for reconfiguration

@arun-koshy arun-koshy marked this pull request as ready for review August 4, 2022 04:33
Copy link
Contributor

@akichidis akichidis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes look good @arun-koshy , thanks for doing this! I would like to have one more look on it. A couple of things:

  • The changes will break the docker compose auto-generate function ex the gen.committee.py. Those should be modified as well - if not part of this PR let's open an issue to address in separate PR.
  • What's the intention regarding the WorkerCache ? Will this be the way to retrieve the workers across the code? What are the cases where the cache will be populated? Any sort of direction - high level description would help so can understand what are the design decisions we are going to make for the epic.

@arun-koshy
Copy link
Contributor Author

@akichidis thanks taking a look at the PR!

The changes will break the docker compose auto-generate function ex the gen.committee.py. Those should be modified as well - if not part of this PR let's open an issue to address in separate PR.

I have modified the Docker files and added @allan-bailey for review of said changes. Please let me know if you can think of anything else I missed testing.

What's the intention regarding the WorkerCache ? Will this be the way to retrieve the workers across the code? What are the cases where the cache will be populated? Any sort of direction - high level description would help so can understand what are the design decisions we are going to make for the epic.

Edited the description of the PR with some more information. Let me know if we still need some more information and we can sync offline to discuss this further.

@arun-koshy arun-koshy requested a review from allan-bailey August 9, 2022 00:09
@akichidis
Copy link
Contributor

@akichidis thanks taking a look at the PR!

The changes will break the docker compose auto-generate function ex the gen.committee.py. Those should be modified as well - if not part of this PR let's open an issue to address in separate PR.

I have modified the Docker files and added @allan-bailey for review of said changes. Please let me know if you can think of anything else I missed testing.

What's the intention regarding the WorkerCache ? Will this be the way to retrieve the workers across the code? What are the cases where the cache will be populated? Any sort of direction - high level description would help so can understand what are the design decisions we are going to make for the epic.

Edited the description of the PR with some more information. Let me know if we still need some more information and we can sync offline to discuss this further.

Thanks @arun-koshy for the extra info about this. So it seems to me that the WorkerCache will play a few roles in the code base and according to the component that is using it (either a primary or a worker) it might even hold different information? For example a worker that contacts a peer primary to be assigned to a worker will hold in its cache something like "my worker peers" info. Whereas in primary the WorkerCache will hold the networking info of its own workers. If the above separation is valid then I would probably look into splitting the semantics of the cache. That being said, don't have strong argument at this stage but I'll wait to see the upcoming PR(s).

@akichidis
Copy link
Contributor

@arun-koshy I see that are probably quite many changes in main, could you please rebase? I am happy to approve but let's rebase first (and probably @allan-bailey review the docker changes)

@arun-koshy
Copy link
Contributor Author

So it seems to me that the WorkerCache will play a few roles in the code base and according to the component that is using it (either a primary or a worker) it might even hold different information? For example a worker that contacts a peer primary to be assigned to a worker will hold in its cache something like "my worker peers" info. Whereas in primary the WorkerCache will hold the networking info of its own workers. If the above separation is valid then I would probably look into splitting the semantics of the cache.

This PR introduced a WorkerCache that would be shared like in the diagram below.

WorkerCache-SharedWorkerCache drawio

I think what you are describing would look something like the diagram below?

WorkerCache-SplitWorkerCache drawio

While the separation is valid I think the biggest issue with this route is the complexity that it adds for reconfiguration. I think we need to keep the SharedWorkerCache instance such that a reconfiguration signal can be sent to one place and have this cache be updated for every component that would need to access it. This PR essentially duplicated exactly how the SharedCommittee works for the SharedWorkerCache. However I think even that might introduce some additional complexity for reconfiguration that needs to be discussed further.

cc @asonnino

@asonnino
Copy link
Contributor

Nice! Will this plan also work when we will start scaling by running workers on separates machines from the primary? We will then have to boot workers without assistance of the primary (as the primary will be on another machine)

@akichidis
Copy link
Contributor

While the separation is valid I think the biggest issue with this route is the complexity that it adds for reconfiguration. I think we need to keep the SharedWorkerCache instance such that a reconfiguration signal can be sent to one place and have this cache be updated for every component that would need to access it. This PR essentially duplicated exactly how the SharedCommittee works for the SharedWorkerCache. However I think even that might introduce some additional complexity for reconfiguration that needs to be discussed further.

Thanks for the reply and the awesome design graphs! Exactly , I was thinking something closer to the second design you posted, ie the one with the MyWorkerCache and the PeerWorkerCache. Additionally the MyWorkerCache (the primary's) my also need to hold the mapping between which "others" workers have allocated to "my" workers (so can keep track, do some load balancing etc). I mean I see different concerns between the cache of a primary and the workers for this.

Regarding the SharedWorkerCache I definitely see the value in it. What I want to clarify though is that the cache will be shared between the components within a node (ex primary) it self and not across nodes (the primary and its own workers), right? Meaning that network calls between a primary and its own workers will be needed to update the cache entries accordingly to allow us deploy the nodes independently. Is that the plan?

Again I don't want to distract you from your design/plans, but those are thoughts that came to me by going through the PR.

Just one thing, regarding the SharedCommittee, I believe that's something that eventually we want to get away from as @asonnino has introduced functionality to support updating both for a NewEpoch and a UpdateCommittee (only when networking information changes) . Not saying that we need to following this approach for what we need here, but just to make sure you are aware of the SharedCommittee plans.

Copy link
Contributor

@huitseeker huitseeker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @arun-koshy for the epic PR! It's a meaningful step towards making the worker information more dynamic. I think this specific PR is too soon to ponder how worker bootstrap will work when we vary the number of workers - an otherwise interesting question. I left a few comments inline. But it seems we have two main matters at hand:

  1. how does this work with reconfiguration?
  2. where does each worker's information lives (as discussed, with diagrams, above between @arun-koshy and @akichidis)

I think it's worthwhile to discuss 1. because it actually enlightens 2. To crack this nutshell I think we need to look at signaling as it is today, and at the invariants we'll want to rely on tomorrow:

Signaling

In a nutshell here's the approach to state updates taken by this PR:

I think we need to keep the SharedWorkerCache instance such that a reconfiguration signal can be sent to one place and have this cache be updated for every component that would need to access it. This PR essentially duplicated exactly how the SharedCommittee works for the SharedWorkerCache.

Except the SharedCommittee was a pipe dream of mine (introduced in #295 and updated in #450) that never quite worked because it was insufficient: @asonnino 's PRs (#489 #626) astutely noticed that in some cases you need to perform more actions than just update the Committee in case of a reconfig/epoch change, and introduced a watch-channel based logic. The starting point of this signal is the primary's StateHandler, which feeds into all the primary components (look for the rx_reconfigure channels being polled in out select loops), and lands on the worker at the Synchronizer which itself propagates the update onwards in the worker:
https://gist.github.com/huitseeker/149a4f9fe805bc4f13446cc971607f65

(notice the update of address -> client tables in the worker,

Invariants

  • everyone must have a fresh copy of all the primaries' public key at all times (and making sure this stays fresh is the point of reconfiguration)
  • we can assume that at all times, the primary knows about its workers, and that the workers know about their primaries, and that this invariant survives epoch changes if the primary does,
  • eventually, we want the reconfiguration signal to only contain information about the primaries - the dynamic discoverability of workers by contacting their primary should take care of the updates

What to do about it?

OK, so where is the current PR potentially incorrect? Everywhere we have a WorkerCache that's not a SharedWorkerCache:
https://gist.github.com/huitseeker/6a57822b2e6fdb8b2d2b07fd1e538f99
(some of those may be illegitimate : do we need and use every WorkerCache above?)

As discussed offline and above, I think we can do one of the following:

  • I don't think it's an emergency to have a separate data structure to store the workers of one's own primary as part of this PR. But it's useful to armor the cleanup function in the {primary, worker} network so that it can't delete those specific workers.
  • add the WorkerCache information to the PrimaryReconfiguration messages, and make the processing of that message. The advantage is this is straightforward to implement, the disadvantage is most of this is code is we would replace with the dynamic dereferencing of workers at their primary.
  • lean into the singleton approach (that singleton /may/ be the SharedWorkerCache), make sure that the singleton value is updated in the two places where it must be (one in the primary, such as StateHandler, one in the worker, downstream of the Synchronizer), and then make sure everyone uses it correctly: access to a worker checks the primary is in the current (updated) Committee, then hits the singleton, and pops the address. The advantage is that when we switch to dynamically dereferencing workers, the code that changes is then encapsulated behind the singleton.

Docker/README.md Outdated Show resolved Hide resolved
test_utils/src/lib.rs Outdated Show resolved Hide resolved
node/src/generate_format.rs Outdated Show resolved Hide resolved
config/src/lib.rs Outdated Show resolved Hide resolved
types/src/primary.rs Outdated Show resolved Hide resolved
@huitseeker
Copy link
Contributor

One additional thing: it would be great to extend the snapshot tests that signal whether we break our configuration files with one that looks at whether we are breaking the new worker info configuration files you're introducing.

@arun-koshy
Copy link
Contributor Author

Thank you for the review! @akichidis @asonnino @huitseeker 🙇🏽

@huitseeker has thoroughly answered the questions raised about the approach we should take for where the worker information will live and how it will be managed. The approach I have taken is to lean into the singleton approach for the WorkerCache and in doing so simplified where and how reconfiguration should happen for the WorkerCache.

Pending items for future PRs

I don't think it's an emergency to have a separate data structure to store the workers of one's own primary as part of this PR. But it's useful to armor the cleanup function in the {primary, worker} network so that it can't delete those specific workers.

Have added TODOs to address this in the network layer in separate PR.

Copy link
Contributor

@huitseeker huitseeker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the humongous amount of work @arun-koshy ! I love the attention to detail, and the fixes to our benches & configs.

This LGTM (modulo a rebase)! This is a breaking change, @arun-koshy please scout when this gets shipped to Sui (through a NW pointer update there) and announce that in appropriate channels (this in particular will need edits to prod config files).

I left comments inline, here are the main ones:

  • I think we should log more profusely when, as we process an epoch change, we insert a primary for which we have no worker information. We discussed some worker info arriving with the reconfiguration message (i.e. ReconfigureNotification::NewEpoch should be extended to contain more than the Committee). Right now, it does not. This would break reconfiguration if we were really changing worker info through it. I think it's OK for now (dynamic worker discoverability will make this issue disappear) if we can have visibility on this + an issue to track fixing. /cc @asonnino (the relevant files are primary/state_handler.rs and worker/synchronizer.rs)
  • the network_diff logic is brittle, and should be replaced with something that does per-interface diffs, then unifies them. We should open an issue to track.

Docker/scripts/gen.workers.py Show resolved Hide resolved
config/src/lib.rs Outdated Show resolved Hide resolved
config/src/lib.rs Outdated Show resolved Hide resolved
config/src/lib.rs Outdated Show resolved Hide resolved
config/src/lib.rs Show resolved Hide resolved
node/src/generate_format.rs Show resolved Hide resolved
primary/src/state_handler.rs Outdated Show resolved Hide resolved
worker/src/synchronizer.rs Outdated Show resolved Hide resolved
worker/src/quorum_waiter.rs Show resolved Hide resolved
primary/src/block_waiter.rs Outdated Show resolved Hide resolved
huitseeker added a commit to huitseeker/narwhal that referenced this pull request Aug 22, 2022
In the hope of fixing the benchmarks
huitseeker added a commit to huitseeker/narwhal that referenced this pull request Aug 22, 2022
In the hope of fixing the benchmarks
huitseeker added a commit to huitseeker/narwhal that referenced this pull request Aug 23, 2022
In the hope of fixing the benchmarks
@arun-koshy arun-koshy merged commit 3b1a090 into main Aug 23, 2022
@arun-koshy arun-koshy deleted the worker branch August 23, 2022 07:30
mwtian pushed a commit to mwtian/sui that referenced this pull request Sep 30, 2022
* Remove workers from committee

* Fix local benchmark, client demo & tests

* Fix remote benchmark

* revert settings.json

* Fix docker compose configurations

* rebase

* Use singleton cache & update reconfiguration

* Update todo comments

* Fix license check

* Address review comments
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants