-
Notifications
You must be signed in to change notification settings - Fork 106
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add option process_distribution: :active
which rebalances processes on node joining / leaving.
#164
Conversation
…o force redistribution of processes between existing nodes in the cluster according to the configured distribution strategy.
…orks with members (Horde.DynamicSupervisor.Member) and not just erlang nodes. Still should verify that after the rebalancing operation all the pieces fall where they should fall
… and verifies that all of the pids which should have been redistributed to node 2 (n2) are actually present on that node according to the node's own state
…d start_child are both suceptible to the error patterns returned by proxy_to_node, as both return a call to this function under determined circumstances. The @SPEC needs to reflect this so that type-checking doesn't throw errors
oh fooey. Looks like I accidentally left a (single) compiler warning in the build. Not sure exactly how to update a PR on github because I've never done it before. I'll look into it. In any case; not such a big deal. |
…ory of my fucking life
Alright I get how this works now; added a commit to fix the compiler warning (unused var) and now circle is giving me formatting errors. Will also fix these post-haste! |
...all checks passed!! 💯🥇 |
Hi @sikanrong, thanks for the PR! I'm going to read through it and then I'll let it marinade for a few days. This is a (relatively) big feature, and I want to make sure we're happy with it before we hit merge. |
Hey @derekkraan, I totally understand. You marinade it up 👍 There was a lot working against me to get to this implementation. I mean at first, I had written this feature in the code for my own application using The implementation could be a bit cleaner, but you would have to change the Specifically, it really would be much better if the process So idk; this PR is by no means feature-complete, and the implementation could be better, but not without making some deeper structural changes to In any case; this is a major missing feature; Horde is unusable in my project without it. Without this you're just totally unable to create a truly elastic Horde cluster, because when you add new nodes there's no way to reassign existing processes to them. |
Also it could lead to more elegant patterns if the |
hey @derekkraan; it's been a week so I just thought I'd chime in and perhaps try to start a dialogue. I'm not sure how you're feeling about the implementation but I just wanted to say that if you were willing to be quite descriptive in telling me how you'd ideally like this feature to be implemented; my team and I would be more than happy to put in the necessary work to make it happen. I suppose I just feel like the best thing would be for us to coordinate so that we can end up with the best possible solution, and put that in the hands of the community as fast as possible. I wonder if you'd maybe like to be in more direct correspondence so we could talk about the best way forward? One thing I think that could really help rebalance/1 a lot (in general) might be the conservation of the child_spec IDs (instead of randomizing them as per current Horde implementation). Anyway idk, maybe there's a better forum for discussing the deeper Horde implementation details? Or perhaps you'd like to talk one-on-one? Just let me know :) |
Hi @sikanrong, The problem is that I am myself not quite certain what I would accept in this PR. There are some initial questions to consider:
I think github issues is a good place to talk. Keep in mind that I have a day job, and can't make myself available 24/7 for this project. I'm also going to ping @beardedeagle in here because he was asking about this before and might have some insights to share with us. |
Alright so I've given this quite a bit of thought, and surely what I'm about to write is going to be a few pages long, so apologies in advance for that... So first off @derekkraan I totally get it; I also have a day job where we're using Horde in a commercial project, so ensuring that it'll work for everything we'll need it to do is my day job. I hope this explains why I'm so enthusiastic about figuring out the rebalance logic in the best possible way. While I'm aware that I could've used another more production-ready solution, I just really liked the Horde base implementation (we're also using your I'll try to respond to your enumerated points with my own enumerated points, but probably there will be some things that don't really fit into the pattern.
So for this part, I think probably the best solution would be basically to just pass a different exit signal to the process in Evidently for this to work we would have to flag a given process as "shutting down for redistribution" so that
As for when this should be triggered, I would think any time the membership has changed, either by using Currently you use Also (as a final thought) I think additionally that redistribution should be something that the user can access via the API and trigger on-demand if needed.
Everything should be configurable! We should make no assumptions about how the user may want to use the tool. That being said, rebalance should definitely also be configurable as well. However I don't think that the DistributionStrategy is a good place for the rebalance logic to go, in the sense that redistribution depends too much on the (distributed) state in However, we could use the Note: for the same reason I find it hard to imagine a "pluggable" solution. I mean we already have
In terms of the current implementation:
|
Anyway tell me what you think of all that when you get a chance, as well @beardedeagle should chime in with any tips! If all of this looks fine, I'd be happy to update my code to make it conform to these new design principles. |
Hi @sikanrong, Thanks for your message and your enthusiasm. So the reason we randomize child ids is because this is what I agree generally with your proposed way forward. I think there will be some tricky things in The only part I'm somewhat unclear on is how we can best make this pluggable and or configurable. But perhaps we can start with just enabling / disabling with a flag (default to off) and go from there. Let me know if you have any other questions before you get started. |
@derekkraan alright great; thanks so much for reading over my changes and considering them! I'll update the code sometime later this week and then we can have another look at it and see if there's any other tweaks that need to be made. |
…ssed in pull request 164. This gets rid of handle_dead_nodes and replaces it with a more general redistribute method which is automatically executed when the cluster configuration is changed. Updates the tests for redistribute (which pass), as well as maintiaining the manual DynamicSupervisor.redistribute/1 method. Not quite done as it must be made configurable and the standalone redistribute method must also have an accompanying test. As well, one other test is now failing and I must find out why...
…distribution has its own triggering mechanism which only will trigger when a node transitions into :alive, :dead, or :redistribute state(s). This makes it so that it will effectively wait for process gracefulshutdown before restarting those processes on their new nodes. This is either triggered via update_process or update_member info when a horde member node is marked :dead. Now the only remaining problem is that sometimes the DynamicSupervisorImpl genserver is too busy to respond to the :horde_shutting_down call
…can be more specific about exactly which state changes should trigger the redistribution, instead of just taking the final indicated state and trying to base it only on that. Like now I can say things like 'when a node transitions from :uninitialized to :alive or from :shutting_down to :down, do a rebalancing'
…down test; consolidating the choose_node logic into a private function which takes a Horde.DynamicSupervisorImpl.Member and correctly prepares it for distribution such that the distribution is effectively deterministic and no matter how the nodes are (re-)distributed they will always be (re-)distributed deterministically.
…ute_on callback and associated events. Disable the :up redistribution for the graceful shutdown test, if not then the redistributor would fire when :horde_2_graceful node would join the cluster (as it should). All tests passing; also fixed remaining compilation errors and removed debug output
…own then we should also gracefully NOT restart those processes which have been gracefully shutdown
…s to :dead are valid reasons for redistributionw
… pass any termination reason to ProcessesSupervisor.terminate_child_by_id and have that propagate through the supervisor logic until it gets to monitor_children/exit_child. Uses this to add a :redistribute exit signal when a process is shut down by horde for redistribution. Also adding tests for this
…s such that I can add more tests regarding all sorts of redistribute related stuff without repeating too much code.
…e as well as redistribute_processes. Also ensure that the tests properly clean up after themselves. Despite not all currently passing, the test suite is complete.
…ce is responsible for the looping instead of doing it with pattern matching and various definitions
…s the worst fucking thing I've ever done in 15 years of programming I think. In the end the solution makes perfect sense. If (during redistribution) the node tries to start the nodes that it should be running only to find that those PIDs are already running (on another node) then it will monitor those processes and wait for them to gracefully terminate before adding those children on the destination node. All in all, a very robust system, allowing for the nodes to take as long as they need to
…efore doing the final assertion - it allows for n2 to catch up and spin-up its processes
…acking on the member status (which is hard to manage) it's now going to be its own key in the CRDT which is set with a Kernel.make_ref() and then that ref is stored in the state after redistribution such that it only triggers redistribution if the refs don't match. This makes it so that all connected nodes who receive the message trigger redistribution
@derekkraan w00t! It's done. I'm really happy with the way this turned out; now it really feels like a part of the library that was meant to be there from the beginning. The end result is really very elegant. It conforms to the principles we discussed:
|
Hi @sikanrong, I haven't had a chance to look at the actual code yet, but I do have a comment. The So if we take away the option to redistribute on Of course if you have arguments for why it should work the way you have done it in your PR then I'm open to hearing those. The rest looks good (in principle), I'll take a look at the actual code if we can agree on the right way to proceed with the above issue. Derek |
hey @derekkraan; your comments make sense. I also thought that maybe
I can have this updated by tomorrow |
…ove the redistribute_on method from the distribution strategy. Instead, establishes a much simpler boolean configuration key :redistribute_on_alive which determines (as the name suggests) whether or not Horde will automatically redistribute processes when it detects a new node has been added to the cluster. There is no longer any configuraion option which makes it so that horde will NOT redistribute processes on node :down (or on graceful shutdown).
@derekkraan - alright, those changes were really easy to make; I'm excited for you to review the code! If there's any further changes or tweaks to be made that's no problem at all, just let me know.
|
Hi Alex, I just took a look at the code. I am unsure of the approach taken to track processes that should be supervised by the current node. For some background information, the way we track processes in the CRDT is with a key The approach I would prefer is:
Does this make sense? This way there is no requirement to coordinate redistribution. It also means that the user can't trigger redistribution globally, but I'm not sure this is a big problem. We could remove that option or just have it trigger a redistribution locally (so the user would be responsible for coordinating redistribution across the entire cluster if that's what she wants). So basically, nodes terminate and "release" processes when they detect that they should no longer be the owners, and when updates come in, check to see if you should claim an unclaimed process. I think this will necessitate a small change in I also want to check with you about the terminology. People are used to swarm's "process handoff", is "redistribute" what we want to call it? Think about it and let me know what you think. Hope this feedback is useful. Let me know if you have any questions. Cheers, |
Hey @derekkraan ! All of this is really great feedback; it's all very well-explained. Your comments make perfect sense. The preferred approach is crystal clear; I can see how this will simplify the code and make for a more robust system. I think I can probably tweak the implementation to conform to those ideas in the next two or three days.
It's possible that I'll have some more questions during development, so please watch this space!
|
…iples laid out by @derekkraan such that each Horde.DynamicSupervisor be as stateless as possible such that redistribution really doesn't have to be coordinated between nodes. Rather, the whole process is now one of nodes 'relinquising' processes to the greater Horde, which are later 'picked up' by the nodes that they should be on as those new nodes intercept the message that a (child) process no longer has a (parent) node. This 'relinquish' message is the very same CRDT message used by the GracefulShutdownManager to indicate to member nodes that they need to pick up those parent-less children.
@derekkraan alright it's all done! Implementation ended up being faster than expected; I'm really happy with the changes made. It makes each Horde node as stateless as possible, and avoids the need for the nodes to do any coordination for process redistribution! Perfect 👍 Now each node will simply "relinquish" it's processes ( to the Horde! ), and then as the other nodes receive the signal that a given process is without a parent node, that node will check to see if it should own that process and if it finds that to be true it will start that process up via I went with Tell me what you think of the new changes! As ever, I'm still very open to tweaks, changes, and improvements. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @sikanrong,
Thanks for all the work on this PR. I think we are getting closer! Please see my individual comments for the necessary changes.
Hey @derekkraan, It's a holiday here today, but I'll spend some time later in the day looking over these change requests and implementing them (and/or responding to them). I definitely did run horde through dialyzer (it's part of the circleCI integration tests too) and I don't know why it didn't pick up on the |
…f the code around process-termination and signaling child-relinquish to the Horde. This is now implemented according to @derekkraan's proposed method of just calling Process.exit(child_pid, {:shutdown, :process_redistribution}) and then intercepting that message in ProcessesSupervisor.maybe_restart_child/5, which in turn notifies the DynamicSupervisorImpl that it should send out the process-without-parent message to the rest of the horde via the CRDT. Also moves away from using Application config and instead uses an options-based implementation where :process_redistribution is one of :active or :passive. Restructuring the tests to use this new active/passive init-options format. Resets the default behavior to :passive
…hich for some reason doesn't compile when I run mix compile on my local machine
Hey again @derekkraan ; So I've made the requested changes and in general everything went totally fine. The single snag that I ran into has to do with some nuance from Elixir's
The bold emphasis is mine. So what it's trying to describe here is that when you set The key words here are "parent process". In Horde, that parent process would be @impl true
def handle_info({:EXIT, _pid, reason}, state) do
{:stop, reason, state}
end If you don't do this, it will error about receiving an unexpected So to avoid exactly this: instead of calling |
Oh also I don't understand how to mark your change requests as "done"? I thought I was just supposed to hit "resolve conversation" as I closed out each item, but it still says "1 review requesting changes" and I don't seem to be able to mark it as "finished" anywhere. Maybe you can help me understand what GitHub wants me to do there (if I'm supposed to do anything?). |
…ly called from the ProcessesSupervisor process which eliminates the need for the ugly handle_info exit passthrough
…d_exit_signal to a call (from a cast) which seems to work this time
alright @derekkraan , I made a few commits to straighten out some implementation snags that were bothering me (detailed in the much-edited comment above) but now I'm calling this officially "done"! The final version is perfect 😃I'm super happy with it. |
@sikanrong, this is looking great. I am very happy with the structure of this feature and how it has turned out, and there are a lot of little things you were able to clean up along the way. Sending the exit signal through the parent process (the supervisor) is a great idea, I'm glad you thought of that. |
process_distribution: :active
which rebalances processes on node joining / leaving.
Hey @derekkraan I just wanted to thank you for your kind words and for making yourself available to work with me on this and review my code. Many thanks! I've really enjoyed this and I hope you have too; I'm stoked that this is a part of Horde now :) Since we're using a lot of other Kraan-brand™️ libraries in our projects maybe we can have more collaborations like this in the future 😃
|
So the use-case is:
If you want to use Horde to dynamically add members to your DynamicSupervisor cluster at runtime (which in itself works just fine), there's no way in the library to manually trigger a re-distribution of processes so that they spread to the newly-added member(s) in the cluster according to your chosen
:distribution_strategy
. As a Horde user, all you're really left with is killing and re-starting all of your processes and hoping that they land on other nodes.To solve this issue I've written
DynamicSupervisor.rebalance/1
. Basically what this does is run each child spec through thedistribution_strategy.choose_node/2
function, detect which processes should be running on other nodes, and specifically terminate/restart those particular processes such that they'll spin up on the new node where they should be according to the new cluster configuration.In the future perhaps this could take more options to make it so that we specifically find processes that should be on a particular node or list of nodes (e.g. the newly-added members of the cluster) and only re-assign those processes... For now, it's just a blunt uniform redistribution tool. In any case, I know this has been really useful for me and I imagine there are other users that are looking to do the same thing in a clean way.
I've also added a (very) comprehensive test for the new
rebalance/1
function indynamic_supervisor_test.exs
and ensured that no type-checking (dialyzer) errors are thrown from the new code....I hope you like it! I'm obviously a big fan of Horde, and my team is using it in a big project 👍