-
Notifications
You must be signed in to change notification settings - Fork 103
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CaseClauseError #103
Comments
@maennchen We'll investigate, thanks for reporting! |
Strange. A handoff request is received for a process that is neither registered at the tracker's node itself nor at the originating node. I think this is where the CaseClauseError originates. Maybe we should just ignore such events? |
I've updated the reproduction repo to the newest quantum version: peek-travel/quantum_swarm#1 |
Just to make sure that the registering is implemented correctly: In this supervisor the workers are started using swarm: https://github.com/quantum-elixir/quantum-core/blob/master/lib/quantum/supervisor.ex#L88-L97 |
Forgive me if this isn't related, but I tried a few more tests with swarm, taking quantum out of the equation. Please see this branch: https://github.com/peek-travel/quantum_swarm/blob/without-quantum/apps/quantum_swarm/lib/quantum_swarm/ponger.ex I run this in docker-compose, starting with 1 node initially, quickly scaling to 5 nodes (causing the process to be handed off to another node. Once that settles down, I then quickly scale back to 1 Node. The process ends up not running anywhere. This is similar to the symptoms that I see with quantum, but without any errors. Am I maybe just doing something wrong with my genserver implementation? Logs below:
|
Hey 👋 We are using quantum in our project and we are being affected by this issue when scaling down our cluster. @doughsay looking into your example I'd say thats the expected behaviour since when you scale down the application receives a |
@ejscunha Could I solve this by setting |
@maennchen If you are talking about the processes you mentioned that are started in the Supervisor init it won't make a difference because those processes will be linked to A possible solution is the functions that are passed to Swarm to start the processes should call |
@ejscunha So a worker would have to implement something like this? defmodule Acem.Worker do
use GenServer
#...
def init(opts) do
Process.flag(:trap_exit, true)
# ...
end
# ...
def terminate(reason, %{name: name} = state) do
handof_state = {:resume, some_stuff}
Swarm.Tracker. handoff(name, handof_state)
reason
end
# ...
end Could you specify further how you would build the |
@maennchen Yes, something like that. An example on how defmodule Quantum.Supervisor do
def init(..) do
..
if global do
child_spec = TaskRegistry.child_spec(task_registry_opts)
Swarm.register_name(task_registry_name, __MODULE__, :start_child_swarm, [child_spec], 15_000)
..
end
..
end
def start_child_swarm(child_spec) do
Supervisor.start_child(<supervisor_name>, child_spec)
end
end
|
@ejscunha And how would that help with the stopping on SIGTERM? |
@maennchen Since your processes would be now trapping exits, when the Supervisor tries to terminate his children due to a |
@arjan Perhaps this occurs when performs the handoff from a third-party node? In other words, due to this change? In this case, the handoff request comes not from the tracker node, or the registered owner, but from an external node which is deciding to trigger the handoff for some reason. It's not clear to me if that is what is happening here, but the only other possibility is a divergence in the registries due to churn, where the same process has been moved more than once, and we're handling a handoff request for one of the earlier moves. I haven't confirmed if that's possible, but it is the only other thing I can think of. |
@bitwalker that seems a plausible explanation. It receives the request from a node it doesnt expect it from. However, I'm not seeing quantum calling this Tracker.handoff function which is required to move a process to a node where it (according to the other nodes) doesn't belong on. And come to think of it, #83 was merged after this bug was reported, so it cannot be the cause... |
I guess the implication then is that it must be due to the latter of the two possibilities I described - at a high enough rate of churn, events may get processed in an order different than we expect (i.e. we're handling a handoff event when the registration has been recently changed, but is not fully synced with reality, or put another way, the handoff is coming from a process which is not yet associated with the name on the node receiving the handoff). We'll need to evaluate the right way to deal with this, as I'm not confident saying "X is the solution" without a pretty thorough review. My guess is that we may want to proceed as if the handoff is correct, and assume that a subsequent event will address the discrepancy (i.e. we're catching up to the real state of the world), but I'm not sure how best to do that. |
Yes either we accept the inconsistent request or we drop it.. both could have unintended implications. It would be best to create a reproducible situation. @maennchen since this issue was filed, #94 was merged, which included #85, which had quite some functional changes in the tracker's CRDT. Could you try to reproduce this CaseClauseError with the latest master? Just in case... |
I updated the repro repo to use swarm master: https://github.com/peek-travel/quantum_swarm I still can't get it to be stable when scaling up or down in docker-compose. I get many kinds of errors, a lot of them look like they're coming from Quantum, but this one here looks like it's coming from swarm, but it doesn't really tell us much...
I'd encourage anyone who has docker and docker-compose installed to clone that repo and just do random scaling up and down and see if it's not just me seeing those errors. |
Thanks, I'll have a look. |
Yes, I've seen this issue now as well. The pinger stops responding sometimes after scaling up. I'm seeing timeout errors while calling the tracker which crash the caller, sometimes terminating the node:
|
I tried to incorporate the suggested changes from above (handing off in terminate, start via You can see the commit here: https://github.com/jshmrtn/quantum-elixir/commit/41fd06c42a948c0160bf68f16690bc646dc81772 I don't really see what is happening, but the application is now crashing on scale down. |
Is there something more I should try to fix the problem? |
We've shortly switched to swarm in the quantum library.
We're seeing two errors when we do agressive up / down scaling.
These errors may very well be errors on the side of the quantum library, but I'm not able to determine the problem.
The issue is tracked here on quantums side: quantum-elixir/quantum-core#374
Help to this issue is very welcome.
The text was updated successfully, but these errors were encountered: