Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CaseClauseError #103

Open
maennchen opened this issue Sep 6, 2018 · 21 comments
Open

CaseClauseError #103

maennchen opened this issue Sep 6, 2018 · 21 comments

Comments

@maennchen
Copy link

We've shortly switched to swarm in the quantum library.

We're seeing two errors when we do agressive up / down scaling.

These errors may very well be errors on the side of the quantum library, but I'm not able to determine the problem.

web_5  | 19:33:01.556 [error] [swarm on [email protected]] [tracker:handle_handoff] ** (CaseClauseError) no case clause matching: {:entry, QuantumSwarm.Scheduler.ExecutionBroadcaster, #PID<32732.1428.0>, #Reference<0.3516506690.689963012.197623>, %{mfa: {Quantum.ExecutionBroadcaster, :start_link, [%Quantum.ExecutionBroadcaster.StartOpts{debug_logging: true, job_broadcaster_reference: {:via, :swarm, QuantumSwarm.Scheduler.JobBroadcaster}, name: QuantumSwarm.Scheduler.ExecutionBroadcaster, scheduler: QuantumSwarm.Scheduler, storage: Quantum.Storage.Noop}]}}, {{0, 3}, {1, 3}}}

The issue is tracked here on quantums side: quantum-elixir/quantum-core#374

Help to this issue is very welcome.

@bitwalker
Copy link
Owner

@maennchen We'll investigate, thanks for reporting!

@arjan
Copy link
Collaborator

arjan commented Sep 7, 2018

Strange. A handoff request is received for a process that is neither registered at the tracker's node itself nor at the originating node.

I think this is where the CaseClauseError originates. Maybe we should just ignore such events?

@maennchen
Copy link
Author

I've updated the reproduction repo to the newest quantum version: peek-travel/quantum_swarm#1

@maennchen
Copy link
Author

Just to make sure that the registering is implemented correctly: In this supervisor the workers are started using swarm: https://github.com/quantum-elixir/quantum-core/blob/master/lib/quantum/supervisor.ex#L88-L97

@doughsay
Copy link

Forgive me if this isn't related, but I tried a few more tests with swarm, taking quantum out of the equation.

Please see this branch: https://github.com/peek-travel/quantum_swarm/blob/without-quantum/apps/quantum_swarm/lib/quantum_swarm/ponger.ex

I run this in docker-compose, starting with 1 node initially, quickly scaling to 5 nodes (causing the process to be handed off to another node. Once that settles down, I then quickly scale back to 1 Node. The process ends up not running anywhere. This is similar to the symptoms that I see with quantum, but without any errors.

Am I maybe just doing something wrong with my genserver implementation? Logs below:

web_1  | 00:30:31.425 [info] [swarm on [email protected]] [tracker:init] started
web_1  | 00:30:36.429 [info] [swarm on [email protected]] [tracker:cluster_wait] joining cluster..
web_1  | 00:30:36.429 [info] [swarm on [email protected]] [tracker:cluster_wait] no connected nodes, proceeding without sync
web_1  | 00:30:36.429 [debug] [swarm on [email protected]] [tracker:handle_call] registering ThePonger as process started by Elixir.QuantumSwarm.Ponger.start_link/0 with args []
web_1  | 00:30:36.429 [debug] [swarm on [email protected]] [tracker:do_track] starting ThePonger on [email protected]
web_1  | 00:30:36.429 [debug] [swarm on [email protected]] [tracker:do_track] started ThePonger on [email protected]
web_1  | 00:30:36.434 [info] Running QuantumSwarmWeb.Endpoint with Cowboy using http://:::80
web_1  | ------> PONG! 0
web_1  | ------> PONG! 1
web_1  | ------> PONG! 2
web_1  | 00:30:46.439 [warn] [libcluster:dns] unable to connect to :"[email protected]"
web_1  | 00:30:46.441 [warn] [libcluster:dns] unable to connect to :"[email protected]"
web_1  | 00:30:46.442 [warn] [libcluster:dns] unable to connect to :"[email protected]"
web_1  | 00:30:46.445 [warn] [libcluster:dns] unable to connect to :"[email protected]"
web_5  | 00:30:46.794 [info] [swarm on [email protected]] [tracker:init] started
web_4  | 00:30:46.808 [info] [swarm on [email protected]] [tracker:init] started
web_3  | 00:30:47.188 [info] [swarm on [email protected]] [tracker:init] started
web_2  | 00:30:47.321 [info] [swarm on [email protected]] [tracker:init] started
web_1  | ------> PONG! 3
web_1  | ------> PONG! 4
web_1  | 00:30:51.443 [info] [libcluster:dns] connected to :"[email protected]"
web_1  | 00:30:51.444 [info] [swarm on [email protected]] [tracker:ensure_swarm_started_on_remote_node] nodeup [email protected]
web_5  | 00:30:51.444 [info] [swarm on [email protected]] [tracker:ensure_swarm_started_on_remote_node] nodeup [email protected]
web_1  | 00:30:51.444 [info] [swarm on [email protected]] [tracker:cluster_wait] joining cluster..
web_1  | 00:30:51.445 [info] [swarm on [email protected]] [tracker:cluster_wait] found connected nodes: [:"[email protected]"]
web_1  | 00:30:51.445 [info] [swarm on [email protected]] [tracker:cluster_wait] selected sync node: [email protected]
web_5  | 00:30:51.447 [info] [swarm on [email protected]] [tracker:cluster_wait] joining cluster..
web_5  | 00:30:51.447 [info] [swarm on [email protected]] [tracker:cluster_wait] syncing with [email protected]
web_5  | 00:30:51.449 [info] [swarm on [email protected]] [tracker:awaiting_sync_ack] received sync acknowledgement from [email protected], syncing with remote registry
web_5  | 00:30:51.449 [debug] [swarm on [email protected]] [tracker:sync_registry] local tracker is missing ThePonger, adding to registry
web_5  | 00:30:51.449 [info] [swarm on [email protected]] [tracker:awaiting_sync_ack] local synchronization with [email protected] complete!
web_5  | 00:30:51.449 [info] [swarm on [email protected]] [tracker:resolve_pending_sync_requests] pending sync requests cleared
web_5  | 00:30:51.449 [debug] [swarm on [email protected]] [tracker:handle_call] registering ThePonger as process started by Elixir.QuantumSwarm.Ponger.start_link/0 with args []
web_5  | 00:30:51.449 [debug] [swarm on [email protected]] [tracker:do_track] starting ThePonger on remote node [email protected]
web_5  | 00:30:51.451 [debug] [swarm on [email protected]] [tracker:start_pid_remotely] ThePonger already registered to #PID<32705.1368.0> on [email protected], registering locally
web_1  | 00:30:51.448 [info] [swarm on [email protected]] [tracker:syncing] received registry from [email protected], merging..
web_1  | 00:30:51.448 [info] [swarm on [email protected]] [tracker:syncing] local synchronization with [email protected] complete!
web_1  | 00:30:51.448 [info] [swarm on [email protected]] [tracker:resolve_pending_sync_requests] pending sync requests cleared
web_1  | 00:30:51.450 [debug] [swarm on [email protected]] [tracker:handle_call] :"[email protected]" is registering ThePonger as process started by Elixir.QuantumSwarm.Ponger.start_link/0 with args []
web_1  | 00:30:51.450 [debug] [swarm on [email protected]] [tracker:do_track] found ThePonger already registered on [email protected]
web_1  | 00:30:51.453 [info] [libcluster:dns] connected to :"[email protected]"
web_4  | 00:30:51.455 [info] [swarm on [email protected]] [tracker:ensure_swarm_started_on_remote_node] nodeup [email protected]
web_1  | 00:30:51.455 [info] [swarm on [email protected]] [tracker:ensure_swarm_started_on_remote_node] nodeup [email protected]
web_1  | 00:30:51.457 [debug] [swarm on [email protected]] [tracker:handle_topology_change] topology change (nodeup for [email protected])
web_1  | 00:30:51.457 [info] [swarm on [email protected]] [tracker:handle_topology_change] topology change complete
web_1  | 00:30:51.486 [info] [libcluster:dns] connected to :"[email protected]"
web_1  | 00:30:51.487 [info] [swarm on [email protected]] [tracker:ensure_swarm_started_on_remote_node] nodeup [email protected]
web_1  | 00:30:51.487 [debug] [swarm on [email protected]] [tracker:handle_topology_change] topology change (nodeup for [email protected])
web_1  | 00:30:51.487 [debug] [swarm on [email protected]] [tracker:handle_topology_change] #PID<0.1368.0> belongs on [email protected]
web_1  | ------> BEGIN HANDOFF! counter=5
web_3  | 00:30:51.487 [info] [swarm on [email protected]] [tracker:ensure_swarm_started_on_remote_node] nodeup [email protected]
web_5  | 00:30:51.490 [debug] [swarm on [email protected]] [tracker:handle_replica_event] replica event: untrack #PID<32705.1368.0>
web_5  | 00:30:51.499 [info] [libcluster:dns] connected to :"[email protected]"
web_5  | 00:30:51.499 [info] [swarm on [email protected]] [tracker:ensure_swarm_started_on_remote_node] nodeup [email protected]
web_5  | 00:30:51.500 [debug] [swarm on [email protected]] [tracker:handle_topology_change] topology change (nodeup for [email protected])
web_5  | 00:30:51.500 [info] [swarm on [email protected]] [tracker:handle_topology_change] topology change complete
web_1  | 00:30:51.488 [debug] [swarm on [email protected]] [tracker:handle_topology_change] ThePonger has requested to be resumed
web_1  | 00:30:51.493 [debug] [swarm on [email protected]] [tracker:handle_topology_change] sending handoff for ThePonger to [email protected]
web_1  | 00:30:51.493 [info] [swarm on [email protected]] [tracker:handle_topology_change] topology change complete
web_1  | ------> DIE! counter=5
web_5  | 00:30:51.509 [info] [libcluster:dns] connected to :"[email protected]"
web_3  | 00:30:51.512 [info] [swarm on [email protected]] [tracker:ensure_swarm_started_on_remote_node] nodeup [email protected]
web_5  | 00:30:51.514 [info] [swarm on [email protected]] [tracker:ensure_swarm_started_on_remote_node] nodeup [email protected]
web_5  | 00:30:51.514 [debug] [swarm on [email protected]] [tracker:handle_topology_change] topology change (nodeup for [email protected])
web_5  | 00:30:51.514 [info] [swarm on [email protected]] [tracker:handle_topology_change] topology change complete
web_2  | 00:30:51.516 [info] [swarm on [email protected]] [tracker:ensure_swarm_started_on_remote_node] nodeup [email protected]
web_4  | 00:30:51.501 [info] [swarm on [email protected]] [tracker:ensure_swarm_started_on_remote_node] nodeup [email protected]
web_1  | 00:30:51.516 [info] [libcluster:dns] connected to :"[email protected]"
web_1  | 00:30:51.524 [info] [swarm on [email protected]] [tracker:ensure_swarm_started_on_remote_node] nodeup [email protected]
web_1  | 00:30:51.524 [debug] [swarm on [email protected]] [tracker:handle_topology_change] topology change (nodeup for [email protected])
web_1  | 00:30:51.525 [info] [swarm on [email protected]] [tracker:handle_topology_change] topology change complete
web_5  | 00:30:51.531 [info] [libcluster:dns] connected to :"[email protected]"
web_5  | 00:30:51.532 [info] [swarm on [email protected]] [tracker:ensure_swarm_started_on_remote_node] nodeup [email protected]
web_2  | 00:30:51.534 [info] [swarm on [email protected]] [tracker:ensure_swarm_started_on_remote_node] nodeup [email protected]
web_5  | 00:30:51.533 [debug] [swarm on [email protected]] [tracker:handle_topology_change] topology change (nodeup for [email protected])
web_5  | 00:30:51.533 [info] [swarm on [email protected]] [tracker:handle_topology_change] topology change complete
web_5  | 00:30:51.545 [info] Running QuantumSwarmWeb.Endpoint with Cowboy using http://:::80
web_4  | 00:30:51.725 [info] [swarm on [email protected]] [tracker:ensure_swarm_started_on_remote_node] nodeup [email protected]
web_3  | 00:30:51.725 [info] [swarm on [email protected]] [tracker:ensure_swarm_started_on_remote_node] nodeup [email protected]
web_3  | 00:30:51.739 [info] [swarm on [email protected]] [tracker:ensure_swarm_started_on_remote_node] nodeup [email protected]
web_2  | 00:30:51.739 [info] [swarm on [email protected]] [tracker:ensure_swarm_started_on_remote_node] nodeup [email protected]
web_4  | 00:30:51.750 [info] [swarm on [email protected]] [tracker:ensure_swarm_started_on_remote_node] nodeup [email protected]
web_2  | 00:30:51.751 [info] [swarm on [email protected]] [tracker:ensure_swarm_started_on_remote_node] nodeup [email protected]
web_4  | 00:30:51.789 [info] [swarm on [email protected]] [tracker:cluster_wait] joining cluster..
web_4  | 00:30:51.789 [info] [swarm on [email protected]] [tracker:cluster_wait] found connected nodes: [:"[email protected]", :"[email protected]", :"[email protected]", :"[email protected]"]
web_4  | 00:30:51.789 [info] [swarm on [email protected]] [tracker:cluster_wait] selected sync node: [email protected]
web_3  | 00:30:51.790 [info] [swarm on [email protected]] [tracker:cluster_wait] pending sync request from [email protected]
web_3  | 00:30:52.169 [info] [swarm on [email protected]] [tracker:cluster_wait] joining cluster..
web_3  | 00:30:52.169 [info] [swarm on [email protected]] [tracker:cluster_wait] found connected nodes: [:"[email protected]", :"[email protected]", :"[email protected]", :"[email protected]"]
web_3  | 00:30:52.169 [info] [swarm on [email protected]] [tracker:cluster_wait] selected sync node: [email protected]
web_2  | 00:30:52.170 [info] [swarm on [email protected]] [tracker:cluster_wait] pending sync request from [email protected]
web_2  | 00:30:52.300 [info] [swarm on [email protected]] [tracker:cluster_wait] joining cluster..
web_2  | 00:30:52.300 [info] [swarm on [email protected]] [tracker:cluster_wait] found connected nodes: [:"[email protected]", :"[email protected]", :"[email protected]", :"[email protected]"]
web_2  | 00:30:52.300 [info] [swarm on [email protected]] [tracker:cluster_wait] selected sync node: [email protected]
web_5  | 00:30:52.301 [debug] [swarm on [email protected]] [tracker:handle_cast] received sync request from [email protected]
web_2  | 00:30:52.301 [info] [swarm on [email protected]] [tracker:syncing] received registry from [email protected], merging..
web_2  | 00:30:52.301 [info] [swarm on [email protected]] [tracker:syncing] local synchronization with [email protected] complete!
web_2  | 00:30:52.301 [info] [swarm on [email protected]] [tracker:resolve_pending_sync_requests] clearing pending sync request for [email protected]
web_2  | 00:30:52.304 [info] [swarm on [email protected]] [tracker:awaiting_sync_ack] received sync acknowledgement from [email protected], syncing with remote registry
web_2  | 00:30:52.304 [info] [swarm on [email protected]] [tracker:awaiting_sync_ack] local synchronization with [email protected] complete!
web_2  | 00:30:52.304 [info] [swarm on [email protected]] [tracker:resolve_pending_sync_requests] pending sync requests cleared
web_2  | 00:30:52.304 [debug] [swarm on [email protected]] [tracker:handle_call] registering ThePonger as process started by Elixir.QuantumSwarm.Ponger.start_link/0 with args []
web_2  | 00:30:52.304 [debug] [swarm on [email protected]] [tracker:do_track] starting ThePonger on remote node [email protected]
web_1  | 00:30:52.312 [debug] [swarm on [email protected]] [tracker:handle_replica_event] replicating registration for ThePonger (#PID<32717.1389.0>) locally
web_5  | 00:30:52.302 [info] [swarm on [email protected]] [tracker:awaiting_sync_ack] received sync acknowledgement from [email protected], syncing with remote registry
web_5  | 00:30:52.302 [info] [swarm on [email protected]] [tracker:awaiting_sync_ack] local synchronization with [email protected] complete!
web_5  | 00:30:52.302 [info] [swarm on [email protected]] [tracker:resolve_pending_sync_requests] pending sync requests cleared
web_5  | 00:30:52.311 [debug] [swarm on [email protected]] [tracker:handle_replica_event] replicating registration for ThePonger (#PID<32708.1389.0>) locally
web_4  | 00:30:52.307 [info] [swarm on [email protected]] [tracker:syncing] received registry from [email protected], merging..
web_4  | 00:30:52.307 [info] [swarm on [email protected]] [tracker:syncing] local synchronization with [email protected] complete!
web_4  | 00:30:52.307 [info] [swarm on [email protected]] [tracker:resolve_pending_sync_requests] pending sync requests cleared
web_4  | 00:30:52.307 [debug] [swarm on [email protected]] [tracker:handle_call] registering ThePonger as process started by Elixir.QuantumSwarm.Ponger.start_link/0 with args []
web_4  | 00:30:52.308 [debug] [swarm on [email protected]] [tracker:do_track] starting ThePonger on remote node [email protected]
web_4  | 00:30:52.308 [debug] [swarm on [email protected]] [tracker:handle_replica_event] replica event: untrack #PID<32705.1368.0>
web_4  | 00:30:52.313 [debug] [swarm on [email protected]] [tracker:handle_replica_event] replicating registration for ThePonger (#PID<32707.1389.0>) locally
web_4  | 00:30:52.314 [debug] [swarm on [email protected]] [tracker:handle_replica_event] replicating registration for ThePonger (#PID<32707.1389.0>) locally
web_2  | 00:30:52.311 [debug] [swarm on [email protected]] [tracker:handle_replica_event] replicating registration for ThePonger (#PID<32707.1389.0>) locally
web_2  | 00:30:52.314 [debug] [swarm on [email protected]] [tracker:handle_replica_event] replicating registration for ThePonger (#PID<32707.1389.0>) locally
web_3  | 00:30:52.303 [info] [swarm on [email protected]] [tracker:syncing] received registry from [email protected], merging..
web_3  | 00:30:52.303 [info] [swarm on [email protected]] [tracker:syncing] local synchronization with [email protected] complete!
web_3  | 00:30:52.303 [info] [swarm on [email protected]] [tracker:resolve_pending_sync_requests] clearing pending sync request for [email protected]
web_3  | 00:30:52.309 [info] [swarm on [email protected]] [tracker:awaiting_sync_ack] received sync acknowledgement from [email protected], syncing with remote registry
web_3  | 00:30:52.309 [info] [swarm on [email protected]] [tracker:awaiting_sync_ack] local synchronization with [email protected] complete!
web_3  | 00:30:52.309 [info] [swarm on [email protected]] [tracker:resolve_pending_sync_requests] pending sync requests cleared
web_3  | 00:30:52.309 [debug] [swarm on [email protected]] [tracker:handle_call] registering ThePonger as process started by Elixir.QuantumSwarm.Ponger.start_link/0 with args []
web_3  | 00:30:52.309 [debug] [swarm on [email protected]] [tracker:do_track] starting ThePonger on [email protected]
web_3  | 00:30:52.309 [debug] [swarm on [email protected]] [tracker:do_track] started ThePonger on [email protected]
web_5  | 00:30:52.316 [debug] [swarm on [email protected]] [tracker:handle_replica_event] replicating registration for ThePonger (#PID<32708.1389.0>) locally
web_3  | 00:30:52.313 [debug] [swarm on [email protected]] [tracker:handle_replica_event] replica event: untrack #PID<32705.1368.0>
web_3  | ------> RESOLVE CONFLICT! my_counter=5 their_counter=0
web_1  | 00:30:52.319 [debug] [swarm on [email protected]] [tracker:handle_replica_event] replicating registration for ThePonger (#PID<32717.1389.0>) locally
web_3  | 00:30:52.320 [debug] [swarm on [email protected]] [tracker:handle_call] :"[email protected]" is registering ThePonger as process started by Elixir.QuantumSwarm.Ponger.start_link/0 with args []
web_3  | 00:30:52.320 [debug] [swarm on [email protected]] [tracker:do_track] found ThePonger already registered on [email protected]
web_3  | 00:30:52.320 [debug] [swarm on [email protected]] [tracker:handle_call] :"[email protected]" is registering ThePonger as process started by Elixir.QuantumSwarm.Ponger.start_link/0 with args []
web_3  | 00:30:52.320 [debug] [swarm on [email protected]] [tracker:do_track] found ThePonger already registered on [email protected]
web_4  | 00:30:52.322 [debug] [swarm on [email protected]] [tracker:start_pid_remotely] ThePonger already registered to #PID<32707.1389.0> on [email protected], registering locally
web_2  | 00:30:52.333 [debug] [swarm on [email protected]] [tracker:start_pid_remotely] ThePonger already registered to #PID<32707.1389.0> on [email protected], registering locally
web_3  | 00:30:52.346 [info] Running QuantumSwarmWeb.Endpoint with Cowboy using http://:::80
web_2  | 00:30:52.427 [info] Running QuantumSwarmWeb.Endpoint with Cowboy using http://:::80
web_4  | 00:30:52.580 [info] Running QuantumSwarmWeb.Endpoint with Cowboy using http://:::80
web_3  | ------> PONG! 5
web_3  | ------> PONG! 6
web_5  | 00:31:00.686 [info] SIGTERM received - shutting down
web_5  |
web_3  | 00:31:00.686 [info] SIGTERM received - shutting down
web_3  |
web_2  | 00:31:00.688 [info] SIGTERM received - shutting down
web_2  |
web_4  | 00:31:00.700 [info] SIGTERM received - shutting down
web_4  |
web_4  | 00:31:00.709 [debug] [swarm on [email protected]] [tracker:handle_monitor] ThePonger is down: :shutdown
web_4  | 00:31:00.716 [warn] [swarm on [email protected]] [tracker:broadcast_event] broadcast of event ({:untrack, #PID<32707.1389.0>}) was not recevied by [:"[email protected]", :"[email protected]"]
web_4  | 00:31:00.716 [debug] [swarm on [email protected]] [tracker:handle_replica_event] replica event: untrack #PID<32707.1389.0>
web_4  | 00:31:00.717 [debug] [swarm on [email protected]] [tracker:handle_replica_event] replica event: untrack #PID<32707.1389.0>
web_1  | 00:31:00.708 [debug] [swarm on [email protected]] [tracker:handle_monitor] ThePonger is down: :shutdown
web_1  | 00:31:00.714 [warn] [swarm on [email protected]] [tracker:broadcast_event] broadcast of event ({:untrack, #PID<32717.1389.0>}) was not recevied by [:"[email protected]", :"[email protected]"]
web_1  | 00:31:00.714 [debug] [swarm on [email protected]] [tracker:handle_replica_event] replica event: untrack #PID<32717.1389.0>
web_1  | 00:31:00.714 [debug] [swarm on [email protected]] [tracker:handle_replica_event] replica event: untrack #PID<32717.1389.0>
web_5  | 00:31:00.713 [debug] [swarm on [email protected]] [tracker:handle_monitor] ThePonger is down: :shutdown
web_5  | 00:31:00.776 [warn] [swarm on [email protected]] [tracker:broadcast_event] broadcast of event ({:untrack, #PID<32708.1389.0>}) was not recevied by [:"[email protected]", :"[email protected]"]
web_5  | 00:31:00.776 [debug] [swarm on [email protected]] [tracker:handle_replica_event] replica event: untrack #PID<32708.1389.0>
web_5  | 00:31:00.776 [debug] [swarm on [email protected]] [tracker:handle_replica_event] replica event: untrack #PID<32708.1389.0>
web_1  | 00:31:01.713 [info] [swarm on [email protected]] [tracker:nodedown] nodedown [email protected]
web_1  | 00:31:01.713 [debug] [swarm on [email protected]] [tracker:handle_topology_change] topology change (nodedown for [email protected])
web_1  | 00:31:01.713 [info] [swarm on [email protected]] [tracker:handle_topology_change] topology change complete
web_1  | 00:31:01.769 [info] [swarm on [email protected]] [tracker:nodedown] nodedown [email protected]
web_1  | 00:31:01.769 [debug] [swarm on [email protected]] [tracker:handle_topology_change] topology change (nodedown for [email protected])
web_1  | 00:31:01.769 [info] [swarm on [email protected]] [tracker:handle_topology_change] topology change complete
web_1  | 00:31:01.777 [info] [swarm on [email protected]] [tracker:nodedown] nodedown [email protected]
web_1  | 00:31:01.777 [debug] [swarm on [email protected]] [tracker:handle_topology_change] topology change (nodedown for [email protected])
web_1  | 00:31:01.777 [info] [swarm on [email protected]] [tracker:handle_topology_change] topology change complete
web_1  | 00:31:01.784 [info] [swarm on [email protected]] [tracker:nodedown] nodedown [email protected]
web_1  | 00:31:01.784 [debug] [swarm on [email protected]] [tracker:handle_topology_change] topology change (nodedown for [email protected])
web_1  | 00:31:01.784 [info] [swarm on [email protected]] [tracker:handle_topology_change] topology change complete
quantum_swarm_umbrella_web_2 exited with code 0
quantum_swarm_umbrella_web_3 exited with code 0
quantum_swarm_umbrella_web_4 exited with code 0
quantum_swarm_umbrella_web_5 exited with code 0

@ejscunha
Copy link

Hey 👋

We are using quantum in our project and we are being affected by this issue when scaling down our cluster.

@doughsay looking into your example I'd say thats the expected behaviour since when you scale down the application receives a SIGTERM signal which makes all the started applications try to gracefully terminate their supervision trees, so when the GenServer spawned through Swarm receives the exit signal it will exit with the reason :shutdown, making Swarm to stop tracking the process because it exited normally.

@maennchen
Copy link
Author

@ejscunha Could I solve this by setting restart to permanent again? I think I set it to transient to make it work with swarm.

@ejscunha
Copy link

@maennchen If you are talking about the processes you mentioned that are started in the Supervisor init it won't make a difference because those processes will be linked to Swarm.Tracker (since the start_link function is called by Swarm) not the Quantum.Supervisor so the restart parameter won't make any difference.

A possible solution is the functions that are passed to Swarm to start the processes should call Supervisor.start_child/2, this way the processes will be linked to a Supervisor and registered on Swarm. Also the spawned processes should trap exits and define a terminate function that hands-off the process with Swarm.Tracker.handoff/2 in case the process is exiting due to a SIGTERM signal. This solution is not perfect and has lots of potential issues, but I think it can solve this issue.

@maennchen
Copy link
Author

maennchen commented Sep 13, 2018

@ejscunha So a worker would have to implement something like this?

defmodule Acem.Worker do
  use GenServer

  #...

  def init(opts) do
    Process.flag(:trap_exit, true)
    # ...
  end

  # ...

  def terminate(reason, %{name: name} = state) do
    handof_state = {:resume, some_stuff}
    Swarm.Tracker. handoff(name, handof_state)
    reason
  end

  # ...
end

Could you specify further how you would build the Supervisor.start_child/2 part?

@ejscunha
Copy link

ejscunha commented Sep 13, 2018

@maennchen Yes, something like that.

An example on how Supervisor.start_child/2 could be used:

defmodule Quantum.Supervisor do
  def init(..) do
    ..
    if global do
      child_spec = TaskRegistry.child_spec(task_registry_opts)
      Swarm.register_name(task_registry_name, __MODULE__, :start_child_swarm, [child_spec], 15_000)
      ..
    end
    ..
  end

  def start_child_swarm(child_spec) do
    Supervisor.start_child(<supervisor_name>, child_spec)
  end
end

@maennchen
Copy link
Author

@ejscunha And how would that help with the stopping on SIGTERM?

@ejscunha
Copy link

@maennchen Since your processes would be now trapping exits, when the Supervisor tries to terminate his children due to a SIGTERM signal, they won't exit immediately with the reason :shutdown (which makes Swarm stop tracking them), but instead we would handoff them into another node.

@bitwalker
Copy link
Owner

@arjan Perhaps this occurs when performs the handoff from a third-party node? In other words, due to this change? In this case, the handoff request comes not from the tracker node, or the registered owner, but from an external node which is deciding to trigger the handoff for some reason.

It's not clear to me if that is what is happening here, but the only other possibility is a divergence in the registries due to churn, where the same process has been moved more than once, and we're handling a handoff request for one of the earlier moves. I haven't confirmed if that's possible, but it is the only other thing I can think of.

@arjan
Copy link
Collaborator

arjan commented Sep 14, 2018

@bitwalker that seems a plausible explanation. It receives the request from a node it doesnt expect it from. However, I'm not seeing quantum calling this Tracker.handoff function which is required to move a process to a node where it (according to the other nodes) doesn't belong on. And come to think of it, #83 was merged after this bug was reported, so it cannot be the cause...

@bitwalker
Copy link
Owner

I guess the implication then is that it must be due to the latter of the two possibilities I described - at a high enough rate of churn, events may get processed in an order different than we expect (i.e. we're handling a handoff event when the registration has been recently changed, but is not fully synced with reality, or put another way, the handoff is coming from a process which is not yet associated with the name on the node receiving the handoff).

We'll need to evaluate the right way to deal with this, as I'm not confident saying "X is the solution" without a pretty thorough review. My guess is that we may want to proceed as if the handoff is correct, and assume that a subsequent event will address the discrepancy (i.e. we're catching up to the real state of the world), but I'm not sure how best to do that.

@arjan
Copy link
Collaborator

arjan commented Sep 16, 2018

Yes either we accept the inconsistent request or we drop it.. both could have unintended implications. It would be best to create a reproducible situation.

@maennchen since this issue was filed, #94 was merged, which included #85, which had quite some functional changes in the tracker's CRDT. Could you try to reproduce this CaseClauseError with the latest master? Just in case...

@doughsay
Copy link

I updated the repro repo to use swarm master: https://github.com/peek-travel/quantum_swarm

I still can't get it to be stable when scaling up or down in docker-compose. I get many kinds of errors, a lot of them look like they're coming from Quantum, but this one here looks like it's coming from swarm, but it doesn't really tell us much...

web_1  | 01:21:56.555 [error] [swarm on [email protected]] [tracker:start_pid_remotely] ** (exit) exited in: :gen_statem.call({Swarm.Tracker, :"[email protected]"}, {:track, QuantumSwarm.Scheduler.TaskRegistry, %{mfa: {Quantum.TaskRegistry, :start_link, [%Quantum.TaskRegistry.StartOpts{name: QuantumSwarm.Scheduler.TaskRegistry}]}}}, :infinity)
web_1  |     ** (EXIT) shutdown
web_1  |     (stdlib) gen.erl:177: :gen.do_call/4
web_1  |     (stdlib) gen_statem.erl:598: :gen_statem.call_dirty/4
web_1  |     (swarm) lib/swarm/tracker/tracker.ex:1331: Swarm.Tracker.start_pid_remotely/6
web_1  |     (elixir) lib/task/supervised.ex:89: Task.Supervised.do_apply/2
web_1  |     (stdlib) proc_lib.erl:249: :proc_lib.init_p_do_apply/3

I'd encourage anyone who has docker and docker-compose installed to clone that repo and just do random scaling up and down and see if it's not just me seeing those errors.

@arjan
Copy link
Collaborator

arjan commented Sep 17, 2018

Thanks, I'll have a look.

@arjan
Copy link
Collaborator

arjan commented Sep 17, 2018

Yes, I've seen this issue now as well. The pinger stops responding sometimes after scaling up.

I'm seeing timeout errors while calling the tracker which crash the caller, sometimes terminating the node:

web_2 | {"Kernel pid terminated",application_controller,"{application_start_failure,quantum_swarm,{{shutdown,{failed_to_start_child,'Elixir.QuantumSwarm.Scheduler',{timeout,{gen_statem,call,['Elixir.Swarm.Tracker',{track,'Elixir.QuantumSwarm.Scheduler.TaskRegistry',#{mfa => {'Elixir.Quantum.TaskRegistry',start_link,[#{'__struct__' => 'Elixir.Quantum.TaskRegistry.StartOpts',name => 'Elixir.QuantumSwarm.Scheduler.TaskRegistry'}]}}},15000]}}}},{'Elixir.QuantumSwarm.Application',start,[normal,[]]}}}"}

@maennchen
Copy link
Author

I tried to incorporate the suggested changes from above (handing off in terminate, start via Supervisor.start_child)

You can see the commit here: https://github.com/jshmrtn/quantum-elixir/commit/41fd06c42a948c0160bf68f16690bc646dc81772

I don't really see what is happening, but the application is now crashing on scale down.

@maennchen
Copy link
Author

Is there something more I should try to fix the problem?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants