CaseClauseError #103

maennchen · 2018-09-06T19:37:23Z

We've shortly switched to swarm in the quantum library.

We're seeing two errors when we do agressive up / down scaling.

These errors may very well be errors on the side of the quantum library, but I'm not able to determine the problem.

web_5  | 19:33:01.556 [error] [swarm on [email protected]] [tracker:handle_handoff] ** (CaseClauseError) no case clause matching: {:entry, QuantumSwarm.Scheduler.ExecutionBroadcaster, #PID<32732.1428.0>, #Reference<0.3516506690.689963012.197623>, %{mfa: {Quantum.ExecutionBroadcaster, :start_link, [%Quantum.ExecutionBroadcaster.StartOpts{debug_logging: true, job_broadcaster_reference: {:via, :swarm, QuantumSwarm.Scheduler.JobBroadcaster}, name: QuantumSwarm.Scheduler.ExecutionBroadcaster, scheduler: QuantumSwarm.Scheduler, storage: Quantum.Storage.Noop}]}}, {{0, 3}, {1, 3}}}

The issue is tracked here on quantums side: quantum-elixir/quantum-core#374

Help to this issue is very welcome.

The text was updated successfully, but these errors were encountered:

bitwalker · 2018-09-06T20:06:36Z

@maennchen We'll investigate, thanks for reporting!

arjan · 2018-09-07T10:05:16Z

Strange. A handoff request is received for a process that is neither registered at the tracker's node itself nor at the originating node.

I think this is where the CaseClauseError originates. Maybe we should just ignore such events?

maennchen · 2018-09-07T10:40:11Z

I've updated the reproduction repo to the newest quantum version: peek-travel/quantum_swarm#1

maennchen · 2018-09-07T10:44:50Z

Just to make sure that the registering is implemented correctly: In this supervisor the workers are started using swarm: https://github.com/quantum-elixir/quantum-core/blob/master/lib/quantum/supervisor.ex#L88-L97

doughsay · 2018-09-11T00:34:24Z

Forgive me if this isn't related, but I tried a few more tests with swarm, taking quantum out of the equation.

Please see this branch: https://github.com/peek-travel/quantum_swarm/blob/without-quantum/apps/quantum_swarm/lib/quantum_swarm/ponger.ex

I run this in docker-compose, starting with 1 node initially, quickly scaling to 5 nodes (causing the process to be handed off to another node. Once that settles down, I then quickly scale back to 1 Node. The process ends up not running anywhere. This is similar to the symptoms that I see with quantum, but without any errors.

Am I maybe just doing something wrong with my genserver implementation? Logs below:

web_1  | 00:30:31.425 [info] [swarm on [email protected]] [tracker:init] started
web_1  | 00:30:36.429 [info] [swarm on [email protected]] [tracker:cluster_wait] joining cluster..
web_1  | 00:30:36.429 [info] [swarm on [email protected]] [tracker:cluster_wait] no connected nodes, proceeding without sync
web_1  | 00:30:36.429 [debug] [swarm on [email protected]] [tracker:handle_call] registering ThePonger as process started by Elixir.QuantumSwarm.Ponger.start_link/0 with args []
web_1  | 00:30:36.429 [debug] [swarm on [email protected]] [tracker:do_track] starting ThePonger on [email protected]
web_1  | 00:30:36.429 [debug] [swarm on [email protected]] [tracker:do_track] started ThePonger on [email protected]
web_1  | 00:30:36.434 [info] Running QuantumSwarmWeb.Endpoint with Cowboy using http://:::80
web_1  | ------> PONG! 0
web_1  | ------> PONG! 1
web_1  | ------> PONG! 2
web_1  | 00:30:46.439 [warn] [libcluster:dns] unable to connect to :"[email protected]"
web_1  | 00:30:46.441 [warn] [libcluster:dns] unable to connect to :"[email protected]"
web_1  | 00:30:46.442 [warn] [libcluster:dns] unable to connect to :"[email protected]"
web_1  | 00:30:46.445 [warn] [libcluster:dns] unable to connect to :"[email protected]"
web_5  | 00:30:46.794 [info] [swarm on [email protected]] [tracker:init] started
web_4  | 00:30:46.808 [info] [swarm on [email protected]] [tracker:init] started
web_3  | 00:30:47.188 [info] [swarm on [email protected]] [tracker:init] started
web_2  | 00:30:47.321 [info] [swarm on [email protected]] [tracker:init] started
web_1  | ------> PONG! 3
web_1  | ------> PONG! 4
web_1  | 00:30:51.443 [info] [libcluster:dns] connected to :"[email protected]"
web_1  | 00:30:51.444 [info] [swarm on [email protected]] [tracker:ensure_swarm_started_on_remote_node] nodeup [email protected]
web_5  | 00:30:51.444 [info] [swarm on [email protected]] [tracker:ensure_swarm_started_on_remote_node] nodeup [email protected]
web_1  | 00:30:51.444 [info] [swarm on [email protected]] [tracker:cluster_wait] joining cluster..
web_1  | 00:30:51.445 [info] [swarm on [email protected]] [tracker:cluster_wait] found connected nodes: [:"[email protected]"]
web_1  | 00:30:51.445 [info] [swarm on [email protected]] [tracker:cluster_wait] selected sync node: [email protected]
web_5  | 00:30:51.447 [info] [swarm on [email protected]] [tracker:cluster_wait] joining cluster..
web_5  | 00:30:51.447 [info] [swarm on [email protected]] [tracker:cluster_wait] syncing with [email protected]
web_5  | 00:30:51.449 [info] [swarm on [email protected]] [tracker:awaiting_sync_ack] received sync acknowledgement from [email protected], syncing with remote registry
web_5  | 00:30:51.449 [debug] [swarm on [email protected]] [tracker:sync_registry] local tracker is missing ThePonger, adding to registry
web_5  | 00:30:51.449 [info] [swarm on [email protected]] [tracker:awaiting_sync_ack] local synchronization with [email protected] complete!
web_5  | 00:30:51.449 [info] [swarm on [email protected]] [tracker:resolve_pending_sync_requests] pending sync requests cleared
web_5  | 00:30:51.449 [debug] [swarm on [email protected]] [tracker:handle_call] registering ThePonger as process started by Elixir.QuantumSwarm.Ponger.start_link/0 with args []
web_5  | 00:30:51.449 [debug] [swarm on [email protected]] [tracker:do_track] starting ThePonger on remote node [email protected]
web_5  | 00:30:51.451 [debug] [swarm on [email protected]] [tracker:start_pid_remotely] ThePonger already registered to #PID<32705.1368.0> on [email protected], registering locally
web_1  | 00:30:51.448 [info] [swarm on [email protected]] [tracker:syncing] received registry from [email protected], merging..
web_1  | 00:30:51.448 [info] [swarm on [email protected]] [tracker:syncing] local synchronization with [email protected] complete!
web_1  | 00:30:51.448 [info] [swarm on [email protected]] [tracker:resolve_pending_sync_requests] pending sync requests cleared
web_1  | 00:30:51.450 [debug] [swarm on [email protected]] [tracker:handle_call] :"[email protected]" is registering ThePonger as process started by Elixir.QuantumSwarm.Ponger.start_link/0 with args []
web_1  | 00:30:51.450 [debug] [swarm on [email protected]] [tracker:do_track] found ThePonger already registered on [email protected]
web_1  | 00:30:51.453 [info] [libcluster:dns] connected to :"[email protected]"
web_4  | 00:30:51.455 [info] [swarm on [email protected]] [tracker:ensure_swarm_started_on_remote_node] nodeup [email protected]
web_1  | 00:30:51.455 [info] [swarm on [email protected]] [tracker:ensure_swarm_started_on_remote_node] nodeup [email protected]
web_1  | 00:30:51.457 [debug] [swarm on [email protected]] [tracker:handle_topology_change] topology change (nodeup for [email protected])
web_1  | 00:30:51.457 [info] [swarm on [email protected]] [tracker:handle_topology_change] topology change complete
web_1  | 00:30:51.486 [info] [libcluster:dns] connected to :"[email protected]"
web_1  | 00:30:51.487 [info] [swarm on [email protected]] [tracker:ensure_swarm_started_on_remote_node] nodeup [email protected]
web_1  | 00:30:51.487 [debug] [swarm on [email protected]] [tracker:handle_topology_change] topology change (nodeup for [email protected])
web_1  | 00:30:51.487 [debug] [swarm on [email protected]] [tracker:handle_topology_change] #PID<0.1368.0> belongs on [email protected]
web_1  | ------> BEGIN HANDOFF! counter=5
web_3  | 00:30:51.487 [info] [swarm on [email protected]] [tracker:ensure_swarm_started_on_remote_node] nodeup [email protected]
web_5  | 00:30:51.490 [debug] [swarm on [email protected]] [tracker:handle_replica_event] replica event: untrack #PID<32705.1368.0>
web_5  | 00:30:51.499 [info] [libcluster:dns] connected to :"[email protected]"
web_5  | 00:30:51.499 [info] [swarm on [email protected]] [tracker:ensure_swarm_started_on_remote_node] nodeup [email protected]
web_5  | 00:30:51.500 [debug] [swarm on [email protected]] [tracker:handle_topology_change] topology change (nodeup for [email protected])
web_5  | 00:30:51.500 [info] [swarm on [email protected]] [tracker:handle_topology_change] topology change complete
web_1  | 00:30:51.488 [debug] [swarm on [email protected]] [tracker:handle_topology_change] ThePonger has requested to be resumed
web_1  | 00:30:51.493 [debug] [swarm on [email protected]] [tracker:handle_topology_change] sending handoff for ThePonger to [email protected]
web_1  | 00:30:51.493 [info] [swarm on [email protected]] [tracker:handle_topology_change] topology change complete
web_1  | ------> DIE! counter=5
web_5  | 00:30:51.509 [info] [libcluster:dns] connected to :"[email protected]"
web_3  | 00:30:51.512 [info] [swarm on [email protected]] [tracker:ensure_swarm_started_on_remote_node] nodeup [email protected]
web_5  | 00:30:51.514 [info] [swarm on [email protected]] [tracker:ensure_swarm_started_on_remote_node] nodeup [email protected]
web_5  | 00:30:51.514 [debug] [swarm on [email protected]] [tracker:handle_topology_change] topology change (nodeup for [email protected])
web_5  | 00:30:51.514 [info] [swarm on [email protected]] [tracker:handle_topology_change] topology change complete
web_2  | 00:30:51.516 [info] [swarm on [email protected]] [tracker:ensure_swarm_started_on_remote_node] nodeup [email protected]
web_4  | 00:30:51.501 [info] [swarm on [email protected]] [tracker:ensure_swarm_started_on_remote_node] nodeup [email protected]
web_1  | 00:30:51.516 [info] [libcluster:dns] connected to :"[email protected]"
web_1  | 00:30:51.524 [info] [swarm on [email protected]] [tracker:ensure_swarm_started_on_remote_node] nodeup [email protected]
web_1  | 00:30:51.524 [debug] [swarm on [email protected]] [tracker:handle_topology_change] topology change (nodeup for [email protected])
web_1  | 00:30:51.525 [info] [swarm on [email protected]] [tracker:handle_topology_change] topology change complete
web_5  | 00:30:51.531 [info] [libcluster:dns] connected to :"[email protected]"
web_5  | 00:30:51.532 [info] [swarm on [email protected]] [tracker:ensure_swarm_started_on_remote_node] nodeup [email protected]
web_2  | 00:30:51.534 [info] [swarm on [email protected]] [tracker:ensure_swarm_started_on_remote_node] nodeup [email protected]
web_5  | 00:30:51.533 [debug] [swarm on [email protected]] [tracker:handle_topology_change] topology change (nodeup for [email protected])
web_5  | 00:30:51.533 [info] [swarm on [email protected]] [tracker:handle_topology_change] topology change complete
web_5  | 00:30:51.545 [info] Running QuantumSwarmWeb.Endpoint with Cowboy using http://:::80
web_4  | 00:30:51.725 [info] [swarm on [email protected]] [tracker:ensure_swarm_started_on_remote_node] nodeup [email protected]
web_3  | 00:30:51.725 [info] [swarm on [email protected]] [tracker:ensure_swarm_started_on_remote_node] nodeup [email protected]
web_3  | 00:30:51.739 [info] [swarm on [email protected]] [tracker:ensure_swarm_started_on_remote_node] nodeup [email protected]
web_2  | 00:30:51.739 [info] [swarm on [email protected]] [tracker:ensure_swarm_started_on_remote_node] nodeup [email protected]
web_4  | 00:30:51.750 [info] [swarm on [email protected]] [tracker:ensure_swarm_started_on_remote_node] nodeup [email protected]
web_2  | 00:30:51.751 [info] [swarm on [email protected]] [tracker:ensure_swarm_started_on_remote_node] nodeup [email protected]
web_4  | 00:30:51.789 [info] [swarm on [email protected]] [tracker:cluster_wait] joining cluster..
web_4  | 00:30:51.789 [info] [swarm on [email protected]] [tracker:cluster_wait] found connected nodes: [:"[email protected]", :"[email protected]", :"[email protected]", :"[email protected]"]
web_4  | 00:30:51.789 [info] [swarm on [email protected]] [tracker:cluster_wait] selected sync node: [email protected]
web_3  | 00:30:51.790 [info] [swarm on [email protected]] [tracker:cluster_wait] pending sync request from [email protected]
web_3  | 00:30:52.169 [info] [swarm on [email protected]] [tracker:cluster_wait] joining cluster..
web_3  | 00:30:52.169 [info] [swarm on [email protected]] [tracker:cluster_wait] found connected nodes: [:"[email protected]", :"[email protected]", :"[email protected]", :"[email protected]"]
web_3  | 00:30:52.169 [info] [swarm on [email protected]] [tracker:cluster_wait] selected sync node: [email protected]
web_2  | 00:30:52.170 [info] [swarm on [email protected]] [tracker:cluster_wait] pending sync request from [email protected]
web_2  | 00:30:52.300 [info] [swarm on [email protected]] [tracker:cluster_wait] joining cluster..
web_2  | 00:30:52.300 [info] [swarm on [email protected]] [tracker:cluster_wait] found connected nodes: [:"[email protected]", :"[email protected]", :"[email protected]", :"[email protected]"]
web_2  | 00:30:52.300 [info] [swarm on [email protected]] [tracker:cluster_wait] selected sync node: [email protected]
web_5  | 00:30:52.301 [debug] [swarm on [email protected]] [tracker:handle_cast] received sync request from [email protected]
web_2  | 00:30:52.301 [info] [swarm on [email protected]] [tracker:syncing] received registry from [email protected], merging..
web_2  | 00:30:52.301 [info] [swarm on [email protected]] [tracker:syncing] local synchronization with [email protected] complete!
web_2  | 00:30:52.301 [info] [swarm on [email protected]] [tracker:resolve_pending_sync_requests] clearing pending sync request for [email protected]
web_2  | 00:30:52.304 [info] [swarm on [email protected]] [tracker:awaiting_sync_ack] received sync acknowledgement from [email protected], syncing with remote registry
web_2  | 00:30:52.304 [info] [swarm on [email protected]] [tracker:awaiting_sync_ack] local synchronization with [email protected] complete!
web_2  | 00:30:52.304 [info] [swarm on [email protected]] [tracker:resolve_pending_sync_requests] pending sync requests cleared
web_2  | 00:30:52.304 [debug] [swarm on [email protected]] [tracker:handle_call] registering ThePonger as process started by Elixir.QuantumSwarm.Ponger.start_link/0 with args []
web_2  | 00:30:52.304 [debug] [swarm on [email protected]] [tracker:do_track] starting ThePonger on remote node [email protected]
web_1  | 00:30:52.312 [debug] [swarm on [email protected]] [tracker:handle_replica_event] replicating registration for ThePonger (#PID<32717.1389.0>) locally
web_5  | 00:30:52.302 [info] [swarm on [email protected]] [tracker:awaiting_sync_ack] received sync acknowledgement from [email protected], syncing with remote registry
web_5  | 00:30:52.302 [info] [swarm on [email protected]] [tracker:awaiting_sync_ack] local synchronization with [email protected] complete!
web_5  | 00:30:52.302 [info] [swarm on [email protected]] [tracker:resolve_pending_sync_requests] pending sync requests cleared
web_5  | 00:30:52.311 [debug] [swarm on [email protected]] [tracker:handle_replica_event] replicating registration for ThePonger (#PID<32708.1389.0>) locally
web_4  | 00:30:52.307 [info] [swarm on [email protected]] [tracker:syncing] received registry from [email protected], merging..
web_4  | 00:30:52.307 [info] [swarm on [email protected]] [tracker:syncing] local synchronization with [email protected] complete!
web_4  | 00:30:52.307 [info] [swarm on [email protected]] [tracker:resolve_pending_sync_requests] pending sync requests cleared
web_4  | 00:30:52.307 [debug] [swarm on [email protected]] [tracker:handle_call] registering ThePonger as process started by Elixir.QuantumSwarm.Ponger.start_link/0 with args []
web_4  | 00:30:52.308 [debug] [swarm on [email protected]] [tracker:do_track] starting ThePonger on remote node [email protected]
web_4  | 00:30:52.308 [debug] [swarm on [email protected]] [tracker:handle_replica_event] replica event: untrack #PID<32705.1368.0>
web_4  | 00:30:52.313 [debug] [swarm on [email protected]] [tracker:handle_replica_event] replicating registration for ThePonger (#PID<32707.1389.0>) locally
web_4  | 00:30:52.314 [debug] [swarm on [email protected]] [tracker:handle_replica_event] replicating registration for ThePonger (#PID<32707.1389.0>) locally
web_2  | 00:30:52.311 [debug] [swarm on [email protected]] [tracker:handle_replica_event] replicating registration for ThePonger (#PID<32707.1389.0>) locally
web_2  | 00:30:52.314 [debug] [swarm on [email protected]] [tracker:handle_replica_event] replicating registration for ThePonger (#PID<32707.1389.0>) locally
web_3  | 00:30:52.303 [info] [swarm on [email protected]] [tracker:syncing] received registry from [email protected], merging..
web_3  | 00:30:52.303 [info] [swarm on [email protected]] [tracker:syncing] local synchronization with [email protected] complete!
web_3  | 00:30:52.303 [info] [swarm on [email protected]] [tracker:resolve_pending_sync_requests] clearing pending sync request for [email protected]
web_3  | 00:30:52.309 [info] [swarm on [email protected]] [tracker:awaiting_sync_ack] received sync acknowledgement from [email protected], syncing with remote registry
web_3  | 00:30:52.309 [info] [swarm on [email protected]] [tracker:awaiting_sync_ack] local synchronization with [email protected] complete!
web_3  | 00:30:52.309 [info] [swarm on [email protected]] [tracker:resolve_pending_sync_requests] pending sync requests cleared
web_3  | 00:30:52.309 [debug] [swarm on [email protected]] [tracker:handle_call] registering ThePonger as process started by Elixir.QuantumSwarm.Ponger.start_link/0 with args []
web_3  | 00:30:52.309 [debug] [swarm on [email protected]] [tracker:do_track] starting ThePonger on [email protected]
web_3  | 00:30:52.309 [debug] [swarm on [email protected]] [tracker:do_track] started ThePonger on [email protected]
web_5  | 00:30:52.316 [debug] [swarm on [email protected]] [tracker:handle_replica_event] replicating registration for ThePonger (#PID<32708.1389.0>) locally
web_3  | 00:30:52.313 [debug] [swarm on [email protected]] [tracker:handle_replica_event] replica event: untrack #PID<32705.1368.0>
web_3  | ------> RESOLVE CONFLICT! my_counter=5 their_counter=0
web_1  | 00:30:52.319 [debug] [swarm on [email protected]] [tracker:handle_replica_event] replicating registration for ThePonger (#PID<32717.1389.0>) locally
web_3  | 00:30:52.320 [debug] [swarm on [email protected]] [tracker:handle_call] :"[email protected]" is registering ThePonger as process started by Elixir.QuantumSwarm.Ponger.start_link/0 with args []
web_3  | 00:30:52.320 [debug] [swarm on [email protected]] [tracker:do_track] found ThePonger already registered on [email protected]
web_3  | 00:30:52.320 [debug] [swarm on [email protected]] [tracker:handle_call] :"[email protected]" is registering ThePonger as process started by Elixir.QuantumSwarm.Ponger.start_link/0 with args []
web_3  | 00:30:52.320 [debug] [swarm on [email protected]] [tracker:do_track] found ThePonger already registered on [email protected]
web_4  | 00:30:52.322 [debug] [swarm on [email protected]] [tracker:start_pid_remotely] ThePonger already registered to #PID<32707.1389.0> on [email protected], registering locally
web_2  | 00:30:52.333 [debug] [swarm on [email protected]] [tracker:start_pid_remotely] ThePonger already registered to #PID<32707.1389.0> on [email protected], registering locally
web_3  | 00:30:52.346 [info] Running QuantumSwarmWeb.Endpoint with Cowboy using http://:::80
web_2  | 00:30:52.427 [info] Running QuantumSwarmWeb.Endpoint with Cowboy using http://:::80
web_4  | 00:30:52.580 [info] Running QuantumSwarmWeb.Endpoint with Cowboy using http://:::80
web_3  | ------> PONG! 5
web_3  | ------> PONG! 6
web_5  | 00:31:00.686 [info] SIGTERM received - shutting down
web_5  |
web_3  | 00:31:00.686 [info] SIGTERM received - shutting down
web_3  |
web_2  | 00:31:00.688 [info] SIGTERM received - shutting down
web_2  |
web_4  | 00:31:00.700 [info] SIGTERM received - shutting down
web_4  |
web_4  | 00:31:00.709 [debug] [swarm on [email protected]] [tracker:handle_monitor] ThePonger is down: :shutdown
web_4  | 00:31:00.716 [warn] [swarm on [email protected]] [tracker:broadcast_event] broadcast of event ({:untrack, #PID<32707.1389.0>}) was not recevied by [:"[email protected]", :"[email protected]"]
web_4  | 00:31:00.716 [debug] [swarm on [email protected]] [tracker:handle_replica_event] replica event: untrack #PID<32707.1389.0>
web_4  | 00:31:00.717 [debug] [swarm on [email protected]] [tracker:handle_replica_event] replica event: untrack #PID<32707.1389.0>
web_1  | 00:31:00.708 [debug] [swarm on [email protected]] [tracker:handle_monitor] ThePonger is down: :shutdown
web_1  | 00:31:00.714 [warn] [swarm on [email protected]] [tracker:broadcast_event] broadcast of event ({:untrack, #PID<32717.1389.0>}) was not recevied by [:"[email protected]", :"[email protected]"]
web_1  | 00:31:00.714 [debug] [swarm on [email protected]] [tracker:handle_replica_event] replica event: untrack #PID<32717.1389.0>
web_1  | 00:31:00.714 [debug] [swarm on [email protected]] [tracker:handle_replica_event] replica event: untrack #PID<32717.1389.0>
web_5  | 00:31:00.713 [debug] [swarm on [email protected]] [tracker:handle_monitor] ThePonger is down: :shutdown
web_5  | 00:31:00.776 [warn] [swarm on [email protected]] [tracker:broadcast_event] broadcast of event ({:untrack, #PID<32708.1389.0>}) was not recevied by [:"[email protected]", :"[email protected]"]
web_5  | 00:31:00.776 [debug] [swarm on [email protected]] [tracker:handle_replica_event] replica event: untrack #PID<32708.1389.0>
web_5  | 00:31:00.776 [debug] [swarm on [email protected]] [tracker:handle_replica_event] replica event: untrack #PID<32708.1389.0>
web_1  | 00:31:01.713 [info] [swarm on [email protected]] [tracker:nodedown] nodedown [email protected]
web_1  | 00:31:01.713 [debug] [swarm on [email protected]] [tracker:handle_topology_change] topology change (nodedown for [email protected])
web_1  | 00:31:01.713 [info] [swarm on [email protected]] [tracker:handle_topology_change] topology change complete
web_1  | 00:31:01.769 [info] [swarm on [email protected]] [tracker:nodedown] nodedown [email protected]
web_1  | 00:31:01.769 [debug] [swarm on [email protected]] [tracker:handle_topology_change] topology change (nodedown for [email protected])
web_1  | 00:31:01.769 [info] [swarm on [email protected]] [tracker:handle_topology_change] topology change complete
web_1  | 00:31:01.777 [info] [swarm on [email protected]] [tracker:nodedown] nodedown [email protected]
web_1  | 00:31:01.777 [debug] [swarm on [email protected]] [tracker:handle_topology_change] topology change (nodedown for [email protected])
web_1  | 00:31:01.777 [info] [swarm on [email protected]] [tracker:handle_topology_change] topology change complete
web_1  | 00:31:01.784 [info] [swarm on [email protected]] [tracker:nodedown] nodedown [email protected]
web_1  | 00:31:01.784 [debug] [swarm on [email protected]] [tracker:handle_topology_change] topology change (nodedown for [email protected])
web_1  | 00:31:01.784 [info] [swarm on [email protected]] [tracker:handle_topology_change] topology change complete
quantum_swarm_umbrella_web_2 exited with code 0
quantum_swarm_umbrella_web_3 exited with code 0
quantum_swarm_umbrella_web_4 exited with code 0
quantum_swarm_umbrella_web_5 exited with code 0

ejscunha · 2018-09-12T16:23:59Z

Hey 👋

We are using quantum in our project and we are being affected by this issue when scaling down our cluster.

@doughsay looking into your example I'd say thats the expected behaviour since when you scale down the application receives a SIGTERM signal which makes all the started applications try to gracefully terminate their supervision trees, so when the GenServer spawned through Swarm receives the exit signal it will exit with the reason :shutdown, making Swarm to stop tracking the process because it exited normally.

maennchen · 2018-09-13T09:20:50Z

@ejscunha Could I solve this by setting restart to permanent again? I think I set it to transient to make it work with swarm.

ejscunha · 2018-09-13T10:02:41Z

@maennchen If you are talking about the processes you mentioned that are started in the Supervisor init it won't make a difference because those processes will be linked to Swarm.Tracker (since the start_link function is called by Swarm) not the Quantum.Supervisor so the restart parameter won't make any difference.

A possible solution is the functions that are passed to Swarm to start the processes should call Supervisor.start_child/2, this way the processes will be linked to a Supervisor and registered on Swarm. Also the spawned processes should trap exits and define a terminate function that hands-off the process with Swarm.Tracker.handoff/2 in case the process is exiting due to a SIGTERM signal. This solution is not perfect and has lots of potential issues, but I think it can solve this issue.

maennchen · 2018-09-13T10:16:35Z

@ejscunha So a worker would have to implement something like this?

defmodule Acem.Worker do
  use GenServer

  #...

  def init(opts) do
    Process.flag(:trap_exit, true)
    # ...
  end

  # ...

  def terminate(reason, %{name: name} = state) do
    handof_state = {:resume, some_stuff}
    Swarm.Tracker. handoff(name, handof_state)
    reason
  end

  # ...
end

Could you specify further how you would build the Supervisor.start_child/2 part?

ejscunha · 2018-09-13T10:26:26Z

@maennchen Yes, something like that.

An example on how Supervisor.start_child/2 could be used:

defmodule Quantum.Supervisor do
  def init(..) do
    ..
    if global do
      child_spec = TaskRegistry.child_spec(task_registry_opts)
      Swarm.register_name(task_registry_name, __MODULE__, :start_child_swarm, [child_spec], 15_000)
      ..
    end
    ..
  end

  def start_child_swarm(child_spec) do
    Supervisor.start_child(<supervisor_name>, child_spec)
  end
end

maennchen · 2018-09-13T11:08:30Z

@ejscunha And how would that help with the stopping on SIGTERM?

ejscunha · 2018-09-13T11:12:22Z

@maennchen Since your processes would be now trapping exits, when the Supervisor tries to terminate his children due to a SIGTERM signal, they won't exit immediately with the reason :shutdown (which makes Swarm stop tracking them), but instead we would handoff them into another node.

bitwalker · 2018-09-14T19:44:46Z

@arjan Perhaps this occurs when performs the handoff from a third-party node? In other words, due to this change? In this case, the handoff request comes not from the tracker node, or the registered owner, but from an external node which is deciding to trigger the handoff for some reason.

It's not clear to me if that is what is happening here, but the only other possibility is a divergence in the registries due to churn, where the same process has been moved more than once, and we're handling a handoff request for one of the earlier moves. I haven't confirmed if that's possible, but it is the only other thing I can think of.

arjan · 2018-09-14T20:52:51Z

@bitwalker that seems a plausible explanation. It receives the request from a node it doesnt expect it from. However, I'm not seeing quantum calling this Tracker.handoff function which is required to move a process to a node where it (according to the other nodes) doesn't belong on. And come to think of it, #83 was merged after this bug was reported, so it cannot be the cause...

bitwalker · 2018-09-14T21:48:28Z

I guess the implication then is that it must be due to the latter of the two possibilities I described - at a high enough rate of churn, events may get processed in an order different than we expect (i.e. we're handling a handoff event when the registration has been recently changed, but is not fully synced with reality, or put another way, the handoff is coming from a process which is not yet associated with the name on the node receiving the handoff).

We'll need to evaluate the right way to deal with this, as I'm not confident saying "X is the solution" without a pretty thorough review. My guess is that we may want to proceed as if the handoff is correct, and assume that a subsequent event will address the discrepancy (i.e. we're catching up to the real state of the world), but I'm not sure how best to do that.

arjan · 2018-09-16T18:54:15Z

Yes either we accept the inconsistent request or we drop it.. both could have unintended implications. It would be best to create a reproducible situation.

@maennchen since this issue was filed, #94 was merged, which included #85, which had quite some functional changes in the tracker's CRDT. Could you try to reproduce this CaseClauseError with the latest master? Just in case...

doughsay · 2018-09-17T01:27:55Z

I updated the repro repo to use swarm master: https://github.com/peek-travel/quantum_swarm

I still can't get it to be stable when scaling up or down in docker-compose. I get many kinds of errors, a lot of them look like they're coming from Quantum, but this one here looks like it's coming from swarm, but it doesn't really tell us much...

web_1  | 01:21:56.555 [error] [swarm on [email protected]] [tracker:start_pid_remotely] ** (exit) exited in: :gen_statem.call({Swarm.Tracker, :"[email protected]"}, {:track, QuantumSwarm.Scheduler.TaskRegistry, %{mfa: {Quantum.TaskRegistry, :start_link, [%Quantum.TaskRegistry.StartOpts{name: QuantumSwarm.Scheduler.TaskRegistry}]}}}, :infinity)
web_1  |     ** (EXIT) shutdown
web_1  |     (stdlib) gen.erl:177: :gen.do_call/4
web_1  |     (stdlib) gen_statem.erl:598: :gen_statem.call_dirty/4
web_1  |     (swarm) lib/swarm/tracker/tracker.ex:1331: Swarm.Tracker.start_pid_remotely/6
web_1  |     (elixir) lib/task/supervised.ex:89: Task.Supervised.do_apply/2
web_1  |     (stdlib) proc_lib.erl:249: :proc_lib.init_p_do_apply/3

I'd encourage anyone who has docker and docker-compose installed to clone that repo and just do random scaling up and down and see if it's not just me seeing those errors.

arjan · 2018-09-17T11:25:48Z

Thanks, I'll have a look.

arjan · 2018-09-17T11:49:02Z

Yes, I've seen this issue now as well. The pinger stops responding sometimes after scaling up.

I'm seeing timeout errors while calling the tracker which crash the caller, sometimes terminating the node:

web_2 | {"Kernel pid terminated",application_controller,"{application_start_failure,quantum_swarm,{{shutdown,{failed_to_start_child,'Elixir.QuantumSwarm.Scheduler',{timeout,{gen_statem,call,['Elixir.Swarm.Tracker',{track,'Elixir.QuantumSwarm.Scheduler.TaskRegistry',#{mfa => {'Elixir.Quantum.TaskRegistry',start_link,[#{'__struct__' => 'Elixir.Quantum.TaskRegistry.StartOpts',name => 'Elixir.QuantumSwarm.Scheduler.TaskRegistry'}]}}},15000]}}}},{'Elixir.QuantumSwarm.Application',start,[normal,[]]}}}"}

maennchen · 2018-11-21T11:04:35Z

I tried to incorporate the suggested changes from above (handing off in terminate, start via Supervisor.start_child)

You can see the commit here: https://github.com/jshmrtn/quantum-elixir/commit/41fd06c42a948c0160bf68f16690bc646dc81772

I don't really see what is happening, but the application is now crashing on scale down.

maennchen · 2019-01-25T09:24:47Z

Is there something more I should try to fix the problem?

bitwalker added bug investigating tracker labels Sep 6, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CaseClauseError #103

CaseClauseError #103

maennchen commented Sep 6, 2018

bitwalker commented Sep 6, 2018

arjan commented Sep 7, 2018

maennchen commented Sep 7, 2018

maennchen commented Sep 7, 2018

doughsay commented Sep 11, 2018

ejscunha commented Sep 12, 2018

maennchen commented Sep 13, 2018

ejscunha commented Sep 13, 2018

maennchen commented Sep 13, 2018 •

edited

Loading

ejscunha commented Sep 13, 2018 •

edited

Loading

maennchen commented Sep 13, 2018

ejscunha commented Sep 13, 2018

bitwalker commented Sep 14, 2018

arjan commented Sep 14, 2018

bitwalker commented Sep 14, 2018

arjan commented Sep 16, 2018

doughsay commented Sep 17, 2018

arjan commented Sep 17, 2018

arjan commented Sep 17, 2018 •

edited

Loading

maennchen commented Nov 21, 2018

maennchen commented Jan 25, 2019

CaseClauseError #103

CaseClauseError #103

Comments

maennchen commented Sep 6, 2018

bitwalker commented Sep 6, 2018

arjan commented Sep 7, 2018

maennchen commented Sep 7, 2018

maennchen commented Sep 7, 2018

doughsay commented Sep 11, 2018

ejscunha commented Sep 12, 2018

maennchen commented Sep 13, 2018

ejscunha commented Sep 13, 2018

maennchen commented Sep 13, 2018 • edited Loading

ejscunha commented Sep 13, 2018 • edited Loading

maennchen commented Sep 13, 2018

ejscunha commented Sep 13, 2018

bitwalker commented Sep 14, 2018

arjan commented Sep 14, 2018

bitwalker commented Sep 14, 2018

arjan commented Sep 16, 2018

doughsay commented Sep 17, 2018

arjan commented Sep 17, 2018

arjan commented Sep 17, 2018 • edited Loading

maennchen commented Nov 21, 2018

maennchen commented Jan 25, 2019

maennchen commented Sep 13, 2018 •

edited

Loading

ejscunha commented Sep 13, 2018 •

edited

Loading

arjan commented Sep 17, 2018 •

edited

Loading