Timeouts on stop job #434

kstaken · 2017-06-21T15:52:03Z

When a job is stopped and the workers take some time to exit you get the following error.

curl -XPOST localhost:5678/ex/56dea3e4-820c-4508-a946-5704ff273675/_stop
{
    "error": 500,
    "message": "Could not stop job, error: Error communicating with node: localhost, Could not send msg: cluster:job:stop, data: {\"ex_id\":\"56dea3e4-820c-4508-a946-5704ff273675\",\"_msgID\":\"SJQofzdQb\"}"
}

This appears to be the result of a timeout but I think there are several issues here.

The error message is misleading. If this is a timeout it should probably be more explicit about it.
The default timeout is probably too short. What is the impact from increasing it to something like 5m?
The job does appear to stop and all workers exit however it is not labeled as stopped and remains in it's prior state. Immediately re-running _stop will work and set the state to stopped even though the job doesn't appear to actually be running anymore.

The text was updated successfully, but these errors were encountered:

jsnoble · 2017-07-05T20:26:11Z

this is mainly resolved and linked to #436 . The error message was change and the timeout is now 5mins. I left that code as is since we need the guarantee that the job actually stoped before marking it as such.

erik-stephens · 2017-07-14T01:09:06Z

I've been seeing this issue consistently. On first call to _stop, the slicer stops but the job stays running and the response is Request timed out (30s). The 2nd call to _stop returns successfully and stops the job.

My job includes a kafka reader with wait:30s. Changing to wait:10s & interval:10s did not have an impact - still timed out after 30s.

fixed stop job timeout issues resolves #434

jsnoble closed this as completed Jul 5, 2017

erik-stephens reopened this Jul 14, 2017

kstaken added the bug label Jul 28, 2017

erik-stephens self-assigned this Aug 7, 2017

kstaken added the priority:urgent label Aug 23, 2017

godber closed this as completed in 42aaa83 Sep 5, 2017

godber added a commit that referenced this issue Sep 5, 2017

Merge pull request #522 from jsnoble/stop_job_timeouts

138e17a

fixed stop job timeout issues resolves #434

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Timeouts on stop job #434

Timeouts on stop job #434

kstaken commented Jun 21, 2017

jsnoble commented Jul 5, 2017

erik-stephens commented Jul 14, 2017

Timeouts on stop job #434

Timeouts on stop job #434

Comments

kstaken commented Jun 21, 2017

jsnoble commented Jul 5, 2017

erik-stephens commented Jul 14, 2017