Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Timeouts on stop job #434

Closed
kstaken opened this issue Jun 21, 2017 · 2 comments
Closed

Timeouts on stop job #434

kstaken opened this issue Jun 21, 2017 · 2 comments

Comments

@kstaken
Copy link
Member

kstaken commented Jun 21, 2017

When a job is stopped and the workers take some time to exit you get the following error.

curl -XPOST localhost:5678/ex/56dea3e4-820c-4508-a946-5704ff273675/_stop
{
    "error": 500,
    "message": "Could not stop job, error: Error communicating with node: localhost, Could not send msg: cluster:job:stop, data: {\"ex_id\":\"56dea3e4-820c-4508-a946-5704ff273675\",\"_msgID\":\"SJQofzdQb\"}"
}

This appears to be the result of a timeout but I think there are several issues here.

  1. The error message is misleading. If this is a timeout it should probably be more explicit about it.
  2. The default timeout is probably too short. What is the impact from increasing it to something like 5m?
  3. The job does appear to stop and all workers exit however it is not labeled as stopped and remains in it's prior state. Immediately re-running _stop will work and set the state to stopped even though the job doesn't appear to actually be running anymore.
@jsnoble
Copy link
Member

jsnoble commented Jul 5, 2017

this is mainly resolved and linked to #436 . The error message was change and the timeout is now 5mins. I left that code as is since we need the guarantee that the job actually stoped before marking it as such.

@jsnoble jsnoble closed this as completed Jul 5, 2017
@erik-stephens
Copy link
Contributor

I've been seeing this issue consistently. On first call to _stop, the slicer stops but the job stays running and the response is Request timed out (30s). The 2nd call to _stop returns successfully and stops the job.

My job includes a kafka reader with wait:30s. Changing to wait:10s & interval:10s did not have an impact - still timed out after 30s.

@erik-stephens erik-stephens reopened this Jul 14, 2017
@kstaken kstaken added the bug label Jul 28, 2017
@erik-stephens erik-stephens self-assigned this Aug 7, 2017
@godber godber closed this as completed in 42aaa83 Sep 5, 2017
godber added a commit that referenced this issue Sep 5, 2017
fixed stop job timeout issues resolves #434
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants