-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow job to override shutdown_timeout - k8s #942
Comments
This won't have the indended affect in all cases, if the worker shutdown is set to a timeout greater than the existing worker shutdown timeout specified in the teraslice configuration, the worker will still shut down within the timeout since it does not know of a different timeout. If the timeout is less than the configurated timeout, the could potentially shutdown non-gracefully, but that risk likely acceptable. |
The behavior you state is fine. The primary use case for the |
It's probably a good idea to call out these caveats in the documentation where we talk about using the |
I am going to hold on this request for now. It seems that setting kubernetes/kubernetes#25055 This doesn't mean it's impossible, just that I have to delete pods now. The more code I write to wrangle k8s objects the more I think I should be using custom controllers/resources. |
It has occurred to me that at the very least, the job definition should be able to override the |
One of the problems this issue is meant to address is that when stopping teraslice jobs we often have to wait this full timeout period for reasons that can't be explained by the slice completion time (e.g. it takes five minutes to stop a worker whose slices take 30s to complete). It has recently occurred to me that the root cause on this is that the execution controller might shut down before the workers have completed their slices and sent their statistics back to the execution controller. See this: Since there is interaction between the workers and execution controllers at the service level that takes place after shutdown is requested ... my attempts to just delete these resources (and hence pods) in such an uncontrolled manner seem to be a bad idea. The execution controller and its service must only be deleted after the last worker has finished and exited cleanly. I think that can cause these long stop issues, while the workers timeout waiting to hear back from the already dead execution controller. |
We have still seen a 5 minute worker shutdown timeout even with my controlled job shutdown code merged in #2074 We're going to have to look at the kafka asset I think |
I guess we could leave this issue open if we really want jobs to override the shutdown timeout. But that just kind of plasters over whatever the real shutdown problem is without addressing the root cause. |
When POSTING to
_stop
a job, one can provide atimeout
as shown here:https://github.com/terascope/teraslice/blob/master/docs/api.md#post-v1jobsjobid_stop
This is not applied in the case of kubernetes clustering because I had forgotten about this feature of the API. I suspect it is possible to pass this value through and override the default which gets set based on cluster configuration here:
https://github.com/terascope/teraslice/blob/v0.46.0/packages/teraslice/lib/cluster/services/cluster/backends/kubernetes/deployments/worker.hbs#L65
I suspect the value in the deployment is overridden by calls to the API because
kubectl delete
has the following option:So I suspect it should be possible to pass a graceperiod option of some sort to the delete calls, like this one:
https://github.com/terascope/teraslice/blob/v0.46.0/packages/teraslice/lib/cluster/services/cluster/backends/kubernetes/k8s.js#L200-L201
The text was updated successfully, but these errors were encountered: