Allow job to override shutdown_timeout - k8s #942

godber · 2019-01-04T18:13:29Z

When POSTING to _stop a job, one can provide a timeout as shown here:
https://github.com/terascope/teraslice/blob/master/docs/api.md#post-v1jobsjobid_stop

This is not applied in the case of kubernetes clustering because I had forgotten about this feature of the API. I suspect it is possible to pass this value through and override the default which gets set based on cluster configuration here:

https://github.com/terascope/teraslice/blob/v0.46.0/packages/teraslice/lib/cluster/services/cluster/backends/kubernetes/deployments/worker.hbs#L65

I suspect the value in the deployment is overridden by calls to the API because kubectl delete has the following option:

--grace-period=-1: Period of time in seconds given to the resource to terminate gracefully. Ignored if negative.

So I suspect it should be possible to pass a graceperiod option of some sort to the delete calls, like this one:

https://github.com/terascope/teraslice/blob/v0.46.0/packages/teraslice/lib/cluster/services/cluster/backends/kubernetes/k8s.js#L200-L201

The text was updated successfully, but these errors were encountered:

peterdemartini · 2019-01-04T18:20:33Z

This won't have the indended affect in all cases, if the worker shutdown is set to a timeout greater than the existing worker shutdown timeout specified in the teraslice configuration, the worker will still shut down within the timeout since it does not know of a different timeout. If the timeout is less than the configurated timeout, the could potentially shutdown non-gracefully, but that risk likely acceptable.

godber · 2019-01-05T14:12:32Z

The behavior you state is fine. The primary use case for the timeout supplied through the API is to shorten the timeout period ... e.g. the user doesn't want to wait the default timeout period for some reason. So it's OK if the worker doesn't know, at least for the intended use case. A non-graceful shutdown should be considered a possible consequence of using timeout, yes. For that matter, any shutdown has the potential to be non-graceful if any of the timeouts are shorter than the slice processing time.

godber · 2019-01-05T14:25:19Z

It's probably a good idea to call out these caveats in the documentation where we talk about using the timeout parameter. Specifically call out the potential for data loss on slices who's process time exceeds that timeout.

godber · 2019-02-07T21:25:22Z

I am going to hold on this request for now. It seems that setting gracePeriodSeconds on a deployment doesn't propagate to the pods by design:

kubernetes/kubernetes#25055
kubernetes/kubernetes#24964

This doesn't mean it's impossible, just that I have to delete pods now. The more code I write to wrangle k8s objects the more I think I should be using custom controllers/resources.

godber · 2019-05-01T23:41:50Z

It has occurred to me that at the very least, the job definition should be able to override the shutdownTimeout. This could be carefully chosen by the author of the job to improve (shorten) shutdown times. The default of 5min is chosen out of an abundance of caution, but if a job is just a simple copy from one system to another, and slice completion times are known to be short, then the job should be able to override the cluster default.

godber · 2020-02-26T00:05:36Z

One of the problems this issue is meant to address is that when stopping teraslice jobs we often have to wait this full timeout period for reasons that can't be explained by the slice completion time (e.g. it takes five minutes to stop a worker whose slices take 30s to complete).

It has recently occurred to me that the root cause on this is that the execution controller might shut down before the workers have completed their slices and sent their statistics back to the execution controller. See this:

https://github.com/terascope/teraslice/blob/master/packages/teraslice/lib/cluster/services/cluster/backends/kubernetes/k8s.js#L236-L240

Since there is interaction between the workers and execution controllers at the service level that takes place after shutdown is requested ... my attempts to just delete these resources (and hence pods) in such an uncontrolled manner seem to be a bad idea. The execution controller and its service must only be deleted after the last worker has finished and exited cleanly.

I think that can cause these long stop issues, while the workers timeout waiting to hear back from the already dead execution controller.

godber · 2020-08-04T01:27:19Z

We have still seen a 5 minute worker shutdown timeout even with my controlled job shutdown code merged in #2074

We're going to have to look at the kafka asset I think

godber · 2020-08-04T01:32:54Z

I guess we could leave this issue open if we really want jobs to override the shutdown timeout. But that just kind of plasters over whatever the real shutdown problem is without addressing the root cause.

godber added bug k8s Applies to Teraslice in kubernetes cluster mode only. labels Jan 4, 2019

godber added the pkg/teraslice label Feb 6, 2019

godber added the priority:hold Work on this issue is on hold. label Feb 25, 2019

godber added priority:medium and removed priority:hold Work on this issue is on hold. labels May 1, 2019

godber changed the title ~~Honor the timeout value specified when stop is called via API - k8s~~ Allow job to override shutdown_timeout - k8s Jun 12, 2019

kstaken added this to the v1.0 milestone Aug 6, 2019

godber mentioned this issue Jul 23, 2020

Add wait for num pods #2074

Merged

godber mentioned this issue Aug 4, 2020

Fix long shutdown times for Teraslice jobs #2106

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow job to override shutdown_timeout - k8s #942

Allow job to override shutdown_timeout - k8s #942

godber commented Jan 4, 2019

peterdemartini commented Jan 4, 2019

godber commented Jan 5, 2019

godber commented Jan 5, 2019

godber commented Feb 7, 2019

godber commented May 1, 2019

godber commented Feb 26, 2020

godber commented Aug 4, 2020

godber commented Aug 4, 2020

Allow job to override shutdown_timeout - k8s #942

Allow job to override shutdown_timeout - k8s #942

Comments

godber commented Jan 4, 2019

peterdemartini commented Jan 4, 2019

godber commented Jan 5, 2019

godber commented Jan 5, 2019

godber commented Feb 7, 2019

godber commented May 1, 2019

godber commented Feb 26, 2020

godber commented Aug 4, 2020

godber commented Aug 4, 2020