-
Notifications
You must be signed in to change notification settings - Fork 5.2k
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Update design with more info and remove irrelevant portions.
- Loading branch information
Dhilip
authored
Jul 14, 2017
1 parent
5eb24e4
commit 9e81de3
Showing
1 changed file
with
117 additions
and
89 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -5,7 +5,7 @@ [email protected] | |
March 2017 | ||
|
||
## Prerequisite | ||
This is a continuation of [initContainers](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/container-init.md) | ||
Understanding of [initContainers](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/container-init.md) | ||
|
||
## Motivation | ||
Since the introduction to Statefulset and Daemonset PODs that consist of stateful containers are now fully supported by | ||
|
@@ -18,6 +18,7 @@ set of Containers to be executed at the time of POD termination. The defined con | |
in the specified order. It will bring destructor() capability to Pods. | ||
|
||
## Use Cases | ||
[5mins Uses Presentation Video during recent Sig-App Meeting](https://youtu.be/s1oZ00_JA00?t=2692) | ||
|
||
### Cleanup: | ||
Most stateful workloads require certain Cleanup activity before or after the appContainers are terminated such as | ||
|
@@ -45,45 +46,46 @@ disrupt the service intermittently. With deferContainers, this handover or re-el | |
* Traditional relational databases typically support a reasonable shutdown sequence, for instance, Oracle has 4 types of | ||
shutdown such as NORMAL, IMMEDIATE, TRANSACTIONAL and ABORT. ‘deferContainers’ will allow us to program and wait for such | ||
complex shutdown scenarios. | ||
* In future when kuberenetes supports Virtual Machine runtime (eg: hyperContainer) for better isolation, we should shutdown | ||
the VMs instead of killing them abruptly. ‘deferContainers’ could help us run such shutdwon commands. | ||
* In future when kubernetes supports Virtual Machine runtime (eg: hyperContainer) for better isolation, we should shutdown | ||
the VMs instead of killing them abruptly. ‘deferContainers’ could help us run such shutdown commands. | ||
|
||
## Limitation with the current system | ||
Container pre-stop hooks are not sufficient for all termination cases: | ||
* They cannot easily coordinate complex conditions across containers in multi-container pods | ||
Container pre-stop lifecycle hook is not sufficient for all termination cases: | ||
* It is container Specific and not pod specific | ||
* They cannot easily coordinate complex termination conditions across containers in multi-container pods | ||
* They can only function with code in the image or code in a shared volume, which would have to be statically linked | ||
(not a common pattern in wide use) | ||
* Does not work across kubelet restart | ||
* Waits for the entire graceperiod even after the pre-stop hook finished earlier. | ||
* Wont restart on failed termination steps. | ||
* Cannot contain complex termination scripts as no logging support. | ||
|
||
## Design Requirements | ||
Most of the requirements are very similar to initContainers. They are replicated and modified as necessary. | ||
* deferContainers should be able to: | ||
* Use the same volume (PV) as appContainers such that it can | ||
* Perform cleanup of shared volume, such as delete several temp directories. | ||
* Delete unwanted files before a final sync is initiated. | ||
* Update Configuration files about the changes in the distributed system so that the next pod getting attached to this PV | ||
will benefit from it. (like a new leader/master etc). | ||
* Update Configuration files about the changes in the distributed system so that the next pod getting attached to this PV will benefit from it. (like a new leader/master etc). | ||
* Deleted secrets or security related files before the pod de-couples from the PV. | ||
* Delay the termination of application containers until operations are complete | ||
* De-Register the pod with other components of the system | ||
* Program termination sequence for cases where TerminationGracePeriod will be hard to predict before hand. | ||
|
||
* Reduce coupling: | ||
* Between application images, eliminating the need to customize those images for Kubernetes generally or specific roles | ||
* Inside of images, by specializing which containers perform which tasks (install git into init container, use filesystem | ||
contents in web container) | ||
* Between termination steps, by supporting multiple sequential cleanup containers | ||
* Pre-Exit and Post-Exit workflow | ||
* Post - specify that the controller can continue to delete the appContainers but wait for deferContainer’s to complete | ||
its execution before marking as the Pod is deleted. | ||
* Pre - specify that the controller should wait until deferContainers completes its execution first and then proceed to | ||
delete the appContainers if needed | ||
* deferContainers should allow us to specify | ||
* if certain containers in deferContainer list need to be re-started on failures. | ||
* GracePeriod | ||
* It should be possible to mention overall terminationGracePeriod for defercontainers. | ||
* Run-once and run-forever pods should be able to use deferContainers | ||
* Pre-Exit | ||
* Should act as pre-exit trigger, should be called when the application is about to be deleted. | ||
* restart on Failure | ||
* If a certain deferContainer failed while execution it should be automatically restarted | ||
* GracePeriod behaviour | ||
* It should be possible to mention overall terminationGracePeriod for defercontainers, if the termination sequences completed before the overall graceperiod then the pod should be deleted without waiting further. | ||
* Reduce Complexity, | ||
* It should be possible to use a generic container as a deferContainer, | ||
* A deferContainer should be independently invokable, ie:- should not require code in the same image as the appContainers. | ||
* deferContainer Images that are not already in the node will be pre-populated while the application is being executed. | ||
|
||
## Design | ||
This proposed pod spec would look like below. | ||
|
@@ -107,93 +109,61 @@ pod: | |
- name: defer-container2 | ||
... | ||
``` | ||
Most of the design elements are similar to initContainers such as below | ||
The api will look like below | ||
``` | ||
// PodSpec is a description of a pod. | ||
type PodSpec struct { | ||
. | ||
. | ||
. | ||
|
||
InitContainers []Container `json:"initContainers,omitempty" patchStrategy:"merge" patchMergeKey:"name" protobuf:"bytes,20,rep,name=initContainers"` | ||
// List of containers belonging to the pod. | ||
// Containers cannot currently be added or removed. | ||
// There must be at least one container in a Pod. | ||
// Cannot be updated. | ||
// +patchMergeKey=name | ||
// +patchStrategy=merge | ||
Containers []Container `json:"containers" patchStrategy:"merge" patchMergeKey:"name" protobuf:"bytes,2,rep,name=containers"` | ||
//List of termination Containers, those will be executed when during the TerminationGracePeriod of the pod | ||
// +patchMergeKey=name | ||
// +patchStrategy=merge | ||
DeferContainers []Container `json:"deferContainers,omitempty" patchStrategy:"merge" patchMergeKey:"name" protobuf:"bytes,26,rep,name=deferContainers"` | ||
. | ||
. | ||
. | ||
} | ||
``` | ||
* Will have 0...N containers and will be executed in sequence (specified order) | ||
* Restart policy could be either ‘restartAlways’ or ‘restartNever’ | ||
* Restart policy for deferConatiners are ‘onFailure’ | ||
* Adding a new phase ‘Terminating’ in the Pod lifecycle | ||
|
||
### Terminating Phase / Defer Phase | ||
* A POD reaches terminating phase when it is about to be removed from Kubernetes Cluster | ||
* During this phase, the appContainers are not restarted if they get terminated/killed | ||
* deferContainers will be executed one-after-the-other in the same sequence as they were specified in the POD Spec. From the above | ||
pod spec example execution sequence will be defer-Container1, then defer-Container2, …, etc., | ||
* we will either move-on from a failed deferContainer or restart it depending upon its restart policy, the default behavior will be | ||
restartNever | ||
* if a particular deferContainer failed, it will be restarted until it succeeds. | ||
* if the user specifies `kubectl delete pod foo --grace-period=0 --force` to delete a pod deferContainers will not be executed. | ||
``` | ||
Example status output when a pod is being terminated. | ||
NAME READY STATUS RESTARTS AGE | ||
foo-0 2/2 Defer:0/4 0 7m | ||
``` | ||
* Failure of one or all deferContainers will not trigger a POD restart. | ||
* If deferContainers are configured Pre-Stop hooks will not be executed. | ||
|
||
### TerminationGracePeriod | ||
* It takes default value (30 seconds as of today), explicitly mentioning this flag overrides the default value. | ||
* To retain backward compatibility PreStopHooks will be started (if configured) for all the containers. | ||
* Then deferContainer will start to execute one after the other (without waiting for preStop hooks to complete) | ||
* Currently when a Pod (PreStopHook) did not finish within the given GracePeriod, then kubelet will provide an extension of 2 seconds, we should retain that property for deferContainers too. | ||
* When the configured GracePeriod expires (with the additional 2 second graceperiod), it will kill all the running app containers (if not deleted already) | ||
* Then deferContainer will start to execute one after the other in the specified order | ||
* If a particular deferContainer failed it will be restart until it succeeds or graceperiod is exhausted. | ||
* When the configured graceperiod expires then all the containers (AppContainers) including the current deferContainer will be terminated. | ||
* It will kill currently executing deferContainer and no further deferContainer will be executed (if there are any). | ||
* deferContainers are timebound by TerminationGracePeriod | ||
For those termination scenarios where running time of a deferContainers is not easy to predict ahead of time, Such as filecopy or fileupload which depends on the disk speed and internet bandwidth respectively. In the future depending on the communities feedback we can consider below possible approaches. This will provide slightly elastic terminationGracePeriod mechanism. | ||
#### Solution 1 `deferContainerGracePeriodExtension: True` (a possible future improvement to deferContainer) | ||
we could introduce a new flag in the podSpec `deferContainerGracePeriodExtension`. | ||
This will allow deferContainers to run beyond the TerminationGracePeriod if liveliness probe is configured for those containers. | ||
After TerminationGraceperiod expires every time liveliness probe succeeds the termination grace period will extend to ‘Probe.PeriodSeconds`. | ||
``` | ||
60s 60s 1 kubelet, kube-node-3 spec.initContainers{initialization} Normal Pulling pulling image "initConteiner" | ||
55s 55s 1 kubelet, kube-node-3 spec.initContainers{initialization} Normal Started Started container with id 8627e1fa6df3be2b2ff976e5ef46bb06dd768fb06e938b27c3171cf1e0c79932 | ||
50s 50s 1 kubelet, kube-node-3 spec.containers{AppContainer} Normal Started Started container with id a3ff2f84a7ee9a1d8d0aebae9b69096eb189d3390f509506351c1e9a987a0674 | ||
45s 45s 1 kubelet, kube-node-3 spec.deferContainers{Termination} Normal Pulling pulling image "cleanup-container" | ||
40s 40s 1 kubelet, kube-node-3 spec.deferContainers{Termination} Normal Started Started container with id 8627e1fa6df3be2b2ff976e5ef46bb06dd768fb06e938b27c3171cf1e0c79933 | ||
10s 10s 1 kubelet, kube-node-3 spec.deferContainers{Termination} Normal extendGracepriod deferContainer still running gracePeriodExtend for 2 more seconds | ||
8s 8s 1 kubelet, kube-node-3 spec.deferContainers{Termination} Normal extendGracepriod deferContainer still running gracePeriodExtend for 2 more seconds | ||
6s 6s 1 kubelet, kube-node-3 spec.deferContainers{Termination} Normal extendGracepriod deferContainer still running gracePeriodExtend for 2 more seconds | ||
``` | ||
If this flag is set then its is recommended that all the deferContainers are configured with livelinessProbe for predictable behaviour. | ||
#### Solution 2 (a possible future improvement to deferContainers) | ||
We could introduce two new flags `deferContainerGracePeriodExtensionIntervel: S seconds` and `deferContainerGracePeriodExtensionCount: N Integer` in the pod spec, such that post TerminationGracePeriod we could keep extending the graceperiod every 'S' seconds and retry that for 'N' times. This will be a common for all deferContainers and will be simpler to implement for Pod Designers. | ||
* deferContainers are time bound by TerminationGracePeriod | ||
* If all the deferContainers completed execution well ahead of TerminationGracePeriod, then we should | ||
|
||
Either cases to delete the pod instantly `--force` && `--grace-period=0` should be supplied. | ||
### PerPopulate Heavy deferContainers | ||
Eventhough it is extreemly rare for someone to actually use a heavy image for a deferContainer, there might be some user scenarios for this. Pulling such heavy images might introduced unexpteced delay in the termination sequence. A new flag will be introduced `prePullDeferImages: true` in podspec that will instruct kubelet to pull all the deferContainers images once the Pod reaches 'running' state. | ||
### Pre/Post Termination triggers | ||
All the deferContainers will behave like a preExit trigger, it should be easier program deferContainers in such a way that it does both | ||
pre and post Termination tasks. | ||
```yaml | ||
pod: | ||
spec: | ||
terminationGracePeriod: 60 | ||
initContainers: ... | ||
containers: ... | ||
deferContainers: | ||
#pre exit operations | ||
- name: remove-shard | ||
image: dbUtils | ||
command: ["/bin/sh", "-C", "remove-shard", "--name=${POD_NAME}"] | ||
- name: wait-for-rebalance | ||
image: dbUtils | ||
command: ["/bin/sh", "-C", "while[[ 1 ]];do sleep 1; if key_rebalance_complete.sh; then exit 0; fi; done"] | ||
- name: kill-db | ||
image: dbUtils | ||
command: ["/bin/sh", "-C", "shutdown-db", "--name=${POD_NAME}"] | ||
#post exit operation | ||
- name: clean-update | ||
image: dbUtils | ||
command: ["/bin/sh", "-C", "./disk_cleanup.sh"] | ||
... | ||
``` | ||
### Short running Job / Pod | ||
For PODs which run and exit gracefully themselves if deferContainers are configured they will act as PostExit triggers, | ||
an internal flag should indicate if deferContainers have already been called or not. | ||
### PrePopulate deferContainers Images | ||
By default, all the deferContainer images will be pulled (if not available) when the POD reaches ‘running’ stage. | ||
|
||
## Implementation Plan | ||
Development and release lifecycle of this feature will follow other kubernetes experimental feature. This will be originally | ||
|
@@ -220,7 +190,7 @@ pod: | |
- name: rm-tmpdir | ||
image: my-utils | ||
command: ["/bin/sh", "-C", "./disk_cleanup.sh"] | ||
#contact and inform a thirdparty system about this pods termination, such as reducing a refernce counter (if one is maintained) | ||
#contact and inform a thirdparty system about this pods termination, such as reducing a reference counter (if one is maintained) | ||
name: ref-counter | ||
image: my-utils | ||
comamnd: ["/bin/sh", "-C", "./decrementRefCount.sh"] | ||
|
@@ -229,7 +199,7 @@ pod: | |
|
||
### Master-slave / Leader-follower statefulset down-size / scale down the replicas | ||
Below scripts 'selectaMaster.sh and reConfSlaves.sh should be designed in such a way that even if a terminating pod is a slave it | ||
shoudnt affect the cluster. This will fit controllers such as Statefulset because they gaurentee only one pod goes down at once. | ||
should not affect the cluster. This will fit controllers such as Statefulset because they guarantee only one pod goes down at once. | ||
```yaml | ||
pod: | ||
spec: | ||
|
@@ -270,12 +240,70 @@ pod: | |
... | ||
``` | ||
|
||
## Kubelet Changes | ||
* The images are pre-pulled in SyncPod() when pod phase is ‘Running’ as a step 7 | ||
* killPodWithSyncResult() is blocking, deferContainers execution is implemented inside this function. | ||
* We needed killPod or killPodWithSyncResults to get access to pullSecrets and podStatus, these two have been propagated | ||
* A new method for ContainerManager interface added WaitForContainer (containerID string) error so that we could start a container and block on it during termination. | ||
|
||
A simple pseudo code implementation | ||
```go | ||
func killContainersWithSyncResult() { | ||
runDeferContainers() | ||
|
||
for _, container := range runningPod.Containers { | ||
|
||
//if deferContainers are configured skip preStopHook() | ||
go killContainer(pod, container.ID, container.Name) | ||
|
||
} | ||
//Wait for all the container to be killed | ||
} | ||
``` | ||
And runDeferContainers will be implemented as | ||
```go | ||
func runDeferContainers(){ | ||
for _, container := range pod.Spec.DeferContainers { | ||
|
||
m.startContainerAndWait(podSandboxID,podSandboxConfig, container) | ||
|
||
//Wait for container to finish or time.After(GracePeriod) | ||
|
||
} | ||
} | ||
``` | ||
Sync POD has a new phase 7 to pre-pull deferContainer images if not available | ||
```go | ||
func SyncPod() { | ||
|
||
// Step 1: Compute sandbox and container changes. | ||
|
||
// Step 2: Kill the pod if the sandbox has changed. | ||
|
||
// Step 3: kill any running containers in this pod which are not to keep. | ||
|
||
// Step 4: Create a sandbox for the pod if necessary. | ||
|
||
// Step 5: start init containers. | ||
|
||
// Step 6: start containers in podContainerChanges.ContainersToStart. | ||
|
||
//Step 7: If the Pods is in running phase pre-populate deferContainer images | ||
|
||
if pod.Status.Phase == v1.PodRunning { | ||
|
||
pre-pullDeferContainerImage() | ||
|
||
} | ||
} | ||
``` | ||
|
||
## Caviate | ||
This Design primarily focuses on handling graceful termination cases. If the Node running a deferContainer configured pod | ||
crashes abruptly, then this design does not guarantee that cleanup was performed gracefully. This still requires community | ||
feedback on how such scenarios are handled and how important it is for deferContainers to handle that situation. | ||
|
||
## Reference | ||
[Community Request](https://github.com/kubernetes/kubernetes/issues/35183) | ||
|
||
[Places for hooks](https://github.com/kubernetes/kubernetes/issues/35183) | ||
[WIP PR](https://github.com/kubernetes/kubernetes/pull/47422) | ||
[UseCase Sides](https://docs.google.com/presentation/d/12WEEWQh8ffiLyqh8F60PgRvQn3mfdC2rx3E8biZm3oM/edit?usp=sharing) |