Skip to content

Commit

Permalink
Update desing
Browse files Browse the repository at this point in the history
Update design with more info and remove irrelevant portions.
  • Loading branch information
Dhilip authored Jul 14, 2017
1 parent 5eb24e4 commit 9e81de3
Showing 1 changed file with 117 additions and 89 deletions.
206 changes: 117 additions & 89 deletions contributors/design-proposals/defer-containers.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ [email protected]
March 2017

## Prerequisite
This is a continuation of [initContainers](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/container-init.md)
Understanding of [initContainers](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/container-init.md)

## Motivation
Since the introduction to Statefulset and Daemonset PODs that consist of stateful containers are now fully supported by
Expand All @@ -18,6 +18,7 @@ set of Containers to be executed at the time of POD termination. The defined con
in the specified order. It will bring destructor() capability to Pods.

## Use Cases
[5mins Uses Presentation Video during recent Sig-App Meeting](https://youtu.be/s1oZ00_JA00?t=2692)

### Cleanup:
Most stateful workloads require certain Cleanup activity before or after the appContainers are terminated such as
Expand Down Expand Up @@ -45,45 +46,46 @@ disrupt the service intermittently. With deferContainers, this handover or re-el
* Traditional relational databases typically support a reasonable shutdown sequence, for instance, Oracle has 4 types of
shutdown such as NORMAL, IMMEDIATE, TRANSACTIONAL and ABORT. ‘deferContainers’ will allow us to program and wait for such
complex shutdown scenarios.
* In future when kuberenetes supports Virtual Machine runtime (eg: hyperContainer) for better isolation, we should shutdown
the VMs instead of killing them abruptly. ‘deferContainers’ could help us run such shutdwon commands.
* In future when kubernetes supports Virtual Machine runtime (eg: hyperContainer) for better isolation, we should shutdown
the VMs instead of killing them abruptly. ‘deferContainers’ could help us run such shutdown commands.

## Limitation with the current system
Container pre-stop hooks are not sufficient for all termination cases:
* They cannot easily coordinate complex conditions across containers in multi-container pods
Container pre-stop lifecycle hook is not sufficient for all termination cases:
* It is container Specific and not pod specific
* They cannot easily coordinate complex termination conditions across containers in multi-container pods
* They can only function with code in the image or code in a shared volume, which would have to be statically linked
(not a common pattern in wide use)
* Does not work across kubelet restart
* Waits for the entire graceperiod even after the pre-stop hook finished earlier.
* Wont restart on failed termination steps.
* Cannot contain complex termination scripts as no logging support.

## Design Requirements
Most of the requirements are very similar to initContainers. They are replicated and modified as necessary.
* deferContainers should be able to:
* Use the same volume (PV) as appContainers such that it can
* Perform cleanup of shared volume, such as delete several temp directories.
* Delete unwanted files before a final sync is initiated.
* Update Configuration files about the changes in the distributed system so that the next pod getting attached to this PV
will benefit from it. (like a new leader/master etc).
* Update Configuration files about the changes in the distributed system so that the next pod getting attached to this PV will benefit from it. (like a new leader/master etc).
* Deleted secrets or security related files before the pod de-couples from the PV.
* Delay the termination of application containers until operations are complete
* De-Register the pod with other components of the system
* Program termination sequence for cases where TerminationGracePeriod will be hard to predict before hand.

* Reduce coupling:
* Between application images, eliminating the need to customize those images for Kubernetes generally or specific roles
* Inside of images, by specializing which containers perform which tasks (install git into init container, use filesystem
contents in web container)
* Between termination steps, by supporting multiple sequential cleanup containers
* Pre-Exit and Post-Exit workflow
* Post - specify that the controller can continue to delete the appContainers but wait for deferContainer’s to complete
its execution before marking as the Pod is deleted.
* Pre - specify that the controller should wait until deferContainers completes its execution first and then proceed to
delete the appContainers if needed
* deferContainers should allow us to specify
* if certain containers in deferContainer list need to be re-started on failures.
* GracePeriod
* It should be possible to mention overall terminationGracePeriod for defercontainers.
* Run-once and run-forever pods should be able to use deferContainers
* Pre-Exit
* Should act as pre-exit trigger, should be called when the application is about to be deleted.
* restart on Failure
* If a certain deferContainer failed while execution it should be automatically restarted
* GracePeriod behaviour
* It should be possible to mention overall terminationGracePeriod for defercontainers, if the termination sequences completed before the overall graceperiod then the pod should be deleted without waiting further.
* Reduce Complexity,
* It should be possible to use a generic container as a deferContainer,
* A deferContainer should be independently invokable, ie:- should not require code in the same image as the appContainers.
* deferContainer Images that are not already in the node will be pre-populated while the application is being executed.

## Design
This proposed pod spec would look like below.
Expand All @@ -107,93 +109,61 @@ pod:
- name: defer-container2
...
```
Most of the design elements are similar to initContainers such as below
The api will look like below
```
// PodSpec is a description of a pod.
type PodSpec struct {
.
.
.

InitContainers []Container `json:"initContainers,omitempty" patchStrategy:"merge" patchMergeKey:"name" protobuf:"bytes,20,rep,name=initContainers"`
// List of containers belonging to the pod.
// Containers cannot currently be added or removed.
// There must be at least one container in a Pod.
// Cannot be updated.
// +patchMergeKey=name
// +patchStrategy=merge
Containers []Container `json:"containers" patchStrategy:"merge" patchMergeKey:"name" protobuf:"bytes,2,rep,name=containers"`
//List of termination Containers, those will be executed when during the TerminationGracePeriod of the pod
// +patchMergeKey=name
// +patchStrategy=merge
DeferContainers []Container `json:"deferContainers,omitempty" patchStrategy:"merge" patchMergeKey:"name" protobuf:"bytes,26,rep,name=deferContainers"`
.
.
.
}
```
* Will have 0...N containers and will be executed in sequence (specified order)
* Restart policy could be either ‘restartAlways’ or ‘restartNever
* Restart policy for deferConatiners are ‘onFailure
* Adding a new phase ‘Terminating’ in the Pod lifecycle

### Terminating Phase / Defer Phase
* A POD reaches terminating phase when it is about to be removed from Kubernetes Cluster
* During this phase, the appContainers are not restarted if they get terminated/killed
* deferContainers will be executed one-after-the-other in the same sequence as they were specified in the POD Spec. From the above
pod spec example execution sequence will be defer-Container1, then defer-Container2, …, etc.,
* we will either move-on from a failed deferContainer or restart it depending upon its restart policy, the default behavior will be
restartNever
* if a particular deferContainer failed, it will be restarted until it succeeds.
* if the user specifies `kubectl delete pod foo --grace-period=0 --force` to delete a pod deferContainers will not be executed.
```
Example status output when a pod is being terminated.
NAME READY STATUS RESTARTS AGE
foo-0 2/2 Defer:0/4 0 7m
```
* Failure of one or all deferContainers will not trigger a POD restart.
* If deferContainers are configured Pre-Stop hooks will not be executed.

### TerminationGracePeriod
* It takes default value (30 seconds as of today), explicitly mentioning this flag overrides the default value.
* To retain backward compatibility PreStopHooks will be started (if configured) for all the containers.
* Then deferContainer will start to execute one after the other (without waiting for preStop hooks to complete)
* Currently when a Pod (PreStopHook) did not finish within the given GracePeriod, then kubelet will provide an extension of 2 seconds, we should retain that property for deferContainers too.
* When the configured GracePeriod expires (with the additional 2 second graceperiod), it will kill all the running app containers (if not deleted already)
* Then deferContainer will start to execute one after the other in the specified order
* If a particular deferContainer failed it will be restart until it succeeds or graceperiod is exhausted.
* When the configured graceperiod expires then all the containers (AppContainers) including the current deferContainer will be terminated.
* It will kill currently executing deferContainer and no further deferContainer will be executed (if there are any).
* deferContainers are timebound by TerminationGracePeriod
For those termination scenarios where running time of a deferContainers is not easy to predict ahead of time, Such as filecopy or fileupload which depends on the disk speed and internet bandwidth respectively. In the future depending on the communities feedback we can consider below possible approaches. This will provide slightly elastic terminationGracePeriod mechanism.
#### Solution 1 `deferContainerGracePeriodExtension: True` (a possible future improvement to deferContainer)
we could introduce a new flag in the podSpec `deferContainerGracePeriodExtension`.
This will allow deferContainers to run beyond the TerminationGracePeriod if liveliness probe is configured for those containers.
After TerminationGraceperiod expires every time liveliness probe succeeds the termination grace period will extend to ‘Probe.PeriodSeconds`.
```
60s 60s 1 kubelet, kube-node-3 spec.initContainers{initialization} Normal Pulling pulling image "initConteiner"
55s 55s 1 kubelet, kube-node-3 spec.initContainers{initialization} Normal Started Started container with id 8627e1fa6df3be2b2ff976e5ef46bb06dd768fb06e938b27c3171cf1e0c79932
50s 50s 1 kubelet, kube-node-3 spec.containers{AppContainer} Normal Started Started container with id a3ff2f84a7ee9a1d8d0aebae9b69096eb189d3390f509506351c1e9a987a0674
45s 45s 1 kubelet, kube-node-3 spec.deferContainers{Termination} Normal Pulling pulling image "cleanup-container"
40s 40s 1 kubelet, kube-node-3 spec.deferContainers{Termination} Normal Started Started container with id 8627e1fa6df3be2b2ff976e5ef46bb06dd768fb06e938b27c3171cf1e0c79933
10s 10s 1 kubelet, kube-node-3 spec.deferContainers{Termination} Normal extendGracepriod deferContainer still running gracePeriodExtend for 2 more seconds
8s 8s 1 kubelet, kube-node-3 spec.deferContainers{Termination} Normal extendGracepriod deferContainer still running gracePeriodExtend for 2 more seconds
6s 6s 1 kubelet, kube-node-3 spec.deferContainers{Termination} Normal extendGracepriod deferContainer still running gracePeriodExtend for 2 more seconds
```
If this flag is set then its is recommended that all the deferContainers are configured with livelinessProbe for predictable behaviour.
#### Solution 2 (a possible future improvement to deferContainers)
We could introduce two new flags `deferContainerGracePeriodExtensionIntervel: S seconds` and `deferContainerGracePeriodExtensionCount: N Integer` in the pod spec, such that post TerminationGracePeriod we could keep extending the graceperiod every 'S' seconds and retry that for 'N' times. This will be a common for all deferContainers and will be simpler to implement for Pod Designers.
* deferContainers are time bound by TerminationGracePeriod
* If all the deferContainers completed execution well ahead of TerminationGracePeriod, then we should

Either cases to delete the pod instantly `--force` && `--grace-period=0` should be supplied.
### PerPopulate Heavy deferContainers
Eventhough it is extreemly rare for someone to actually use a heavy image for a deferContainer, there might be some user scenarios for this. Pulling such heavy images might introduced unexpteced delay in the termination sequence. A new flag will be introduced `prePullDeferImages: true` in podspec that will instruct kubelet to pull all the deferContainers images once the Pod reaches 'running' state.
### Pre/Post Termination triggers
All the deferContainers will behave like a preExit trigger, it should be easier program deferContainers in such a way that it does both
pre and post Termination tasks.
```yaml
pod:
spec:
terminationGracePeriod: 60
initContainers: ...
containers: ...
deferContainers:
#pre exit operations
- name: remove-shard
image: dbUtils
command: ["/bin/sh", "-C", "remove-shard", "--name=${POD_NAME}"]
- name: wait-for-rebalance
image: dbUtils
command: ["/bin/sh", "-C", "while[[ 1 ]];do sleep 1; if key_rebalance_complete.sh; then exit 0; fi; done"]
- name: kill-db
image: dbUtils
command: ["/bin/sh", "-C", "shutdown-db", "--name=${POD_NAME}"]
#post exit operation
- name: clean-update
image: dbUtils
command: ["/bin/sh", "-C", "./disk_cleanup.sh"]
...
```
### Short running Job / Pod
For PODs which run and exit gracefully themselves if deferContainers are configured they will act as PostExit triggers,
an internal flag should indicate if deferContainers have already been called or not.
### PrePopulate deferContainers Images
By default, all the deferContainer images will be pulled (if not available) when the POD reaches ‘running’ stage.

## Implementation Plan
Development and release lifecycle of this feature will follow other kubernetes experimental feature. This will be originally
Expand All @@ -220,7 +190,7 @@ pod:
- name: rm-tmpdir
image: my-utils
command: ["/bin/sh", "-C", "./disk_cleanup.sh"]
#contact and inform a thirdparty system about this pods termination, such as reducing a refernce counter (if one is maintained)
#contact and inform a thirdparty system about this pods termination, such as reducing a reference counter (if one is maintained)
name: ref-counter
image: my-utils
comamnd: ["/bin/sh", "-C", "./decrementRefCount.sh"]
Expand All @@ -229,7 +199,7 @@ pod:

### Master-slave / Leader-follower statefulset down-size / scale down the replicas
Below scripts 'selectaMaster.sh and reConfSlaves.sh should be designed in such a way that even if a terminating pod is a slave it
shoudnt affect the cluster. This will fit controllers such as Statefulset because they gaurentee only one pod goes down at once.
should not affect the cluster. This will fit controllers such as Statefulset because they guarantee only one pod goes down at once.
```yaml
pod:
spec:
Expand Down Expand Up @@ -270,12 +240,70 @@ pod:
...
```

## Kubelet Changes
* The images are pre-pulled in SyncPod() when pod phase is ‘Running’ as a step 7
* killPodWithSyncResult() is blocking, deferContainers execution is implemented inside this function.
* We needed killPod or killPodWithSyncResults to get access to pullSecrets and podStatus, these two have been propagated
* A new method for ContainerManager interface added WaitForContainer (containerID string) error so that we could start a container and block on it during termination.

A simple pseudo code implementation
```go
func killContainersWithSyncResult() {
runDeferContainers()

for _, container := range runningPod.Containers {

//if deferContainers are configured skip preStopHook()
go killContainer(pod, container.ID, container.Name)

}
//Wait for all the container to be killed
}
```
And runDeferContainers will be implemented as
```go
func runDeferContainers(){
for _, container := range pod.Spec.DeferContainers {

m.startContainerAndWait(podSandboxID,podSandboxConfig, container)

//Wait for container to finish or time.After(GracePeriod)

}
}
```
Sync POD has a new phase 7 to pre-pull deferContainer images if not available
```go
func SyncPod() {

// Step 1: Compute sandbox and container changes.

// Step 2: Kill the pod if the sandbox has changed.

// Step 3: kill any running containers in this pod which are not to keep.

// Step 4: Create a sandbox for the pod if necessary.

// Step 5: start init containers.

// Step 6: start containers in podContainerChanges.ContainersToStart.

//Step 7: If the Pods is in running phase pre-populate deferContainer images

if pod.Status.Phase == v1.PodRunning {

pre-pullDeferContainerImage()

}
}
```

## Caviate
This Design primarily focuses on handling graceful termination cases. If the Node running a deferContainer configured pod
crashes abruptly, then this design does not guarantee that cleanup was performed gracefully. This still requires community
feedback on how such scenarios are handled and how important it is for deferContainers to handle that situation.

## Reference
[Community Request](https://github.com/kubernetes/kubernetes/issues/35183)

[Places for hooks](https://github.com/kubernetes/kubernetes/issues/35183)
[WIP PR](https://github.com/kubernetes/kubernetes/pull/47422)
[UseCase Sides](https://docs.google.com/presentation/d/12WEEWQh8ffiLyqh8F60PgRvQn3mfdC2rx3E8biZm3oM/edit?usp=sharing)

0 comments on commit 9e81de3

Please sign in to comment.