loss of data in k8s persistent volumes when used in k8sm setup #798

ravilr · 2016-03-30T03:02:23Z

@jdef
k8s kubelet sets up the volume mounts in a directory configured by --root-dir flag and bind mounts them on to docker containers. kubelet also runs Kubelet.cleanupOrphanedVolumes() in sync loop to cleanup/unmount any volume mounts that are left out on the kubelet host from any killed/finished pods.

In case of k8sm, kubelet's RootDirectory is set to mesos executor's sandbox dir. this is overridable using --kubelet-root-dir flag in scheduler, but this doesn't work due to executor using the same dir for setting up static-pod-config dir and the executor fails to come up with error if it finds an already existing static-pods dir.
the issue we are seeing with kubelet executor using the sandbox dir itself as kubelet root-dir is, whenever the executor id (or slave id) changes due to executor restart/slave restart/framework upgrade, the kubelet doesn't get a chance to properly cleanup orphaned volumes mounted on the host. in case of persistent volumes, the kubelet pods volume dir would still be pointing to mounted filesystem. Then, the mesos slave's gc_delay setting kicks in and tries to cleanup the old executors sandbox dirs, which leads to rm'ing of persistent volume dirs. the end result is : all data backed by persistent volume are gone.

i think static-pods dir should be using the mesos sandbox dir instead of using kubelet.RootDirectory. then one could set --kubelet-root-dir to a static path on the slave host. But, still there is no guarantee that a slave gets assigned a kubelet executor task again, which means the kubelet volume dirs might be left mounted forever. But atleast, they won't be deleted inadvertently by the mesos-slave gc.

we are experiencing this in our k8sm cluster, using nfs backed persistent volumes.

jdef · 2016-04-14T07:08:17Z

First thoughts about reasonable defaults (because executors really
shouldn't write anything outside of their container):

rootDir={sandbox}/root
staticPods={sandbox}/static

And then if you want to override rootDir to point to some location on the
host, outside of the sandbox, you could do that. Although I'm not convinced
that's a great idea, I can certainly sympathize with data loss! Running an
executor this way (with rootDir outside the sandbox) is prone to mount
resource leaks, as you've pointed out, among others: there may be old pod
directories that are never cleaned up (and those may also contain mounts).
How would we ever, responsibly, GC these? On executor startup (that might
not happen for a while, depending on offers)?

We're actually thinking through a related problem right now with respect to
https://issues.apache.org/jira/browse/MESOS-5013: what's the best way to GC
external volume mount points in a way that's compatible with Mesos slave
recovery?

A better solution might come in the form of a custom k8sm runtime
implementation for kubelet that allows kubelet to properly contain the pod
containers that it launches: kubelet could run in its own mountns and pods
would be realized in containers that inherit the requisite root-Dir volume
mounts from the kubelet's mountns. This is non-trivial.

Another solution might be to write a custom mesos isolator module that adds
GC for volume mounts created within a kubelet-executor container. This is
also non-trivial.

On Tue, Mar 29, 2016 at 11:02 PM, ravilr [email protected] wrote:

@jdef https://github.com/jdef
k8s kubelet sets up the volume mounts in a directory configured by
--root-dir flag and bind mounts them on to docker containers. kubelet also
runs Kubelet.cleanupOrphanedVolumes() in sync loop to cleanup/unmount any
volume mounts that are left out on the kubelet host from any
killed/finished pods.

In case of k8sm, kubelet's RootDirectory is set to mesos executor's
sandbox dir. this is overridable using --kubelet-root-dir flag in
scheduler, but this doesn't work due to executor using the same dir for
setting up static-pod-config dir and the executor fails to come up with
error if it finds an already existing static-pods dir.
the issue we are seeing with kubelet executor using the sandbox dir itself
as kubelet root-dir is, whenever the executor id (or slave id) changes due
to executor restart/slave restart/framework upgrade, the kubelet doesn't
get a chance to properly cleanup orphaned volumes mounted on the host. in
case of persistent volumes, the kubelet pods volume dir would still be
pointing to mounted filesystem. Then, the mesos slave's gc_delay setting
kicks in and tries to cleanup the old executors sandbox dirs, which leads
to rm'ing of persistent volume dirs. the end result is : all data backed by
persistent volume are gone.

i think static-pods dir should be using the mesos sandbox dir instead of
using kubelet.RootDirectory. then one could set --kubelet-root-dir to a
static path on the slave host. But, still there is no guarantee that a
slave gets assigned a kubelet executor task again, which means the kubelet
volume dirs might be left mounted forever. But atleast, they won't be
deleted inadvertently by the mesos-slave gc.

we are experiencing this in our k8sm cluster, using nfs backed persistent
volumes.

—
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#798

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

loss of data in k8s persistent volumes when used in k8sm setup #798

loss of data in k8s persistent volumes when used in k8sm setup #798

ravilr commented Mar 30, 2016

jdef commented Apr 14, 2016

loss of data in k8s persistent volumes when used in k8sm setup #798

loss of data in k8s persistent volumes when used in k8sm setup #798

Comments

ravilr commented Mar 30, 2016

jdef commented Apr 14, 2016