Skip to content

Commit

Permalink
Merge pull request #1222 from garden-io/fix-nfs-failures
Browse files Browse the repository at this point in the history
fix(k8s): avoid issues with NFS provisioner on node/pod eviction
  • Loading branch information
eysi09 authored Oct 11, 2019
2 parents 698af0b + f745619 commit db365b2
Show file tree
Hide file tree
Showing 26 changed files with 795 additions and 41 deletions.
15 changes: 10 additions & 5 deletions .circleci/config.yml
Original file line number Diff line number Diff line change
Expand Up @@ -211,10 +211,11 @@ jobs:
command: ./garden-service/bin/garden init --root garden-service/test/e2e --logger-type basic
- run:
name: Run e2e tests
command: ./garden-service/bin/garden test e2e-tests --root garden-service/test/e2e --logger-type basic
# overriding CIRCLE_BUILD_NUM to avoid conflict with other tests
command: CIRCLE_BUILD_NUM=$CIRCLE_BUILD_NUM-e2e ./garden-service/bin/garden test e2e-tests --root garden-service/test/e2e --logger-type basic
- run:
name: Cleanup
command: kubectl delete --wait=false $(kubectl get ns -o name | grep $CIRCLE_BUILD_NUM) || true
command: CIRCLE_BUILD_NUM=$CIRCLE_BUILD_NUM-e2e kubectl delete --wait=false $(kubectl get ns -o name | grep testing-$CIRCLE_BUILD_NUM) || true
when: always
test-dashboard:
<<: *node-config
Expand Down Expand Up @@ -312,10 +313,11 @@ jobs:
- run: sudo apt-get update && sudo apt-get -y install rsync
- run:
name: Deploy demo-project with binary
command: garden-service/dist/linux-amd64/garden deploy --root examples/demo-project --env testing --logger-type basic
# overriding CIRCLE_BUILD_NUM to avoid conflict with other tests
command: CIRCLE_BUILD_NUM=$CIRCLE_BUILD_NUM-dist garden-service/dist/linux-amd64/garden deploy --root examples/demo-project --env testing --logger-type basic
- run:
name: Cleanup
command: kubectl delete --wait=false $(kubectl get ns -o name | grep $CIRCLE_BUILD_NUM) || true
command: CIRCLE_BUILD_NUM=$CIRCLE_BUILD_NUM-dist kubectl delete --wait=false $(kubectl get ns -o name | grep testing-$CIRCLE_BUILD_NUM) || true
when: always
test-minikube:
machine:
Expand Down Expand Up @@ -375,12 +377,15 @@ jobs:
gcloud --quiet config set compute/zone $env:GOOGLE_COMPUTE_ZONE
gcloud --quiet container clusters get-credentials $env:GOOGLE_CLUSTER_ID --zone $env:GOOGLE_COMPUTE_ZONE
gcloud --quiet auth configure-docker
- run:
name: Override build ID
command: $env:CIRCLE_BUILD_NUM = "$env:CIRCLE_BUILD_NUM-win"
- run:
name: Deploy demo-project
command: .\garden-service\dist\windows-amd64\garden.exe deploy --root .\examples\demo-project\ --logger-type basic --env testing
- run:
name: Cleanup
command: (kubectl delete --wait=false $(kubectl get ns -o name | grep $env:CIRCLE_BUILD_NUM)) -or $true
command: (kubectl delete namespace --wait=false demo-project-testing-$env:CIRCLE_BUILD_NUM demo-project-testing-$env:CIRCLE_BUILD_NUM--metadata) -or $true
when: always

workflows:
Expand Down
33 changes: 29 additions & 4 deletions docs/reference/providers/kubernetes.md
Original file line number Diff line number Diff line change
Expand Up @@ -536,9 +536,9 @@ Storage parameters to set for the in-cluster builder, container registry and cod
These are all shared cluster-wide across all users and builds, so they should be resourced accordingly,
factoring in how many concurrent builds you expect and how large your images and build contexts tend to be.

| Type | Required | Default |
| -------- | -------- | ---------------------------------------------------------------------------------------------------------------------------------------- |
| `object` | No | `{"builder":{"size":20480,"storageClass":null},"registry":{"size":20480,"storageClass":null},"sync":{"size":10240,"storageClass":null}}` |
| Type | Required | Default |
| -------- | -------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `object` | No | `{"builder":{"size":20480,"storageClass":null},"nfs":{"storageClass":null},"registry":{"size":20480,"storageClass":null},"sync":{"size":10240,"storageClass":null}}` |

### `providers[].storage.builder`

Expand Down Expand Up @@ -572,6 +572,29 @@ Storage class to use for the volume.
| -------- | -------- | ------- |
| `string` | No | `null` |

### `providers[].storage.nfs`

[providers](#providers) > [storage](#providersstorage) > nfs

Storage parameters for the NFS provisioner, which we automatically create for the sync volume, _unless_
you specify a `storageClass` for the sync volume. See the below `sync` parameter for more.

Only applies when `buildMode` is set to `cluster-docker` or `kaniko`, ignored otherwise.

| Type | Required | Default |
| -------- | -------- | ----------------------- |
| `object` | No | `{"storageClass":null}` |

### `providers[].storage.nfs.storageClass`

[providers](#providers) > [storage](#providersstorage) > [nfs](#providersstoragenfs) > storageClass

Storage class to use as backing storage for NFS .

| Type | Required | Default |
| -------- | -------- | ------- |
| `string` | No | `null` |

### `providers[].storage.registry`

[providers](#providers) > [storage](#providersstorage) > registry
Expand Down Expand Up @@ -613,7 +636,7 @@ Storage parameters for the code sync volume, which build contexts are synced to
in-cluster builds.

Important: The storage class configured here has to support _ReadWriteMany_ access.
If you don't specify a storage class, Garden creates an NFS provisioner and provisions an ephemeral
If you don't specify a storage class, Garden creates an NFS provisioner and provisions an
NFS volume for the sync data volume.

Only applies when `buildMode` is set to `cluster-docker` or `kaniko`, ignored otherwise.
Expand Down Expand Up @@ -936,6 +959,8 @@ providers:
builder:
size: 20480
storageClass: null
nfs:
storageClass: null
registry:
size: 20480
storageClass: null
Expand Down
33 changes: 29 additions & 4 deletions docs/reference/providers/local-kubernetes.md
Original file line number Diff line number Diff line change
Expand Up @@ -536,9 +536,9 @@ Storage parameters to set for the in-cluster builder, container registry and cod
These are all shared cluster-wide across all users and builds, so they should be resourced accordingly,
factoring in how many concurrent builds you expect and how large your images and build contexts tend to be.

| Type | Required | Default |
| -------- | -------- | ---------------------------------------------------------------------------------------------------------------------------------------- |
| `object` | No | `{"builder":{"size":20480,"storageClass":null},"registry":{"size":20480,"storageClass":null},"sync":{"size":10240,"storageClass":null}}` |
| Type | Required | Default |
| -------- | -------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `object` | No | `{"builder":{"size":20480,"storageClass":null},"nfs":{"storageClass":null},"registry":{"size":20480,"storageClass":null},"sync":{"size":10240,"storageClass":null}}` |

### `providers[].storage.builder`

Expand Down Expand Up @@ -572,6 +572,29 @@ Storage class to use for the volume.
| -------- | -------- | ------- |
| `string` | No | `null` |

### `providers[].storage.nfs`

[providers](#providers) > [storage](#providersstorage) > nfs

Storage parameters for the NFS provisioner, which we automatically create for the sync volume, _unless_
you specify a `storageClass` for the sync volume. See the below `sync` parameter for more.

Only applies when `buildMode` is set to `cluster-docker` or `kaniko`, ignored otherwise.

| Type | Required | Default |
| -------- | -------- | ----------------------- |
| `object` | No | `{"storageClass":null}` |

### `providers[].storage.nfs.storageClass`

[providers](#providers) > [storage](#providersstorage) > [nfs](#providersstoragenfs) > storageClass

Storage class to use as backing storage for NFS .

| Type | Required | Default |
| -------- | -------- | ------- |
| `string` | No | `null` |

### `providers[].storage.registry`

[providers](#providers) > [storage](#providersstorage) > registry
Expand Down Expand Up @@ -613,7 +636,7 @@ Storage parameters for the code sync volume, which build contexts are synced to
in-cluster builds.

Important: The storage class configured here has to support _ReadWriteMany_ access.
If you don't specify a storage class, Garden creates an NFS provisioner and provisions an ephemeral
If you don't specify a storage class, Garden creates an NFS provisioner and provisions an
NFS volume for the sync data volume.

Only applies when `buildMode` is set to `cluster-docker` or `kaniko`, ignored otherwise.
Expand Down Expand Up @@ -837,6 +860,8 @@ providers:
builder:
size: 20480
storageClass: null
nfs:
storageClass: null
registry:
size: 20480
storageClass: null
Expand Down
15 changes: 15 additions & 0 deletions garden-service/src/plugins/kubernetes/commands/cluster-init.ts
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,8 @@
import { PluginCommand } from "../../../types/plugin/command"
import { prepareSystem, getEnvironmentStatus } from "../init"
import chalk from "chalk"
import { helm } from "../helm/helm-cli"
import { KubernetesPluginContext } from "../config"

export const clusterInit: PluginCommand = {
name: "cluster-init",
Expand All @@ -34,6 +36,19 @@ export const clusterInit: PluginCommand = {
})
}

const k8sCtx = ctx as KubernetesPluginContext

log.info("Cleaning up old resources...")

try {
await helm({
ctx: k8sCtx,
log,
namespace: "garden-system",
args: ["delete", "--purge", "garden-nfs-provisioner"],
})
} catch (_) { }

log.info(chalk.green("\nDone!"))

return { result }
Expand Down
24 changes: 21 additions & 3 deletions garden-service/src/plugins/kubernetes/config.ts
Original file line number Diff line number Diff line change
Expand Up @@ -45,12 +45,13 @@ interface KubernetesResources {
}

interface KubernetesStorageSpec {
size: number
size?: number
storageClass: string | null
}

interface KubernetesStorage {
builder: KubernetesStorageSpec
nfs: KubernetesStorageSpec
registry: KubernetesStorageSpec
sync: KubernetesStorageSpec
}
Expand Down Expand Up @@ -125,6 +126,9 @@ export const defaultStorage: KubernetesStorage = {
size: 20 * 1024,
storageClass: null,
},
nfs: {
storageClass: null,
},
registry: {
size: 20 * 1024,
storageClass: null,
Expand Down Expand Up @@ -176,7 +180,7 @@ const storageSchema = (defaults: KubernetesStorageSpec) => joi.object()
.description("Volume size in megabytes."),
storageClass: joi.string()
.allow(null)
.default(null)
.default(defaults.storageClass)
.description("Storage class to use for the volume."),
})
.default(defaults)
Expand Down Expand Up @@ -312,6 +316,20 @@ export const kubernetesConfigBase = providerConfigBaseSchema
Only applies when \`buildMode\` is set to \`cluster-docker\`, ignored otherwise.
`),
nfs: joi.object()
.keys({
storageClass: joi.string()
.allow(null)
.default(null)
.description("Storage class to use as backing storage for NFS ."),
})
.default({ storageClass: null })
.description(dedent`
Storage parameters for the NFS provisioner, which we automatically create for the sync volume, _unless_
you specify a \`storageClass\` for the sync volume. See the below \`sync\` parameter for more.
Only applies when \`buildMode\` is set to \`cluster-docker\` or \`kaniko\`, ignored otherwise.
`),
registry: storageSchema(defaultStorage.registry)
.description(dedent`
Storage parameters for the in-cluster Docker registry volume. Built images are stored here, so that they
Expand All @@ -325,7 +343,7 @@ export const kubernetesConfigBase = providerConfigBaseSchema
in-cluster builds.
Important: The storage class configured here has to support _ReadWriteMany_ access.
If you don't specify a storage class, Garden creates an NFS provisioner and provisions an ephemeral
If you don't specify a storage class, Garden creates an NFS provisioner and provisions an
NFS volume for the sync data volume.
Only applies when \`buildMode\` is set to \`cluster-docker\` or \`kaniko\`, ignored otherwise.
Expand Down
47 changes: 34 additions & 13 deletions garden-service/src/plugins/kubernetes/init.ts
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,8 @@ import chalk from "chalk"
import { deline } from "../../util/string"
import { combineStates, ServiceStatusMap } from "../../types/service"

const nfsStorageClass = "garden-system-nfs"
// Note: We need to increment a version number here if we ever make breaking changes to the NFS provisioner StatefulSet
const nfsStorageClass = "garden-system-nfs-v2"

/**
* Performs the following actions to check environment status:
Expand Down Expand Up @@ -162,7 +163,11 @@ export async function prepareSystem(
// in the prepareEnvironment handler, instead of flagging as not ready here. This avoids blocking users where
// there's variance in configuration between users of the same cluster, that often doesn't affect usage.
if (!clusterInit && remoteCluster) {
if (combinedState === "outdated" && !serviceStates.includes("missing")) {
if (
combinedState === "outdated" &&
!serviceStates.includes("missing") &&
!(ctx.command && ctx.command.name === "plugins" && ctx.command.args.command === "cluster-init")
) {
log.warn({
symbol: "warning",
msg: chalk.yellow(deline`
Expand All @@ -178,12 +183,6 @@ export async function prepareSystem(
// We require manual init if we're installing any system services to remote clusters, to avoid conflicts
// between users or unnecessary work.
if (!clusterInit && remoteCluster && !systemReady) {
// Special-case so that this doesn't error when attempting to run the cluster init
const initCommandName = `plugins ${ctx.provider.name} cluster-init`
if (ctx.command && ctx.command.name === initCommandName) {
return {}
}

throw new KubernetesError(deline`
One or more cluster-wide system services are missing or not ready. You need to run
\`garden --env=${ctx.environmentName} plugins kubernetes cluster-init\`
Expand All @@ -198,16 +197,31 @@ export async function prepareSystem(
const sysProvider = await sysGarden.resolveProvider(k8sCtx.provider.name)
const sysCtx = <KubernetesPluginContext>await sysGarden.getPluginContext(sysProvider)

await sysGarden.clearBuilds()

await installTiller({ ctx: sysCtx, provider: sysCtx.provider, log, force })

// We need to install the NFS provisioner separately, so that we can optionally install it
// FIXME: when we've added an `enabled` field, we should get rid of this special case
if (systemServiceNames.includes("nfs-provisioner")) {
await prepareSystemServices({
log,
sysGarden,
namespace: systemNamespace,
force,
ctx: k8sCtx,
serviceNames: ["nfs-provisioner"],
})
}

// Install system services
await prepareSystemServices({
log,
sysGarden,
namespace: systemNamespace,
force,
ctx: k8sCtx,
serviceNames: systemServiceNames,
serviceNames: systemServiceNames.filter(name => name !== "nfs-provisioner"),
})

sysGarden.log.setSuccess()
Expand All @@ -231,6 +245,8 @@ export async function cleanupEnvironment({ ctx, log }: CleanupEnvironmentParams)
}

export function getKubernetesSystemVariables(config: KubernetesConfig) {
const syncStorageClass = config.storage.sync.storageClass || nfsStorageClass

return {
"namespace": systemNamespace,

Expand All @@ -241,22 +257,27 @@ export function getKubernetesSystemVariables(config: KubernetesConfig) {
"builder-limits-memory": megabytesToString(config.resources.builder.limits.memory),
"builder-requests-cpu": millicpuToString(config.resources.builder.requests.cpu),
"builder-requests-memory": megabytesToString(config.resources.builder.requests.memory),
"builder-storage-size": megabytesToString(config.storage.builder.size),
"builder-storage-size": megabytesToString(config.storage.builder.size!),
"builder-storage-class": config.storage.builder.storageClass,

// We only use NFS for the build-sync volume, so we allocate the space we need for that plus 1GB for margin.
"nfs-storage-size": megabytesToString(config.storage.sync.size! + 1024),
"nfs-storage-class": config.storage.nfs.storageClass,

"registry-limits-cpu": millicpuToString(config.resources.registry.limits.cpu),
"registry-limits-memory": megabytesToString(config.resources.registry.limits.memory),
"registry-requests-cpu": millicpuToString(config.resources.registry.requests.cpu),
"registry-requests-memory": megabytesToString(config.resources.registry.requests.memory),
"registry-storage-size": megabytesToString(config.storage.registry.size),
"registry-storage-size": megabytesToString(config.storage.registry.size!),
"registry-storage-class": config.storage.registry.storageClass,

"sync-limits-cpu": millicpuToString(config.resources.sync.limits.cpu),
"sync-limits-memory": megabytesToString(config.resources.sync.limits.memory),
"sync-requests-cpu": millicpuToString(config.resources.sync.requests.cpu),
"sync-requests-memory": megabytesToString(config.resources.sync.requests.memory),
"sync-storage-size": megabytesToString(config.storage.sync.size),
"sync-storage-class": config.storage.sync.storageClass || nfsStorageClass,
"sync-storage-size": megabytesToString(config.storage.sync.size!),
"sync-storage-class": syncStorageClass,
"sync-volume-name": `garden-sync-${syncStorageClass}`,
}
}

Expand Down
8 changes: 6 additions & 2 deletions garden-service/static/kubernetes/system/build-sync/garden.yml
Original file line number Diff line number Diff line change
Expand Up @@ -3,8 +3,10 @@ type: helm
name: build-sync
description: Sync service for receiving build context ahead of in-cluster builds
releaseName: garden-build-sync
dependencies:
- nfs-provisioner
# Note: We special-case this dependency in the cluster-init procedure,
# to be fixed when the `enabled` field is implemented
# dependencies:
# - nfs-provisioner
values:
resources:
limits:
Expand All @@ -16,3 +18,5 @@ values:
storage:
request: ${var.sync-storage-size}
storageClass: ${var.sync-storage-class}
pvc:
name: ${var.sync-volume-name}
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,7 @@ spec:
volumes:
- name: garden-build-sync
persistentVolumeClaim:
claimName: garden-build-sync
claimName: {{ .Values.pvc.name }}
containers:
- name: sync
image: "gardendev/rsync:0.1"
Expand Down
Loading

0 comments on commit db365b2

Please sign in to comment.