Skip to content

Latest commit

 

History

History
749 lines (629 loc) · 30.9 KB

from_bosh_to_kube.md

File metadata and controls

749 lines (629 loc) · 30.9 KB

Transforming BOSH concepts to Kubernetes

Open Questions

  1. How do we rename things going from one version to the next?

Missing Features

  1. Canary support in QuarksStatefulSets
  2. Missing support for the allow_executions flag in bpm configs

High-level Direction

  • releases are defined in the usual way (a releases block), but the information given is used to build a reference for a docker image
  • each instance group is transformed to an QuarksStatefulSet or an QuarksJob
  • each BOSH Job corresponds to one or more containers in the Pod template defined in the QuarksStatefulSet or the QuarksJob; there's one container for each process defined in the BPM information of each BOSH Job
  • "explicit" variables are generated using QuarksSecrets
  • for rendering of BOSH Job Templates, please read this document
  • we have a concept of Desired Manifests
  • all communication happens through Kubernetes Services, which have deterministic DNS Addresses; you can read more about these here

Deployment Lifecycle

Please read the documentation for the BOSHDeployment controller.

Example Deployment Manifest Conversion Details

---
# The name of the deployment. Replace the name with the name of the BOSHDeployment resource
# It's used to namespace resources created for this deployment.
# Based on docs [1], names should be less than 253 characters. We should limit this to
# characters in the operator, to make sure that with any suffix, we won't go beyond the limit.
name: "foo"
# Not used by the cf-operator.
# A warning is printed in the logs if this is present.
director_uuid: "bar"
# A hash of director features. We could use this to control operator features as well.
features:
  # Enable variables to be regenerated by the config server (e.g. CredHub) when the variable options change. Default false.
  # In the cf-operator, if an QuarksSecret is changed, (e.g. a new domain is added to a cert),
  # the value will be automatically updated.
  # The operator won't be able to control this behavior.
  # A warning is printed in the logs if this is present.
  converge_variables: true
  # Randomizes AZs for left over instances that cannot be distributed equally between AZs.
  # Not currently used. It's likely that we'll be able to support this.
  randomize_az_placement: false
  # Enables or disables returning of DNS addresses in links.
  # In Kubernetes we always use DNS addresses.
  # An error should be returned if this value is set to false.
  use_dns_addresses: true
# A list of all releases used in this deployment.
# Required.
# Each release's image reference is constructed from this information like this:
# <url>/<name>:<stemcell.os>-<stemcell.version>-<version>
releases:
  # Name of a release used in the deployment.
  name: "capi-release"
  # The version of the release to be used.
  # "latest" is not supported by the cf-operator. An error is thrown if "latest" is used.
  version: "1.0"
  # Required for the operator. Link to the registry and organization containing the image.
  url: "docker.io/cloudfoundry"
  # Not used by the cf-operator.
  # Integrity of the image itself is handled by whatever
  # container runtime and the image registry.
  sha1: "332ac15609b220a3fdf5efad0e0aa069d8235788"
  # Required by the operator
  stemcell:
    # OS of the stemcell used by the release. Used to construct the image name.
    os: "opensuse"
    # Version of the OS of the stemcell used by the release.
    version: "42.3"
  # Only used by the cf-operator.
  # A secret is created with the credentials [2], used by the pods
  # that reference this release.
  credentials:
    username: "foo"
    password: "secret"
# Not used by the cf-operator.
# A warning is logged if this is set
stemcells: []
# Specifies how updates are handled
# The cf-operator uses some of these settings.
update:
  # The number of pods to deploy in the new version of an QuarksStatefulSet
  # Once canaries are running, deployment can continue.
  # TODO: Support for canaries needs implementation in QuarksStatefulSet.
  canaries: 2
  # Time to wait for canary pods to be ready in a new version of an QuarksStatefulSet
  canary_watch_time: 100
  # The maximum number of non-canary instances to update in parallel for an QuarksStatefulSet.
  # TODO: Support for this needs to be implemented in the controller.
  max_in_flight: 2
  # TODO: is there a need for this in QuarksStatefulSet (in a readiness Probe?)
  update_watch_time: 0
  # Not used in cf-operator.
  # All instance groups are deployed at the same time.
  # If set to true, a warning is logged.
  serial: false
  # Not used in cf-operator.
  # If set, a warning is logged.
  vm_strategy: ""
# Each instance group is converted into an QuarksStatefulSet
instance_groups:
  # Used to name the QuarksStatefulSet or QuarksJob
- name: "api-az1"
  # Support for AZs is implemented in the QuarksStatefulSet
  azs: ["az1"]
  # Number of replicas for the StatefulSets in an QuarksStatefulSet
  # If this instance group defines an QuarksJob, this value must be 1. An error is thrown otherwise
  instances: 3
  # Each job results in a rendered bpm.yml file.
  # BPM information is required - the deployment fails if it's missing.
  # Each job has one or more processes (defined in bpm.yml), and each   corresponds to a container of a pod in a StatefulSet or Job
  jobs:
    # It's used to name the container
  - name: "cloud_controller_ng"
    # The name of a release that must exist in the releases block.
    # If it doesn't exist in the releases block, an error is thrown.
    # The docker image used for the container is resolved using this release name.
    release: "capi-release"
    # Used by the cf-operator to calculate links before rendering templates.
    # All resources in the cf-operator are deterministic (IP addresses are not used),
    # So they can be calculated before template rendering occurs.
    consumes: {}
    # Same as the consumes block above.
    provides: {}
    # Defines all properties, used to render job templates.
    # Job templates are rendered as Secrets, and then mounted into pod containers.
    # If a property is changed, the operator runs rendering in an QuarksJob, and the
    # template's secret is (re)generated.
    # All properties are input to this QuarksJob that does rendering.
    # Some properties can reference variables, which can be generated. The cf-operator
    # collects values for all properties before starting the rendering process.
    properties:
      domain: "mycf.com"
      admin_password: "((adminpass))"
      # Extra information specific to the cf-operator
      quarks:
        run:
          # Hints for pod replica count
          scaling:
            min: 1
            max: 3
            ha: 2
          # Extra capabilities required by the containers of this job
          capabilities: []
          # Memory used by each container. Overrides info from vm_resources.
          memory: 128
          # Number of vCPUs used by each container. Overrides info from vm_resources.
          virtual-cpus: 2
          # Healthcheck information for the containers in this job.
          healthcheck:
            some_process_name:
              readiness:
                exec:
                  command:
                  - "curl --silent --fail --head http://${HOSTNAME}:8080/health"
        # List of ports to be opened up for this job.
        ports:
        - name: "health-port"
          protocol: "TCP"
          internal: 8080
  # Not used by the cf-operator.
  # A warning is logged if this is set.
  vm_type: ""
  # Not used by the cf-operator.
  # A warning is logged if this is set.
  vm_extensions: []
  # Used by the cf-operator to limit the resources used by a container in a pod
  vm_resources:
    # Number of vCPUs used by a container
    cpu: 4
    # Memory used by a container
    ram: 1024
    # Used for PVC sizes if `ephemeralAsPVC` is set to true
    ephemeral_disk_size: 4096
  # Not used by the cf-operator.
  # A warning is logged if this is set.
  stemcell: ""
  # Size of the volume attached to a pod container.
  persistent_disk: 4096
  # This must be the name of a StorageClass used by the cf-operator to create volumes.
  persistent_disk_type: "default"
  # Not used by the cf-operator.
  # A warning is logged if this key is set.
  networks:
    # Not used by the cf-operator
    - name: "foo"
      # Not used by the cf-operator
      static_ips: []
      # Not used by the cf-operator
      default: []
  # Specific update settings for this instance group. Use this to override global job update settings on a per-instance-group basis.
  update: {}
  # TODO: understand how instance group renames can occur in an QuarksStatefulSet or QuarksJob
  migrated_from:
  - cloud_controller
  # This is the key that controls how an instance group is treated by the cf-operator.
  # If lifecycle is "service", an QuarksStatefulSet is created for the instance group.
  # Otherwise, if it's "errand", an QuarksJob is created. As with normal BOSH, errands have a
  # manual trigger, so QuarksJobs have to support this (manual triggers).
  # In Kubernetes we also need errands that can run on a trigger. These are not supported by BOSH.
  # The lifecycle for such an QuarksJob is "auto-errand".
  # Manual triggers are supported by QuarksJobs
  lifecycle: "service"
  # Deprecated - the cf-operator does not support this key.
  # An error is thrown if this is set.
  properties: {}
  # Usually used for BOSH Agent configuration.
  # We can use this hash to control how the operator generates resources, however
  # none of the settings used by the Agent are supported by the operator.
  env:
    # Not used by the cf-operator.
    # A warning is logged if this is set.
    persistent_disk_fs: "ext4"
    # Not used by the cf-operator.
    # A warning is logged if this is set.
    persistent_disk_mount_options: []
    # Not used by the cf-operator.
    # A warning is logged if this is set.
    bosh [Hash, optional]:
      # Not used by the cf-operator.
      # A warning is logged if this is set.
      password: "foo"
      # Not used by the cf-operator.
      # A warning is logged if this is set.
      keep_root_password: vcap
      # Not used by the cf-operator.
      # A warning is logged if this is set.
      remove_dev_tools: false
      # Not used by the cf-operator.
      # A warning is logged if this is set.
      remove_static_libraries: false.
      # Not used by the cf-operator.
      # A warning is logged if this is set.
      swap_size: 100
      # Not used by the cf-operator.
      # A warning is logged if this is set.
      ipv6:
        # Not used by the cf-operator.
        # A warning is logged if this is set.
        enable: false
      # Not used by the cf-operator.
      # A warning is logged if this is set.
      job_dir:
        # Not used by the cf-operator.
        # A warning is logged if this is set.
        tmpfs: false
        # Not used by the cf-operator.
        # A warning is logged if this is set.
        tmpfs_size: "0m"
      agent:
        # Not used by the cf-operator.
        # A warning is logged if this is set.
        tmpfs: false
        # Used by the cf-operator to set kubernetes-specific information
        # for the resources representing this instance group.
        settings:
          # Affinity information for this instance group's pod.
          # These definitions are merged directly into the pod's definition.
          # The structure is the same as the one used by Kube [3].
          affinity: {}
          # Labels to add to the resources representing the instance group
          labels: {}
          # Annotations to add to the resources representing the instance group
          annotations: {}
          # disable_log_sidecar is an option to disable log sidecar
          disable_log_sidecar: false
          # serviceAccountName is the name of the ServiceAccount to use to run this pod.
          serviceAccountName: kubecf
          # automountServiceAccountToken indicates whether a service account token should be automatically mounted
          automountServiceAccountToken: false
          # ImagePullSecrets is an optional list of references to secrets to use for pulling any of the images.
          # This field in PodSpec can be automated by setting the imagePullSecrets in a serviceAccount.
          imagePullSecrets: {}
          # Tolerations and taints are a concept defined in kubernetes to repel pods from nodes. [4]
          tolerations: []
          # If this is set to true, the operator will define a PersistentVolumeClaim template
          # on the QuarksStatefulSet of the instance group, and it will use that PVC for all volume
          # mounts for ephemeral disks
          ephemeralAsPVC: false
          # This sets the backoffLimit for the jobs running errands. If not set, it will use the Kube default which is 6.
          # https://kubernetes.io/docs/concepts/workloads/controllers/jobs-run-to-completion/#handling-pod-and-container-failures
          jobBackoffLimit: 6
          # An array of disks to be mounted on the containers
          disks:
            # A PersistentVolumeClaim to be used as a template in the StatefulSet of the instance group.
          - pvc:
              name: foo
              storageClassName: persistent
            # Volume definition to be included in the pod.
            volume:
              name: extravolume
              emptyDir: {}
            # Volume mounts to be set on the containers that match the job and process set in "filters".
            volumeMount:
              name: extravolume
              mountPath: /var/vcap/data/rep
            # Filters to identify on which containers to apply the volume mounts.
            filters:
              job_name: "cflinuxfs3-rootfs-setup"
              process_name: "test-server"
# Each addon job is added to the desired manifest before it's persisted
# Not all placement rules are supported, see below for more details.
addons:
  # The name of the addon is not used by the operator.
  # TODO: investigate whether it's useful to  set this in an annotation of the instance group sts/pod
- name: foo
  # All jobs are added to instance groups based on placement rules before the desired manifest is persisted
  jobs:
  - name: metron
    release: loggregator-release
    properties:
      loggregator:
        metron:
          log_level: debug
  include:
    # Supported
    stemcell:
    - os: opensuse
    # Not supported, addons are used per-deployment
    deployments: []
    # Supported
    jobs:
      name: cloud_controller_ng
      release: capi-release
    # Supported
    instance_groups:
    - api
    - diego-cell
    # Not supported
    networks: []
    # Not supported
    teams: []
  # The same matchers are supported as the "include" key
  exclude: {}
# Deprecated - the cf-operator does not support this key.
# An error is thrown if this is set.
properties: {}
# For each variable, the cf-operator creates QuarksSecrets
# As with normal BOSH, variables are referenced by job properties.
# Each variable's generated secret is mounted in the container that renders each job's
# templates. They are then used by the rendering process.
# This means that the operator needs to look at the job's properties, and parse any references
# to variables, so it knows what it needs to mount.
variables:
    # Unique name used to identify a variable. Used to name the QuarksSecret
  - name: "adminPass"
    # As with normal BOSH, supported types are certificate, password, rsa, and ssh.
    type: "password"
    # Specifies generation options
    options: {is_ca: true, common_name: "some-ca"}
# Tags are transformed into annotations for the resources created
# by this deployment.
tags:
  maintainer: "Philip J. Fry"

BPM

In a BOSH release some jobs have BPM configuration in templates/bpm.yml.erb. Each process specified in the BPM configuration is run in a single Kubernetes Container as part of a Pod.

The following subsections describe the mapping of BPM configuration into containers.

Entrypoint & Environment Variables

Bosh Kube Pod Container
executable command
args args
env env

Resources

Bosh Kube Pod Container
workdir workingDir. Not implemented yet.
hooks initContainers. and container hooks. Not implemented yet.
process.capabilities container.SecurityContext.Capabilities.
limits container.Resources.Limits. Not implemented yet.
ephemeral_disk emptyDir volumes by default, but can be PersistentVolumeClaims if ephemeralAsPVC is set on the bosh.agent.settings.
persistent_disk PersistentVolumeClaims. Not yet implemented.
additional_volumes emptyDir. Paths under /var/vcap/store are currently ignored.
unsafe.unrestricted_volumes emptyDir. Paths under /var/vcap/store are currently ignored.
unsafe.privileged container.SecurityContext.Privileged.

Health checks

BPM doesn't provide information for health checks and relies on monit instead. CF-Operator provides health checks via the quarks property key in the deployment manifest.

In Kubernetes, we use liveness and readiness probes for healthchecks.

Hooks

BPM supports pre_start hooks. CF-Operator will convert those to additional init containers.

Misc

In addition, there are configuration variables that are not available in Bosh but are required for scaling in a kubernetes environment.

Job spec in Manifest Kube Pod Container Description
properties.quarks.bpm.processes[n].requests.cpu container.Resources.Requests.cpu Guaranteed CPU
properties.quarks.bpm.processes[n].requests.memory container.Resources.Requests.memory Guaranteed memory

Conversion Details

Calculation of docker image location for releases

Release image tags are immutable. The release image locations are comprised of multiple elements:

  • docker registry URL
  • organization and repository
  • stemcell name and version
  • fissile version
  • the release name and version

Release image locations always have to be resolved in the context of an instance group/job because they depend on the stemcell that is being used.

A typical release image location looks could look like hub.docker.com/cfcontainerization/cflinuxfs3-release:opensuse-15.0-28.g837c5b3-30.263-7.0.0_233.gde0accd0-0.62.0.

The different elements are taken from different places in the manifest. Given this excerpt from a BOSH deployment manifest:

stemcells:
- alias: default
  os: opensuse-42.3
  version: 28.g837c5b3-30.263-7.0.0_234.gcd7d1132
instance_groups:
- name: diego-cell
  stemcell: default
  jobs:
  - name: cflinuxfs3-rootfs-setup
    release: cflinuxfs3
releases:
- name: cflinuxfs3
  version: 0.62.0
  url: hub.docker.com/cfcontainerization
  sha1: 6466c44827c3493645ca34b084e7c21de23272b4
  stemcell:
    os: opensuse-15.0
    version: 28.g837c5b3-30.263-7.0.0_233.gde0accd0

The stemcell information (name, and stemcell and fissile version) are taken from the stemcells entry that matches the instance group's stemcell alias. The registry URL including the organization, the release name, and the version come from the releases entry that's referenced from the job.

Note:

Releases can optionally specify a separate stemcell section, in which case the information from the instance group stemcell is overridden.

Variables to Quarks Secrets

For each Explicit BOSH Variable (with a definition in the variables section in the deployment manifest), the cf-operator creates an QuarksSecret. The QuarksSecret is meant to generate the value required by the variable.

The name of the QuarksSecret is calculated like this:

var-<VARIABLE_NAME>

The name of the final generated Secret (the secretName key of the QuarksSecret) is calculated the same way.

Overriding generated variables

The user can also specify overrides for generated secrets using the vars key in the BOSHDeployment spec.

These map explicit variable names to secret names.

Each secret must contain the usual keys used in explicit variables (see here for more details).

You can find an example here.

Instance Groups to Quarks StatefulSets and Jobs

BOSH Services vs BOSH Errands

BOSH Services are converted to QuarksStatefulSets and Services.

BOSH Errands are converted to QuarksJobs with trigger.strategy: manually.

BOSH Auto-Errands (supported only by the operator) are converted to QuarksJobs with trigger.strategy: once.

Miscellaneous

Dealing with AZs

QuarksStatefulSets support AZs. You can learn more about this in the docs.

Support for active/passive pod replicas

QuarksStatefulSets support active/passive pod replicas. You can learn more about this in the docs.

Ephemeral Disks

We use an emptyDir for ephemeral disks. You can learn more from the official docs.

If the setting bosh.settings.agent.ephemeralAsPVC is set to true, the operator will use PersistentVolumeClaims instead. This option should be used for jobs that make assumptions about ephemeral disks (like this garden job) mounts, or the size limit for the disk is critical. If vm_resources.ephemeral_disk_size is set, the PVC size will be set to this. If it's not set, the operator will try to use persistent_disk as a size. If this is not set either, the operator will use a default of 10GB.

Credentials for Docker Registries

Providing credentials for private registries is supported by Kubernetes. Please read the official docs.

Running manual errands

BOSH makes use of errands, which are manually triggered. We support manual triggers - you can learn more in the QuarksJob docs.

Readiness and Liveness Probes

When the deployment manifest declares health check information for jobs, via the quarks section, we configure those in Kubernetes.

The probes are defined per BPM process.

Example:

instance_groups:
- name: "api-az1"
  process.
    properties:
      quarks:
        run:
          healthcheck:
            bpm-process-name:
              readiness:
              liveness:

Both keys contain information that should is used as-is for the container that matches the process name.

Persistent Disks

When a BOSH deployment manifest declares persistent disks on instance groups, we provide a persistent volume to the containers of a pod in /var/vcap/store. You can learn more about BOSH Persistent Disks in the BOSH Official Docs.

These volumes are mounted on each container that's part of the instance group.

The implementation uses the default storage class if not specified using the persistent_disk_type key in the manifest.

Manual ("implicit") variables

BOSH deployment manifests support two different types of variables, implicit and explicit ones.

"Explicit" variables are declared in the variables section of the manifest and are generated automatically before the interpolation step.

"Implicit" variables just appear in the document within double parentheses without any declaration. These variables have to be provided by the user prior to creating the BOSH deployment as a secret. The secret name has to follow the scheme

var-<variable-name>

By default the variable content is expected in the value key, e.g.

((system-domain))
---
apiVersion: v1
kind: Secret
metadata:
  name: var-system-domain
type: Opaque
stringData:
  value: example.com

It is also possible to specify the key name after a / separator, e.g.

((ssl/ca))
---
apiVersion: v1
kind: Secret
metadata:
  name: var-ssl
type: Opaque
stringData:
  ca: ...
  cert: ...
  key: ...

Pre_render_scripts

Similar to what can be achieved in SCF v1, with the patches scripts, the cf-operator is able to support this behaviour. Basically, it allows the user to execute a custom script during runtime of the job container for a specific instance_group. Because patching during runtime is always a great feature to have, for a variety of reasons, users can specify this via the quarks.pre_render_scripts key.

Keep it mind, that the script should belong to a type, to avoid running all scripts as a whole. Currently supported types are:

  • quarks.pre_render_scripts.bpm.
  • quarks.pre_render_scripts.ig_resolver
  • quarks.pre_render_scripts.jobs

This allows you to run anything, by specifying a list of commands/scripts to execute. For example:

instance_groups:
- name: redis-slave
  instances: 2
  lifecycle: errand
  azs: [z1, z2]
  jobs:
  - name: redis-server
    release: redis
    properties:
      quarks:
        pre_render_scripts:
          bpm:
          - |
            touch /tmp

BOSH DNS

The BOSH DNS addon is implemented using a separate DNS server (coredns). For each BOSHDeployment, which enables this addon, an additional DNS server is created within the namespace. This DNS server rewrites all BOSH dns requests to standard k8s queries (e.g. api.service.cf.internal -> api.<namespace>.svc.cluster.local) and forwards them to the k8s DNS server. All pods created from the BOSHDeployment are configured to use this DNS server.

Additionally the headless services are created on base of the specified aliases. The following alias

  - domain: blobstore.service.cf.internal
    targets:
    - deployment: cf
      domain: bosh
      instance_group: singleton-blobstore
      network: default
      query: '*'

will create a headless service with the name blobstore instead of singleton-blobstore.

For migration purpose, the DNS service does also a rewrite of all previous headless service names (e.g. singleton-blobstore is rewritten to blobstore.<namespace>.svc.cluster.local).

Flow

flow

Naming Conventions

After creating a BOSHDeployment named nats-deployment, with one Instance Group, the following resources should exist:

  • BOSHDeployment

    nats-deployment
    
  • QuarksJob

    ig
    dm
    
  • QuarksSecret

    var-nats-password
    
  • QuarksStatefulSet

    nats
    
  • Secrets

    bpm.nats-v1
    ig-resolved.nats-v1
    var-nats-password
    with-ops
    desired-manifest-v1
    
  • StatefulSets

    nats
    
  • Pods

    nats-0
    nats-1
    
  • Services

    nats
    nats-0
    nats-1