Skip to content

Latest commit

 

History

History
1200 lines (917 loc) · 48.3 KB

upgrade-specific.mdx

File metadata and controls

1200 lines (917 loc) · 48.3 KB
layout page_title description
docs
Upgrade Guides
Specific versions of Nomad may have additional information about the upgrade process beyond the standard flow.

Upgrade Guides

The upgrading page covers the details of doing a standard upgrade. However, specific versions of Nomad may have more details provided for their upgrades as a result of new features or changed behavior. This page is used to document those details separately from the standard upgrade flow.

Nomad 1.0.11 and 1.1.5 Enterprise

Audit log file names

Audit log file naming now matches the standard log file naming introduced in 1.0.10 and 1.1.4. The audit log currently being written will no longer have a timestamp appended.

Nomad 1.0.10 and 1.1.4

Log file names

The log_file configuration option was not being fully respected, as the generated filename would include a timestamp. After upgrade, the active log file will always be the value defined in log_file, with timestamped files being created during log rotation.

Nomad 1.0.9 and 1.1.3

Namespace in Job Run and Plan APIs

The Job Run and Plan APIs now respect the ?namespace=... query parameter over the namespace specified in the job itself. This matches the precedence of region and fixes a bug where the -namespace flag was not respected for the nomad run and nomad apply commands.

For users of api.Client who want their job namespace respected, you must ensure the Config.Namespace field is unset.

Docker Driver

1.1.3 only

Starting in Nomad 1.1.2, task groups with network.mode = "bridge" generated a hosts file in Docker containers. This generated hosts file was bind-mounted from the task directory to /etc/hosts within the task. In Nomad 1.1.3 the source for the bind mount was moved to the allocation directory so that it is shared between all tasks in an allocation.

Please note that this change may prevent extra_hosts values from being properly set in each task when there are multiple tasks within the same group. When using extra_hosts with Consul Connect in bridge network mode, you should set the hosts values in the sidecar_task.config block instead.

Nomad 1.1.0

Enterprise licenses

Nomad Enterprise licenses are no longer stored in raft or synced between servers. Nomad Enterprise servers will not start without a license. There is no longer a six hour evaluation period when running Nomad Enterprise. Before upgrading, you must provide each server with a license on disk or in its environment (see the Enterprise licensing documentation for details).

The nomad license put command has been removed.

The nomad license get command is no longer forwarded to the Nomad leader, and will return the license from the specific server being contacted.

Click here to get a trial license for Nomad Enterprise.

Agent Metrics API

The Nomad agent metrics API now respects the prometheus_metrics configuration value. If this value is set to false, which is the default value, calling /v1/metrics?format=prometheus will now result in a response error.

CSI volumes

The volume specification for CSI volumes has been updated to support volume creation. The access_mode and attachment_mode fields have been moved to a capability block that can be repeated. Existing registered volumes will be automatically modified the next time that a volume claim is updated. Volume specification files for new volumes should be updated to the format described in the volume create and volume register commands.

The volume block has an access_mode and attachment_mode field that are required for CSI volumes. Jobs that use CSI volumes should be updated with these fields.

Connect native tasks

Connect native tasks running in host networking mode will now have CONSUL_HTTP_ADDR set automatically. Before this was only the case for bridge networking. If an operator already explicitly set CONSUL_HTTP_ADDR then it will not get overridden.

Linux capabilities in exec/java

Following the security remediation in Nomad versions 0.12.12, 1.0.5, and 1.1.0-rc1, the exec and java task drivers will additionally no longer enable the following linux capabilities by default.

AUDIT_CONTROL  AUDIT_READ  BLOCK_SUSPEND  DAC_READ_SEARCH  IPC_LOCK  IPC_OWNER  LEASE
LINUX_IMMUTABLE  MAC_ADMIN  MAC_OVERRIDE  NET_ADMIN  NET_BROADCAST  NET_RAW  SYS_ADMIN
SYS_BOOT  SYSLOG  SYS_MODULE  SYS_NICE  SYS_PACCT  SYS_PTRACE  SYS_RAWIO  SYS_RESOURCE
SYS_TIME  SYS_TTY_CONFIG  WAKE_ALARM

The capabilities now enabled by default are modeled after Docker default linux capabilities (excluding NET_RAW).

AUDIT_WRITE  CHOWN  DAC_OVERRIDE  FOWNER  FSETID  KILL  MKNOD  NET_BIND_SERVICE
SETFCAP  SETGID  SETPCAP  SETUID  SYS_CHROOT

A new allow_caps plugin configuration parameter for exec and java task drivers can be used to restrict the set of capabilities allowed for use by tasks.

Tasks using the exec or java task drivers can add or remove desired linux capabilities using the cap_add and cap_drop task configuration options.

iptables

Nomad now appends its iptables rules to the NOMAD-ADMIN chain instead of inserting them as the first rule. This allows better control for user-defined iptables rules but users who append rules currently should verify that their rules are being appended in the correct order.

Nomad 1.1.0-rc1, 1.0.5, 0.12.12

Nomad versions 1.1.0-rc1, 1.0.5 and 0.12.12 change the behavior of the docker, exec, and java task drivers so that the CAP_NET_RAW linux capability is disabled by default. This is one of the linux capabilities that Docker itself enables by default, as this capability enables the generation of ICMP packets - used by the common ping utility for performing network diagnostics. When used by groups in bridge networking mode, the CAP_NET_RAW capability also exposes tasks to ARP spoofing, enabling DoS and MITM attacks against other tasks running in bridge networking on the same host. Operators should weigh potential impact of an upgrade on their applications against the security consequences inherit with CAP_NET_RAW. Typical applications using tcp or udp based networking should not be affected.

This is the sole change for Nomad 1.0.5 and 0.12.12, intended to provide better task network isolation by default.

Users of the docker driver can restore the previous behavior by configuring the allow_caps driver configuration option to explicitly enable the CAP_NET_RAW capability.

plugin "docker" {
  config {
    allow_caps = [
      "CHOWN", "DAC_OVERRIDE", "FSETID", "FOWNER", "MKNOD",
      "SETGID", "SETUID", "SETFCAP", "SETPCAP", "NET_BIND_SERVICE",
      "SYS_CHROOT", "KILL", "AUDIT_WRITE", "NET_RAW",
    ]
  }
}

An upcoming version of Nomad will include similar configuration options for the exec and java task drivers.

This change is limited to docker, exec, and java driver plugins. It does not affect the Nomad server. This only affects Nomad clients running Linux, with tasks using bridge networking and one of these task drivers, or third-party plugins which relied on the shared Nomad executor library.

Upgrading a Nomad client to 1.0.5 or 0.12.12 will not restart existing tasks. As such, processes from existing docker, exec, or java tasks will need to be manually restarted (using alloc stop or another mechanism) in order to be fully isolated.

Nomad 1.0.3, 0.12.10

Nomad versions 1.0.3 and 0.12.10 change the behavior of the exec and java drivers so that tasks are isolated in their own PID and IPC namespaces. As a result, the process launched by these drivers will be PID 1 in the namespace. This has significant impact on the treatment of a process by the Linux kernel. Furthermore, tasks in the same allocation will no longer be able to coordinate using signals, SystemV IPC objects, or POSIX message queues. Operators should weigh potential impact of an upgrade on their applications against the security consequences inherent in using the host namespaces.

This is the sole change for Nomad 1.0.3, intended to provide better process isolation by default. An upcoming version of Nomad will include options for configuring this behavior.

This change is limited to the exec and java driver plugins. It does not affect the Nomad server. This only affect Nomad clients running on Linux, using the exec or java drivers or third-party driver plugins which relied on the shared Nomad executor library.

Upgrading a Nomad client to 1.0.3 or 0.12.10 will not restart existing tasks. As such, processes from existing exec/java tasks will need to be manually restarted (using alloc stop or another mechanism) in order to be fully isolated.

Nomad 1.0.2

Dynamic secrets trigger template changes on client restart

Nomad 1.0.2 changed the behavior of template change_mode triggers when a client node restarts. In Nomad 1.0.1 and earlier, the first rendering of a template after a client restart would not trigger the change_mode. For dynamic secrets such as the Vault PKI secrets engine, this resulted in the secret being updated but not restarting or signalling the task. When the secret's lease expired at some later time, the task workload might fail because of the stale secret. For example, a web server's SSL certificate would be expired and browsers would be unable to connect.

In Nomad 1.0.2, when a client node is restarted any task with Vault secrets that are generated or have expired will have its change_mode triggered. If change_mode = "restart" this will result in the task being restarted, to avoid the task failing unexpectedly at some point in the future. This change only impacts tasks using dynamic Vault secrets engines such as PKI, or when secrets are rotated. Secrets that don't change in Vault will not trigger a change_mode on client restart.

Nomad 1.0.1

Envoy worker threads

Nomad v1.0.0 changed the default behavior around the number of worker threads created by the Envoy when being used as a sidecar for Consul Connect. In Nomad v1.0.1, the same default setting of --concurrency=1 is set for Envoy when used as a Connect gateway. As before, the meta.connect.proxy_concurrency property can be set in client configuration to override the default value.

Nomad 1.0.0

HCL2 for Job specification

Nomad v1.0.0 adopts HCL2 for parsing the job spec. HCL2 extends HCL with more expression and reuse support, but adds some stricter schema for HCL blocks (a.k.a. stanzas). Check HCL for more details.

Signal used when stopping Docker tasks

When stopping tasks running with the Docker task driver, Nomad documents that a SIGTERM will be issued (unless configured with kill_signal). However, recent versions of Nomad would issue SIGINT instead. Starting again with Nomad v1.0.0 SIGTERM will be sent by default when stopping Docker tasks.

Deprecated metrics have been removed

Nomad v0.7.0 added supported for tagged metrics and deprecated untagged metrics. There was support for configuring backwards-compatible metrics. This support has been removed with v1.0.0, and all metrics will be emitted with tags.

Null characters in region, datacenter, job name/ID, task group name, and task names

Starting with Nomad v1.0.0, jobs will fail validation if any of the following contain null character: the job ID or name, the task group name, or the task name. Any jobs meeting this requirement should be modified before an update to v1.0.0. Similarly, client and server config validation will prohibit either the region or the datacenter from containing null characters.

EC2 CPU characteristics may be different

Starting with Nomad v1.0.0, the AWS fingerprinter uses data derived from the official AWS EC2 API to determine default CPU performance characteristics, including core count and core speed. This data should be accurate for each instance type per region. Previously, Nomad used a hand-made lookup table that was not region aware and may have contained inaccurate or incomplete data. As part of this change, the AWS fingerprinter no longer sets the cpu.modelname attribute.

As before, cpu_total_compute can be used to override the discovered CPU resources available to the Nomad client.

Inclusive language

Starting with Nomad v1.0.0, the terms blacklist and whitelist have been deprecated from client configuration and driver configuration. The existing configuration values are permitted but will be removed in a future version of Nomad. The specific configuration values replaced are:

  • Client driver.blacklist is replaced with driver.denylist.

  • Client driver.whitelist is replaced with driver.allowlist.

  • Client env.blacklist is replaced with env.denylist.

  • Client fingerprint.blacklist is replaced with fingerprint.denylist.

  • Client fingerprint.whitelist is replaced with fingerprint.allowlist.

  • Client user.blacklist is replaced with user.denylist.

  • Client template.function_blacklist is replaced with template.function_denylist.

  • Docker driver docker.caps.whitelist is replaced with docker.caps.allowlist.

Consul Connect

Nomad 1.0's Consul Connect integration works best with Consul 1.9 or later. The ideal upgrade path is:

  1. Create a new Nomad client image with Nomad 1.0 and Consul 1.9 or later.
  2. Add new hosts based on the image.
  3. Drain and shutdown old Nomad client nodes.

While inplace upgrades and older versions of Consul are supported by Nomad 1.0, Envoy proxies will drop and stop accepting connections while the Nomad agent is restarting. Nomad 1.0 with Consul 1.9 do not have this limitation.

Envoy proxy versions

Nomad v1.0.0 changes the behavior around the selection of Envoy version used for Connect sidecar proxies. Previously, Nomad always defaulted to Envoy v1.11.2 if neither the meta.connect.sidecar_image parameter or sidecar_task stanza were explicitly configured. Likewise the same version of Envoy would be used for Connect ingress gateways if meta.connect.gateway_image was unset. Starting with Nomad v1.0.0, each Nomad Client will query Consul for a list of supported Envoy versions. Nomad will make use of the latest version of Envoy supported by the Consul agent when launching Envoy as a Connect sidecar proxy. If the version of the Consul agent is older than v1.7.8, v1.8.4, or v1.9.0, Nomad will fallback to the v1.11.2 version of Envoy. As before, if the meta.connect.sidecar_image, meta.connect.gateway_image, or sidecar_task stanza are set, those settings take precedence.

When upgrading Nomad Clients from a previous version to v1.0.0 and above, it is recommended to also upgrade the Consul agents to v1.7.8, 1.8.4, or v1.9.0 or newer. Upgrading Nomad and Consul to versions that support the new behavior while also doing a full node drain at the time of the upgrade for each node will ensure Connect workloads are properly rescheduled onto nodes in such a way that the Nomad Clients, Consul agents, and Envoy sidecar tasks maintain compatibility with one another.

Envoy worker threads

Nomad v1.0.0 changes the default behavior around the number of worker threads created by the Envoy sidecar proxy when using Consul Connect. Previously, the Envoy --concurrency argument was left unset, which caused Envoy to spawn as many worker threads as logical cores available on the CPU. The --concurrency value now defaults to 1 and can be configured by setting the meta.connect.proxy_concurrency property in client configuration.

Nomad 0.12.8

Docker volume mounts

Nomad 0.12.8 includes security fixes for the handling of Docker volume mounts:

  • The docker.volumes.enabled flag now defaults to false as documented.

  • Docker driver mounts of type "volume" (but not "bind") were not sandboxed and could mount arbitrary locations from the client host. The docker.volumes.enabled configuration will now disable Docker mounts with type "volume" when set to false (the default).

This change Docker impacts jobs that use a mounts with type "volume", as shown below. This job will fail when placed unless docker.volumes.enabled = true.

mounts = [
  {
    type     = "volume"
    target   = "/path/in/container"
    source   = "docker_volume"
    volume_options = {
      driver_config = {
        name = "local"
        options = [
          {
            device = "/"
            o      = "ro,bind"
            type   = "ext4"
          }
        ]
      }
    }
  }
]

Nomad 0.12.6

Artifact and Template Paths

Nomad 0.12.6 includes security fixes for privilege escalation vulnerabilities in handling of job template and artifact stanzas:

  • The template.source and template.destination fields are now protected by the file sandbox introduced in 0.9.6. These paths are now restricted to fall inside the task directory by default. An operator can opt-out of this protection with the template.disable_file_sandbox field in the client configuration.

  • The paths for template.source, template.destination, and artifact.destination are validated on job submission to ensure the paths do not escape the file sandbox. It was possible to use interpolation to bypass this validation. The client now interpolates the paths before checking if they are in the file sandbox.

~> Warning: Due to a bug in Nomad v0.12.6, the template.destination and artifact.destination paths do not support absolute paths, including the interpolated NOMAD_SECRETS_DIR, NOMAD_TASK_DIR, and NOMAD_ALLOC_DIR variables. This bug is fixed in v0.12.9. To work around the bug, use a relative path.

Nomad 0.12.0

mbits and Task Network Resource deprecation

Starting in Nomad 0.12.0 the mbits field of the network resource block has been deprecated and is no longer considered when making scheduling decisions. This is in part because we felt that mbits didn't accurately account network bandwidth as a resource.

Additionally the use of the network block inside of a task's resource block is also deprecated. Users are advised to move their network block to the group block. Recent networking features have only been added to group based network configuration. If any usecase or feature which was available with task network resource is not fulfilled with group network configuration, please open an issue detailing the missing capability.

Additionally, the docker driver's port_map configuration is deprecated in lieu of the ports field.

Enterprise Licensing

Enterprise binaries for Nomad are now publicly available via releases.hashicorp.com. By default all enterprise features are enabled for 6 hours. During that time enterprise users should apply their license with the nomad license put ... command.

Once the 6 hour demonstration period expires, Nomad will shutdown. If restarted Nomad will shutdown in a very short amount of time unless a valid license is applied.

~> Warning: Due to a bug in Nomad v0.12.0, existing clusters that are upgraded will not have 6 hours to apply a license. The minimal grace period should be sufficient to apply a valid license, but enterprise users are encouraged to delay upgrading until Nomad v0.12.1 is released and fixes the issue.

Docker access host filesystem

Nomad 0.12.0 disables Docker tasks access to the host filesystem, by default. Prior to Nomad 0.12, Docker tasks may mount and then manipulate any host file and may pose a security risk.

Operators now must explicitly allow tasks to access host filesystem. Host Volumes provide a fine tune access to individual paths.

To restore pre-0.12.0 behavior, you can enable Docker volume to allow binding host paths, by adding the following to the nomad client config file:

plugin "docker" {
  config {
    volumes {
      enabled = true
    }
  }
}

QEMU images

Nomad 0.12.0 restricts the paths the QEMU tasks can load an image from. A QEMU task may download an image to the allocation directory to load. But images outside the allocation directories must be explicitly allowed by operators in the client agent configuration file.

For example, you may allow loading QEMU images from /mnt/qemu-images by adding the following to the agent configuration file:

plugin "qemu" {
  config {
    image_paths = ["/mnt/qemu-images"]
  }
}

Nomad 0.11.7

Docker volume mounts

Nomad 0.11.7 includes a security fix for the handling of Docker volume mounts. Docker driver mounts of type "volume" (but not "bind") were not sandboxed and could mount arbitrary locations from the client host. The docker.volumes.enabled configuration will now disable Docker mounts with type "volume" when set to false.

This change Docker impacts jobs that use a mounts with type "volume", as shown below. This job will fail when placed unless docker.volumes.enabled = true.

mounts = [
  {
    type     = "volume"
    target   = "/path/in/container"
    source   = "docker_volume"
    volume_options = {
      driver_config = {
        name = "local"
        options = [
          {
            device = "/"
            o      = "ro,bind"
            type   = "ext4"
          }
        ]
      }
    }
  }
]

Nomad 0.11.5

Artifact and Template Paths

Nomad 0.11.5 includes backported security fixes for privilege escalation vulnerabilities in handling of job template and artifact stanzas:

  • The template.source and template.destination fields are now protected by the file sandbox introduced in 0.9.6. These paths are now restricted to fall inside the task directory by default. An operator can opt-out of this protection with the template.disable_file_sandbox field in the client configuration.
  • The paths for template.source, template.destination, and artifact.destination are validated on job submission to ensure the paths do not escape the file sandbox. It was possible to use interpolation to bypass this validation. The client now interpolates the paths before checking if they are in the file sandbox.

~> Warning: Due to a bug in Nomad v0.11.5, the template.destination and artifact.destination paths do not support absolute paths, including the interpolated NOMAD_SECRETS_DIR, NOMAD_TASK_DIR, and NOMAD_ALLOC_DIR variables. This bug is fixed in v0.11.6. To work around the bug, use a relative path.

Nomad 0.11.3

Nomad 0.11.3 fixes a critical bug causing the nomad agent to become unresponsive. The issue is due to a Go 1.14.1 runtime bug and affects Nomad 0.11.1 and 0.11.2.

Nomad 0.11.2

Scheduler Scoring Changes

Prior to Nomad 0.11.2 the scheduler algorithm used a node's reserved resources incorrectly during scoring. The result of this bug was that scoring biased in favor of nodes with reserved resources vs nodes without reserved resources.

Placements will be more correct but slightly different in v0.11.2 vs earlier versions of Nomad. Operators do not need to take any actions as the impact of the bug fix will only minimally affect scoring.

Feasibility (whether a node is capable of running a job at all) is not affected.

Periodic Jobs and Daylight Saving Time

Nomad 0.11.2 fixed a long outstanding bug affecting periodic jobs that are scheduled to run during Daylight Saving Time transitions.

Nomad 0.11.2 provides a more defined behavior: Nomad evaluates the cron expression with respect to specified time zone during transition. A 2:30am nightly job with America/New_York time zone will not run on the day daylight saving time starts; similarly, a 1:30am nightly job will run twice on the day daylight saving time ends. See the Daylight Saving Time documentation for details.

Nomad 0.11.0

client.template: vault_grace deprecation

Nomad 0.11.0 updates consul-template to v0.24.1. This library deprecates the vault_grace option for templating included in Nomad. The feature has been ignored since Vault 0.5 and as long as you are running a more recent version of Vault, you can safely remove vault_grace from your Nomad jobs.

Rkt Task Driver Removed

The rkt task driver has been deprecated and removed from Nomad. While the code is available in an external repository, https://github.com/hashicorp/nomad-driver-rkt, it will not be maintained as rkt is no longer being developed upstream. We encourage all rkt users to find a new task driver as soon as possible.

Nomad 0.10.8

Docker volume mounts

Nomad 0.10.8 includes a security fix for the handling of Docker volume mounts. Docker driver mounts of type "volume" (but not "bind") were not sandboxed and could mount arbitrary locations from the client host. The docker.volumes.enabled configuration will now disable Docker mounts with type "volume" when set to false.

This change Docker impacts jobs that use a mounts with type "volume", as shown below. This job will fail when placed unless docker.volumes.enabled = true.

mounts = [
  {
    type     = "volume"
    target   = "/path/in/container"
    source   = "docker_volume"
    volume_options = {
      driver_config = {
        name = "local"
        options = [
          {
            device = "/"
            o      = "ro,bind"
            type   = "ext4"
          }
        ]
      }
    }
  }
]

Nomad 0.10.6

Artifact and Template Paths

Nomad 0.10.6 includes backported security fixes for privilege escalation vulnerabilities in handling of job template and artifact stanzas:

  • The template.source and template.destination fields are now protected by the file sandbox introduced in 0.9.6. These paths are now restricted to fall inside the task directory by default. An operator can opt-out of this protection with the template.disable_file_sandbox field in the client configuration.

  • The paths for template.source, template.destination, and artifact.destination are validated on job submission to ensure the paths do not escape the file sandbox. It was possible to use interpolation to bypass this validation. The client now interpolates the paths before checking if they are in the file sandbox.

~> Warning: Due to a bug in Nomad v0.10.6, the template.destination and artifact.destination paths do not support absolute paths, including the interpolated NOMAD_SECRETS_DIR, NOMAD_TASK_DIR, and NOMAD_ALLOC_DIR variables. This bug is fixed in v0.10.7. To work around the bug, use a relative path.

Nomad 0.10.4

Same-Node Scheduling Penalty Removed

Nomad 0.10.4 includes a fix to the scheduler that removes the same-node penalty for allocations that have not previously failed. In earlier versions of Nomad, the node where an allocation was running was penalized from receiving updated versions of that allocation, resulting in a higher chance of the allocation being placed on a new node. This was changed so that the penalty only applies to nodes where the previous allocation has failed or been rescheduled, to reduce the risk of correlated failures on a host. Scheduling weighs a number of factors, but this change should reduce movement of allocations that are being updated from a healthy state. You can view the placement metrics for an allocation with nomad alloc status -verbose.

Additional Environment Variable Filtering

Nomad will by default prevent certain environment variables set in the client process from being passed along into launched tasks. The CONSUL_HTTP_TOKEN environment variable has been added to the default list. More information can be found in the env.blacklist configuration .

Nomad 0.10.3

mTLS Certificate Validation

Nomad 0.10.3 includes a fix for a privilege escalation vulnerability in validating TLS certificates for RPC with mTLS. Nomad RPC endpoints validated that TLS client certificates had not expired and were signed by the same CA as the Nomad node, but did not correctly check the certificate's name for the role and region as described in the Securing Nomad with TLS guide. This allows trusted operators with a client certificate signed by the CA to send RPC calls as a Nomad client or server node, bypassing access control and accessing any secrets available to a client.

Nomad clusters configured for mTLS following the Securing Nomad with TLS guide or the Vault PKI Secrets Engine Integration guide should already have certificates that will pass validation. Before upgrading to Nomad 0.10.3, operators using mTLS with verify_server_hostname = true should confirm that the common name or SAN of all Nomad client node certs is client.<region>.nomad, and that the common name or SAN of all Nomad server node certs is server.<region>.nomad.

Connection Limits Added

Nomad 0.10.3 introduces the limits agent configuration parameters for mitigating denial of service attacks from users who are not authenticated via mTLS. The default limits stanza is:

limits {
  https_handshake_timeout   = "5s"
  http_max_conns_per_client = 100
  rpc_handshake_timeout     = "5s"
  rpc_max_conns_per_client  = 100
}

If your Nomad agent's endpoints are protected from unauthenticated users via other mechanisms these limits may be safely disabled by setting them to 0.

However the defaults were chosen to be safe for a wide variety of Nomad deployments and may protect against accidental abuses of the Nomad API that could cause unintended resource usage.

Nomad 0.10.2

Preemption Panic Fixed

Nomad 0.9.7 and 0.10.2 fix a server crashing bug present in scheduler preemption since 0.9.0. Users unable to immediately upgrade Nomad can disable preemption to avoid the panic.

Dangling Docker Container Cleanup

Nomad 0.10.2 addresses an issue occurring in heavily loaded clients, where containers are started without being properly managed by Nomad. Nomad 0.10.2 introduced a reaper that detects and kills such containers.

Operators may opt to run reaper in a dry-mode or disabling it through a client config.

For more information, see Docker Dangling containers.

Nomad 0.10.0

Deployments

Nomad 0.10 enables rolling deployments for service jobs by default and adds a default update stanza when a service job is created or updated. This does not affect jobs with an update stanza.

In pre-0.10 releases, when updating a service job without an update stanza, all existing allocations are stopped while new allocations start up, and this may cause a service degradation or an outage. You can regain this behavior and disable deployments by setting max_parallel to 0.

For more information, see update stanza.

Nomad 0.9.5

Template Rendering

Nomad 0.9.5 includes security fixes for privilege escalation vulnerabilities in handling of job template stanzas:

  • The client host's environment variables are now cleaned before rendering the template. If a template includes the env function, the job should include an env stanza to allow access to the variable in the template.

  • The plugin function is no longer permitted by default and will raise an error if used in a template. Operator can opt-in to permitting this function with the new template.function_blacklist field in the client configuration.

  • The file function has been changed to restrict paths to fall inside the task directory by default. Paths that used the NOMAD_TASK_DIR environment variable to prefix file paths should work unchanged. Relative paths or symlinks that point outside the task directory will raise an error. An operator can opt-out of this protection with the new template.disable_file_sandbox field in the client configuration.

Nomad 0.9.0

Preemption

Nomad 0.9 adds preemption support for system jobs. If a system job is submitted that has a higher priority than other running jobs on the node, and the node does not have capacity remaining, Nomad may preempt those lower priority allocations to place the system job. See preemption for more details.

Task Driver Plugins

All task drivers have become plugins in Nomad 0.9.0. There are two user visible differences between 0.8 and 0.9 drivers:

  • LXC is now community supported and distributed independently.

  • Task driver config stanzas are no longer validated by the nomad job validate command. This is a regression that will be fixed in a future release.

There is a new method for client driver configuration options, but existing client.options settings are supported in 0.9. See plugin configuration for details.

LXC

LXC is now an external plugin and must be installed separately. See the LXC driver's documentation for details.

Structured Logging

Nomad 0.9.0 switches to structured logging. Any log processing on the pre-0.9 log output will need to be updated to match the structured output.

Structured log lines have the format:

# <Timestamp> [<Level>] <Component>: <Message>: <KeyN>=<ValueN> ...

2019-01-29T05:52:09.221Z [INFO ] client.plugin: starting plugin manager: plugin-type=device

Values containing whitespace will be quoted:

... starting plugin: task=redis args="[/opt/gopath/bin/nomad logmon]"

HCL2 Transition

Nomad 0.9.0 begins a transition to HCL2, the next version of the HashiCorp configuration language. While Nomad has begun integrating HCL2, users will need to continue to use HCL1 in Nomad 0.9.0 as the transition is incomplete.

If you interpolate variables in your task.config containing consecutive dots in their name, you will need to change your job specification to use the env map. See the following example:

env {
  # Note the multiple consecutive dots
  image...version = "3.2"

  # Valid in both v0.8 and v0.9
  image.version = "3.2"
}

# v0.8 task config stanza:
task {
  driver = "docker"
  config {
    image = "redis:${image...version}"
  }
}

# v0.9 task config stanza:
task {
  driver = "docker"
  config {
    image = "redis:${env["image...version"]}"
  }
}

This only affects users who interpolate unusual variables with multiple consecutive dots in their task config stanza. All other interpolation is unchanged.

Since HCL2 uses dotted object notation for interpolation users should transition away from variable names with multiple consecutive dots.

Downgrading clients

Due to the large refactor of the Nomad client in 0.9, downgrading to a previous version of the client after upgrading it to Nomad 0.9 is not supported. To downgrade safely, users should erase the Nomad client's data directory.

port_map Environment Variable Changes

Before Nomad 0.9.0 ports mapped via a task driver's port_map stanza could be interpolated via the NOMAD_PORT_<label> environment variables.

However, in Nomad 0.9.0 no parameters in a driver's config stanza, including its port_map, are available for interpolation. This means {{ env NOMAD_PORT_<label> }} in a template stanza or HTTP_PORT = "${NOMAD_PORT_http}" in an env stanza will now interpolate the host ports, not the container's.

Nomad 0.10 introduced Task Group Networking which natively supports port mapping without relying on task driver specific port_map fields. The to field on group network port stanzas will be interpolated properly. Please see the network stanza documentation for details.

Nomad 0.8.0

Raft Protocol Version Compatibility

When upgrading to Nomad 0.8.0 from a version lower than 0.7.0, users will need to set the raft_protocol option in their server stanza to 1 in order to maintain backwards compatibility with the old servers during the upgrade. After the servers have been migrated to version 0.8.0, raft_protocol can be moved up to 2 and the servers restarted to match the default.

The Raft protocol must be stepped up in this way; only adjacent version numbers are compatible (for example, version 1 cannot talk to version 3). Here is a table of the Raft Protocol versions supported by each Nomad version:

Version Supported Raft Protocols
0.6 and earlier 0
0.7 1
0.8 and later 1, 2, 3

In order to enable all Autopilot features, all servers in a Nomad cluster must be running with Raft protocol version 3 or later.

Upgrading to Raft Protocol 3

This section provides details on upgrading to Raft Protocol 3 in Nomad 0.8 and higher. Raft protocol version 3 requires Nomad running 0.8.0 or newer on all servers in order to work. See Raft Protocol Version Compatibility for more details. Also the format of peers.json used for outage recovery is different when running with the latest Raft protocol. See Manual Recovery Using peers.json for a description of the required format.

Please note that the Raft protocol is different from Nomad's internal protocol as shown in commands like nomad server members. To see the version of the Raft protocol in use on each server, use the nomad operator raft list-peers command.

The easiest way to upgrade servers is to have each server leave the cluster, upgrade its raft_protocol version in the server stanza, and then add it back. Make sure the new server joins successfully and that the cluster is stable before rolling the upgrade forward to the next server. It's also possible to stand up a new set of servers, and then slowly stand down each of the older servers in a similar fashion.

When using Raft protocol version 3, servers are identified by their node-id instead of their IP address when Nomad makes changes to its internal Raft quorum configuration. This means that once a cluster has been upgraded with servers all running Raft protocol version 3, it will no longer allow servers running any older Raft protocol versions to be added. If running a single Nomad server, restarting it in-place will result in that server not being able to elect itself as a leader. To avoid this, either set the Raft protocol back to 2, or use Manual Recovery Using peers.json to map the server to its node ID in the Raft quorum configuration.

Node Draining Improvements

Node draining via the node drain command or the drain API has been substantially changed in Nomad 0.8. In Nomad 0.7.1 and earlier draining a node would immediately stop all allocations on the node being drained. Nomad 0.8 now supports a migrate stanza in job specifications to control how many allocations may be migrated at once and the default will be used for existing jobs.

The drain command now blocks until the drain completes. To get the Nomad 0.7.1 and earlier drain behavior use the command: nomad node drain -enable -force -detach <node-id>

See the migrate stanza documentation and Decommissioning Nodes guide for details.

Periods in Environment Variable Names No Longer Escaped

Applications which expect periods in environment variable names to be replaced with underscores must be updated.

In Nomad 0.7 periods (.) in environment variables names were replaced with an underscore in both the env and template stanzas.

In Nomad 0.8 periods are not replaced and will be included in environment variables verbatim.

For example the following stanza:

env {
  registry.consul.addr = "${NOMAD_IP_http}:8500"
}

In Nomad 0.7 would be exposed to the task as registry_consul_addr=127.0.0.1:8500. In Nomad 0.8 it will now appear exactly as specified: registry.consul.addr=127.0.0.1:8500.

Client APIs Unavailable on Older Nodes

Because Nomad 0.8 uses a new RPC mechanism to route node-specific APIs like nomad alloc fs through servers to the node, 0.8 CLIs are incompatible using these commands on clients older than 0.8.

To access these commands on older clients either continue to use a pre-0.8 version of the CLI, or upgrade all clients to 0.8.

CLI Command Changes

Nomad 0.8 has changed the organization of CLI commands to be based on subcommands. An example of this change is the change from nomad alloc-status to nomad alloc status. All commands have been made to be backwards compatible, but operators should update any usage of the old style commands to the new style as the old style will be deprecated in future versions of Nomad.

RPC Advertise Address

The behavior of the advertised RPC address has changed to be only used to advertise the RPC address of servers to client nodes. Server to server communication is done using the advertised Serf address. Existing cluster's should not be effected but the advertised RPC address may need to be updated to allow connecting client's over a NAT.

Nomad 0.6.0

Default advertise address changes

When no advertise address was specified and Nomad's bind_addr was loopback or 0.0.0.0, Nomad attempted to resolve the local hostname to use as an advertise address.

Many hosts cannot properly resolve their hostname, so Nomad 0.6 defaults advertise to the first private IP on the host (e.g. 10.1.2.3).

If you manually configure advertise addresses no changes are necessary.

Nomad Clients

The change to the default, advertised IP also effect clients that do not specify which network_interface to use. If you have several routable IPs, it is advised to configure the client's network interface such that tasks bind to the correct address.

Nomad 0.5.5

Docker load changes

Nomad 0.5.5 has a backward incompatible change in the docker driver's configuration. Prior to 0.5.5 the load configuration option accepted a list images to load, in 0.5.5 it has been changed to a single string. No functionality was changed. Even if more than one item was specified prior to 0.5.5 only the first item was used.

To do a zero-downtime deploy with jobs that use the load option:

  • Upgrade servers to version 0.5.5 or later.

  • Deploy new client nodes on the same version as the servers.

  • Resubmit jobs with the load option fixed and a constraint to only run on version 0.5.5 or later:

    constraint {
      attribute = "${attr.nomad.version}"
      operator  = "version"
      value     = ">= 0.5.5"
    }
  • Drain and shutdown old client nodes.

Validation changes

Due to internal job serialization and validation changes you may run into issues using 0.5.5 command line tools such as nomad run and nomad validate with 0.5.4 or earlier agents.

It is recommended you upgrade agents before or alongside your command line tools.

Nomad 0.4.0

Nomad 0.4.0 has backward incompatible changes in the logic for Consul deregistration. When a Task which was started by Nomad v0.3.x is uncleanly shut down, the Nomad 0.4 Client will no longer clean up any stale services. If an in-place upgrade of the Nomad client to 0.4 prevents the Task from gracefully shutting down and deregistering its Consul-registered services, the Nomad Client will not clean up the remaining Consul services registered with the 0.3 Executor.

We recommend draining a node before upgrading to 0.4.0 and then re-enabling the node once the upgrade is complete.

Nomad 0.3.1

Nomad 0.3.1 removes artifact downloading from driver configurations and places them as a first class element of the task. As such, jobs will have to be rewritten in the proper format and resubmitted to Nomad. Nomad clients will properly re-attach to existing tasks but job definitions must be updated before they can be dispatched to clients running 0.3.1.

Nomad 0.3.0

Nomad 0.3.0 has made several substantial changes to job files included a new log block and variable interpretation syntax (${var}), a modified restart policy syntax, and minimum resources for tasks as well as validation. These changes require a slight change to the default upgrade flow.

After upgrading the version of the servers, all previously submitted jobs must be resubmitted with the updated job syntax using a Nomad 0.3.0 binary.

  • All instances of $var must be converted to the new syntax of ${var}

  • All tasks must provide their required resources for CPU, memory and disk as well as required network usage if ports are required by the task.

  • Restart policies must be updated to indicate whether it is desired for the task to restart on failure or to fail using mode = "delay" or mode = "fail" respectively.

  • Service names that include periods will fail validation. To fix, remove any periods from the service name before running the job.

After updating the Servers and job files, Nomad Clients can be upgraded by first draining the node so no tasks are running on it. This can be verified by running nomad node status <node-id> and verify there are no tasks in the running state. Once that is done the client can be killed, the data_dir should be deleted and then Nomad 0.3.0 can be launched.