feat: [DC-761] Update nomad v0.10.4 #11

spavell · 2020-03-16T11:04:29Z

Update master to 0.10.4

alloc.Job may be stale as well and need to migrate it. It does cost extra cycles but should be negligible.

This changeset is part of the work to improve our E2E provisioning process to allow our upgrade tests: * Move more of the setup into the AMI image creation so it's a little more obvious to provisioning config authors which bits are essential to deploying a specific version of Nomad. * Make the service file update do a systemd daemon-reload so that we can update an already-running cluster with the same script we use to deploy it initially.

This builds on API changes in hashicorp#6017 and hashicorp#6021 to conditionally turn off the “Run Job” button based on the current token’s capabilities, or the capabilities of the anonymous policy if no token is present. If you try to visit the job-run route directly, it redirects to the job list.

Group service checks cannot interpolate task fields, because the task fields are not available at the time the script check hook is created for the group service. When f31482a was merged this e2e test began failing because we are now correctly matching the script check ID to the service ID, which revealed this jobspec was invalid.

I originally planned to add component documentation, but as this dragged on and I found that JSDoc-to-Markdown sometimes needed hand-tuning, I decided to skip it and focus on replicating what was already present in Freestyle. Adding documentation is a finite task that can be revisited in the future. My goal was to migrate everything from Freestyle with as few changes as possible. Some adaptations that I found necessary: • the DelayedArray and DelayedTruth utilities that delay component rendering until slightly after initial render because without them: ◦ charts were rendering with zero width ◦ the JSON viewer was rendering with empty content • Storybook in Ember renders components in a routerless/controllerless context by default, so some component stories needed changes: ◦ table pagination/sorting stories access to query params, which necessitates some reaching into Ember internals to start routing and dynamically generate a Storybook route/controller to render components into ◦ some stories have a faux controller as part of their Storybook context that hosts setInterval-linked dynamic computed properties • some jiggery-pokery with anchor tags ◦ inert href='#' had to become href='javascript:; ◦ links that are actually meant to navigate need target='_parent' so they don’t navigate inside the Storybook iframe Maybe some of these could be addressed by fixes in ember-cli-storybook but I’m wary of digging around in there any more than I already have, as I’ve lost a lot of time to Storybook confusion and frustrations already 😞 The STORYBOOK=true environment variable tweaks some environment settings to get things working as expected in the Storybook context. I chose to: • use angle bracket invocation within stories rather than have to migrate them soon after having moved to Storybook • keep Freestyle around for now for its palette and typeface components

The e2e framework instantiates clients for Nomad/Consul but the provisioning of the actual Nomad cluster is left to Terraform. The Terraform provisioning process uses `remote-exec` to deploy specific versions of Nomad so that we don't have to bake an AMI every time we want to test a new version. But Terraform treats the resulting instances as immutable, so we can't use the same tooling to update the version of Nomad in-place. This is a prerequisite for upgrade testing. This changeset extends the e2e framework to provide the option of deploying Nomad (and, in the future, Consul/Vault) with specific versions to running infrastructure. This initial implementation is focused on deploying to a single cluster via `ssh` (because that's our current need), but provides interfaces to hook the test run at the start of the run, the start of each suite, or the start of a given test case. Terraform work includes: * provides Terraform output that written to JSON used by the framework to configure provisioning via `terraform output provisioning`. * provides Terraform output that can be used by test operators to configure their shell via `$(terraform output environment)` * drops `remote-exec` provisioning steps from Terraform * makes changes to the deployment scripts to ensure they can be run multiple times w/ different versions against the same host.

Fixes a bug introduced in 0aa58b9 where we're writing a test file to a taskdir-interpolated location, which works when we `alloc exec` but not in the jobspec for a group script check. This changeset also makes the test safe to run multiple times by namespacing the file with the alloc ID, which has the added bonus of exercising our alloc interpolation code for group script checks.

If an existing system allocation is running and the node its running on is marked as ineligible, subsequent plan/applys return an RPC error instead of a more helpful plan result. This change logs the error, and appends a failedTGAlloc for the placement.

…ineligible Return FailedTGAlloc metric instead of no node err

Fixes a deadlock in leadership handling if leadership flapped. Raft propagates leadership transition to Nomad through a NotifyCh channel. Raft blocks when writing to this channel, so channel must be buffered or aggressively consumed[1]. Otherwise, Raft blocks indefinitely in `raft.runLeader` until the channel is consumed[1] and does not move on to executing follower related logic (in `raft.runFollower`). While Raft `runLeader` defer function blocks, raft cannot process any other raft operations. For example, `run{Leader|Follower}` methods consume `raft.applyCh`, and while runLeader defer is blocked, all raft log applications or config lookup will block indefinitely. Sadly, `leaderLoop` and `establishLeader` makes few Raft calls! `establishLeader` attempts to auto-create autopilot/scheduler config [3]; and `leaderLoop` attempts to check raft configuration [4]. All of these calls occur without a timeout. Thus, if leadership flapped quickly while `leaderLoop/establishLeadership` is invoked and hit any of these Raft calls, Raft handler _deadlock_ forever. Depending on how many times it flapped and where exactly we get stuck, I suspect it's possible to get in the following case: * Agent metrics/stats http and RPC calls hang as they check raft.Configurations * raft.State remains in Leader state, and server attempts to handle RPC calls (e.g. node/alloc updates) and these hang as well As we create goroutines per RPC call, the number of goroutines grow over time and may trigger a out of memory errors in addition to missed updates. [1] https://github.com/hashicorp/raft/blob/d90d6d6bdacf1b35d66940b07be515b074d89e88/config.go#L190-L193 [2] https://github.com/hashicorp/raft/blob/d90d6d6bdacf1b35d66940b07be515b074d89e88/raft.go#L425-L436 [3] https://github.com/hashicorp/nomad/blob/2a89e477465adbe6a88987f0dcb9fe80145d7b2f/nomad/leader.go#L198-L202 [4] https://github.com/hashicorp/nomad/blob/2a89e477465adbe6a88987f0dcb9fe80145d7b2f/nomad/leader.go#L877

website: add ‘intro to nomad’ video to /intro

Update ecs.html.md

Update configuring-tasks.html.md

System sched e2e

PR hashicorp#6065 was intended to be backported to v0.9.6 to fix issue hashicorp#6223. However it appears to have not been backported: * https://github.com/hashicorp/nomad/blob/v0.9.6/client/allocrunner/taskrunner/task_runner.go#L1349-L1351 * https://github.com/hashicorp/nomad/blob/v0.9.7/client/allocrunner/taskrunner/task_runner.go#L1349-L1351 The fix was included in v0.10.0: * https://github.com/hashicorp/nomad/blob/v0.10.0/client/allocrunner/taskrunner/task_runner.go#L1363-L1370

docs: hashicorp#6065 shipped in v0.10.0, not v0.9.6

e2e: wait 2m rather than 10s after disabling consul acls

…stent gutter menu

…ver-buttons UI: Explicit transparent bg on popover actions

…ot-full-width UI: Override the max-width on mobile to avoid losing space due to non-existent gutter menu

Note that 0.10.4, nomad windows binaries will be signed. [ci skip]

changelog windows binaries being signed

change log for remote pprof endpoints

Consul CLI uses CONSUL_HTTP_TOKEN, so Nomad should use the same. Note that consul-template uses CONSUL_TOKEN, which Nomad also uses, so be careful to preserve any reference to that in the consul-template context.

nomad: unset consul token on job register

command: use consistent CONSUL_HTTP_TOKEN name

Manicqin · 2020-03-16T19:55:39Z

2794 files changed... I will start now and finish when the Corona passes.

Mahmood Ali and others added 30 commits January 15, 2020 09:02

actually always canonicalize alloc.Job

4813863

alloc.Job may be stale as well and need to migrate it. It does cost extra cycles but should be negligible.

Avoid unnecessary golang version reference

9aa4cfe

add a script to update golang version

8dbb16c

Update golang to 1.12.15

ee244e9

Update ecs.html.md

ae3030e

Update configuring-tasks.html.md

302e27e

Update changelog

1e62825

update changelog

abde9f9

extract leader step function

ccd9c14

Merge pull request hashicorp#6968 from hashicorp/b-system-sched-plan-…

15b782c

…ineligible Return FailedTGAlloc metric instead of no node err

e2e: document e2e provisioning process (hashicorp#6976)

074f17a

Merge pull request hashicorp#6780 from hashicorp/km.intro-video

bb5f15f

website: add ‘intro to nomad’ video to /intro

Merge pull request hashicorp#6952 from TimHiggison/patch-1

2b3db3c

Update ecs.html.md

Merge pull request hashicorp#6953 from TimHiggison/patch-2

3ed31eb

Update configuring-tasks.html.md

Add the digital marketing team as the code owners for the website dir

004810d

Mock the eligibility endpoint in mirage

a43d108

Implement eligibility toggling in the data layer

649be7f

Add isMigrating property to the allocation model

b22047e

Mock the drain endpoint

1cbbba0

drain and forceDrain adapter methods

4ddcd60

Update drain methods to properly wrap DrainSpec params

0b031f0

cancelDrain adapter method

4d05f53

drewbailey and others added 24 commits February 4, 2020 13:59

simplify job, better error

84cc906

Merge pull request hashicorp#7072 from hashicorp/system-sched-e2e

f944959

System sched e2e

Merge pull request hashicorp#7074 from hashicorp/docs-changelog-6065

a74917e

docs: hashicorp#6065 shipped in v0.10.0, not v0.9.6

Merge pull request hashicorp#7071 from hashicorp/b-e2e-cacls-wait-longer

729e0c2

e2e: wait 2m rather than 10s after disabling consul acls

e2e: add --quiet flag to s3 copy to reduce log spam (hashicorp#7085)

8d17366

Explicit transparent bg on popover actions

b73ac89

Override the max-width on mobile to avoid losing space due to non-exi…

17e2947

…stent gutter menu

Merge pull request hashicorp#7098 from hashicorp/b-ui/consistent-popo…

6eee8d6

…ver-buttons UI: Explicit transparent bg on popover actions

Merge pull request hashicorp#7099 from hashicorp/b-ui/mobile-styles-n…

4757f87

…ot-full-width UI: Override the max-width on mobile to avoid losing space due to non-existent gutter menu

changelog windows binaries being signed

22f7844

Note that 0.10.4, nomad windows binaries will be signed. [ci skip]

Merge pull request hashicorp#7108 from hashicorp/ch-windows-binaries

ce6b8f2

changelog windows binaries being signed

change log for remote pprof endpoints

1dd45d8

Merge pull request hashicorp#7122 from hashicorp/pprof/changelog

76e1785

change log for remote pprof endpoints

nomad: unset consul token on job register

a67710c

nomad: assert consul token is unset on job register in tests

b5b5e50

command: use consistent CONSUL_HTTP_TOKEN name

5ccc9a6

Consul CLI uses CONSUL_HTTP_TOKEN, so Nomad should use the same. Note that consul-template uses CONSUL_TOKEN, which Nomad also uses, so be careful to preserve any reference to that in the consul-template context.

docs: update changelog mentioning consul token passthrough

7173cb2

Merge pull request hashicorp#7127 from hashicorp/b-unset-ct

0a93ddd

nomad: unset consul token on job register

Merge pull request hashicorp#7129 from hashicorp/b-consistent-ct-name

4c4dd0f

command: use consistent CONSUL_HTTP_TOKEN name

release: prep 0.10.4

81c73e4

Generate files for 0.10.4 release

f750636

Release v0.10.4

646df47

merge and conflic resolving

db3d518

spavell requested a review from Manicqin March 16, 2020 11:09

Manicqin approved these changes Mar 16, 2020

View reviewed changes

spavell changed the title ~~Update nomad v0.10.4~~ feat: [DC-761] Update nomad v0.10.4 Mar 16, 2020

spavell merged commit becd650 into master Mar 17, 2020

spavell deleted the update_nomad_v0.10.4 branch March 17, 2020 22:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: [DC-761] Update nomad v0.10.4 #11

feat: [DC-761] Update nomad v0.10.4 #11

spavell commented Mar 16, 2020

Manicqin commented Mar 16, 2020

feat: [DC-761] Update nomad v0.10.4 #11

feat: [DC-761] Update nomad v0.10.4 #11

Conversation

spavell commented Mar 16, 2020

Manicqin commented Mar 16, 2020