-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support Consul Connect with Consul ACLs enabled #6701
Comments
As leader, check to see if a ClusterID has already been generated and replicated through the raft log. If one has not been generated, generate a UUID and push it through. If a ClusterID has been generated, its value is applied into the state store. The value is not actually used anywhere in this changeset, but is a prerequisite for gh-6701.
As leader, check to see if a ClusterID has already been generated and replicated through the raft log. If one has not been generated, generate a UUID and push it through. If a ClusterID has been generated, its value is applied into the state store. The value is not actually used anywhere in this changeset, but is a prerequisite for gh-6701.
Enable any Server to lookup the unique ClusterID. If one has not been generated, and this node is the leader, generate a UUID and push it through. The value is not yet used anywhere in this changeset, but is a prerequisite for gh-6701.
Enable any Server to lookup the unique ClusterID. If one has not been generated, and this node is the leader, generate a UUID and attempt to apply it through raft. The value is not yet used anywhere in this changeset, but is a prerequisite for gh-6701.
Enable any Server to lookup the unique ClusterID. If one has not been generated, and this node is the leader, generate a UUID and attempt to apply it through raft. The value is not yet used anywhere in this changeset, but is a prerequisite for gh-6701.
Enable any Server to lookup the unique ClusterID. If one has not been generated, and this node is the leader, generate a UUID and attempt to apply it through raft. The value is not yet used anywhere in this changeset, but is a prerequisite for gh-6701.
Enable any Server to lookup the unique ClusterID. If one has not been generated, and this node is the leader, generate a UUID and attempt to apply it through raft. The value is not yet used anywhere in this changeset, but is a prerequisite for gh-6701.
Enable any Server to lookup the unique ClusterID. If one has not been generated, and this node is the leader, generate a UUID and attempt to apply it through raft. The value is not yet used anywhere in this changeset, but is a prerequisite for gh-6701.
Enable any Server to lookup the unique ClusterID. If one has not been generated, and this node is the leader, generate a UUID and attempt to apply it through raft. The value is not yet used anywhere in this changeset, but is a prerequisite for gh-6701.
Enable any Server to lookup the unique ClusterID. If one has not been generated, and this node is the leader, generate a UUID and attempt to apply it through raft. The value is not yet used anywhere in this changeset, but is a prerequisite for gh-6701.
Enable any Server to lookup the unique ClusterID. If one has not been generated, and this node is the leader, generate a UUID and attempt to apply it through raft. The value is not yet used anywhere in this changeset, but is a prerequisite for gh-6701.
Enable any Server to lookup the unique ClusterID. If one has not been generated, and this node is the leader, generate a UUID and attempt to apply it through raft. The value is not yet used anywhere in this changeset, but is a prerequisite for gh-6701.
Enable any Server to lookup the unique ClusterID. If one has not been generated, and this node is the leader, generate a UUID and attempt to apply it through raft. The value is not yet used anywhere in this changeset, but is a prerequisite for gh-6701.
Enable any Server to lookup the unique ClusterID. If one has not been generated, and this node is the leader, generate a UUID and attempt to apply it through raft. The value is not yet used anywhere in this changeset, but is a prerequisite for gh-6701.
Enable any Server to lookup the unique ClusterID. If one has not been generated, and this node is the leader, generate a UUID and attempt to apply it through raft. The value is not yet used anywhere in this changeset, but is a prerequisite for gh-6701.
Enable any Server to lookup the unique ClusterID. If one has not been generated, and this node is the leader, generate a UUID and attempt to apply it through raft. The value is not yet used anywhere in this changeset, but is a prerequisite for gh-6701.
* actually always canonicalize alloc.Job alloc.Job may be stale as well and need to migrate it. It does cost extra cycles but should be negligible. * e2e: improve reusability of provisioning scripts (hashicorp#6942) This changeset is part of the work to improve our E2E provisioning process to allow our upgrade tests: * Move more of the setup into the AMI image creation so it's a little more obvious to provisioning config authors which bits are essential to deploying a specific version of Nomad. * Make the service file update do a systemd daemon-reload so that we can update an already-running cluster with the same script we use to deploy it initially. * Avoid unnecessary golang version reference * add a script to update golang version * Update golang to 1.12.15 * Update ecs.html.md * Update configuring-tasks.html.md * ui: Change Run Job availability based on ACLs (hashicorp#5944) This builds on API changes in hashicorp#6017 and hashicorp#6021 to conditionally turn off the “Run Job” button based on the current token’s capabilities, or the capabilities of the anonymous policy if no token is present. If you try to visit the job-run route directly, it redirects to the job list. * Update changelog * e2e: use valid jobspec for group check test (hashicorp#6967) Group service checks cannot interpolate task fields, because the task fields are not available at the time the script check hook is created for the group service. When f31482a was merged this e2e test began failing because we are now correctly matching the script check ID to the service ID, which revealed this jobspec was invalid. * UI: Migrate to Storybook (hashicorp#6507) I originally planned to add component documentation, but as this dragged on and I found that JSDoc-to-Markdown sometimes needed hand-tuning, I decided to skip it and focus on replicating what was already present in Freestyle. Adding documentation is a finite task that can be revisited in the future. My goal was to migrate everything from Freestyle with as few changes as possible. Some adaptations that I found necessary: • the DelayedArray and DelayedTruth utilities that delay component rendering until slightly after initial render because without them: ◦ charts were rendering with zero width ◦ the JSON viewer was rendering with empty content • Storybook in Ember renders components in a routerless/controllerless context by default, so some component stories needed changes: ◦ table pagination/sorting stories access to query params, which necessitates some reaching into Ember internals to start routing and dynamically generate a Storybook route/controller to render components into ◦ some stories have a faux controller as part of their Storybook context that hosts setInterval-linked dynamic computed properties • some jiggery-pokery with anchor tags ◦ inert href='#' had to become href='javascript:; ◦ links that are actually meant to navigate need target='_parent' so they don’t navigate inside the Storybook iframe Maybe some of these could be addressed by fixes in ember-cli-storybook but I’m wary of digging around in there any more than I already have, as I’ve lost a lot of time to Storybook confusion and frustrations already 😞 The STORYBOOK=true environment variable tweaks some environment settings to get things working as expected in the Storybook context. I chose to: • use angle bracket invocation within stories rather than have to migrate them soon after having moved to Storybook • keep Freestyle around for now for its palette and typeface components * e2e: update framework to allow deploying Nomad (hashicorp#6969) The e2e framework instantiates clients for Nomad/Consul but the provisioning of the actual Nomad cluster is left to Terraform. The Terraform provisioning process uses `remote-exec` to deploy specific versions of Nomad so that we don't have to bake an AMI every time we want to test a new version. But Terraform treats the resulting instances as immutable, so we can't use the same tooling to update the version of Nomad in-place. This is a prerequisite for upgrade testing. This changeset extends the e2e framework to provide the option of deploying Nomad (and, in the future, Consul/Vault) with specific versions to running infrastructure. This initial implementation is focused on deploying to a single cluster via `ssh` (because that's our current need), but provides interfaces to hook the test run at the start of the run, the start of each suite, or the start of a given test case. Terraform work includes: * provides Terraform output that written to JSON used by the framework to configure provisioning via `terraform output provisioning`. * provides Terraform output that can be used by test operators to configure their shell via `$(terraform output environment)` * drops `remote-exec` provisioning steps from Terraform * makes changes to the deployment scripts to ensure they can be run multiple times w/ different versions against the same host. * e2e: ensure group script check tests interpolation (hashicorp#6972) Fixes a bug introduced in 0aa58b9 where we're writing a test file to a taskdir-interpolated location, which works when we `alloc exec` but not in the jobspec for a group script check. This changeset also makes the test safe to run multiple times by namespacing the file with the alloc ID, which has the added bonus of exercising our alloc interpolation code for group script checks. * Return FailedTGAlloc metric instead of no node err If an existing system allocation is running and the node its running on is marked as ineligible, subsequent plan/applys return an RPC error instead of a more helpful plan result. This change logs the error, and appends a failedTGAlloc for the placement. * update changelog * extract leader step function * Handle Nomad leadership flapping Fixes a deadlock in leadership handling if leadership flapped. Raft propagates leadership transition to Nomad through a NotifyCh channel. Raft blocks when writing to this channel, so channel must be buffered or aggressively consumed[1]. Otherwise, Raft blocks indefinitely in `raft.runLeader` until the channel is consumed[1] and does not move on to executing follower related logic (in `raft.runFollower`). While Raft `runLeader` defer function blocks, raft cannot process any other raft operations. For example, `run{Leader|Follower}` methods consume `raft.applyCh`, and while runLeader defer is blocked, all raft log applications or config lookup will block indefinitely. Sadly, `leaderLoop` and `establishLeader` makes few Raft calls! `establishLeader` attempts to auto-create autopilot/scheduler config [3]; and `leaderLoop` attempts to check raft configuration [4]. All of these calls occur without a timeout. Thus, if leadership flapped quickly while `leaderLoop/establishLeadership` is invoked and hit any of these Raft calls, Raft handler _deadlock_ forever. Depending on how many times it flapped and where exactly we get stuck, I suspect it's possible to get in the following case: * Agent metrics/stats http and RPC calls hang as they check raft.Configurations * raft.State remains in Leader state, and server attempts to handle RPC calls (e.g. node/alloc updates) and these hang as well As we create goroutines per RPC call, the number of goroutines grow over time and may trigger a out of memory errors in addition to missed updates. [1] https://github.com/hashicorp/raft/blob/d90d6d6bdacf1b35d66940b07be515b074d89e88/config.go#L190-L193 [2] https://github.com/hashicorp/raft/blob/d90d6d6bdacf1b35d66940b07be515b074d89e88/raft.go#L425-L436 [3] https://github.com/hashicorp/nomad/blob/2a89e477465adbe6a88987f0dcb9fe80145d7b2f/nomad/leader.go#L198-L202 [4] https://github.com/hashicorp/nomad/blob/2a89e477465adbe6a88987f0dcb9fe80145d7b2f/nomad/leader.go#L877 * e2e: document e2e provisioning process (hashicorp#6976) * Add the digital marketing team as the code owners for the website dir * Mock the eligibility endpoint in mirage * Implement eligibility toggling in the data layer * Add isMigrating property to the allocation model * Mock the drain endpoint * drain and forceDrain adapter methods * Update drain methods to properly wrap DrainSpec params * cancelDrain adapter method * Reformat the client detail page to use the two-row header design * Add tooltip to the eligibility control * Update the underlying node model when toggling eligibility in mirage * Eligibility toggling behavior * PopoverMenu component * Update the dropdown styles to be more similar to button styles * Multiline modifier for tooltips * More form styles as needed for the drain form * Initial layout of the drain options popover * Let dropdowns assume their full width * Add triggerClass support to the popover menu * Factor out the drain popover and implement its behaviors * Extract the duration parsing into a util * Test coverage for the parse duration util * Refactor parseDuration to support multi-character units * Polish for the drain popover * Stub out all the markup for the new drain strategy view * Fill in the drain strategy ribbon values * Fill out the metrics and time since values in the drain summary * Drain complete notification * Drain stop and update and notifications * Modifiers to the two-step-button * Make outline buttons have a solid white background * Force drain button in the drain info box * New toggle component * Swap the eligiblity checkbox out for a toggle * Toggle bugs: focus and multiline alignment * Switch drain popover checkboxes for toggles * Clear all notifications when resetting the controller * Model the notification pattern as a page object component * Update the client detail page object * Integration tests for the toggle component * PopoverMenu integration tests * Update existing tests * New test coverage for the drain capabilities * Stack the popover menu under the subnav * Use qunit-dom where applicable * Increase the size and spacing of the toggle component * Remove superfluous information from the client details ribbon * Tweak vertical spacing of headings * Update client detail test given change to the compositeStatus property * Replace custom parse-duration implementation with an existing lib * fix comment * consul: add support for canary meta * website: add canary meta to api docs * docs: add Go versioning policy * consul: fix var name from rebase * docs: reseting bootstrap doesn't invalidate token * consul: fix var name from rebase * Update website/source/guides/security/acl.html.markdown Co-Authored-By: Tim Gross <[email protected]> * e2e: packer builds should not be public (hashicorp#6998) * docs: tweaks * include test and address review comments * handle channel close signal Always deliver last value then send close signal. * tweak leadership flapping log messages * tests: defer closing shutdownCh * client: canonicalize alloc.Job on restore There is a case for always canonicalizing alloc.Job field when canonicalizing the alloc. I'm less certain of implications though, and the job canonicalize hasn't changed for a long time. Here, we special case client restore from database as it's probably the most relevant part. When receiving an alloc from RPC, the data should be fresh enough. * Support customizing full scheduler config * tests: run_for is already a string * canary_meta will be part of 0.10.3 (not 0.10.2) I assume this is just an oversight. I tried adding the `canary_meta` stanza to an existing v0.10.2 setup (Nomad v0.10.2 (0d2d6e3) and it did show the error message: ``` * group: 'ggg', task: 'tttt', invalid key: canary_meta ``` * use golang 1.12.16 * Allow nomad monitor command to lookup server UUID Allows addressing servers with nomad monitor using the servers name or ID. Also unifies logic for addressing servers for client_agent_endpoint commands and makes addressing logic region aware. rpc getServer test * fix tests, update changelog * e2e: add a -suite flag to e2e.Framework This change allows for providing the -suite=<Name> flag when running the e2e framework. If set, only the matching e2e/Framework.TestSuite.Component will be run, and all ther suites will be skipped. * Document default_scheduler_config option * document docker's disable_log_collection flag * batch mahmood's changelog entries [ci skip] * incorporate review feedback * core: add limits to unauthorized connections Introduce limits to prevent unauthorized users from exhausting all ephemeral ports on agents: * `{https,rpc}_handshake_timeout` * `{http,rpc}_max_conns_per_client` The handshake timeout closes connections that have not completed the TLS handshake by the deadline (5s by default). For RPC connections this timeout also separately applies to first byte being read so RPC connections with TLS enabled have `rpc_handshake_time * 2` as their deadline. The connection limit per client prevents a single remote TCP peer from exhausting all ephemeral ports. The default is 100, but can be lowered to a minimum of 26. Since streaming RPC connections create a new TCP connection (until MultiplexV2 is used), 20 connections are reserved for Raft and non-streaming RPCs to prevent connection exhaustion due to streaming RPCs. All limits are configurable and may be disabled by setting them to `0`. This also includes a fix that closes connections that attempt to create TLS RPC connections recursively. While only users with valid mTLS certificates could perform such an operation, it was added as a safeguard to prevent programming errors before they could cause resource exhaustion. * docs: document limits Taken more or less verbatim from Consul. * Merge pull request hashicorp#160 from hashicorp/b-mtls-hostname server: validate role and region for RPC w/ mTLS * docs: bump 0.10.2 -> 0.10.3 * docs: add v0.10.3 release to changelog * Add an ability for client permissions * Refactor ability tests to use a setup hook for ability lookup * Enable the eligibility toggle conditionally based on acls * Refetch all ACL things when the token changes * New disabled buttons story * Disabled button styles * Disable options for popover and drain-popover * hclfmt a test jobspec (hashicorp#7011) * Update disabled 'Run Job' button to use standard disabled style * Add an explanatory tooltip to the unauthorized node drain popover * Fix token referencing from the token controller, as well as resetting * Handle the case where ACLs aren't enabled in abilities * Account for disabled ACLs in ability tests * Acceptance test for disabled node write controls * Use secret ID for NOMAD_TOKEN Use secret ID for NOMAD_TOKEN as the accessor ID doesn't seem to work. I tried with a local micro cluster following the tutorials, and if I do: ```console $ export NOMAD_TOKEN=85310d07-9afa-ef53-0933-0c043cd673c7 ``` Using the accessor ID as in this example, I get an error: ``` Error querying jobs: Unexpected response code: 403 (ACL token not found) ``` But when using the secret ID in that env var it seems to work correctly. * Pass stats interval colleciton to executor This fixes a bug where executor based drivers emit stats every second, regardless of user configuration. When serializing the Stats request across grpc, the nomad agent dropped the Interval value, and then executor uses 1s as a default value. * changelog * Some fixes to connection pooling Pick up some fixes from Consul: * If a stream returns an EOF error, clear session from cache/pool and start a new one. * Close the codec when closing StreamClient * Allow for an icon within the node status light * Add an icon inside the node status light * Assign icons to node statuses * New node initializing icon * Redo the node-status-light CSS to be icon-based * Add an animation for the initializing state * Call out the 'down' status too, since it's a pretty bad one * command, docs: create and document consul token configuration for connect acls (hashicorpgh-6716) This change provides an initial pass at setting up the configuration necessary to enable use of Connect with Consul ACLs. Operators will be able to pass in a Consul Token through `-consul-token` or `$CONSUL_TOKEN` in the `job run` and `job revert` commands (similar to Vault tokens). These values are not actually used yet in this changeset. * nomad: ensure a unique ClusterID exists when leader (hashicorpgh-6702) Enable any Server to lookup the unique ClusterID. If one has not been generated, and this node is the leader, generate a UUID and attempt to apply it through raft. The value is not yet used anywhere in this changeset, but is a prerequisite for hashicorpgh-6701. * client: enable nomad client to request and set SI tokens for tasks When a job is configured with Consul Connect aware tasks (i.e. sidecar), the Nomad Client should be able to request from Consul (through Nomad Server) Service Identity tokens specific to those tasks. * nomad: proxy requests for Service Identity tokens between Clients and Consul Nomad jobs may be configured with a TaskGroup which contains a Service definition that is Consul Connect enabled. These service definitions end up establishing a Consul Connect Proxy Task (e.g. envoy, by default). In the case where Consul ACLs are enabled, a Service Identity token is required for these tasks to run & connect, etc. This changeset enables the Nomad Server to recieve RPC requests for the derivation of SI tokens on behalf of instances of Consul Connect using Tasks. Those tokens are then relayed back to the requesting Client, which then injects the tokens in the secrets directory of the Task. * client: enable envoy bootstrap hook to set SI token When creating the envoy bootstrap configuration, we should append the "-token=<token>" argument in the case where the sidsHook placed the token in the secrets directory. * nomad: fixup token policy validation * nomad: handle SI token revocations concurrently Be able to revoke SI token accessors concurrently, and also ratelimit the requests being made to Consul for the various ACL API uses. * agent: re-enable the server in dev mode * client: remove unused indirection for referencing consul executable Was thinking about using the testing pattern where you create executable shell scripts as test resources which "mock" the process a bit of code is meant to fork+exec. Turns out that wasn't really necessary in this case. * client: skip task SI token file load failure if testing as root The TestEnvoyBootstrapHook_maybeLoadSIToken test case only works when running as a non-priveleged user, since it deliberately tries to read an un-readable file to simulate a failure loading the SI token file. * comments: cleanup some leftover debug comments and such * nomad,client: apply smaller PR suggestions Apply smaller suggestions like doc strings, variable names, etc. Co-Authored-By: Nick Ethier <[email protected]> Co-Authored-By: Michael Schurter <[email protected]> * nomad,client: apply more comment/style PR tweaks * client: set context timeout around SI token derivation The derivation of an SI token needs to be safegaurded by a context timeout, otherwise an unresponsive Consul could cause the siHook to block forever on Prestart. * client: manage TR kill from parent on SI token derivation failure Re-orient the management of the tr.kill to happen in the parent of the spawned goroutine that is doing the actual token derivation. This makes the code a little more straightforward, making it easier to reason about not leaking the worker goroutine. * nomad: fix leftover missed refactoring in consul policy checking * nomad: make TaskGroup.UsesConnect helper a public helper * client: PR cleanup - shadow context variable * client: PR cleanup - improved logging around kill task in SIDS hook * client: additional test cases around failures in SIDS hook * tests: skip some SIDS hook tests if running tests as root * e2e: e2e test for connect with consul acls Provide script for managing Consul ACLs on a TF provisioned cluster for e2e testing. Script can be used to 'enable' or 'disable' Consul ACLs, and automatically takes care of the bootstrapping process if necessary. The bootstrapping process takes a long time, so we may need to extend the overall e2e timeout (20 minutes seems fine). Introduces basic tests for Consul Connect with ACLs. * e2e: remove forgotten unused field from new struct * e2e: do not use eventually when waiting for allocs This test is causing panics. Unlike the other similar tests, this one is using require.Eventually which is doing something bad, and this change replaces it with a for-loop like the other tests. Failure: === RUN TestE2E/Connect === RUN TestE2E/Connect/*connect.ConnectE2ETest === RUN TestE2E/Connect/*connect.ConnectE2ETest/TestConnectDemo === RUN TestE2E/Connect/*connect.ConnectE2ETest/TestMultiServiceConnect === RUN TestE2E/Connect/*connect.ConnectClientStateE2ETest panic: Fail in goroutine after TestE2E/Connect/*connect.ConnectE2ETest has completed goroutine 38 [running]: testing.(*common).Fail(0xc000656500) /opt/google/go/src/testing/testing.go:565 +0x11e testing.(*common).Fail(0xc000656100) /opt/google/go/src/testing/testing.go:559 +0x96 testing.(*common).FailNow(0xc000656100) /opt/google/go/src/testing/testing.go:587 +0x2b testing.(*common).Fatalf(0xc000656100, 0x1512f90, 0x10, 0xc000675f88, 0x1, 0x1) /opt/google/go/src/testing/testing.go:672 +0x91 github.com/hashicorp/nomad/e2e/connect.(*ConnectE2ETest).TestMultiServiceConnect.func1(0x0) /home/shoenig/go/src/github.com/hashicorp/nomad/e2e/connect/multi_service.go:72 +0x296 github.com/hashicorp/nomad/vendor/github.com/stretchr/testify/assert.Eventually.func1(0xc0004962a0, 0xc0002338f0) /home/shoenig/go/src/github.com/hashicorp/nomad/vendor/github.com/stretchr/testify/assert/assertions.go:1494 +0x27 created by github.com/hashicorp/nomad/vendor/github.com/stretchr/testify/assert.Eventually /home/shoenig/go/src/github.com/hashicorp/nomad/vendor/github.com/stretchr/testify/assert/assertions.go:1493 +0x272 FAIL github.com/hashicorp/nomad/e2e 21.427s * e2e: uncomment test case that is not broken * e2e: use hclfmt on consul acls policy config files * e2e: agent token was only being set for server0 * e2e: remove redundant extra API call for getting allocs * e2e: setup consul ACLs a little more correctly * tests: set consul token for nomad client for testing SIDS TR hook * nomad: min cluster version for connect ACLs is now v0.10.4 * nomad: remove unused default schedular variable This is from a merge conflict resolution that went the wrong direction. I assumed the block had been added, but really it had been removed. Now, it is removed once again. * docs: update chanagelog to mention connect with acls * nomad/docs: increment version number to 0.10.4 * sentinel: copy jobs to prevent mutation It's unclear whether Sentinel code can mutate values passed to the eval, so ensure it cannot by copying the job. * ignore computed diffs if node is ineligible test flakey, add temp sleeps for debugging fix computed class * make diffSystemAllocsForNode aware of eligibility diffSystemAllocs -> diffSystemAllocsForNode, this function is only used for diffing system allocations, but lacked awareness of eligible nodes and the node ID that the allocation was going to be placed. This change now ignores a change if its existing allocation is on an ineligible node. For a new allocation, it also checks tainted and ineligible nodes in the same function instead of nil-ing out the diff after computation in diffSystemAllocs * add test for node eligibility * comment for filtering reason * update changelog * vagrant: disable audio interference Avoid Vagrant/virtualbox interferring with host audio when the VM boots. * prehook: fix enterprise repo remote value * dev: Tweaks to cluster dev scripts Consolidate all nomad data dir in a single root `/tmp/nomad-dev-cluster`. Eases clean up. Allow running script from any path - don't require devs to cd into `dev/cluster` directory first. Also, block while nomad processes are running and prapogate SIGTERM/SIGINT to nomad processes to shutdown. * e2e: remove leftover debug println statement * run "make hclfmt" * make: emit explanation for /api isolation Emit a slightly helpful message when /api depends on nomad internal packages. * pool: Clear connection before releasing This to be consistent with other connection clean up handler as well as consul's https://github.com/hashicorp/consul/blob/v1.6.3/agent/pool/pool.go#L468-L479 . * Fix panic when monitoring a local client node Fixes a panic when accessing a.agent.Server() when agent is a client instead. This pr removes a redundant ACL check since ACLs are validated at the RPC layer. It also nil checks the agent server and uses Client() when appropriate. * agent Profile req nil check s.agent.Server() clean up logic and tests * update changelog * docs: fix misspelling * keep placed canaries aligned with alloc status * nomad state store must be modified through raft, rm local state change * add state store test to ensure PlacedCanaries is updated * docs: add link & reorg hashicorp#6690 in changelog * docs: fix typo, ordering, & style in changelog * e2e: turn no-ACLs connect tests back on Also cleanup more missed debugging things >.> * e2e: improve provisioning defaults and documentation (hashicorp#7062) This changeset improves the ergonomics of running the Nomad e2e test provisioning process by defaulting to a blank `nomad_sha` in the Terraform configuration. By default, a user will now need to pass in one of the Nomad version flags. But they won't have to manually edit the `provisioning.json` file for the common case of deploying a released version of Nomad, and won't need to put dummy values for `nomad_sha`. Includes general documentation improvements. * e2e: rename linux runner to avoid implicit build tag (hashicorp#7070) Go implicitly treats files ending with `_linux.go` as build tagged for Linux only. This broke the e2e provisioning framework on macOS once we tried importing it into the `e2e/consulacls` module. * e2e: wait 2m rather than 10s after disabling consul acls Pretty sure Consul / Nomad clients are often not ready yet after the ConsulACLs test disables ACLs, by the time the next test starts running. Running locally things tend to work, but in TeamCity this seems to be a recurring problem. However, when running locally sometimes I do see that the "show status" step after disabling ACLs, some nodes are still initializing, suggesting we're right on the border of not waiting long enough nomad node status ID DC Name Class Drain Eligibility Status 0e4dfce2 dc1 EC2AMAZ-JB3NF9P <none> false eligible ready 6b90aa06 dc2 ip-172-31-16-225 <none> false eligible ready 7068558a dc2 ip-172-31-20-143 <none> false eligible ready e0ae3c5c dc1 ip-172-31-25-165 <none> false eligible ready 15b59ed6 dc1 ip-172-31-23-199 <none> false eligible initializing Going to try waiting a full 2 minutes after disabling ACLs, hopefully that will help things Just Work. In the future, we should probably be parsing the output of the status checks and actually confirming all nodes are ready. Even better, maybe that's something shipyard will have built-in. * add e2e test for system sched ineligible nodes * get test passing, new util func to wait for not pending * clean up * rm unused field * fix check * simplify job, better error * docs: hashicorp#6065 shipped in v0.10.0, not v0.9.6 PR hashicorp#6065 was intended to be backported to v0.9.6 to fix issue hashicorp#6223. However it appears to have not been backported: * https://github.com/hashicorp/nomad/blob/v0.9.6/client/allocrunner/taskrunner/task_runner.go#L1349-L1351 * https://github.com/hashicorp/nomad/blob/v0.9.7/client/allocrunner/taskrunner/task_runner.go#L1349-L1351 The fix was included in v0.10.0: * https://github.com/hashicorp/nomad/blob/v0.10.0/client/allocrunner/taskrunner/task_runner.go#L1363-L1370 * e2e: add --quiet flag to s3 copy to reduce log spam (hashicorp#7085) * Explicit transparent bg on popover actions * Override the max-width on mobile to avoid losing space due to non-existent gutter menu * changelog windows binaries being signed Note that 0.10.4, nomad windows binaries will be signed. [ci skip] * change log for remote pprof endpoints * nomad: unset consul token on job register * nomad: assert consul token is unset on job register in tests * command: use consistent CONSUL_HTTP_TOKEN name Consul CLI uses CONSUL_HTTP_TOKEN, so Nomad should use the same. Note that consul-template uses CONSUL_TOKEN, which Nomad also uses, so be careful to preserve any reference to that in the consul-template context. * docs: update changelog mentioning consul token passthrough * release: prep 0.10.4 * Generate files for 0.10.4 release * Release v0.10.4 Co-authored-by: Mahmood Ali <[email protected]> Co-authored-by: Tim Gross <[email protected]> Co-authored-by: Tim Higgison <[email protected]> Co-authored-by: Buck Doyle <[email protected]> Co-authored-by: Drew Bailey <[email protected]> Co-authored-by: Charlie Voiselle <[email protected]> Co-authored-by: Michael Schurter <[email protected]> Co-authored-by: Michael Lange <[email protected]> Co-authored-by: Nick Ethier <[email protected]> Co-authored-by: Tim Gross <[email protected]> Co-authored-by: Shantanu Gadgil <[email protected]> Co-authored-by: Seth Hoenig <[email protected]> Co-authored-by: Sebastián Ramírez <[email protected]> Co-authored-by: Nomad Release bot <[email protected]>
Is there documentation about how to use nomad with consul-connect when ACL's are enabled yet? |
I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues. |
Implement support for Consul Connect with Consul ACLs enabled.
This will be largely centered around Consul Service Identity (SI) Tokens being automatically managed per Nomad Task.
The text was updated successfully, but these errors were encountered: