Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support for Nomad - Consul Connect integration with Consul ACLs #6905

Merged
merged 28 commits into from
Jan 31, 2020

Conversation

shoenig
Copy link
Member

@shoenig shoenig commented Jan 7, 2020

This PR adds support for using Nomad's Consul Connect Integration when Consul ACLs are enabled.

Incorporates
#6701 - overall connect acls support
#6717 - nomad server mint & revoke SI tokens
#6718 - nomad client request & inject SI tokens
#6789 - nomad client set SI token for envoy bootstrap

We still need e2e tests which cover the cases

  • when Consul is absent
  • when Consul is present, with ACLs disabled
  • when Consul is present, with ACLs in default:allow mode
  • when Consul is present, with ACLs in default:deny mode
  • when the built-in envoy support is the sidecar
  • when a self-provided Task is the sidecar

Currently, this PR has Nomad manually parsing Consul ACL policies on Consul Tokens for doing authorization. This seems necessary for now, as Consul does not provide an endpoint with a kind of "dry-run" functionality for authorization validation without actually executing changes. The vault integration does something similar. One major caveat is the lack of support for Consul Namespaces, which is rolling out in the current Consul beta. Not sure if this is a show-stopper.

This change does not incorporate the background orphaned SI token reconciliation described in #6719. It should be very uncommon for SI tokens to be left in that state (known by Consul but unknown by Nomad). It requires a race condition where Nomad Server requests the token to be created, Consul creates the token, and the requesting Nomad Server crashes in the interim. In other failure scenarios Nomad Server already retries the revocation of the SI token in the background until Consul says the token is gone. The missing functionality will cover the case where Nomad Server crashes, forgets the token exists, and must do reconciliation with Consul to purge any token from Consul that is identified with this Nomad's ClusterID, but is not known by this Nomad cluster. Will be added in upcoming release.

client/allocrunner/taskrunner/envoybootstrap_hook.go Outdated Show resolved Hide resolved
client/allocrunner/taskrunner/envoybootstrap_hook.go Outdated Show resolved Hide resolved
client/allocrunner/taskrunner/envoybootstrap_hook.go Outdated Show resolved Hide resolved
client/allocrunner/taskrunner/sids_hook.go Show resolved Hide resolved
client/allocrunner/taskrunner/sids_hook.go Show resolved Hide resolved
client/allocrunner/taskrunner/task_runner_test.go Outdated Show resolved Hide resolved
client/allocrunner/taskrunner/task_runner_test.go Outdated Show resolved Hide resolved
nomad/consul_policy.go Outdated Show resolved Hide resolved
nomad/job_endpoint.go Outdated Show resolved Hide resolved
nomad/leader.go Outdated Show resolved Hide resolved
shoenig and others added 14 commits January 24, 2020 09:10
When a job is configured with Consul Connect aware tasks (i.e. sidecar),
the Nomad Client should be able to request from Consul (through Nomad Server)
Service Identity tokens specific to those tasks.
… Consul

Nomad jobs may be configured with a TaskGroup which contains a Service
definition that is Consul Connect enabled. These service definitions end
up establishing a Consul Connect Proxy Task (e.g. envoy, by default). In
the case where Consul ACLs are enabled, a Service Identity token is required
for these tasks to run & connect, etc. This changeset enables the Nomad Server
to recieve RPC requests for the derivation of SI tokens on behalf of instances
of Consul Connect using Tasks. Those tokens are then relayed back to the
requesting Client, which then injects the tokens in the secrets directory of
the Task.
When creating the envoy bootstrap configuration, we should append
the "-token=<token>" argument in the case where the sidsHook placed
the token in the secrets directory.
Be able to revoke SI token accessors concurrently, and also
ratelimit the requests being made to Consul for the various
ACL API uses.
Was thinking about using the testing pattern where you create executable
shell scripts as test resources which "mock" the process a bit of code
is meant to fork+exec. Turns out that wasn't really necessary in this case.
The TestEnvoyBootstrapHook_maybeLoadSIToken test case only works when
running as a non-priveleged user, since it deliberately tries to read
an un-readable file to simulate a failure loading the SI token file.
Apply smaller suggestions like doc strings, variable names, etc.

Co-Authored-By: Nick Ethier <[email protected]>
Co-Authored-By: Michael Schurter <[email protected]>
The derivation of an SI token needs to be safegaurded by a context
timeout, otherwise an unresponsive Consul could cause the siHook
to block forever on Prestart.
Re-orient the management of the tr.kill to happen in the parent of
the spawned goroutine that is doing the actual token derivation. This
makes the code a little more straightforward, making it easier to
reason about not leaking the worker goroutine.
Copy link
Member

@nickethier nickethier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some minor nits/questions. Overall this looks good!


func (h *envoyBootstrapHook) writeConfig(filename, config string) error {
if err := ioutil.WriteFile(filename, []byte(config), 0440); err != nil {
_ = os.Remove(filename)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
_ = os.Remove(filename)
os.Remove(filename)

return nil
}

func (_ *envoyBootstrapHook) retry(ctx context.Context) bool {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this function used anywhere?

Using an uninitialized receiver is often a smell imo that you don't need it. This logic is simple enough that if you are using it I'd just inline the select and maybe make a const out of the timeout.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

"-sidecar-for", id,
)

cmd := exec.CommandContext(ctx, "consul", bootstrapArgs...)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I started thinking that maybe we want to use a different context here. This command should execute rather instantly. If it does block, blocking it will block prestart until the task is restarted. Thoughts?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you suggesting we could wrap the context with a timeout, e.g.

ctx, cancel = context.WithTimeout(ctx, 3 * time.Second)
cmd := exec.CommandContext(ctx, "consul", bootstrapArgs...)

That would prevent the exec from hanging forever and also let the TR cancel it. If so I'd like to pull the whole body of the retry loop into a function so we can make use of defer for both the ctx's cancel and the fd.Close

Copy link
Member

@schmichael schmichael Jan 30, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The loop is because this code depends-on and runs-asynchronously-with group service registration in the alloc runner hook. Since our Consul service client does not block until a service is successfully registered we have no way of knowing whether this will succeed on the first try.

I don't think this command blocks if the expected service doesn't exist. I think it just errors. Hence the retries.

A conservative timeout (3+ seconds) just to ensure it doesn't block forever for some unexpected reason doesn't seem like a bad idea though.

Ideally this code would die someday when Consul provides an API or library for generating the bootstrap config.

}

// Enforce that the operator has necessary Consul ACL permissions
for _, tg := range args.Job.ConnectTasks() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this might be a good candidate to be a validating admission controller.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking the same thing, created #7020 to pull this and some other blocks out.

return nil
}

func tgUsesConnect(tg *structs.TaskGroup) bool {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd be okay with creating a func (*TaskGroup) UsesConnect() bool {} helper in the structs package.

// not even in the task group
return false
}
// todo(shoenig): TBD what Kind does a native task have?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will probably have to designate an additional kind for native tasks that users must set if they want Nomad to derive an SI token for them. We only identify connect native at the group service level today.

// derive an SI token until a token is successfully created, or ctx is signaled
// done.
func (h *sidsHook) deriveSIToken(ctx context.Context) (string, error) {
ctx2, cancel := context.WithTimeout(ctx, h.derivationTimeout)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's idiomatic to shadow the original ctx variable:

Suggested change
ctx2, cancel := context.WithTimeout(ctx, h.derivationTimeout)
ctx, cancel := context.WithTimeout(ctx, h.derivationTimeout)

}

func (h *sidsHook) kill(ctx context.Context, err error) {
_ = h.lifecycle.Kill(
Copy link
Member

@schmichael schmichael Jan 15, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably log any errors here. Even though kill errors are already logged in task runner, those log messages won't include why the task was being killed. I think we only have that context here.


stop := !backoff(ctx, 1000)
r.True(stop)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see many error path tests. Perhaps add one where the secrets directory doesn't exist so the file write should fail.

"-sidecar-for", id,
)

cmd := exec.CommandContext(ctx, "consul", bootstrapArgs...)
Copy link
Member

@schmichael schmichael Jan 30, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The loop is because this code depends-on and runs-asynchronously-with group service registration in the alloc runner hook. Since our Consul service client does not block until a service is successfully registered we have no way of knowing whether this will succeed on the first try.

I don't think this command blocks if the expected service doesn't exist. I think it just errors. Hence the retries.

A conservative timeout (3+ seconds) just to ensure it doesn't block forever for some unexpected reason doesn't seem like a bad idea though.

Ideally this code would die someday when Consul provides an API or library for generating the bootstrap config.

Provide script for managing Consul ACLs on a TF provisioned cluster for
e2e testing. Script can be used to 'enable' or 'disable' Consul ACLs,
and automatically takes care of the bootstrapping process if necessary.

The bootstrapping process takes a long time, so we may need to
extend the overall e2e timeout (20 minutes seems fine).

Introduces basic tests for Consul Connect with ACLs.
This test is causing panics. Unlike the other similar tests, this
one is using require.Eventually which is doing something bad, and
this change replaces it with a for-loop like the other tests.

Failure:

=== RUN   TestE2E/Connect
=== RUN   TestE2E/Connect/*connect.ConnectE2ETest
=== RUN   TestE2E/Connect/*connect.ConnectE2ETest/TestConnectDemo
=== RUN   TestE2E/Connect/*connect.ConnectE2ETest/TestMultiServiceConnect
=== RUN   TestE2E/Connect/*connect.ConnectClientStateE2ETest
panic: Fail in goroutine after TestE2E/Connect/*connect.ConnectE2ETest has completed

goroutine 38 [running]:
testing.(*common).Fail(0xc000656500)
	/opt/google/go/src/testing/testing.go:565 +0x11e
testing.(*common).Fail(0xc000656100)
	/opt/google/go/src/testing/testing.go:559 +0x96
testing.(*common).FailNow(0xc000656100)
	/opt/google/go/src/testing/testing.go:587 +0x2b
testing.(*common).Fatalf(0xc000656100, 0x1512f90, 0x10, 0xc000675f88, 0x1, 0x1)
	/opt/google/go/src/testing/testing.go:672 +0x91
github.com/hashicorp/nomad/e2e/connect.(*ConnectE2ETest).TestMultiServiceConnect.func1(0x0)
	/home/shoenig/go/src/github.com/hashicorp/nomad/e2e/connect/multi_service.go:72 +0x296
github.com/hashicorp/nomad/vendor/github.com/stretchr/testify/assert.Eventually.func1(0xc0004962a0, 0xc0002338f0)
	/home/shoenig/go/src/github.com/hashicorp/nomad/vendor/github.com/stretchr/testify/assert/assertions.go:1494 +0x27
created by github.com/hashicorp/nomad/vendor/github.com/stretchr/testify/assert.Eventually
	/home/shoenig/go/src/github.com/hashicorp/nomad/vendor/github.com/stretchr/testify/assert/assertions.go:1493 +0x272
FAIL	github.com/hashicorp/nomad/e2e	21.427s
e2e test for connect with consul acls
@shoenig shoenig merged this pull request into dev-connect-acls Jan 31, 2020
@shoenig shoenig deleted the f-connect-acl-cleanup branch January 31, 2020 01:18
@github-actions
Copy link

I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Jan 19, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants