-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
support for Nomad - Consul Connect integration with Consul ACLs #6905
Conversation
626bfa1
to
fd9a9d1
Compare
9a48f47
to
63ce04e
Compare
fd9a9d1
to
5c29d0d
Compare
549601f
to
0a1f4c1
Compare
5c29d0d
to
75aa44d
Compare
When a job is configured with Consul Connect aware tasks (i.e. sidecar), the Nomad Client should be able to request from Consul (through Nomad Server) Service Identity tokens specific to those tasks.
… Consul Nomad jobs may be configured with a TaskGroup which contains a Service definition that is Consul Connect enabled. These service definitions end up establishing a Consul Connect Proxy Task (e.g. envoy, by default). In the case where Consul ACLs are enabled, a Service Identity token is required for these tasks to run & connect, etc. This changeset enables the Nomad Server to recieve RPC requests for the derivation of SI tokens on behalf of instances of Consul Connect using Tasks. Those tokens are then relayed back to the requesting Client, which then injects the tokens in the secrets directory of the Task.
When creating the envoy bootstrap configuration, we should append the "-token=<token>" argument in the case where the sidsHook placed the token in the secrets directory.
Be able to revoke SI token accessors concurrently, and also ratelimit the requests being made to Consul for the various ACL API uses.
Was thinking about using the testing pattern where you create executable shell scripts as test resources which "mock" the process a bit of code is meant to fork+exec. Turns out that wasn't really necessary in this case.
The TestEnvoyBootstrapHook_maybeLoadSIToken test case only works when running as a non-priveleged user, since it deliberately tries to read an un-readable file to simulate a failure loading the SI token file.
Apply smaller suggestions like doc strings, variable names, etc. Co-Authored-By: Nick Ethier <[email protected]> Co-Authored-By: Michael Schurter <[email protected]>
The derivation of an SI token needs to be safegaurded by a context timeout, otherwise an unresponsive Consul could cause the siHook to block forever on Prestart.
Re-orient the management of the tr.kill to happen in the parent of the spawned goroutine that is doing the actual token derivation. This makes the code a little more straightforward, making it easier to reason about not leaking the worker goroutine.
0a1f4c1
to
d88f25f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some minor nits/questions. Overall this looks good!
|
||
func (h *envoyBootstrapHook) writeConfig(filename, config string) error { | ||
if err := ioutil.WriteFile(filename, []byte(config), 0440); err != nil { | ||
_ = os.Remove(filename) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_ = os.Remove(filename) | |
os.Remove(filename) |
return nil | ||
} | ||
|
||
func (_ *envoyBootstrapHook) retry(ctx context.Context) bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this function used anywhere?
Using an uninitialized receiver is often a smell imo that you don't need it. This logic is simple enough that if you are using it I'd just inline the select and maybe make a const out of the timeout.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
"-sidecar-for", id, | ||
) | ||
|
||
cmd := exec.CommandContext(ctx, "consul", bootstrapArgs...) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I started thinking that maybe we want to use a different context here. This command should execute rather instantly. If it does block, blocking it will block prestart until the task is restarted. Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you suggesting we could wrap the context with a timeout, e.g.
ctx, cancel = context.WithTimeout(ctx, 3 * time.Second)
cmd := exec.CommandContext(ctx, "consul", bootstrapArgs...)
That would prevent the exec from hanging forever and also let the TR cancel it. If so I'd like to pull the whole body of the retry loop into a function so we can make use of defer
for both the ctx
's cancel
and the fd.Close
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The loop is because this code depends-on and runs-asynchronously-with group service registration in the alloc runner hook. Since our Consul service client does not block until a service is successfully registered we have no way of knowing whether this will succeed on the first try.
I don't think this command blocks if the expected service doesn't exist. I think it just errors. Hence the retries.
A conservative timeout (3+ seconds) just to ensure it doesn't block forever for some unexpected reason doesn't seem like a bad idea though.
Ideally this code would die someday when Consul provides an API or library for generating the bootstrap config.
} | ||
|
||
// Enforce that the operator has necessary Consul ACL permissions | ||
for _, tg := range args.Job.ConnectTasks() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this might be a good candidate to be a validating admission controller.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking the same thing, created #7020 to pull this and some other blocks out.
nomad/node_endpoint.go
Outdated
return nil | ||
} | ||
|
||
func tgUsesConnect(tg *structs.TaskGroup) bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd be okay with creating a func (*TaskGroup) UsesConnect() bool {}
helper in the structs package.
// not even in the task group | ||
return false | ||
} | ||
// todo(shoenig): TBD what Kind does a native task have? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We will probably have to designate an additional kind for native tasks that users must set if they want Nomad to derive an SI token for them. We only identify connect native at the group service level today.
// derive an SI token until a token is successfully created, or ctx is signaled | ||
// done. | ||
func (h *sidsHook) deriveSIToken(ctx context.Context) (string, error) { | ||
ctx2, cancel := context.WithTimeout(ctx, h.derivationTimeout) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's idiomatic to shadow the original ctx variable:
ctx2, cancel := context.WithTimeout(ctx, h.derivationTimeout) | |
ctx, cancel := context.WithTimeout(ctx, h.derivationTimeout) |
} | ||
|
||
func (h *sidsHook) kill(ctx context.Context, err error) { | ||
_ = h.lifecycle.Kill( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should probably log any errors here. Even though kill errors are already logged in task runner, those log messages won't include why the task was being killed. I think we only have that context here.
|
||
stop := !backoff(ctx, 1000) | ||
r.True(stop) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see many error path tests. Perhaps add one where the secrets directory doesn't exist so the file write should fail.
"-sidecar-for", id, | ||
) | ||
|
||
cmd := exec.CommandContext(ctx, "consul", bootstrapArgs...) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The loop is because this code depends-on and runs-asynchronously-with group service registration in the alloc runner hook. Since our Consul service client does not block until a service is successfully registered we have no way of knowing whether this will succeed on the first try.
I don't think this command blocks if the expected service doesn't exist. I think it just errors. Hence the retries.
A conservative timeout (3+ seconds) just to ensure it doesn't block forever for some unexpected reason doesn't seem like a bad idea though.
Ideally this code would die someday when Consul provides an API or library for generating the bootstrap config.
Provide script for managing Consul ACLs on a TF provisioned cluster for e2e testing. Script can be used to 'enable' or 'disable' Consul ACLs, and automatically takes care of the bootstrapping process if necessary. The bootstrapping process takes a long time, so we may need to extend the overall e2e timeout (20 minutes seems fine). Introduces basic tests for Consul Connect with ACLs.
This test is causing panics. Unlike the other similar tests, this one is using require.Eventually which is doing something bad, and this change replaces it with a for-loop like the other tests. Failure: === RUN TestE2E/Connect === RUN TestE2E/Connect/*connect.ConnectE2ETest === RUN TestE2E/Connect/*connect.ConnectE2ETest/TestConnectDemo === RUN TestE2E/Connect/*connect.ConnectE2ETest/TestMultiServiceConnect === RUN TestE2E/Connect/*connect.ConnectClientStateE2ETest panic: Fail in goroutine after TestE2E/Connect/*connect.ConnectE2ETest has completed goroutine 38 [running]: testing.(*common).Fail(0xc000656500) /opt/google/go/src/testing/testing.go:565 +0x11e testing.(*common).Fail(0xc000656100) /opt/google/go/src/testing/testing.go:559 +0x96 testing.(*common).FailNow(0xc000656100) /opt/google/go/src/testing/testing.go:587 +0x2b testing.(*common).Fatalf(0xc000656100, 0x1512f90, 0x10, 0xc000675f88, 0x1, 0x1) /opt/google/go/src/testing/testing.go:672 +0x91 github.com/hashicorp/nomad/e2e/connect.(*ConnectE2ETest).TestMultiServiceConnect.func1(0x0) /home/shoenig/go/src/github.com/hashicorp/nomad/e2e/connect/multi_service.go:72 +0x296 github.com/hashicorp/nomad/vendor/github.com/stretchr/testify/assert.Eventually.func1(0xc0004962a0, 0xc0002338f0) /home/shoenig/go/src/github.com/hashicorp/nomad/vendor/github.com/stretchr/testify/assert/assertions.go:1494 +0x27 created by github.com/hashicorp/nomad/vendor/github.com/stretchr/testify/assert.Eventually /home/shoenig/go/src/github.com/hashicorp/nomad/vendor/github.com/stretchr/testify/assert/assertions.go:1493 +0x272 FAIL github.com/hashicorp/nomad/e2e 21.427s
e2e test for connect with consul acls
I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions. |
This PR adds support for using Nomad's Consul Connect Integration when Consul ACLs are enabled.
Incorporates
#6701 - overall connect acls support
#6717 - nomad server mint & revoke SI tokens
#6718 - nomad client request & inject SI tokens
#6789 - nomad client set SI token for envoy bootstrap
We still need e2e tests which cover the cases
default:allow
modedefault:deny
modeCurrently, this PR has Nomad manually parsing Consul ACL policies on Consul Tokens for doing authorization. This seems necessary for now, as Consul does not provide an endpoint with a kind of "dry-run" functionality for authorization validation without actually executing changes. The vault integration does something similar. One major caveat is the lack of support for Consul Namespaces, which is rolling out in the current Consul beta. Not sure if this is a show-stopper.
This change does not incorporate the background orphaned SI token reconciliation described in #6719. It should be very uncommon for SI tokens to be left in that state (known by Consul but unknown by Nomad). It requires a race condition where Nomad Server requests the token to be created, Consul creates the token, and the requesting Nomad Server crashes in the interim. In other failure scenarios Nomad Server already retries the revocation of the SI token in the background until Consul says the token is gone. The missing functionality will cover the case where Nomad Server crashes, forgets the token exists, and must do reconciliation with Consul to purge any token from Consul that is identified with this Nomad's ClusterID, but is not known by this Nomad cluster. Will be added in upcoming release.