Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Task API via Unix Domain Socket #15864

Merged
merged 18 commits into from
Feb 6, 2023
Merged

Task API via Unix Domain Socket #15864

merged 18 commits into from
Feb 6, 2023

Conversation

schmichael
Copy link
Member

@schmichael schmichael commented Jan 25, 2023

The goal of this PR is to provide universal workload access to Nomad's HTTP API via Unix Domain Sockets (UDS).

Dynamic Node Metadata #15844 is one potential use case.

Inspired by e90807e

TODO

  • docs - might do in a followup since dynamic node metadata / task api / workload id all need to interlink
  • Unit tests for auth middleware
  • Caching for auth middleware
  • Rate limiting on negative lookups for auth middleware

tgross added a commit that referenced this pull request Jan 25, 2023
This changeset allows Workload Identities to authenticate to all the RPCs that
support HTTP API endpoints, for use with PR #15864.

* Extends the work done for pre-forwarding authentication to all RPCs that
  support a HTTP API endpoint.
* Consolidates the auth helpers used by the CSI, Service Registration, and Node
  endpoints that are currently used to support both tokens and client secrets.

Intentionally excluded from this changeset:
* The Variables endpoint still has custom handling because of the implicit
  policies. Ideally we'll figure out an efficient way to resolve those into real
  policies and then we can get rid of that custom handling.
* The RPCs that don't currently support auth tokens (i.e. those that don't
  support HTTP endpoints) have not been updated with the new pre-forwarding auth
  We'll be doing this under a separate PR to support RPC rate metrics.
Comment on lines +1013 to +1031
// Initialize builtin API server here for use in the client, but it won't
// accept connections until the HTTP servers are created.
a.builtinServer = newBuiltinAPI()
conf.APIListenerRegistrar = a.builtinServer
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this necessary? The implemenation of builtinAPI is clever and unfamiliar, which is usually a bad thing.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tl;dr I would love a better way of doing this.

Is this necessary?

Kind of? Basically giving the Client access to the Agent HTTP server creates a circular dependency that needs solving somehow:

  1. NewAgent creates Client
  2. agent.Command creates HTTP servers which accept the Agent as a dependency

That worked out fine when the Client was oblivious to the existence of the HTTP API.

When we needed access to the HTTP API for template support we used bufconn to create a Dialer connected to a Listener, returned the Listener to Agent, and Agent would slap the HTTP API on it. bufconn blocks Dials until the Listener accepts them.

Unix Domain Sockets (UDS) are listeners themselves though, so we couldn't just reuse the Dialer (unless we wanted to create 1 Accept goroutine per Task and 2 proxy goroutines for every UDS connection to proxy reads and writes from the UDS to the bufconn).

So I more or less copied bufconn except instead of blocking a Dial attempt on its corresponding Accept, I block a Register(listener) attempt on the HTTP Server existing.

schmichael why don't you just swap the order we start the api and client?

I think that could work. The Agent.{Client,Server} getters for the API to access the client and server would need locks and setters since they would be nil until NewAgent returned them. The dependencies that the API has on Agent itself are harder to untangle: Agent.RPC( being one of the harder bits. Do we do a bufconn/builtinServer style block RPCs from the HTTP API until the Agent is created? We don't really gain much from swapping the order if we don't also detangle the circular dependencies. So it gets very complicated especially because both Client and Server have their own RPC implementations and methods of routing requests around and the HTTP API just grabs one and uses it.

@schmichael schmichael marked this pull request as ready for review February 4, 2023 01:15
@schmichael
Copy link
Member Author

Marked as ready for review despite some outstanding TODOs. All of them are fairly orthogonal to this and may be best suited to followup PRs anyway.

@shoenig
Copy link
Contributor

shoenig commented Feb 6, 2023

@schmichael can you look into these test failures?

=== RUN   TestAllocRunner_Restore_RunningTerminal
    alloc_runner_unix_test.go:130: 
        	Error Trace:	/home/runner/work/nomad/nomad/client/allocrunner/alloc_runner_unix_test.go:130
        	Error:      	"[{remove 16566603-2d92-a388-a6ac-becf18a65b32 web 2023-02-04 01:29:21.8589[730](https://github.com/hashicorp/nomad/actions/runs/4089214192/jobs/7051705048#step:4:731)08 +0000 UTC m=+0.094414412} {remove 16566603-2d92-a388-a6ac-becf18a65b32 group-web 2023-02-04 01:29:22.860232985 +0000 UTC m=+1.095674389} {remove 16566603-2d92-a388-a6ac-becf18a65b32 group-web 2023-02-04 01:29:22.860298886 +0000 UTC m=+1.095740290}]" should have 2 item(s), but has 3
# github.com/hashicorp/nomad/command/agent/consul_test [github.com/hashicorp/nomad/command/agent/consul.test]
Error: command/agent/consul/int_test.go:168:3: unknown field APIListenerRegistrar in struct literal of type taskrunner.Config

return nil
}

if err := os.Chmod(udsPath, 0o777); err != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think our logic here for why this is safe in the context of the task dir is sound, but it's also going to result in us getting dinged by third-party auditors armed with sophisticated source code analysis tools like grep 😀

Can we leave a comment above here about why it's ok in hopes of reducing the amount of noise we get from those discussions?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, does it need to be world writable? Or can we apply the same logic made in https://github.com/hashicorp/nomad/blob/main/helper/users/lookup.go#L39

And does it need to be executable? some random sockets on my machine are srw-rw-rw-

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good points. I was going to say well we require auth anyway so who cares but...

These are the sorts of decisions we have to live with forever, and respecting Task.User when its set (as nomad_token does too) seems like a valuable road to go down.

So I followed #15755 and made this attempt to chown the socket to Task.User with 0o600 perms. Like #15755 it falls back to 0o666 if that fails (which it always will on Windows or as a non-root user).

client/config/config.go Outdated Show resolved Hide resolved
e2e/workload_id/input/api-auth.nomad.hcl Show resolved Hide resolved
client/allocrunner/taskrunner/api_hook.go Show resolved Hide resolved
return nil
}

if err := os.Chmod(udsPath, 0o777); err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, does it need to be world writable? Or can we apply the same logic made in https://github.com/hashicorp/nomad/blob/main/helper/users/lookup.go#L39

And does it need to be executable? some random sockets on my machine are srw-rw-rw-

Copy link
Member

@tgross tgross left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@apollo13
Copy link
Contributor

Hi there, I realize I am kinda late to the party here -- sorry for that. I was wondering if it is a good idea to expose the API sockets to tasks by default. I realize that the access to it requires authentication, but not all tasks are trusted and a pre-auth vulnerability in the API would allow "malicious" tasks to exploit this. From an security POV I'd very much prefer this to be opt in (it would keep the attack surface smaller).

@schmichael
Copy link
Member Author

That's a valid concern @apollo13! Mind opening a new issue? We discussed the ability to opt out internally but couldn't decide who should control it: Client Agent, jobspec, namespace, ...some combination of those to allow defaults in one place to be overridden in another place?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants