Task API via Unix Domain Socket #15864

schmichael · 2023-01-25T01:20:58Z

The goal of this PR is to provide universal workload access to Nomad's HTTP API via Unix Domain Sockets (UDS).

Dynamic Node Metadata #15844 is one potential use case.

Inspired by e90807e

TODO

docs - might do in a followup since dynamic node metadata / task api / workload id all need to interlink
Unit tests for auth middleware
Caching for auth middleware
Rate limiting on negative lookups for auth middleware

This changeset allows Workload Identities to authenticate to all the RPCs that support HTTP API endpoints, for use with PR #15864. * Extends the work done for pre-forwarding authentication to all RPCs that support a HTTP API endpoint. * Consolidates the auth helpers used by the CSI, Service Registration, and Node endpoints that are currently used to support both tokens and client secrets. Intentionally excluded from this changeset: * The Variables endpoint still has custom handling because of the implicit policies. Ideally we'll figure out an efficient way to resolve those into real policies and then we can get rid of that custom handling. * The RPCs that don't currently support auth tokens (i.e. those that don't support HTTP endpoints) have not been updated with the new pre-forwarding auth We'll be doing this under a separate PR to support RPC rate metrics.

client/allocrunner/taskrunner/task_runner.go

shoenig · 2023-01-30T16:24:07Z

command/agent/agent.go

+	// Initialize builtin API server here for use in the client, but it won't
+	// accept connections until the HTTP servers are created.
+	a.builtinServer = newBuiltinAPI()
+	conf.APIListenerRegistrar = a.builtinServer


Is this necessary? The implemenation of builtinAPI is clever and unfamiliar, which is usually a bad thing.

tl;dr I would love a better way of doing this.

Is this necessary?

Kind of? Basically giving the Client access to the Agent HTTP server creates a circular dependency that needs solving somehow:

NewAgent creates Client

agent.Command creates HTTP servers which accept the Agent as a dependency

That worked out fine when the Client was oblivious to the existence of the HTTP API.

When we needed access to the HTTP API for template support we used bufconn to create a Dialer connected to a Listener, returned the Listener to Agent, and Agent would slap the HTTP API on it. bufconn blocks Dials until the Listener accepts them.

Unix Domain Sockets (UDS) are listeners themselves though, so we couldn't just reuse the Dialer (unless we wanted to create 1 Accept goroutine per Task and 2 proxy goroutines for every UDS connection to proxy reads and writes from the UDS to the bufconn).

So I more or less copied bufconn except instead of blocking a Dial attempt on its corresponding Accept, I block a Register(listener) attempt on the HTTP Server existing.

schmichael why don't you just swap the order we start the api and client?

I think that could work. The Agent.{Client,Server} getters for the API to access the client and server would need locks and setters since they would be nil until NewAgent returned them. The dependencies that the API has on Agent itself are harder to untangle: Agent.RPC( being one of the harder bits. Do we do a bufconn/builtinServer style block RPCs from the HTTP API until the Agent is created? We don't really gain much from swapping the order if we don't also detangle the circular dependencies. So it gets very complicated especially because both Client and Server have their own RPC implementations and methods of routing requests around and the HTTP API just grabs one and uses it.

client/allocrunner/taskrunner/api_hook.go

schmichael · 2023-02-04T01:16:27Z

Marked as ready for review despite some outstanding TODOs. All of them are fairly orthogonal to this and may be best suited to followup PRs anyway.

shoenig · 2023-02-06T14:09:14Z

@schmichael can you look into these test failures?

=== RUN   TestAllocRunner_Restore_RunningTerminal
    alloc_runner_unix_test.go:130: 
        	Error Trace:	/home/runner/work/nomad/nomad/client/allocrunner/alloc_runner_unix_test.go:130
        	Error:      	"[{remove 16566603-2d92-a388-a6ac-becf18a65b32 web 2023-02-04 01:29:21.8589[730](https://github.com/hashicorp/nomad/actions/runs/4089214192/jobs/7051705048#step:4:731)08 +0000 UTC m=+0.094414412} {remove 16566603-2d92-a388-a6ac-becf18a65b32 group-web 2023-02-04 01:29:22.860232985 +0000 UTC m=+1.095674389} {remove 16566603-2d92-a388-a6ac-becf18a65b32 group-web 2023-02-04 01:29:22.860298886 +0000 UTC m=+1.095740290}]" should have 2 item(s), but has 3

# github.com/hashicorp/nomad/command/agent/consul_test [github.com/hashicorp/nomad/command/agent/consul.test]
Error: command/agent/consul/int_test.go:168:3: unknown field APIListenerRegistrar in struct literal of type taskrunner.Config

tgross · 2023-02-06T14:25:04Z

client/allocrunner/taskrunner/api_hook.go

+		return nil
+	}
+
+	if err := os.Chmod(udsPath, 0o777); err != nil {


I think our logic here for why this is safe in the context of the task dir is sound, but it's also going to result in us getting dinged by third-party auditors armed with sophisticated source code analysis tools like grep 😀

Can we leave a comment above here about why it's ok in hopes of reducing the amount of noise we get from those discussions?

Also, does it need to be world writable? Or can we apply the same logic made in https://github.com/hashicorp/nomad/blob/main/helper/users/lookup.go#L39

And does it need to be executable? some random sockets on my machine are srw-rw-rw-

Good points. I was going to say well we require auth anyway so who cares but...

These are the sorts of decisions we have to live with forever, and respecting Task.User when its set (as nomad_token does too) seems like a valuable road to go down.

So I followed #15755 and made this attempt to chown the socket to Task.User with 0o600 perms. Like #15755 it falls back to 0o666 if that fails (which it always will on Windows or as a non-root user).

client/config/config.go

e2e/workload_id/input/api-auth.nomad.hcl

client/allocrunner/taskrunner/api_hook.go

shoenig · 2023-02-06T14:58:40Z

client/allocrunner/taskrunner/api_hook.go

+		return nil
+	}
+
+	if err := os.Chmod(udsPath, 0o777); err != nil {


Also, does it need to be world writable? Or can we apply the same logic made in https://github.com/hashicorp/nomad/blob/main/helper/users/lookup.go#L39

And does it need to be executable? some random sockets on my machine are srw-rw-rw-

Co-authored-by: Seth Hoenig <[email protected]>

tgross

LGTM

apollo13 · 2023-03-10T20:15:46Z

Hi there, I realize I am kinda late to the party here -- sorry for that. I was wondering if it is a good idea to expose the API sockets to tasks by default. I realize that the access to it requires authentication, but not all tasks are trusted and a pre-auth vulnerability in the API would allow "malicious" tasks to exploit this. From an security POV I'd very much prefer this to be opt in (it would keep the attack surface smaller).

schmichael · 2023-03-10T21:21:05Z

That's a valid concern @apollo13! Mind opening a new issue? We discussed the ability to opt out internally but couldn't decide who should control it: Client Agent, jobspec, namespace, ...some combination of those to allow defaults in one place to be overridden in another place?

This was referenced Jan 25, 2023

WI: allow workloads to use RPCs associated with HTTP API #15870

Merged

Workload Identity #15614

Closed

tgross reviewed Jan 25, 2023

View reviewed changes

client/allocrunner/taskrunner/task_runner.go Outdated Show resolved Hide resolved

shoenig reviewed Jan 30, 2023

View reviewed changes

shoenig reviewed Jan 31, 2023

View reviewed changes

client/allocrunner/taskrunner/api_hook.go Outdated Show resolved Hide resolved

schmichael force-pushed the f-api-uds branch from 0cf307f to 439fdf3 Compare February 2, 2023 19:04

wip task api via unix domain socket

7e51b69

schmichael force-pushed the f-api-uds branch from 439fdf3 to 7e51b69 Compare February 2, 2023 19:08

vercel bot deployed to Preview – nomad-storybook-and-ui February 2, 2023 19:14 View deployment

make whoami always validate jwts; fix race between api serve & context

228aef9

vercel bot deployed to Preview – nomad-storybook-and-ui February 2, 2023 22:40 View deployment

cleanup socket handling code

3f6b5e1

vercel bot deployed to Preview – nomad-storybook-and-ui February 3, 2023 00:55 View deployment

shorten path and add test

f75c7d4

vercel bot deployed to Preview – nomad-storybook-and-ui February 3, 2023 20:09 View deployment

windows will be the end of me

6c71f18

vercel bot deployed to Preview – nomad-storybook-and-ui February 3, 2023 20:26 View deployment

stub out task api in tests

e41683f

vercel bot deployed to Preview – nomad-storybook-and-ui February 3, 2023 21:10 View deployment

fix panic on task restart

b80763b

vercel bot deployed to Preview – nomad-storybook-and-ui February 3, 2023 22:28 View deployment

schmichael added 2 commits February 3, 2023 16:08

must make task api uds world writable

f222ad1

task api tests

ee0e423

vercel bot deployed to Preview – nomad-storybook-and-ui February 4, 2023 00:14 View deployment

remove redundant return

a0188c5

vercel bot deployed to Preview – nomad-storybook-and-ui February 4, 2023 00:54 View deployment

schmichael added 2 commits February 3, 2023 16:58

fix panic in consul integration test

66b68e8

changelog

49ab5cb

vercel bot deployed to Preview – nomad-storybook-and-ui February 4, 2023 01:04 View deployment

revert initial token hack

a5adab9

vercel bot deployed to Preview – nomad-storybook-and-ui February 4, 2023 01:11 View deployment

cleanup this mess

4114588

schmichael marked this pull request as ready for review February 4, 2023 01:15

vercel bot deployed to Preview – nomad-storybook-and-ui February 4, 2023 01:18 View deployment

tgross reviewed Feb 6, 2023

View reviewed changes

shoenig reviewed Feb 6, 2023

View reviewed changes

schmichael and others added 4 commits February 6, 2023 10:29

hclfmt

ba34c9d

opportunistically use least privileges for task api socket

fc8f5b8

Fix typo in client/config/config.go

e137a8c

Co-authored-by: Seth Hoenig <[email protected]>

fix test

e9d02db

vercel bot deployed to Preview – nomad-storybook-and-ui February 6, 2023 18:54 View deployment

tgross approved these changes Feb 6, 2023

View reviewed changes

schmichael merged commit 9bab96e into main Feb 6, 2023

schmichael deleted the f-api-uds branch February 6, 2023 19:31

jrasell mentioned this pull request Feb 7, 2023

agent: fix agent HTTP server audit event implementation access. #16076

Merged

tgross mentioned this pull request Feb 13, 2023

"Your IP is issuing too many concurrent connections" with server UI behind proxy #15471

Closed

tgross mentioned this pull request Feb 22, 2023

Nomad v1.5.0-beta.1 panic: SetServer called twice. #16239

Closed

apollo13 mentioned this pull request Mar 11, 2023

[nomad-1.5] Nomad API socket exposed by default leading to a larger attack surface. #16436

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Task API via Unix Domain Socket #15864

Task API via Unix Domain Socket #15864

schmichael commented Jan 25, 2023 •

edited

Loading

shoenig Jan 30, 2023

schmichael Jan 30, 2023

schmichael commented Feb 4, 2023

shoenig commented Feb 6, 2023

tgross Feb 6, 2023

shoenig Feb 6, 2023

schmichael Feb 6, 2023

shoenig Feb 6, 2023

tgross left a comment

apollo13 commented Mar 10, 2023

schmichael commented Mar 10, 2023

Task API via Unix Domain Socket #15864

Task API via Unix Domain Socket #15864

Conversation

schmichael commented Jan 25, 2023 • edited Loading

shoenig Jan 30, 2023

Choose a reason for hiding this comment

schmichael Jan 30, 2023

Choose a reason for hiding this comment

schmichael commented Feb 4, 2023

shoenig commented Feb 6, 2023

tgross Feb 6, 2023

Choose a reason for hiding this comment

shoenig Feb 6, 2023

Choose a reason for hiding this comment

schmichael Feb 6, 2023

Choose a reason for hiding this comment

shoenig Feb 6, 2023

Choose a reason for hiding this comment

tgross left a comment

Choose a reason for hiding this comment

apollo13 commented Mar 10, 2023

schmichael commented Mar 10, 2023

schmichael commented Jan 25, 2023 •

edited

Loading