Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

connect: configure envoy to support multiple sidecars in the same alloc #6816

Merged
merged 4 commits into from
Jan 10, 2020

Conversation

nickethier
Copy link
Member

There were two configuration issues with envoy preventing multiple proxy sidecars to run in the same alloc. The first was the locking of shared memory used by envoy for hot reloading. Since Connect doesn't require hot reloading this feature is disabled with the --disable-hot-restart flag at runtime.

The second issue was a collision with the admin port of the envoy proxy. By default, Consul generates a bootstrap config with the admin api bound to localhost:19000. Since sidecars share the loopback interfaces, only one could start. To solve this we simply add the index of the task to 19000 to derive a unique port number. Note this only works for non host netns allocs, and a more robust solution will eventually be needed to support connect on the host netns.

@nickethier nickethier added this to the 0.10.3 milestone Dec 6, 2019
@@ -18,6 +18,8 @@ import (

var _ interfaces.TaskPrestartHook = &envoyBootstrapHook{}

const envoyBaseAdminPort = 19000
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will this be exposed as an env var? we would want to scrape data from envoy admin port for all our allocs, and need to dynamically discover these endpoints

@@ -76,19 +78,25 @@ func (h *envoyBootstrapHook) Prestart(ctx context.Context, req *interfaces.TaskP
// the host netns.
grpcAddr := "unix://" + allocdir.AllocGRPCSocket

// Envoy runs an administrative API on the loopback interface. If multiple sidecars
// are running, the bind addresses need to have unique ports.
// TODO: support running in host netns, using freeport to find available port
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if there's something clever we can do to let the OS pick the port, and then lookup what port got picked after the sidecar process launch. Even the updated freeport is a fingers-crossed kind of thing

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, let's remove this TODO as I don't think freeport would be a significant improvement. Since this is only on localhost within a netns the only potential conflict is with other services using localhost in the netns (likely the task the sidecar is proxying for). We know the port we're proxying to and can validate it won't conflict.

That leaves the only remaining potential conflict as internal-only ports on other tasks in the netns which should be rare. I think if we document this it will hopefully prevent errors.

Does Envoy exit and log with a reasonable error message when this condition is hit so at least people stand a chance of diagnosing it for themselves?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes envoy will log a bind error to stderr.

However this todo was specifically for when we support connect without network namespaces (host ns). We won't be able to use the current alloc local incrementing port as all allocs in host networking mode would be sharing the loopback.

Copy link
Member

@schmichael schmichael left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's write a test asserting 2 sidecars in the same alloc works. This is the only blocker.

@@ -76,19 +78,25 @@ func (h *envoyBootstrapHook) Prestart(ctx context.Context, req *interfaces.TaskP
// the host netns.
grpcAddr := "unix://" + allocdir.AllocGRPCSocket

// Envoy runs an administrative API on the loopback interface. If multiple sidecars
// are running, the bind addresses need to have unique ports.
// TODO: support running in host netns, using freeport to find available port
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, let's remove this TODO as I don't think freeport would be a significant improvement. Since this is only on localhost within a netns the only potential conflict is with other services using localhost in the netns (likely the task the sidecar is proxying for). We know the port we're proxying to and can validate it won't conflict.

That leaves the only remaining potential conflict as internal-only ports on other tasks in the netns which should be rare. I think if we document this it will hopefully prevent errors.

Does Envoy exit and log with a reasonable error message when this condition is hit so at least people stand a chance of diagnosing it for themselves?

}
}
return fmt.Sprintf("localhost:%d", port)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we do this in the job admission controller we can set the port as a group meta variable and interpolate it into the command. This would allow @jippi to interpolate it into the environment (or templates or command args) of other tasks in the same group.

We could also validate the port doesn't conflict with the port specified on the connect service (or local_service_port) or local_bind_ports for upstreams.

That's an awful lot more work, so we could write a test, get this merged, and then migrate to the admission controller in 0.11.x or something.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went ahead and injected an env var NOMAD_ENVOY_ADMIN_ADDR_<service> in from the hook env, let me know what you think. I don't love the idea of polluting the alloc environment with another variable but this seems reasonable until we can rework it into the admission controller?

@DhashS
Copy link

DhashS commented Jan 5, 2020

We (@sibblegp) have also tried this, since we thought that #6646 was the blocker. We've been using the jobfile in that issue to test

Repeated port binding to 19000 and passing the --base-id parameter always equal to the other allocs were some things we found.

This PR adds support for non-constant binding to port 19000, but I don't think it addresses the base-id parameter.

For reference, one proxy launches and either fails with unable to bind domain socket with id=0 (see --base-id option) or a repeated bind to port 19000

@nickethier
Copy link
Member Author

@DhashS the --base-id option is used for hot-reloading purposes. Since we never reload the config (consul configures envoy through its APIs over the admin port) we actually don't need hot reloading so this PR disables it and should address the bind error.

Have you had a chance to try this branch with the fix to see if it works?

@nickethier nickethier requested a review from schmichael January 7, 2020 03:02
@sibblegp
Copy link

sibblegp commented Jan 8, 2020

@nickethier This branch does appear to solve our issue. Looking forward to 0.10.3

// TODO: support running in host netns, using freeport to find available port
envoyAdminBind := buildEnvoyAdminBind(h.alloc, req.Task.Name)
resp.Env = map[string]string{
helper.CleanEnvVar(envoyAdminBindEnvPrefix+serviceName, '_'): envoyAdminBind,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fine for now, but in the future let's try to return things as variables that people can explicitly interpolate into their environment using an env {} stanza. Unfortunately this takes hook changes, so this is fine for today.

break
}
}
return fmt.Sprintf("localhost:%d", port)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the addr instead of just the port? Is there potential for this to bite us if we make the bind address/interface configurable someday? Seems like we can't setup a test today that would break in the future.

That being said I can't think of a reason you'd bind this to anything but localhost, so this is fine.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah since this can expose tls certificates the more restricted the better IMO

@@ -25,6 +25,7 @@ var (
"args": []interface{}{
"-c", structs.EnvoyBootstrapPath,
"-l", "${meta.connect.log_level}",
"--disable-hot-restart",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, we should probably comment this. The git blame will be useful, so nbd

@nickethier nickethier merged commit 4b6f9e8 into master Jan 10, 2020
@nickethier nickethier deleted the b-multiple-envoy branch January 10, 2020 04:25
@github-actions
Copy link

I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Jan 21, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants