-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
connect: configure envoy to support multiple sidecars in the same alloc #6816
Conversation
@@ -18,6 +18,8 @@ import ( | |||
|
|||
var _ interfaces.TaskPrestartHook = &envoyBootstrapHook{} | |||
|
|||
const envoyBaseAdminPort = 19000 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will this be exposed as an env var? we would want to scrape data from envoy admin port for all our allocs, and need to dynamically discover these endpoints
@@ -76,19 +78,25 @@ func (h *envoyBootstrapHook) Prestart(ctx context.Context, req *interfaces.TaskP | |||
// the host netns. | |||
grpcAddr := "unix://" + allocdir.AllocGRPCSocket | |||
|
|||
// Envoy runs an administrative API on the loopback interface. If multiple sidecars | |||
// are running, the bind addresses need to have unique ports. | |||
// TODO: support running in host netns, using freeport to find available port |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if there's something clever we can do to let the OS pick the port, and then lookup what port got picked after the sidecar process launch. Even the updated freeport is a fingers-crossed kind of thing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, let's remove this TODO as I don't think freeport would be a significant improvement. Since this is only on localhost within a netns the only potential conflict is with other services using localhost in the netns (likely the task the sidecar is proxying for). We know the port we're proxying to and can validate it won't conflict.
That leaves the only remaining potential conflict as internal-only ports on other tasks in the netns which should be rare. I think if we document this it will hopefully prevent errors.
Does Envoy exit and log with a reasonable error message when this condition is hit so at least people stand a chance of diagnosing it for themselves?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes envoy will log a bind error to stderr.
However this todo was specifically for when we support connect without network namespaces (host ns). We won't be able to use the current alloc local incrementing port as all allocs in host networking mode would be sharing the loopback.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's write a test asserting 2 sidecars in the same alloc works. This is the only blocker.
@@ -76,19 +78,25 @@ func (h *envoyBootstrapHook) Prestart(ctx context.Context, req *interfaces.TaskP | |||
// the host netns. | |||
grpcAddr := "unix://" + allocdir.AllocGRPCSocket | |||
|
|||
// Envoy runs an administrative API on the loopback interface. If multiple sidecars | |||
// are running, the bind addresses need to have unique ports. | |||
// TODO: support running in host netns, using freeport to find available port |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, let's remove this TODO as I don't think freeport would be a significant improvement. Since this is only on localhost within a netns the only potential conflict is with other services using localhost in the netns (likely the task the sidecar is proxying for). We know the port we're proxying to and can validate it won't conflict.
That leaves the only remaining potential conflict as internal-only ports on other tasks in the netns which should be rare. I think if we document this it will hopefully prevent errors.
Does Envoy exit and log with a reasonable error message when this condition is hit so at least people stand a chance of diagnosing it for themselves?
} | ||
} | ||
return fmt.Sprintf("localhost:%d", port) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we do this in the job admission controller we can set the port as a group meta variable and interpolate it into the command. This would allow @jippi to interpolate it into the environment (or templates or command args) of other tasks in the same group.
We could also validate the port doesn't conflict with the port specified on the connect service (or local_service_port
) or local_bind_ports
for upstreams.
That's an awful lot more work, so we could write a test, get this merged, and then migrate to the admission controller in 0.11.x or something.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I went ahead and injected an env var NOMAD_ENVOY_ADMIN_ADDR_<service>
in from the hook env, let me know what you think. I don't love the idea of polluting the alloc environment with another variable but this seems reasonable until we can rework it into the admission controller?
We (@sibblegp) have also tried this, since we thought that #6646 was the blocker. We've been using the jobfile in that issue to test Repeated port binding to 19000 and passing the --base-id parameter always equal to the other allocs were some things we found. This PR adds support for non-constant binding to port 19000, but I don't think it addresses the base-id parameter. For reference, one proxy launches and either fails with |
@DhashS the Have you had a chance to try this branch with the fix to see if it works? |
8376669
to
8bca0e0
Compare
@nickethier This branch does appear to solve our issue. Looking forward to 0.10.3 |
// TODO: support running in host netns, using freeport to find available port | ||
envoyAdminBind := buildEnvoyAdminBind(h.alloc, req.Task.Name) | ||
resp.Env = map[string]string{ | ||
helper.CleanEnvVar(envoyAdminBindEnvPrefix+serviceName, '_'): envoyAdminBind, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is fine for now, but in the future let's try to return things as variables that people can explicitly interpolate into their environment using an env {}
stanza. Unfortunately this takes hook changes, so this is fine for today.
break | ||
} | ||
} | ||
return fmt.Sprintf("localhost:%d", port) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why the addr instead of just the port? Is there potential for this to bite us if we make the bind address/interface configurable someday? Seems like we can't setup a test today that would break in the future.
That being said I can't think of a reason you'd bind this to anything but localhost, so this is fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah since this can expose tls certificates the more restricted the better IMO
@@ -25,6 +25,7 @@ var ( | |||
"args": []interface{}{ | |||
"-c", structs.EnvoyBootstrapPath, | |||
"-l", "${meta.connect.log_level}", | |||
"--disable-hot-restart", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, we should probably comment this. The git blame will be useful, so nbd
I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions. |
There were two configuration issues with envoy preventing multiple proxy sidecars to run in the same alloc. The first was the locking of shared memory used by envoy for hot reloading. Since Connect doesn't require hot reloading this feature is disabled with the
--disable-hot-restart
flag at runtime.The second issue was a collision with the admin port of the envoy proxy. By default, Consul generates a bootstrap config with the admin api bound to
localhost:19000
. Since sidecars share the loopback interfaces, only one could start. To solve this we simply add the index of the task to 19000 to derive a unique port number. Note this only works for non host netns allocs, and a more robust solution will eventually be needed to support connect on the host netns.