Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

orchestratord TLS for balancerd and environmentd #30444

Conversation

alex-hunt-materialize
Copy link
Contributor

@alex-hunt-materialize alex-hunt-materialize commented Nov 12, 2024

Add support in orchestratord to configure TLS for balancerd and environmentd.
This does not yet configure TLS for clusterd or the console, nor does it configure MTLS.

Motivation

Tips for reviewer

The cert-manager CRDs were generated with Kopium, using src/k8s/src/gen/generate-kopium.sh in the cloud repo. The exact output in this PR is based on some changes to that script in this draft PR.

Checklist

  • This PR has adequate test coverage / QA involvement has been duly considered. (trigger-ci for additional test/nightly runs)
  • This PR has an associated up-to-date design doc, is a design doc (template), or is sufficiently small to not require a design.
  • If this PR evolves an existing $T ⇔ Proto$T mapping (possibly in a backwards-incompatible way), then it is tagged with a T-proto label.
  • If this PR will require changes to cloud orchestration or tests, there is a companion cloud PR to account for those changes that is tagged with the release-blocker label (example).
  • If this PR includes major user-facing behavior changes, I have pinged the relevant PM to schedule a changelog post.

@alex-hunt-materialize alex-hunt-materialize force-pushed the orchestratord_internal_tls_toggle branch 19 times, most recently from 8bb8961 to 2874f38 Compare November 19, 2024 13:42
@alex-hunt-materialize alex-hunt-materialize force-pushed the orchestratord_internal_tls_toggle branch 2 times, most recently from 76d791d to 899769a Compare November 20, 2024 11:23
@alex-hunt-materialize alex-hunt-materialize marked this pull request as ready for review November 20, 2024 12:15
@alex-hunt-materialize alex-hunt-materialize requested a review from a team as a code owner November 20, 2024 12:15
Copy link
Contributor

@pH14 pH14 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense -- will take another look when we're closer to merging, but the plumbing seems right to me for envd+balancerd

misc/helm-charts/testing/environmentd.yaml Outdated Show resolved Hide resolved
@@ -156,7 +187,8 @@ fn create_console_deployment_object(
..Default::default()
}];

if config.enable_tls {
if mz.spec.console_external_certificate_spec.is_some() {
// TODO define volumes, volume_mounts, and any arg changes
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you want to wire in the console certs in this PR? or leave for a follow up

Copy link
Contributor Author

@alex-hunt-materialize alex-hunt-materialize Nov 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was going to leave it for a follow up. I don't know how the console works, and didn't want to block this on me figuring that out.

Console TLS support is not required for parity with the existing SaaS offering, which was the goal of this PR.

@alex-hunt-materialize alex-hunt-materialize marked this pull request as draft November 26, 2024 13:25
@alex-hunt-materialize alex-hunt-materialize force-pushed the orchestratord_internal_tls_toggle branch 2 times, most recently from fea4772 to 864e8cd Compare November 26, 2024 14:09
@alex-hunt-materialize alex-hunt-materialize marked this pull request as ready for review November 26, 2024 15:45
Comment the cert specs in the testing environmentd.yaml.
Having them present would require cert-manager in testing clusters,
which is not guaranteed, and doesn't have documentation.
Copy link
Contributor

@bobbyiliev bobbyiliev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Notting one thing down while testing this out in different scenarios, I noticed so far is that the console container would crash if the balancerd is not ready yet:

2024/11/28 18:31:37 [emerg] 1#1: host not found in upstream "mzghw4549ri2-balancerd.materialize-environment.svc.cluster.local" in /etc/nginx/conf.d/default.conf:28
nginx: [emerg] host not found in upstream "mzghw4549ri2-balancerd.materialize-environment.svc.cluster.local" in /etc/nginx/conf.d/default.conf:28

First time I'm seeing this, though not sure if this is related to this PR specifically or if it has always been there but might be best to wait for the console creation until the balancerd is ready.

Not a blocker for this PR but might be an issue for the nighty tests that @def- has set up.

@bobbyiliev
Copy link
Contributor

Not sure if we are too worried about this edge case but, one more thing that I noticed is that if you were to deploy Materialize with no TLS configured, and then try to set TLS, upgrade the operator and the CRDs, the console pods do not get recreated during the instance rolling upgrade, so they end up not being able to connect to balancerd:

127.0.0.1 - - [28/Nov/2024:20:08:46 +0000] "POST /api/sql?options=.[..] HTTP/1.1" 502 559 "http://localhost:8080/regions/local-flexible-deployment/shell" "Mozilla/5.0 [..]" "-"

After manually deleting and recreating the console pods it works well.

Copy link
Contributor

@bobbyiliev bobbyiliev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just tested this on a fresh new EKS cluster and besides the two comments above, everything seems to be working as expected!

Wrote a simple guide while testing this here.

@alex-hunt-materialize
Copy link
Contributor Author

Not sure if we are too worried about this edge case but, one more thing that I noticed is that if you were to deploy Materialize with no TLS configured, and then try to set TLS, upgrade the operator and the CRDs, the console pods do not get recreated during the instance rolling upgrade, so they end up not being able to connect to balancerd:

127.0.0.1 - - [28/Nov/2024:20:08:46 +0000] "POST /api/sql?options=.[..] HTTP/1.1" 502 559 "http://localhost:8080/regions/local-flexible-deployment/shell" "Mozilla/5.0 [..]" "-"

After manually deleting and recreating the console pods it works well.

I'm going to look into this, but I don't think it needs to block this PR. We might have an order of operations issue in orchestratord.

@alex-hunt-materialize alex-hunt-materialize merged commit 7af0a66 into MaterializeInc:main Dec 2, 2024
79 checks passed
@alex-hunt-materialize alex-hunt-materialize deleted the orchestratord_internal_tls_toggle branch December 2, 2024 10:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants