data/aws: use nlbs instead of elbs #594

crawford · 2018-11-01T23:08:36Z

We've noticed an elevated rate of installation failures recently. The
root cause appears to be 50-90 seconds of latency added to traffic going
through the internal ELB on port 49500. This was causing Ignition's
connections to timeout, resulting in the machines never provisioning.

AWS's NLBs don't seem to have this high latency, so we've decided to
move over to them instead. With the move to NLBs, we also get the
ability to add individual health checks for each port instead of just a
single health check for each load balancer. Also, NLBs are cheaper.

This commit drops support for ingress and the console. Since the console
and router aren't currently configured correctly, nobody should notice
that this is gone. It was easier to drop support in this commit rather
than continue to try to plumb through the existing implementation
knowing that it was going to have to change in the future. Once the
router has a strategy for ingress, we'll re-add this functionality using
the new NLBs.

This also drop support for the <cluster-name>-k8s DNS entry. We aren't
aware of any consumers and it was going to be tedious to keep this
working.

These variables and outputs aren't used anywhere and can be removed.

The existing version would sometimes panic when creating NLBs and was also missing a few options specific to NLBs. There are probably many other fixes and improvements that come with the newer provider.

This gives the installer the ability to tear down NLBs.

data/data/aws/bootstrap/variables.tf

data/data/aws/bootstrap/main.tf

We've noticed an elevated rate of installation failures recently. The root cause appears to be 50-90 seconds of latency added to traffic going through the internal ELB on port 49500. This was causing Ignition's connections to timeout, resulting in the machines never provisioning. AWS's NLBs don't seem to have this high latency, so we've decided to move over to them instead. With the move to NLBs, we also get the ability to add individual health checks for each port instead of just a single health check for each load balancer. Also, NLBs are cheaper. This commit drops support for ingress and the console. Since the console and router aren't currently configured correctly, nobody should notice that this is gone. It was easier to drop support in this commit rather than continue to try to plumb through the existing implementation knowing that it was going to have to change in the future. Once the router has a strategy for ingress, we'll re-add this functionality using the new NLBs. This also drop support for the `<cluster-name>-k8s` DNS entry. We aren't aware of any consumers and it was going to be tedious to keep this working.

abhinavdahiya · 2018-11-02T00:15:43Z

/lgtm

openshift-ci-robot · 2018-11-02T00:15:49Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: abhinavdahiya, crawford

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [abhinavdahiya,crawford]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

crawford · 2018-11-02T00:18:49Z

/hold

Waiting for limit bump on NLBs from AWS.

Also, @ironcladlou, does the removal of the ingress LBs block your work? We can add an ALB for router traffic if needed.

ironcladlou · 2018-11-02T14:02:20Z

@crawford

Also, @ironcladlou, does the removal of the ingress LBs block your work? We can add an ALB for router traffic if needed.

This change doesn't impact ingress today; routers are exposed by ELBs provisioned by the k8s service controller, and we haven't got to external app router DNS setup yet. Thanks for the heads up!

crawford · 2018-11-02T15:51:17Z

Still waiting on AWS, but the current limit is 20 NLBs, which gets us 10 concurrent clusters. I think that's good enough for now.

@ironcladlou Okay, I'm going to go ahead and cancel the hold.

/hold cancel

The listeners may be private or public, but as far as the bootstrap and master modules are concerned, these are just lists of ARNs that need attaching. This partially unwinds changes from 16dfbb3 (data/aws: use nlbs instead of elbs, 2018-01-01, openshift#594), before which the bootstrap module used a unified list of load balancers while the master module made private/public distinctions. While I'm touching variables.tf, I've alphabetized the dns_server_ip and kubeconfig_content entries as well.

It's the default. The old parameter was deprecated before hashicorp/terraform-provider-aws@1cc81c92 (docs/data-source/aws_region: Remove now deprecated current argument, 2018-02-09, v1.9.0), and we're pinning 1.39.0 since ac5aeed (data/aws: bump aws provider, 2018-11-01, openshift#594). Changing this avoids: $ openshift-install --log-level=debug create cluster ... - Downloading plugin for provider "aws" (1.39.0)... ... Warning: module.vpc.data.aws_region.current: "current": [DEPRECATED] Defaults to current provider region if no other filtering is enabled ...

As suggested by Dani Comnea [1]. When we switched to network load balancers in 16dfbb3 (data/aws: use nlbs instead of elbs, 2018-11-01, openshift#594), we replaced things like: resource "aws_elb" "api_internal" { ... health_check { healthy_threshold = 2 unhealthy_threshold = 2 timeout = 3 target = "SSL:6443" interval = 5 } ... } with: resource "aws_lb_target_group" "api_internal" { ... health_check { healthy_threshold = 3 unhealthy_threshold = 3 interval = 10 port = 6443 protocol = "TCP" } } This resulted in logs like [2]: [core@ip-10-0-11-88 ~]$ sudo crictl ps CONTAINER ID IMAGE CREATED STATE NAME ATTEMPT 1bf4870ea6eea registry.svc.ci.openshift.org/openshift/origin-v4.0-2018-12-15-160933@sha256:97eac256dde260e8bee9a5948efce5edb879dc6cb522a0352567010285378a56 2 minutes ago Running machine-config-server 0 [core@ip-10-0-11-88 ~]$ sudo crictl logs 1bf4870ea6eea I1215 20:23:07.088210 1 bootstrap.go:37] Version: 3.11.0-356-gb7ffe0c7-dirty I1215 20:23:07.088554 1 api.go:54] launching server I1215 20:23:07.088571 1 api.go:54] launching server 2018/12/15 20:24:17 http: TLS handshake error from 10.0.20.86:28372: EOF 2018/12/15 20:24:18 http: TLS handshake error from 10.0.20.86:38438: EOF 2018/12/15 20:24:18 http: TLS handshake error from 10.0.47.69:26320: EOF ... when the health check opens a TCP connection (in this case to the machine-config server on 49500) and then hangs up without completing the TLS handshake. Network load balancers [3,4] do not have an analog to the classic load balancers' SSL protocol [5,6,7], so we're using HTTPS. There's some discussion in [8] about the best way to perform unauthenticated liveness checks on the Kubernetes API server that suggests 401s are possible in some configurations. Checking against a recent cluster: $ curl -i https://wking-api.devcluster.openshift.com:6443/healthz curl: (60) Peer's Certificate issuer is not recognized. More details here: http://curl.haxx.se/docs/sslcerts.html curl performs SSL certificate verification by default, using a "bundle" of Certificate Authority (CA) public keys (CA certs). If the default bundle file isn't adequate, you can specify an alternate file using the --cacert option. If this HTTPS server uses a certificate signed by a CA represented in the bundle, the certificate verification probably failed due to a problem with the certificate (it might be expired, or the name might not match the domain name in the URL). If you'd like to turn off curl's verification of the certificate, use the -k (or --insecure) option. $ curl -ik https://wking-api.devcluster.openshift.com:6443/healthz HTTP/1.1 200 OK Cache-Control: no-store Date: Sun, 16 Dec 2018 06:18:23 GMT Content-Length: 2 Content-Type: text/plain; charset=utf-8 ok I don't know if the network load balancer health checks care about certificate validity or not. I guess we'll see how CI testing handles this. Ignition is only exposed inside the cluster, and checking that from a master node: [core@ip-10-0-26-134 ~]$ curl -i https://wking-api.devcluster.openshift.com:49500/ curl: (60) Peer's Certificate issuer is not recognized. More details here: http://curl.haxx.se/docs/sslcerts.html curl performs SSL certificate verification by default, using a "bundle" of Certificate Authority (CA) public keys (CA certs). If the default bundle file isn't adequate, you can specify an alternate file using the --cacert option. If this HTTPS server uses a certificate signed by a CA represented in the bundle, the certificate verification probably failed due to a problem with the certificate (it might be expired, or the name might not match the domain name in the URL). If you'd like to turn off curl's verification of the certificate, use the -k (or --insecure) option. [core@ip-10-0-26-134 ~]$ curl -ik https://wking-api.devcluster.openshift.com:49500/ HTTP/1.1 404 Not Found Content-Type: text/plain; charset=utf-8 X-Content-Type-Options: nosniff Date: Sun, 16 Dec 2018 06:30:14 GMT Content-Length: 19 404 page not found So we're checking the new /healthz from openshift/machine-config-operator@d0a7ae21 (server: Add /healthz, 2019-01-04, openshift/machine-config-operator#267) instead. Unfortunately, setting matcher [9] is not allowed for network load balancers (e.g. see [10,11]). Setting it leads to errors like: ERROR * module.vpc.aws_lb_target_group.api_internal: 1 error occurred: ERROR * aws_lb_target_group.api_internal: Error creating LB Target Group: InvalidConfigurationRequest: Custom health check matchers are not supported for health checks for target groups with the TCP protocol ERROR status code: 400, request id: 25a53d63-00fe-11e9-80c5-59885e191c9c So I've left it unset here, and we'll just hope the 401s don't start happening. [1]: openshift#923 [2]: https://groups.google.com/d/msg/openshift-4-dev-preview/Jmt6AK0EJR4/Ed3W7yZyBQAJ [3]: https://docs.aws.amazon.com/elasticloadbalancing/latest/network/target-group-health-checks.html [4]: https://www.terraform.io/docs/providers/aws/r/lb_target_group.html#protocol [5]: https://docs.aws.amazon.com/elasticloadbalancing/latest/classic/elb-healthchecks.html [6]: https://www.terraform.io/docs/providers/aws/r/elb.html#target [7]: hashicorp/terraform-provider-aws#6866 [8]: kubernetes/kubernetes#43784 [9]: https://www.terraform.io/docs/providers/aws/r/lb_target_group.html#matcher [10]: https://github.com/terraform-providers/terraform-provider-aws/pull/2906/files#diff-375aea487c27a6ada86edfd817ba2401R612 [11]: hashicorp/terraform-provider-aws#2708 (comment)

These were changes from two to three in 16dfbb3 (data/aws: use nlbs instead of elbs, 2019-11-01, openshift#594): $ git show 16dfbb3 | grep _threshold | sort | uniq - healthy_threshold = 2 - # healthy_threshold = 2 + healthy_threshold = 3 - unhealthy_threshold = 2 - # unhealthy_threshold = 2 + unhealthy_threshold = 3 Alex doesn't remember intentionally making that change, and the lower thresholds will hopefully help mitigate some issues with continued connection attempts to API servers which are in the process of shutting down.

It is not valid for network load balancers [1]. We've been setting this in our Terraform config since 0a96415 (modules/aws: configure ELB idle timeout, coreos/tectonic-installer#725) when we were using classic load balancers and should have dropped it in 16dfbb3 (data/aws: use nlbs instead of elbs, 2018-11-01, openshift#594). [1]: hashicorp/terraform-provider-aws@d25a227#diff-f4b0dbdc7e3eede6ba70cd286c834f37R78

crawford requested a review from abhinavdahiya November 1, 2018 23:08

openshift-ci-robot requested review from aaronlevy and steveej November 1, 2018 23:08

openshift-ci-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Nov 1, 2018

crawford added 3 commits November 1, 2018 16:50

data/aws/route53: remove unused variables

68e4c27

These variables and outputs aren't used anywhere and can be removed.

data/aws: bump aws provider

ac5aeed

The existing version would sometimes panic when creating NLBs and was also missing a few options specific to NLBs. There are probably many other fixes and improvements that come with the newer provider.

vendor: bump github.com/openshift/hive

8823176

This gives the installer the ability to tear down NLBs.

wking reviewed Nov 1, 2018

View reviewed changes

data/data/aws/bootstrap/variables.tf Outdated Show resolved Hide resolved

wking reviewed Nov 2, 2018

View reviewed changes

data/data/aws/bootstrap/main.tf Outdated Show resolved Hide resolved

openshift-ci-robot assigned abhinavdahiya Nov 2, 2018

openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Nov 2, 2018

openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 2, 2018

openshift-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 2, 2018

openshift-merge-robot merged commit 1e2b796 into openshift:master Nov 2, 2018

crawford deleted the nlb branch November 2, 2018 16:36

wking mentioned this pull request Nov 2, 2018

CHANGELOG: Document changes since v0.3.0 #595

Merged

wking mentioned this pull request Nov 4, 2018

data/aws: Consolidate pivate and public load balancer target groups #609

Merged

wking mentioned this pull request Nov 15, 2018

data/data/aws/vpc: Drop 'current' from aws_region #686

Merged

wking mentioned this pull request Dec 16, 2018

data/aws/vpc: Use HTTPS for load balancer health checks #924

Merged

wking mentioned this pull request Apr 16, 2019

data/aws/vpc/master-elb: Set thresholds back to two #1629

Merged

wking mentioned this pull request Aug 28, 2019

Bug 1717604: data/aws/vpc/master-elb: Bump load-balancer timeouts to 20m #2279

Merged

wking mentioned this pull request Aug 28, 2019

data/aws/vpc/master-elb: Drop idle_timeout from aws_lb resources #2283

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data/aws: use nlbs instead of elbs #594

data/aws: use nlbs instead of elbs #594

crawford commented Nov 1, 2018

abhinavdahiya commented Nov 2, 2018

openshift-ci-robot commented Nov 2, 2018

crawford commented Nov 2, 2018

ironcladlou commented Nov 2, 2018

crawford commented Nov 2, 2018

data/aws: use nlbs instead of elbs #594

data/aws: use nlbs instead of elbs #594

Conversation

crawford commented Nov 1, 2018

abhinavdahiya commented Nov 2, 2018

openshift-ci-robot commented Nov 2, 2018

crawford commented Nov 2, 2018

ironcladlou commented Nov 2, 2018

crawford commented Nov 2, 2018