Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OCPBUGS-14998: Only use RoleARN for Route53 API #951

Merged
merged 1 commit into from
Jun 27, 2023

Conversation

gcs278
Copy link
Contributor

@gcs278 gcs278 commented Jun 20, 2023

To support Shared VPC, we split the DNS client into public and private providers, the private using the RoleARN (Account A) and the public using the default (Account B). However, the RoleARN only provides API access for Account A's Route53 service, not the ability to describe Account B's ELBs. The ingress-operator DNS Logic uses the AWS API to describe the ELBs in order to determine the Zone ID. This fix isolates the RoleARN to only be used with Route53 API services.

pkg/dns/aws/dns.go: Create a separate Route53 session object that uses the RoleARN when provided.

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Jun 20, 2023
@openshift-ci-robot
Copy link
Contributor

@gcs278: This pull request references Jira Issue OCPBUGS-14998, which is invalid:

  • expected the bug to target the "4.14.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

To support Shared VPC, we split the DNS client into public and private providers, the private using the RoleARN (Account A) and the public using the default (Account B). However, the RoleARN only provides API access for Account A's Route53 service, not the ability to describe Account B's ELBs. This fix isolates the RoleARN to only be used with Route53 API services.

pkg/dns/aws/dns.go: Create a separate Route53 session object that uses the RoleARN when provided.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci-robot openshift-ci-robot added the jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. label Jun 20, 2023
@openshift-ci openshift-ci bot requested review from miheer and rfredette June 20, 2023 20:16
@openshift-ci-robot
Copy link
Contributor

@gcs278: This pull request references Jira Issue OCPBUGS-14998, which is invalid:

  • expected the bug to target the "4.14.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

In response to this:

To support Shared VPC, we split the DNS client into public and private providers, the private using the RoleARN (Account A) and the public using the default (Account B). However, the RoleARN only provides API access for Account A's Route53 service, not the ability to describe Account B's ELBs. The ingress-operator DNS Logic uses the AWS API to describe the ELBs in order to determine the Zone ID. This fix isolates the RoleARN to only be used with Route53 API services.

pkg/dns/aws/dns.go: Create a separate Route53 session object that uses the RoleARN when provided.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@@ -157,6 +154,14 @@ func NewProvider(config Config, operatorReleaseVersion string) (*Provider, error
}
}

// When RoleARN is provided, make a copy of the Route53 session and configure it to use RoleARN.
// RoleARN is intended to only provide access to another account's Route53 service, not for ELBs.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// RoleARN is intended to only provide access to another account's Route53 service, not for ELBs.
// RoleARN is intended to only provide access to another account's Route 53 service, not for ELBs.

Can you add a similar comment to the RoleARN field in the Config struct definition?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

// RoleARN is intended to only provide access to another account's Route53 service, not for ELBs.
sessRoute53 := sess
if config.RoleARN != "" {
sessRoute53 := sess.Copy(sess.Config)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You shouldn't need to specify sess.Config.

// Copy creates and returns a copy of the current Session, copying the config
// and handlers. If any additional configs are provided they will be merged
// on top of the Session's copied config.

Suggested change
sessRoute53 := sess.Copy(sess.Config)
sessRoute53 := sess.Copy()

I'm not certain, but you might be able to hang this config on r53Config below instead:

	r53Config := aws.NewConfig()
	// When RoleARN is provided, make a copy of the Route53 session and configure it to use RoleARN.
	// RoleARN is intended to only provide access to another account's Route53 service, not for ELBs.
	if config.RoleARN != "" {
		r53Config.WithCredentials(stscreds.NewCredentials(sess, config.RoleARN))
	}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done & will keep pattern in my back pocket.

Copy link
Contributor Author

@gcs278 gcs278 Jun 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Miciah Turns out, it appears that i need sess.Config, otherwise, it shared vpc Route53 creation breaks completely.

Added sessRoute53 := sess.Copy(sess.Config) back, and in my smoke testing, it worked.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I tested again, and it didn't work. I think I might of ran an invalid test, possibly the service hanging around while I created a new ingresscontroller caused me to get a false positive by somehow re-using the existing service...not sure, will update when I find what works.

@Miciah
Copy link
Contributor

Miciah commented Jun 20, 2023

/approve
/lgtm

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jun 20, 2023

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: Miciah

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 20, 2023
@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jun 20, 2023
@gcs278
Copy link
Contributor Author

gcs278 commented Jun 20, 2023

/hold
Asking QE to help verify this fix CC: @yunjiang29

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 20, 2023
@gcs278
Copy link
Contributor Author

gcs278 commented Jun 20, 2023

/jira refresh

@openshift-ci-robot openshift-ci-robot added jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. and removed jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Jun 20, 2023
@openshift-ci-robot
Copy link
Contributor

@gcs278: This pull request references Jira Issue OCPBUGS-14998, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.14.0) matches configured target version for branch (4.14.0)
  • bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @melvinjoseph86

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci openshift-ci bot requested a review from melvinjoseph86 June 20, 2023 20:52
@Miciah
Copy link
Contributor

Miciah commented Jun 21, 2023

e2e-aws-ovn failed because Undiagnosed panic detected in pod failed:

{  pods/openshift-ovn-kubernetes_ovnkube-master-tqzxx_ovnkube-master_previous.log.gz:E0620 21:33:48.429234       1 runtime.go:79] Observed a panic: &runtime.TypeAssertionError{_interface:(*runtime._type)(0x1dd8820), concrete:(*runtime._type)(0x1f01740), asserted:(*runtime._type)(0x20f42e0), missingMethod:""} (interface conversion: interface {} is cache.DeletedFinalStateUnknown, not *v1.Pod)}

I've filed OCPBUGS-15227 for this issue.
/test e2e-aws-ovn

@Miciah
Copy link
Contributor

Miciah commented Jun 21, 2023

e2e-aws-ovn-serial failed because [sig-network][Feature:EgressIP][apigroup:operator.openshift.io] [external-targets][apigroup:user.openshift.io][apigroup:security.openshift.io] pods should have the assigned EgressIPs and EgressIPs can be deleted and recreated failed:

{  fail [github.com/openshift/origin/test/extended/networking/egressip.go:798]: Timed out after 180.003s.
Expected
    <bool>: false
to be true
Ginkgo exit error 1: exit with code 1}

This looks like a timeout that might have been caused by general cluster instability (many other tests were reported as flaky in the same CI job run). Maybe it was a fluke.
/test e2e-aws-ovn-serial

@gcs278
Copy link
Contributor Author

gcs278 commented Jun 21, 2023

CI Looks pretty bad today, but no errors attributed to this PR.

Must-gather failure could not be located when the pod was terminated
/test e2e-aws-operator
Undiagnosed panic detected in pod
/test e2e-aws-ovn
Same [sig-network][Feature:EgressIP][apigroup:operator.openshift.io] [external-targets][apigroup:user.openshift.io][apigroup:security.openshift.io] pods should have the assigned EgressIPs and EgressIPs can be deleted and recreated [Skipped:azure][apigroup:route.openshift.io] [Serial] [Suite:openshift/conformance/serial] issue as before
/test e2e-aws-ovn-serial
Must gather failure? "e2e-aws-ovn-upgrade" pod "e2e-aws-ovn-upgrade-gather-audit-logs" failed: pod pending for more than 30m0s: containers have not started in 30m0.000152102s: cp-entrypoint-wrapper, sidecar, test
/test e2e-aws-ovn-upgrade
Cluster install failed
/test e2e-azure-operator
Disruption failure
/test e2e-azure-ovn
Known TestInternalLoadBalancer failure
/test e2e-gcp-operator

@openshift-ci openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Jun 22, 2023
@gcs278
Copy link
Contributor Author

gcs278 commented Jun 22, 2023

/hold
still testing the solution

@gcs278
Copy link
Contributor Author

gcs278 commented Jun 23, 2023

TestRouterCompressionOperation failure. search.ci says this is a new flake.

router_compression_test.go:177: failed to observe deployment completion: timed out waiting for deployment router-default to complete rollout: failed to observe deployment rollout complete; deployment specifies 2 replicas and has generation 14; last observed 2 updated, 1 available, 3 total replicas, with observed generation 14

I've never seen this one before, but since this update is specific to DNS, I don't think it's related.
/test e2e-aws-operator

Cluster resource quota should control resource limits across namespaces: seems unrelated
/test e2e-aws-ovn

`pods should successfully create sandboxes by adding pod to networ: seems unrelated
/test e2e-aws-ovn-single-node

// RoleARN is intended to only provide access to another account's Route 53 service, not for ELBs.
sessRoute53 := sess
if config.RoleARN != "" {
sessRoute53 = sess.Copy()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Miciah - I made a silly mistake that broke everything. I had sessRoute53 := sess.Copy() instead of sessRoute53 = sess.Copy().

The variable was shadowed in the scope of the if statement, and didn't modify the sessRoute53 in the scope we needed. Everything works as expected now.

@gcs278
Copy link
Contributor Author

gcs278 commented Jun 26, 2023

/hold cancel
Both myself and QE has done some early confirmation of this fix, and it looks good so far.

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 26, 2023
@gcs278
Copy link
Contributor Author

gcs278 commented Jun 26, 2023

must gather failure
/test e2e-aws-operator

pathological event should not see excessive Back-off restarting failed containers
/test e2e-aws-ovn-single-node

Distruption testing failure
/test e2e-gcp-ovn

TestUpgradeControlPlane/EnsureNoCrashingPods
/test e2e-hypershift

@gcs278
Copy link
Contributor Author

gcs278 commented Jun 26, 2023

unable to create pod
/test e2e-hypershift

@@ -16,7 +17,6 @@ import (
"github.com/aws/aws-sdk-go/aws/arn"
"github.com/aws/aws-sdk-go/aws/awserr"
"github.com/aws/aws-sdk-go/aws/client/metadata"
"github.com/aws/aws-sdk-go/aws/credentials/stscreds"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The "github.com/aws/aws-sdk-go/aws/credentials/stscreds" import was moved erroneously

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whoops, fixed.

To support Shared VPC, we split the DNS client into public and private
providers, the private using the RoleARN (Account A) and the public
using the default (Account B). However, the RoleARN only provides API
access for Account A's Route53 service, not the ability to describe
Account B's ELBs. This fix isolates the RoleARN to only be used with
Route53 API services.

`pkg/dns/aws/dns.go`: Create a separate Route53 session object that
uses the RoleARN when provided.
@gcs278
Copy link
Contributor Author

gcs278 commented Jun 26, 2023

Must gather failure
/test e2e-aws-operator

TestAllowedSourceRanges and TestInternalLoadBalancer failures. I'm assigned the bug https://issues.redhat.com/browse/OCPBUGS-13106
/test e2e-gcp-operator

[sig-instrumentation] Prometheus [apigroup:image.openshift.io] when installed on the cluster should report telemetry [Late] [Skipped:Disconnected] [Suite:openshift/conformance/parallel]
/test e2e-gcp-ovn

@melvinjoseph86
Copy link

There was no issues with regression on the public and private cluster created with the premerged image, hence marking as verified

@Miciah
Copy link
Contributor

Miciah commented Jun 27, 2023

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jun 27, 2023
@gcs278
Copy link
Contributor Author

gcs278 commented Jun 27, 2023

error: timed out waiting for the condition on hostedclusters/8e677dc7f4fe7c93f4f1-mgmt
test e2e-hypershift

@gcs278
Copy link
Contributor Author

gcs278 commented Jun 27, 2023

error: timed out waiting for the condition on hostedclusters/8e677dc7f4fe7c93f4f1-mgmt
/test e2e-hypershift

@openshift-ci-robot
Copy link
Contributor

/retest-required

Remaining retests: 0 against base HEAD ecb3786 and 2 for PR HEAD 047bd98 in total

@gcs278
Copy link
Contributor Author

gcs278 commented Jun 27, 2023

The openshift-monitoring prometheus-adapter pods [apigroup:monitoring.coreos.com] should be scheduled on different nodes [Suite:openshift/conformance/parallel]
/test e2e-aws-ovn

Lots of e2e test failures in e2e-azure-operator, that's a bit of a first, but it seems something went wrong with the installation
/test e2e-azure-operator

error: timed out waiting for the condition on hostedclusters/1870669922af87d68c6e-mgmt again
/test e2e-hypershift

@gcs278
Copy link
Contributor Author

gcs278 commented Jun 27, 2023

Waiting for cluster to become available: Is hypershift e2e permafailing?
/test e2e-hypershift

@gcs278
Copy link
Contributor Author

gcs278 commented Jun 27, 2023

Same issue, 3rd time...
/test e2e-hypershift

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jun 27, 2023

@gcs278: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@openshift-merge-robot openshift-merge-robot merged commit d319e46 into openshift:master Jun 27, 2023
@openshift-ci-robot
Copy link
Contributor

@gcs278: Jira Issue OCPBUGS-14998: All pull requests linked via external trackers have merged:

Jira Issue OCPBUGS-14998 has been moved to the MODIFIED state.

In response to this:

To support Shared VPC, we split the DNS client into public and private providers, the private using the RoleARN (Account A) and the public using the default (Account B). However, the RoleARN only provides API access for Account A's Route53 service, not the ability to describe Account B's ELBs. The ingress-operator DNS Logic uses the AWS API to describe the ELBs in order to determine the Zone ID. This fix isolates the RoleARN to only be used with Route53 API services.

pkg/dns/aws/dns.go: Create a separate Route53 session object that uses the RoleARN when provided.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants