Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OCPBUGS-48780: Fix IBMCloud DNS Propagation Issues in E2E #1164

Merged
merged 2 commits into from
Jan 31, 2025

Conversation

gcs278
Copy link
Contributor

@gcs278 gcs278 commented Oct 29, 2024

Fix IBMCloud DNS resolution issues in our E2E tests with two fixes:

Fix 1:

Extend the timeout to 10 minutes. This fix defines the timeout as a documented constant, ensuring that all DNS resolution logic is updated to reference it.

Testing showed that new IBMCloud DNS records were resolving within 7 minutes for external (e.g., test runner cluster) queries. Setting the timeout to 10 minutes provides a reasonable buffer to accommodate DNS propagation across all platforms.

Fix 2:

During testing, we found that IBMCloud's DNS resolution works well from outside the cluster (e.g., the test runner cluster). However, internal DNS queries within the test cluster trigger to an unchangeable ~30-minute negative caching TTL.

This fix introduces an internal warmup period for IBMCloud clusters to mitigate the negative caching issue. Only one test, TestUnmanagedDNSToManagedDNSInternalIngressController, requires this workaround.

@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Oct 29, 2024
Copy link
Contributor

openshift-ci bot commented Oct 29, 2024

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@gcs278
Copy link
Contributor Author

gcs278 commented Oct 29, 2024

/test e2e-ibmcloud-operator

@gcs278 gcs278 force-pushed the ibmcloud-e2e-dns-fix branch from 66cbc36 to c029765 Compare October 29, 2024 04:05
@gcs278
Copy link
Contributor Author

gcs278 commented Oct 29, 2024

/test e2e-ibmcloud-operator

@gcs278 gcs278 force-pushed the ibmcloud-e2e-dns-fix branch from c029765 to 133382a Compare October 29, 2024 17:03
@gcs278
Copy link
Contributor Author

gcs278 commented Oct 29, 2024

/test e2e-ibmcloud-operator

@gcs278 gcs278 force-pushed the ibmcloud-e2e-dns-fix branch 2 times, most recently from afee45c to 34c2e67 Compare October 29, 2024 19:48
@gcs278
Copy link
Contributor Author

gcs278 commented Oct 29, 2024

/test e2e-ibmcloud-operator

@gcs278
Copy link
Contributor Author

gcs278 commented Oct 29, 2024

To make sure I didn't break anything:
/test e2e-aws-operator

@gcs278
Copy link
Contributor Author

gcs278 commented Oct 30, 2024

Success!
/test e2e-ibmcloud-operator

@gcs278
Copy link
Contributor Author

gcs278 commented Nov 11, 2024

I've temporarily added an availability check for IngressController in TestScopeChange to confirm that it would catch the issue fixed in #1133. I'll remove and place in #1133 once confirmed it fails here.
/test e2e-ibmcloud-operator

@gcs278 gcs278 force-pushed the ibmcloud-e2e-dns-fix branch from 4f8b44e to 34c2e67 Compare November 12, 2024 14:28
@gcs278
Copy link
Contributor Author

gcs278 commented Nov 12, 2024

Yup, it failed, that's a good thing 👍 Added the TestScopeChange updates to #1133 and removed from here.

@gcs278 gcs278 force-pushed the ibmcloud-e2e-dns-fix branch from 34c2e67 to 9f6e396 Compare November 12, 2024 15:05
@gcs278
Copy link
Contributor Author

gcs278 commented Nov 12, 2024

/test e2e-ibmcloud-operator

@gcs278
Copy link
Contributor Author

gcs278 commented Nov 13, 2024

success, but failed on deprovisioning:
/test e2e-ibmcloud-operator

@gcs278
Copy link
Contributor Author

gcs278 commented Nov 13, 2024

Failure with:

=== NAME  TestAll/parallel/TestMTLSWithCRLs/multiple-intermediate-ca
    client_tls_test.go:1079: failed to get hostname of route "echo-pod-8ccjt"

Looks like a flake in which the router didn't admit the router within 1 minute. That's odd, but I doubt specific to IBMCloud.
/test e2e-ibmcloud-operator

@gcs278 gcs278 changed the title [WIP] Fix IBMCloud DNS Propagation Issues OCPBUGS-42045: Fix IBMCloud DNS Propagation Issues Nov 13, 2024
@openshift-ci-robot openshift-ci-robot added jira/severity-important Referenced Jira bug's severity is important for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. labels Nov 13, 2024
@openshift-ci-robot
Copy link
Contributor

@gcs278: This pull request references Jira Issue OCPBUGS-42045, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.18.0) matches configured target version for branch (4.18.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @lihongan

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

Resolves IBMCloud DNS names inside the cluster because they don't propagate to the test runner clusters in a reasonable time.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot openshift-ci-robot added the jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. label Nov 13, 2024
@openshift-ci openshift-ci bot requested a review from lihongan November 13, 2024 16:10
@gcs278 gcs278 changed the title OCPBUGS-42045: Fix IBMCloud DNS Propagation Issues [WIP] OCPBUGS-42045: Fix IBMCloud DNS Propagation Issues Nov 13, 2024
@gcs278 gcs278 force-pushed the ibmcloud-e2e-dns-fix branch from 9f6e396 to 941bff1 Compare November 14, 2024 01:28
@gcs278
Copy link
Contributor Author

gcs278 commented Nov 14, 2024

okay DNS failed again. Increased warmup period to 5 minute which is what I found in #1132 (comment)
/test e2e-ibmcloud-operator

@openshift-ci openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Jan 29, 2025
@gcs278 gcs278 changed the title [WIP] OCPBUGS-48780: Fix IBMCloud DNS Propagation Issues OCPBUGS-48780: Extend DNS Resolution Timeout to 10 Minutes Jan 29, 2025
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 29, 2025
@openshift-ci-robot
Copy link
Contributor

@gcs278: This pull request references Jira Issue OCPBUGS-48780, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.19.0) matches configured target version for branch (4.19.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @lihongan

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

Fix IBMCloud DNS resolution issues by extending the timeout to 10 minutes. This fix defines the timeout as a documented constant, ensuring that all DNS resolution logic is updated to reference it.

Testing showed that new IBMCloud DNS records were resolving within 7 minutes. Setting the timeout to 10 minutes provides a reasonable buffer to accommodate DNS propagation across all platforms.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@gcs278
Copy link
Contributor Author

gcs278 commented Jan 29, 2025

Refactored to only increase the DNS Resolution Timeout to 10 minutes:

/test e2e-ibmcloud-operator

@gcs278 gcs278 changed the title OCPBUGS-48780: Extend DNS Resolution Timeout to 10 Minutes OCPBUGS-48780: Fix IBMCloud DNS Propagation Issues Jan 29, 2025
@gcs278 gcs278 force-pushed the ibmcloud-e2e-dns-fix branch from e50c9bd to 3bfdce3 Compare January 29, 2025 23:25
@gcs278 gcs278 changed the title OCPBUGS-48780: Fix IBMCloud DNS Propagation Issues OCPBUGS-48780: Fix IBMCloud DNS Propagation Issues in E2E Jan 29, 2025
Fix IBMCloud DNS resolution issues in our E2E tests by extending the
timeout to 10 minutes. This fix defines the timeout as a documented
constant, ensuring that all DNS resolution logic is updated to
reference it.

Testing showed that new IBMCloud DNS records were resolving within
7 minutes for external (e.g., test runner cluster) queries. Setting
the timeout to 10 minutes provides a reasonable buffer to accommodate
DNS propagation across all platforms.
During testing, we found that IBMCloud's DNS resolution works well from
outside the cluster (e.g., the test runner cluster). However, internal
DNS queries within the test cluster trigger to an unchangeable
~30-minute negative caching TTL.

This E2E test fix introduces an internal warmup period for IBMCloud
clusters to mitigate the negative caching issue. Only one test,
TestUnmanagedDNSToManagedDNSInternalIngressController, requires this
workaround.
@gcs278 gcs278 force-pushed the ibmcloud-e2e-dns-fix branch from 3bfdce3 to 7ee110b Compare January 29, 2025 23:27
@gcs278
Copy link
Contributor Author

gcs278 commented Jan 30, 2025

darn I forgot to run it:

/test e2e-ibmcloud-operator

@gcs278
Copy link
Contributor Author

gcs278 commented Jan 30, 2025

/retest

@gcs278
Copy link
Contributor Author

gcs278 commented Jan 30, 2025

success round 1 (hasn't finished, but I see the e2e jobs passed)

/test e2e-ibmcloud-operator

@alebedev87
Copy link
Contributor

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jan 30, 2025
@gcs278
Copy link
Contributor Author

gcs278 commented Jan 30, 2025

Success, round 2:
/test e2e-ibmcloud-operator

@gcs278
Copy link
Contributor Author

gcs278 commented Jan 30, 2025

installation failure, not related:
/test e2e-ibmcloud-operator

1 similar comment
@gcs278
Copy link
Contributor Author

gcs278 commented Jan 30, 2025

installation failure, not related:
/test e2e-ibmcloud-operator

@gcs278
Copy link
Contributor Author

gcs278 commented Jan 30, 2025

installation failure, terraform is timing out, I guess I should have stopped while I was ahead earlier...

/test e2e-ibmcloud-operator

@gcs278
Copy link
Contributor Author

gcs278 commented Jan 30, 2025

installation failure, not related:
/test e2e-ibmcloud-operator

@gcs278
Copy link
Contributor Author

gcs278 commented Jan 30, 2025

As far as DNS Propagation is concerned, the last run was a success, but the failure seems like it's a dns-related flake (but real bug) for IBMCloud, so I filed https://issues.redhat.com/browse/OCPBUGS-49684 (maybe a win for adding this E2E job?).

/test e2e-ibmcloud-operator

Copy link
Contributor

openshift-ci bot commented Jan 31, 2025

@gcs278: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@gcs278
Copy link
Contributor Author

gcs278 commented Jan 31, 2025

4 successes, looks good to me.

This is a E2E fix for our pre-submit job which is optional and not running by default, so there is no risk that it will cause any hold up in CI for 4.19.

/label acknowledge-critical-fixes-only

@openshift-ci openshift-ci bot added the acknowledge-critical-fixes-only Indicates if the issuer of the label is OK with the policy. label Jan 31, 2025
@gcs278
Copy link
Contributor Author

gcs278 commented Jan 31, 2025

/hold cancel

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 31, 2025
@openshift-merge-bot openshift-merge-bot bot merged commit ce5dace into openshift:master Jan 31, 2025
20 checks passed
@openshift-ci-robot
Copy link
Contributor

@gcs278: Jira Issue OCPBUGS-48780: All pull requests linked via external trackers have merged:

Jira Issue OCPBUGS-48780 has been moved to the MODIFIED state.

In response to this:

Fix IBMCloud DNS resolution issues in our E2E tests with two fixes:

Fix 1:

Extend the timeout to 10 minutes. This fix defines the timeout as a documented constant, ensuring that all DNS resolution logic is updated to reference it.

Testing showed that new IBMCloud DNS records were resolving within 7 minutes for external (e.g., test runner cluster) queries. Setting the timeout to 10 minutes provides a reasonable buffer to accommodate DNS propagation across all platforms.

Fix 2:

During testing, we found that IBMCloud's DNS resolution works well from outside the cluster (e.g., the test runner cluster). However, internal DNS queries within the test cluster trigger to an unchangeable ~30-minute negative caching TTL.

This fix introduces an internal warmup period for IBMCloud clusters to mitigate the negative caching issue. Only one test, TestUnmanagedDNSToManagedDNSInternalIngressController, requires this workaround.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-bot
Copy link
Contributor

[ART PR BUILD NOTIFIER]

Distgit: ose-cluster-ingress-operator
This PR has been included in build ose-cluster-ingress-operator-container-v4.19.0-202501311738.p0.gce5dace.assembly.stream.el9.
All builds following this will include this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
acknowledge-critical-fixes-only Indicates if the issuer of the label is OK with the policy. approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/severity-low Referenced Jira bug's severity is low for the branch this PR is targeting. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants