-
Notifications
You must be signed in to change notification settings - Fork 2.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
External-DNS delete and recreate constantly all DNS records #992
Comments
@jhohertz I've seen your PR merged to fix this issue and some people reported the issue have been fixed. Although, this still happening with the Cloudflare provider. Do you have any idea ? Thanks! |
This issue is happening to me as well on both v0.5.12 and v0.5.13 with Cloudflare. |
@eduardolundgren I think v0.5.3 does not have this bug and it was introduced in v0.5.4. If you were able to confirm this that would be great. |
Just realized that the issue on my case was due to two clusters with the same owner id value, one external dns instance was removing the entries being added by the other. I can confirm it's working on both v0.5.3 and v0.5.13 versions. Thank you. |
@eduardolundgren do you use proxied entries with v0.5.13 ? Thanks! |
This appears to be caused by #970 which added support for multiple target addresses for the CloudFlare provider. As nta mentions in #970, CloudFlare does not support "sets" of targets in basic DNS, but rather a single record for each target. For example, for a 2 node cluster, the CloudFlare provider's Records() returns:
However, the desired endpoints generated by Endpoints() has a set of targets for each name.
This confuses the planner, resulting in the calculation of a plan that updates records even when no changes are necessary:
The fix appears to be to group records by name and type. See pull request #1034. Post pull request, given the following Records() and the desired Endpoints() above:
The following plan is returned:
|
@shasderias I built and tested your branch. I thought it would fix the update loop for Cloudflare. It appears the issue is still there. From my own experience, the issue have been introduced in v0.5.4 Note: I only tested proxied entries. Here's how I tested your branch:
Here's the logs:
Any thoughts? |
There might be more than one issue here then. Can you set up a go dev environment? If so, edit controller/controller.go as per this gist (adds 3 pretty print statements and imports the pretty print package). Then set your environment variables (CF_API_EMAIL, CF_API_PASSWORD, etc.) and run:
Remove any sensitive information and paste the output. Not sure I can help as I'm no expert, but should help narrow down the issue. |
@shasderias here the logs
|
@MiniJerome Ah. Mystery solved. I hope. Relevant lines from log: Current:
Desired:
When a record is proxied, the TTL is automatically managed by CloudFlare - the TTL cannot be changed from the control panel. However, I believe you have the following annotation in your ingress, and this causes external-dns to repeatedly attempt to try and fail to set the TTL for the proxied record to 120, causing the constant updating of records you are experiencing:
Could you check if you have said annotation, and if so, try removing it and see if that solves your problem. Fix on external-dns's end would be one or more of the following:
|
@shasderias I confirm. Good job! I updated 1/2 of my apps, removing the annotation. We can see that only the ingress with the TTL annotation is still in the update loop.
Thank you for your help. Much appreciated. |
I can confirm that the behaviour is related to the TTL, I was able to reproduce it on AWS with Route53. I'm gonna go try dive a bit deeper into that and update as soon as I have a full idea on how to solve the problem. /assign @Raffo |
So I managed to replicate this in AWS Route53. ExternalDNS creates behind the scenes an ALIAS record that does not allow setting a TTL:
I suspect that when we read the TTL we don't find it and try to set it again. I think this is not a bug as the user is trying to do something that literally cannot be done and I'm pretty confident that it is the same behaviour that is happening in cloudflare. |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
/remove-lifecycle stale |
I had the same problem with Azure DNS using an up to date release (sha256:27c516624d71636b6642f8d4089df728ae4667d866c39ec1bcd8c03cbed732dd) following the basic installation described in azure.md. Records were deleted and recreated every minute. I noticed that one of the two external-dns pods tried to mount a non-existing external-dns-token secret for whatever reason. The other pod that had mounted the correct secret was constantly delete/creating records. Weirdly, after deleting the pod with the wrong secret and having it recreated, the endless delete/create cycle ended. |
We experienced the same issue after upgrading external-dns from v0.5.17 to v0.5.18. Because of this bug, we reached some Cloudflare API throttling causing major downtime on our side. After looking into the Cloudflare provider submitChanges logic, we noticed that all records are being deleted and re-created instead of executing a record update via the UpdateDNSRecord provided by the cloudflare-go library. |
@etiennetremel seeing the same - latest good chart at https://github.com/helm/charts/tree/master/stable/external-dns is |
We saw the same on v0.5.18 r4 with Cloudflare. Reverting to v0.5.17 fixed the issue. |
I confirm, revert to |
Also encountering this on the |
Grouping doesn't work well and is causing issues because ProviderSpecificProperty can be per-target.. e.g. proxied field for Cloudflare |
@MiniJerome please re-open, this bug is still present. |
Same issue on Azure, external-dns update A + TXT records every minutes, because ApplyChanges() is called every minutes....
|
@MiniJerome Just updated to the last version and it´s definitely not fixed.
Using cloudflare without the "proxied" option and no custom ttl annotations defined. |
Does anybody has seen major random downtimes with DNS not being resolved at all? I have problems regarding that. Not sure if this can be the cause of problem. I have also opened issue on Cloudflare support |
@brpaz that's exactly the issue I had. The rollback to v0.5.17 mentioned above fixed it. |
@brpaz I am also having the same issue with the latest version. Check the external-dns logs to be sure |
Hi @brpaz, Yes, I'm having that issue with the RFC2136 provider. I suspect our DNS team doesn't like it when records are deleted then updated every minute. |
Is the TTL annotation actually required for this problem to happen? In my case with the RFC2136 (issue #1596), we're seeing issues with no TTL annotation present at all, similar to what @jaygorrell is seeing. The RFC2136 provider sets a global minimum TTL using the |
In my case I am not using the TTL annotation. |
In #1596, I believe the issue might be that external-dns sees a mismatch between what it's trying to delete and what is actually present. I believe external-dns is looking for a record that looks like this:
But in actuality, the TTL is '60' not '0':
I noticed this because the constant removal/additions only happen for old records that have an older TTL and thus don't match the expected pattern. The new records seem to have a TTL that matches what's expected. So, this means I have a path forward to manually clean this up the stale records. However, the underlying behavior is still there and will require occasional cleanup in the future. |
Reopening. |
like what @jaygorrell said, I temporarily downgraded from v0.7.1 to v0.5.17 and it fixed it. I'll keep it with v0.5.17 until another the upcoming fix. |
This should be fixed in 0.7.2, could you confirm? |
In my case - fixed. |
It seems to work well with 0.7.2 - thanks 🎉 I tested using the cloudflare provider. |
works for me. |
Seems fixed for me as well |
OK, I'll consider this fixed. If not, please create another issue with steps to reproduce, or ideally a test in cloudflare_test.go or other affected. The tests for cloudflare provider are really easy to write :) /close |
@sheerun: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
I'm experiencing this with the latest image: v0.7.4 |
Nevermind. I added these two options and the problem went away
|
We have the same behaviour with 0.7.6 and 0.8.0 versions on AWS Route53 |
This is required specially if you have multiple clusters on the same account managing the same zone (pay also attention to the different regions!). |
It still doesn't work, Changing records all the time |
I noticed it happens to me on records without subdomains on Cloudflare. Example: I have ingress:
hosts:
- host: example.com And on Cloudflare there's:
But not:
I tried to create it manually but Cloudflare automatically convert it to:
Because it removes the plain domain. Can be this the cause? That's the only difference between the ones working and the ones causing the infinte update |
Context
External-DNS constantly delete and re-create DNS records, every 1 minute, according to the interval value.
I expected External-DNS would create, update and or delete DNS records only if there is a change on the Ingress.
This is blocking because it causes a downtime between deletion and creation that can take up to 30 seconds, every minute.
Note: Same behavior with External-dns v0.5.12 and v0.5.9.
External-DNS logs
External-DNS Deployment
Cloudflare Logs
Similar issue
#883
The text was updated successfully, but these errors were encountered: