Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provider 3.27.0 azurerm_cdn_frontdoor_route and azurerm_cdn_frontdoor_custom_domain_association and custom domains #18844

Closed
1 task done
slime-uk opened this issue Oct 18, 2022 · 20 comments
Assignees
Labels
bug question service/cdn upstream/microsoft Indicates that there's an upstream issue blocking this issue/PR v/3.x

Comments

@slime-uk
Copy link

slime-uk commented Oct 18, 2022

Is there an existing issue for this?

  • I have searched the existing issues

Community Note

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment

Terraform Version

1.1.9

AzureRM Provider Version

3.27.0

Affected Resource(s)/Data Source(s)

azurerm_cdn_frontdoor_route azurerm_cdn_frontdoor_custom_domain_association azurerm_cdn_frontdoor_custom_domain

Terraform Configuration Files

I have previously opened an issue where all this is provided - so please see existing issue #18656

Debug Output/Panic Output

I have previously opened an issue where all this is provided - so please see existing issue #18656

Expected Behaviour

We expect to be able to create all needed AzFD resources (profile, origin groups, origins and routes ideally not linked to default AzFD endpoint/domain but portal does not allow this) and then associate to AzFD custom domains as and when using the new resource azurerm_cdn_frontdoor_custom_domain_association. We'd also like to delete AzFD custom domains and ensure Terraform knows the correct order to delete (i.e. first remove the custom domain from any subsequent route associations).

Actual Behaviour

This does work in 3.27.0 (see my comments on my issue #18656) but only if you

  1. create the routes after custom domains and the associate here in the azurerm_cdn_frontdoor_route resource but also set link_to_default_domain = true. And then also
  2. use new resource azurerm_cdn_frontdoor_custom_domain_association to also set associations from custom domains to routes

If you then try and remove a custom domain later, Terraform appears to understand the order and deletes successfully.

However, if you do the above but set link_to_default_domain = false, although the creation works fine, on deletion of an AzFD custom domain, we get an error that the domain is still associated to a route by partner id - see later commnets on #18656

Steps to Reproduce

See #18656

Important Factoids

No response

References

#18656

@WodansSon
Copy link
Collaborator

WodansSon commented Oct 18, 2022

@slime-uk, thank you for opening this issue, Front Door is an incredibly complex resource with lots of moving parts and very difficult to grok, so I feel your pain. The Terraform crash that you encountered I have already fixed and was caused by the API returning the routes resource ID's with a lower cased resourcegroup instead of the expected by Terraform casing resourceGroup.

The purpose of the new azurerm_cdn_frontdoor_custom_domain_association resource is not to actually associate the custom domain with the route. That is controlled by the routes cdn_frontdoor_custom_domain_ids field. The new resources sole purpose is to remove the association of the custom domain to the route to allow for the deletion to occur. The API has a rule that a route cannot be created, with the exception of the initial creation of the route, unless it is associated with either:

a.) A custom domain or
b.) linked to the default domain (e.g. endpoint)

There is nothing I can do about that since that is the design of the API. What the new resource does is on delete updates the routes cdn_frontdoor_custom_domain_ids field to remove the custom domain from that list to disassociate the custom domain from that route to allow the deletion of the custom domain to complete successfully with out error. If the new resource detects that the custom domain is the last associated custom domain with a given route it will toggle the link_to_default_domain field on the route to true to make sure the route is in a valid state according to the two requirements of the API I mentioned above.

The last issue you mentioned:

However, if you do the above but set link_to_default_domain = false, although the creation works fine, on deletion of an AzFD custom domain, we get an error that the domain is still associated to a route by partner id

I have not encountered personally, I will spend some time today attempting to reproduce this issue, but setting link_to_default_domain to false is not valid unless the route is associated with a custom domain. So I am not sure what is going on there.

I hope this has answered a few of your question and feel free to reply with any follow up question you may have. 🚀

@WodansSon
Copy link
Collaborator

To help understand how these resources relate to each other I have created this image that I hope helps visually show how they relate to each other and the direction of that relationship...

image

So now knowing the association direction, relationship and the deployment order if you want to decouple your custom domain you will need to do a few extra steps, which I understand is not terrible intuitive, but given how the API design and Terraform are at odds with each other I saw no other way around it. 🙁 To do this you will first have to update your configuration file and remove the reference to the custom domain ID from your routes cdn_frontdoor_custom_domain_ids field, remove the azurerm_cdn_frontdoor_custom_domain_association resource code block completely and leave the azurerm_cdn_frontdoor_custom_domain resource code block alone, it does not have to change as it does not have a reference to either the association resource or the route resource.

At this point if you run plan you will see that the azurerm_cdn_frontdoor_custom_domain_association resource will be marked to be destroyed, the route resource will be marked as update in place and the custom domain resource will be unchanged and will not show up in the plan at all.

If you apply this change what will happen is that the azurerm_cdn_frontdoor_custom_domain_association resource will be destroyed first, when the azurerm_cdn_frontdoor_custom_domain_association resource is destroyed it removes its custom domain reference from all of its referenced route resource(s), effectively "un-associating" that custom domain from the route. Once the azurerm_cdn_frontdoor_custom_domain_association resource updates the route resources cdn_frontdoor_custom_domain_ids field it will remove itself from the state file and it is gone. Next the route resource will attempt to apply the changes from the plan, which is now a no-op since the destruction of the azurerm_cdn_frontdoor_custom_domain_association resource already performed the operation for the route resource.

Once that apply completes you will have a totally un-associated custom domain resource which you can then associate with another route or whatever you need to do to it as it will not be associated at that point.

I hope this helps make a little bit of sense in regards to the behavior you have been seeing... I don't know about you but I am a visual learner and I have to see it before it clicks in my head personally. 🙂

@slime-uk
Copy link
Author

slime-uk commented Oct 19, 2022

@slime-uk, thank you for opening this issue, Front Door is an incredibly complex resource with lots of moving parts and very difficult to grok, so I feel your pain. The Terraform crash that you encountered I have already fixed and was caused by the API returning the routes resource ID's with a lower cased resourcegroup instead of the expected by Terraform casing resourceGroup.

Thanks!

The purpose of the new azurerm_cdn_frontdoor_custom_domain_association resource is not to actually associate the custom domain with the route. That is controlled by the routes cdn_frontdoor_custom_domain_ids field. The new resources sole purpose is to remove the association of the custom domain to the route to allow for the deletion to occur. The API has a rule that a route cannot be created, with the exception of the initial creation of the route, unless it is associated with either:

a.) A custom domain or b.) linked to the default domain (e.g. endpoint)

There is nothing I can do about that since that is the design of the API. What the new resource does is on delete updates the routes cdn_frontdoor_custom_domain_ids field to remove the custom domain from that list to disassociate the custom domain from that route to allow the deletion of the custom domain to complete successfully with out error. If the new resource detects that the custom domain is the last associated custom domain with a given route it will toggle the link_to_default_domain field on the route to true to make sure the route is in a valid state according to the two requirements of the API I mentioned above.

Yes - thanks for all you hard work. I do appreciate that a route must to be associated with a CD or the AzFD default domain (the portal makes this clear). Thanks!

However, your comment about the logic where if we are removing the last CD for a given route, it flips link_to_default_domain = true so that the route can stay and the CD can be removed I feel is really good and exactly what we need. Awesome! But in my experience, this is the bit that maybe is not working for us? I will try again to be sure, but I created 2 routes with link_to_default_domain = false but associated to 2 custom domains instead when I defined the route. I also used new resource to associate the CD to the route (although as you say this is not really an association but a dissassociation!) That all worked fine :)

I then tried to remove one of the custom domains, and this would have impacted both routes. The new resource was destroyed (as you say), and the portal showed no more associations but TF went onto then destroying the actual custom domain but it failed with message "Error: waiting for the deletion of Front Door Custom Domain: (XXX / Profile Name "YYY" etc.): Code="BadRequest" Message=ErrorMessage Host with Id: XX for tenant Id: YYY is still referenced by partners: ZZZZ""

Here's the TF apply that errored - no op was performed on the route but the new resource was destroyed as you say it should have been - but it then tried and failed to delete the CD even though portal shows no route association. In portal the CD was then showing "failed" as provisioning state:

image

Although - I guess this would not have been the final custom domain - just the 2nd one. Any ideas on this?

The same scenario seems to work exactly as you say - but only if when I create the 2 routes, I set link_to_default_domain = true which is not what we really want!

Here's the exact same scenario but the routes were defined as link_to_default_domain = true. Note the "no op" on the route did appear to do something! That's the only difference I can see between this successful apply and the above?

image

Again - so many thanks for all the hard work. AzFD is a bit complicated!!

@WodansSon
Copy link
Collaborator

WodansSon commented Oct 19, 2022

@slime-uk, that is interesting I have not encountered that issue... but just to make sure I understand your configuration, is the below similar to how you have your environment deployed?

image

If the above image is correct that would mean that you are attempting to delete the cronus.wodans-son.com custom domain, which is associated with both the cronus-route and the fabrikam-route, correct?

Now if that is also correct is the below example what your azurerm_cdn_frontdoor_custom_domain_association resource looks like?

resource "azurerm_cdn_frontdoor_custom_domain_association" "cronus" {
  cdn_frontdoor_custom_domain_id = azurerm_cdn_frontdoor_custom_domain.cronus.id
  cdn_frontdoor_route_ids        = [azurerm_cdn_frontdoor_route.cronus.id, azurerm_cdn_frontdoor_route.fabrikam.id]
}
See Notes and Additional conditions
NOTE: In this scenario the azurerm_cdn_frontdoor_custom_domain_association resource for the cronus.wodans-son.com custom domain should reference both of the routes that have it associated with it(e.g. azurerm_cdn_frontdoor_route.cronus.id and azurerm_cdn_frontdoor_route.fabrikam.id).
ADDITIONAL: For every azurerm_cdn_frontdoor_custom_domain resource in your configuration file you should also have a corresponding azurerm_cdn_frontdoor_custom_domain_association resource that references all of the azurerm_cdn_frontdoor_route resource(s) that are associated with that azurerm_cdn_frontdoor_custom_domain else you will receive the service side error This resource is still associated with a route. Please delete the association with the route first before deleting this resource.

If all of the above matches your situation, then to delete the custom domain and leave the route in place you will need to do the following steps:

  1. Modify your azurerm_cdn_frontdoor_route for the cronus-route to remove the cronus.wodans-son.com custom domain reference from the cdn_frontdoor_custom_domain_ids field and remove the link_to_default_domain field.

  2. Modify your azurerm_cdn_frontdoor_route for the fabrikam-route to remove the cronus.wodans-son.com custom domain reference from the cdn_frontdoor_custom_domain_ids field. You can leave your link_to_default_domain field set to false.

  3. Remove the azurerm_cdn_frontdoor_custom_domain_association code block from your configuration that controls the reference between the cronus.wodans-son.com custom domain and the routes(e.g. cronus-route and fabrikam-route).

  4. Remove the azurerm_cdn_frontdoor_custom_domain code block for the cronus.wodans-son.com custom domain.

  5. If defined, remove the CNAME and TXT DNS records associated with the cronus.wodans-son.com custom domain.

Once the configuration modifications are complete you can apply the new configuration which will disassociate both of the routes from the cronus.wodans-son.com custom domain, delete the cronus.wodans-son.com custom domain and associated DNS entries (if defined). Your environment should then look like this:

image

This is the desired end state you are looking for correct? If you are still receiving that error I would start looking for other Frontdoor instances that are currently deployed which may have routes that are still associated with the cronus.wodans-son.com custom domain, which I suspect might be the root cause if the error you are seeing.

@slime-uk
Copy link
Author

Hi,

Thanks once again for the detailed reply.

When I got the error I'm reporting, our configuration was almost like that yes.

To be 100% clear, we have two AKS clusters and an AzFD route to each (say central US AKS and westeurope AKS). Then we have a number of custom domains and it's possible with our TFC variable map values to map to each route for each custom domain. As you have seen, in our configuration we use for_each for all this.

So I started with this - example TFC vars:

routes = ["centralus", "westeurope"]
doms = ["dom1", "dom2"]
# Used by AzFD route
routes_to_doms = {
   "centralus" = ["dom1"]
   "westeurope" = ["dom1", "dom2"]
}
# Used by resource azurerm_cdn_frontdoor_custom_domain_association
doms_to_routes = {
   "dom1" = ["centralus", "westeurope"]
   "dom2 = ["westeurope"]
}

And, here's what it looked like in AzFD portal:
image

So, at the time of error, I had dom1 mapped to both routes, and dom2 only mapped to westeurope route.

I then changed the TFC var values and removed dom2 from all 3 controlling maps/lists, as per:

routes = ["centralus", "westeurope"]
doms = ["dom1"]
# Used by AzFD route
routes_to_doms = {
   "centralus" = ["dom1"]
   "westeurope" = ["dom1"]
}
# Used by resource azurerm_cdn_frontdoor_custom_domain_association
doms_to_routes = {
   "dom1" = ["centralus", "westeurope"]
}

The TFC run (attached above) did remove the new azurerm_cdn_frontdoor_custom_domain_association resource and said it needed a no op change to the routes (where it appeared to indeed do nothing). It then continued and tried to delete the dom2 custom domain and failed and left it in the "failed" state in the portal with the error "referenced by partner id...". Looking in the portal, there were no route associations (and as far as I remember, the DNS auth and CNAME recs had also been removed). The TF apply screenshot confirms.

If I only change 1 thing in my configuration (and that's to change the route definition (for all routes) to link_to_default_domain = true) and re-do the exact same thing (starting with a new clean AzFD), then again TF creates fine, and I can remove dom2 just by adjusting TFC map values fine - with no error. Awesome work! Sadly, we don't really want default domain linked at all (unless a route has no custom domains associated any longer - then that's fine and understood as a limitation in Azure). The only difference I can spot in the TFC execution is the no op did seem to change the route resource. This was removing dom2 and not the final domain for any route.

I will try again, just to be sure it wasn't Azure being funny! Let me see if I can re-produce today.

@slime-uk
Copy link
Author

slime-uk commented Oct 20, 2022

Well - I'm so glad I tried again as it all worked fine today. I'm still on Terraform 1.1.9/AzureRM provider 3.27.0.

The difference today is that the route change (no op), again did actually do something - allowing the no longer required dom2 to be removed successfully. No idea why the other day it failed.

Just to confirm, today I tore AzFD down completely, and then re-created with the same 2 domains, again one associated to both routes, and the other only associated to WE.

I then removed dom2 from our TFC var maps and re-ran. This time it worked!

FYI - the "no op" on the route did something today (just as I saw when I tried this the other day but only with link_to_default_domain = true). Today it also worked with link_to_default_domain = false.

While I was watching the apply, I really expected it to fail again, as it seemed to be doing exactly the same as the other day - here you can see it's attempting to delete the domain, but no op has been performed on the route yet:
image

But, it just worked resulting in this:
image

So, assuming this is reliable (just a little concerned after the error the other day which left AzFD is a weird state), it's appears to be working exactly as it should now!

@slime-uk
Copy link
Author

slime-uk commented Oct 20, 2022

I tried one more thing: and that was to delete the final custom domain, In AzFD Manager, I saw it flip the routes to the AzFD default domain as you said and then try and delete the final custom domain, It failed again with that same error:

image

And left us with AzFD in a squiffy state again:
image

@WodansSon
Copy link
Collaborator

@slime-uk, I have engaged the FrontDoor service team to try to get to the bottom of why you are seeing these issues since I am not able to repro your issue locally on my dev box.

@slime-uk
Copy link
Author

Thanks - FYI - once AzFD is in this state (a custom domain in a failed state), I then have to blow away the entire AzFD to recover. Before that, I tried to get TF to add a new custom domain but of course it tries again first to remove the old (failed state) domain and fails again. No easy way out without deleting the AzFD in the portal. You can't delete the failed CD in the portal either - same error. The only solution I have found is to delete the entire AzFD.

@WodansSon
Copy link
Collaborator

@slime-uk, the service team is saying they "cannot find any custom domain/endpoint name/host name contains “skgdom1” from our kusto query" Can you give us anything more identifiable so we can search our internal logs to see what is happening?

@slime-uk
Copy link
Author

Sadly, I just tore it all down again! Happy to try and recreate - or can they look through logs even though it's gone? If so, can I PM you the full details of the domain?

@WodansSon
Copy link
Collaborator

@slime-uk, it doesn't matter if it is gone or not... just send me the info and I will pass that on to the service team...

@WodansSon
Copy link
Collaborator

@slime-uk, also as a side note, have you looked at your activity log in portal to see if it can add anymore context as to why you are getting this error... make sure you click the JSON tab on the right side of the UI to get all of the details of the error in the log.

@hashicorp hashicorp deleted a comment from slime-uk Oct 21, 2022
@WodansSon
Copy link
Collaborator

@slime-uk, thank you for that info... I have removed the comment, but I believe I have found your calls in our internal service trace logs... currently working with service team to track down the root cause thank you. 🙂

@WodansSon
Copy link
Collaborator

@slime-uk... this is starting to look like a race condition in the service code... from the logs, it shows that the domains are still associated even though the UI shows that they are not... 😭

@slime-uk
Copy link
Author

slime-uk commented Oct 21, 2022

Makes sense - it seems hit and miss as per the above, one day it seems to delete the associations (completely) and one day maybe it doesn't but thinks it has and then tries to delete the custom domain (too early)

@WodansSon
Copy link
Collaborator

WodansSon commented Oct 22, 2022

@slime-uk, the service team just confirmed that this is in fact caused by a race condition in the service and that they are prioritizing a fix for this issue.

Per the service team:

"This is a race condition between Rule and Route concurrent operations. In this case, a delete rule operation updated the associated routes using a stale config right after the route update operation to remove Custom Domain completed. Rest client layer default retry logic will cause rule to continue retry the CP call with a stale config."

Meaning that the delete operation will try over and over again until it eventually times out and that explains why you were seeing the error you were seeing and why I was not able to reproduce it because my repro did not include a rule for the route.

@WodansSon WodansSon added the bug label Oct 24, 2022
@WodansSon WodansSon added the upstream/microsoft Indicates that there's an upstream issue blocking this issue/PR label Oct 24, 2022
@WodansSon
Copy link
Collaborator

@slime-uk, if it is OK with you I would like to close this issue as this is not an issue with the Terraform resources, but rather an upstream issue with the actual service.

@slime-uk
Copy link
Author

Sure! Thank you for looking into it and raising with the service team. Over and above effort from you and I thank you. I will close with this comment. Thanks again.

@github-actions
Copy link

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Nov 25, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug question service/cdn upstream/microsoft Indicates that there's an upstream issue blocking this issue/PR v/3.x
Projects
None yet
Development

No branches or pull requests

3 participants