Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provisioning state failed from private ACR with User managed identity #1233

Open
1 of 3 tasks
nextdarius opened this issue Jul 15, 2024 · 26 comments
Open
1 of 3 tasks
Labels
Backlog Issue has been validated and logged in our backlog for future work bug Something isn't working CLI Related to CLI

Comments

@nextdarius
Copy link

This issue is a: (mark with an x)

  • bug report -> please search issues before submitting
  • documentation issue or request
  • regression (a behavior that used to work and stopped in a new release)

Issue description

I have a private ACR without admin access enabled from which I pull images for my azure container app. I've created a user managed identity for which I granted AcrPull and is assigned to my Azure Container App. I try to update the revision of my container app using AZ CLI (OIDC Login) but I simply receive "provisioningState": "failed" without any additional information. I tried to check in both ContainerAppSystemLogs_CL and ContainerAppConsoleLogs_CL but could not find anything.

As soon as I enable admin access on ACR, then everything works normal and I can see logs (creating new revision, deprovisioning of old one etc.)

Doing this from Portal with the same user managed identity is OK as well.

Steps to reproduce

  1. Use Az CLI with OIDC authentication
  2. Prepare an ACR without admin access
  3. Prepare a User Identity with AcrPull for previous ACR created
  4. Assign the user identity to the container app
  5. Perform an az containerapp update with an image from the ACR
  6. Receive provisioning state failed

Expected behavior
A new revision to be created

Actual behavior
Provisioning state failed without any information or logs in any tables.

@microsoft-github-policy-service microsoft-github-policy-service bot added the Needs: triage 🔍 Pending a first pass to read, tag, and assign label Jul 15, 2024
@anthonychu anthonychu added bug Something isn't working CLI Related to CLI labels Jul 15, 2024
@redging-very-well
Copy link

I'm facing the exact same issue.

Incidentally, I've also tried setting the registry on the container app to my managed identity, but this also fails:

az containerapp registry set -n example -g $RG --server $ACR.azurecr.io --identity $ID_NAME
User identity /subscriptions/<subid>/resourcegroups/<rg>/providers/Microsoft.ManagedIdentity/userAssignedIdentities/mi-acr-puller is already assigned to containerapp
- Running ..Failed to provision revision for container app 'example'. Error details: The following field(s) are either invalid or missing. Field 'configuration.Registries<acr>.azurecr.io.Identity' is invalid with details: 'Invalid value: "mic-acr-puller": Managed Identity does not exist';..

(I've blocked out the sub and rg intentionally - those are correctly populated with the expected sub and rg in the console output.

@simonjj simonjj added Backlog Issue has been validated and logged in our backlog for future work and removed Needs: triage 🔍 Pending a first pass to read, tag, and assign labels Jul 17, 2024
@simonjj
Copy link
Collaborator

simonjj commented Jul 17, 2024

Thank you for raising this. @nextdarius and @redging-very-well. We've labeled this as Backlog. If this is of high priority please go ahead an raise a support ticket and feel free to mention this issue in your ticket.

@simongottschlag
Copy link

simongottschlag commented Jul 19, 2024

I was trying this out in my lab and noticed the same issue. In my case, I have an Azure Firewall that blocks everything to the internet. I saw that it was blocking traffic to two different FQDNs for login (the normal one and a region specific).

After opening that traffic it started working for me.

These two was needed for me:

login.microsoftonline.com
swedencentral.login.microsoft.com

@redging-very-well
Copy link

redging-very-well commented Aug 9, 2024

I've figured out that you can set the registry if you specify the --identity parameter as a fully qualified ID.

e.g.

FQID=$(az identity show -n ${identityName} -g ${RG} --query id --output tsv)
az containerapp registry set -n example -g $RG --server $ACR.azurecr.io --identity $FQID

@nextdarius
Copy link
Author

Thanks @redging-very-well, I confirm as well that this does the trick!

We're using terraform to create the resources and discovered in the meantime that adding the registry block solves it as well.

However, it's still very hard to tackle a deployment failure in such case, as from what I experienced, there's no information at all. Also the fact that az cli does not throw an error in case of a failure is not ideal for CI/CD.

@redging-very-well
Copy link

@nextdarius glad that helped!

I totally agree - the container app deployment experience isn't great. It would be good if there was a way to wait for a deployment to succeed, as is possible with tools like helm.

@Greedygre
Copy link

Hi @nextdarius
What is the error you got for Perform an az containerapp update with an image from the ACR?
Base on the step process, you didn't execute command to assign user identity to the registry as following:
az containerapp registry set -n example -g $RG --server $ACR.azurecr.io --identity $Identity-resource-id

@Greedygre
Copy link

Greedygre commented Aug 19, 2024

Hi @redging-very-well
Need to input resource id for a user-defined identity, did you occur this error when input a user-defined identity's name? Thanks.


az containerapp registry set -h

Command
    az containerapp registry set : Add or update a container registry's details.

Arguments
    --identity          : The managed identity with which to authenticate to the Azure Container
                          Registry (instead of username/password). Use 'system' for a system-defined
                          identity or a resource id for a user-defined identity. The managed
                          identity should have been assigned acrpull permissions on the ACR before
                          deployment (use 'az role assignment create --role acrpull ...').

@Greedygre
Copy link

I will give a more friendly error message for command: az containerapp registry set when --identity input not system and not a resource id for a user-defined identity.

@FilippTrigub
Copy link

Similar issue here:

  • container app deployed with image A
  • image A was automatically deleted by a task
  • container app now refuses to deploy new image, cannot be stopped, cannot be modified in any way

Currently my only solution is to destory it and create a new one. Not great.

@jonathan-vogel-siemens
Copy link

I can confirm this sometimes happens, also somhow in conjunction with Terraform. Only reliable solution i found somehow is to destroy and redeploy. Container App is rendered completely unusable and refuses to locate the managed identity in any way.

@Greedygre
Copy link

Steps to reproduce

  1. Use Az CLI with OIDC authentication
  2. Prepare an ACR without admin access
  3. Prepare a User Identity with AcrPull for previous ACR created
  4. Assign the user identity to the container app
  5. Perform an az containerapp update with an image from the ACR
  6. Receive provisioning state failed

For this issue, before step 5, we need to assign the user identity to the registry:

az containerapp registry set -n example -g $RG --server $ACR.azurecr.io --identity $Identity-resource-id

then we can perform an az containerapp update with an image from the ACR.

For a more easy way to execute az containerapp registry set and az containerapp update in one command, you can try command with containerapp extension version >= 1.0.0b4:
az containerapp up --image {} --registry-identity {$Identity-resource-id}

For extension install:
az extension add -n containerapp --upgrade

@Greedygre
Copy link

Greedygre commented Jan 14, 2025

Similar issue here:

  • container app deployed with image A
  • image A was automatically deleted by a task
  • container app now refuses to deploy new image, cannot be stopped, cannot be modified in any way

Currently my only solution is to destory it and create a new one. Not great.

Hi @FilippTrigub @jonathan-vogel-siemens
Can you show me what is the version you are using? Could you show me the result of executing command az version?
There was an issue about Receive provisioning state failed without any reason, it has been fixed from azure-cli version 2.66.0.

Thanks!

@FilippTrigub
Copy link

I'm managing the app with Terraform 1.9.8 and az cli 2.67.0.

I'm fairly certain the problem described above occurs equally when trying to deploy a new revision manually via the UI. The app can't handle deprovisioning of revisions with images, which are not available on the acr.

@Greedygre
Copy link

Greedygre commented Jan 14, 2025

Similar issue here:

  • container app deployed with image A
  • image A was automatically deleted by a task
  • container app now refuses to deploy new image, cannot be stopped, cannot be modified in any way

Currently my only solution is to destory it and create a new one. Not great.

What is image A was automatically deleted by a task? Do you mean the image A was exists in the ACR but deleted by design? Can you give me your containerapp and the timestamp the error happened and the region you deployed the containerapp? Thanks.

@AurimasNav
Copy link

AurimasNav commented Jan 14, 2025

Azure support suggested authenticating to ACR using admin credentials to avoid encountering this bug.

@FilippTrigub
Copy link

@Greedygre dont have one at hand, unfortunately. I mean that this bug occurs automatically, if the image of the active revision of the contaienr app has been deleted from the acr.

It is of course obvious that the app should crash if the image cannot be pulled. The problem is however that the app becomes locked in the ProvisioningState failed state and cannot be simply redeployed with a new image.

@Greedygre
Copy link

Greedygre commented Jan 15, 2025

@Greedygre dont have one at hand, unfortunately. I mean that this bug occurs automatically, if the image of the active revision of the contaienr app has been deleted from the acr.

It is of course obvious that the app should crash if the image cannot be pulled. The problem is however that the app becomes locked in the ProvisioningState failed state and cannot be simply redeployed with a new image.

Hi @FilippTrigub
I cannot repro this issue with revision mode Single with following steps.

container app deployed with image A
my step: az containerapp create -n {} -g {} --environment {} --registry-server {my-acr}.azurecr.io --image {my-acr}.azurecr.io/k8se/quickstart:testversion2
image A was automatically deleted by a task
container app now refuses to deploy new image, cannot be stopped, cannot be modified in any way:
my step: az containerapp update -n {} -g {} --image {my-acr}.azurecr.io/k8se/quickstart:latest

  1. Could you tell me more detail about the Containerapp mode? and the environment type the containerapp using?
  2. Even you have deleted the container, can you give me the container name and the environment name and the region and the date you repro this issue, I can search more detail from that to help to repro this issue.
  3. Did your private ACR enable admin access?
  4. Did you use User Identity with AcrPull for previous ACR created?
    Thanks!

@jonathan-vogel-siemens
Copy link

@Greedygre did you make sure the container is scaled to zero before deleting the image from ACR? Then after it is deleted, the revision should try to activate again. Also not sure, but the container apps service might cache images for some time.

@FilippTrigub
Copy link

@Greedygre

As @jonathan-vogel-siemens points out, you have to scale the app to 0, then delete the underlying image in the ACR, then restart the app. The container will attempt pulling the image, will not be able to, and move into the locked state.

  1. Containerapp is in Single mode, Env type does not matter (happens in consumption and D4).
  2. I really cant, cause I just destroy the whole app every time that happens and it only happens, if I dont deploy daily. Probably happened over the holidays last time, but I do not know, if it was recorded in system logs and if so, how it was recorded.
  3. yes
  4. yes, its still on user identity. Need to update that.

@Greedygre
Copy link

@Greedygre

As @jonathan-vogel-siemens points out, you have to scale the app to 0, then delete the underlying image in the ACR, then restart the app. The container will attempt pulling the image, will not be able to, and move into the locked state.

  1. Containerapp is in Single mode, Env type does not matter (happens in consumption and D4).
  2. I really cant, cause I just destroy the whole app every time that happens and it only happens, if I dont deploy daily. Probably happened over the holidays last time, but I do not know, if it was recorded in system logs and if so, how it was recorded.
  3. yes
  4. yes, its still on user identity. Need to update that.

Hi @FilippTrigub

For "container app now refuses to deploy new image, cannot be stopped, cannot be modified in any way"
I tried following steps:
1.create a containerapp with ACR image A use user-assigned identity with AcrPull(single mode).
2.wait for containerapp scale to 0
3.delete the image A
4.try to stop the containerapp, got error: The Container App failed to stop.: Failed to stop container app 'xxxx'. Error details: The following field(s) are either invalid or missing. Field 'template.containers.xxxx.image' is invalid with details: 'Invalid value: "xxxx.azurecr.io/k8se3/quickstart:testversion3": GET https:: MANIFEST_UNKNOWN: manifest tagged by "testversion3" is not found; map[Tag:testversion3]';.., this should be an issue, we are investigating.

But if I update the containerapp with image C, (the main point is the image C should be different with the image that you can get with command az containerapp show), it can be updated successfully.

This is because if we update a containerapp with a image that didn't update the template (image is always same with the image that you can get with command az containerapp show), it will not create a new revision, it looks like stuck and nothing happen.

To make sure we are talking about the same case, may I ask did you try with an image C at that time? Or you always try the same image that you can get with command az containerapp show when the issue happen?
What is the behavior of refuses to deploy new image and cannot be modified in any way?
Thank you!

@FilippTrigub
Copy link

@Greedygre

My apps are updated via CICD, so yes, I am fairly certain the image was new.

I tried to deploy a new revision with a new image with terraform and manually, without success.

I encountered the issue yesterday when updating an old frontend deploy for my prod. The error produced by the workflow was

ERROR: Failed to provision revision for container app 'frontend-app-production'. Error details: The following field(s) are either invalid or missing. Field 'template.containers.frontend-app-production.image' is invalid with details: 'Invalid value: "snaacr.azurecr.io/frontend-app:2025-01-05-20-19-57-5fd403db-prod": GET https:: MANIFEST_UNKNOWN: manifest tagged by "2025-01-05-20-19-57-5fd403db-prod" is not found; map[Tag:2025-01-05-20-19-57-5fd403db-prod]

This occurred on 16.01.25 at 11:22 AM GMT+1.

I am using azure/container-apps-deploy-action@v2 to deploy.

Happy to provide you with more details. Please indicate, what you would need.

In the meantime I have rewritten our purge scripts so that this doesnt occurr anymore.

@Greedygre
Copy link

@Greedygre

My apps are updated via CICD, so yes, I am fairly certain the image was new.

ERROR: Failed to provision revision for container app 'frontend-app-production'. Error details: The following field(s) are either invalid or missing. Field 'template.containers.frontend-app-production.image' is invalid with details: 'Invalid value: "snaacr.azurecr.io/frontend-app:2025-01-05-20-19-57-5fd403db-prod": GET https:: MANIFEST_UNKNOWN: manifest tagged by "2025-01-05-20-19-57-5fd403db-prod" is not found; map[Tag:2025-01-05-20-19-57-5fd403db-prod]

Hi @FilippTrigub

Thanks for your help!

About this error, this error is due to we cannot find the image used to update the containerapp in the ACR, I think the image with tag 2025-01-05-20-19-57-5fd403db-prod might be deleted at that time. (The error happened time is at 2025-01-15 10:21:53.2176640, the image tag is 10 days ago)

Also I found the error happened before at 2024-12-09, with image tag 2024-11-29-10-15-25-e41cb531-prod, which also 10 days ago. (I guest the image tag was deleted too)

About this error, please check your task logic about clean the image tag and make sure the image exists when you use it to update the containerapp.

About cannot stop the containerapp, this should be an issue, I can repro it, and we are investigating.

Thanks!

@FilippTrigub
Copy link

This is exactly what happened. Halo to hear it can be reproduced.

@arielmoraes
Copy link

ERROR: Failed to provision revision for container app 'frontend-app-production'. Error details: The following field(s) are either invalid or missing. Field 'template.containers.frontend-app-production.image' is invalid with details: 'Invalid value: "snaacr.azurecr.io/frontend-app:2025-01-05-20-19-57-5fd403db-prod": GET https:: MANIFEST_UNKNOWN: manifest tagged by "2025-01-05-20-19-57-5fd403db-prod" is not found; map[Tag:2025-01-05-20-19-57-5fd403db-prod]

+1, When deploying the app using VS some fields are set and even deleting and recreating the app I can't add User-managed identities and custom domains. Because it's targeting another repo that does not exist.

@AurimasNav
Copy link

AurimasNav commented Jan 30, 2025

According to azure support case we had open, the underlying issue with managed identity authentication to ACR is fixed and these scenarios should no longer occur, no easy way to fix apps that are already stuck in failed state though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Backlog Issue has been validated and logged in our backlog for future work bug Something isn't working CLI Related to CLI
Projects
None yet
Development

No branches or pull requests

10 participants