-
Notifications
You must be signed in to change notification settings - Fork 65
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Using gMSA on multiple containers simultaneously causes Domain Trust Relationship to fail #405
Comments
Hi, thanks for bringing this issue to our attention. First, I've have to give credit where credit is due. This is so well written up! Thank you for providing a very clear description of the current and expected behavior. Second, this is a quick question: Is there a reason why all the containers in this cluster all have the same gMSA? |
We actually don't use the same gMSA for all the containers in the cluster. Different type of application containers run with different gMSAs. The problem arises when there are multiple instances (replicas) of the same application, such as an application that requires to be highly available. During my testing I also found that it does not have to be replicas of same container image/deployment, different containers still running as the same gMSA will also run into this issue. Multiple containers running as same gMSA can't be avoided for these purposes - without them we can't distribute our workload or promise high availability. |
@ntrappe-msft has there been an internal confirmation of this bug and any discussions on a fix ? This issue severely limits ability to scale Windows containers and use AD authentication because of direct relation between number of containers and domain controllers. |
Hi, thank you for your patience! We know this is blocking you right now and we're working hard to make sure it's resolved as soon as possible. We've reached out to the gMSA team to get more context on the problem and some troubleshooting suggestions. |
The gMSA team is still doing their investigation but they can confirm that this is unexpected and unusual behavior. We may ask for some logs in the future if it would help them diagnose the root cause. |
Hi, could you give us a few follow-up details?
|
Hi Nicole @ntrappe-msft
Process Isolation
Correct
Microsoft Windows Server 2022 Standard (Core), with October CU applied Sharing some more data from our experiments, in case it help the team to troubleshoot the issue:
This issue has been severely restricting usage of Windows Containers at scale :( |
🔖 ADO 47828389 |
While we appreciate that the Containers team is still looking into this issue, I wanted to share some insights into just how seemingly difficult this problem is to work around. In order to prevent requests landing on "bad" containers, I was trying to write custom aspnet core health check that could inquire the status of Trust Relationship of the container and mark the service as unhealthy when Domain Trust fails. What seemed to be a very straightforward tempory fix/compromise for our problems turned out to be a complex anomaly:
My guesses for why the usual means to troubleshoot gMSA/Trust problems are not working for us is probably an attempted to fix a VERY SIMILAR problem for Containers in Server 2019:
Since we do not understand how this was achieved, we have again reached a dead end and are desperately hoping the Containers team is able to solve our gMSA-Containers-At-Scale problem |
Thanks for the additional details. We've had a number of comments from internal and external teams struggling with the same issue. Our support team is still working to find a workaround that they can publish. |
Support team is still working on this. We'll make sure we also update our "troubleshoot gMSAs" documentation when we can address the problem. |
We're also running into this issue, we're using Windows Server 2019 container images, however there are no multiple container instances running with the same gMSA however we still get the same error about trust. Update:
|
Hello @ntrappe-msft - is Containers team in touch with the gMSA/CCG group. Our support engineers informed us that we are the only ones who have reported this issue, but based on your confirmation in #405 (comment), and assuming from reactions on this issue, it is clear there are many users who have run into this exact problem.
@israelvaldez, see my above comment. I would think it is worth highlighting this problem to Microsoft Support from your end as well, so that that it is obvious, without any doubt, that multiple customers face this and it could be appropriately prioritized (if not already) |
Hi @ntrappe-msft we are also experiencing the same issue with our gMSA containers intermittently losing trusts with our domain and needs to be restarted. Wondering if Microsoft has any update on this issue. We have multiple container instances running the same app and using the gMSA. Interestingly even though each of them have their own unique hostname defined, the log shows it's connecting to the DC using the gMSA name as MachineName. Host/domain/dc names replaced with **. EventID : 5720 |
@avin3sh you are definitely not the only one experiencing this Issue. There are a number of internal teams who would like to increase the severity of this Issue and attention towards it. I'm crossing my fingers that we'll have a positive update soon. But it does help us if more people comment on this threat highlighting that they too are encountering this problem. |
This is a huge issue for us at Broadcom with multiple fortunate 100 customers wanting this feature in one of our products and thousands of workloads being blocked from being migrated off VMs to containers |
In my scenario I created a new gMSA othern than the one I was using (which was not being used in multiple pods) and I was able to workaround this problem. |
The workaround is appreciated, but we would like to see Microsoft fix this issue directly so that customers do not need to significantly redesign their environments. |
This issue has been fortunate enough to not get attention of auto-reminder bots so far, but I am afraid they will be here anytime soon. I see this has been finally assigned, does it mean a fix is in the works ? |
Please do not close this issue until the underlying technical problem has been resolved.On Jun 3, 2024, at 3:01 PM, microsoft-github-policy-service[bot] ***@***.***> wrote:
This issue has been open for 30 days with no updates.
@riyapatel-ms, please provide an update or close this issue.
—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>
|
We have started seeing a new issue with As a result, all our new containers are failing to serve ANY incoming kerberized requests!! This is no longer intermittent. This is no longer about number of containers running simultaneously with a gMSA. This is straight up fatal error rendering the container pretty much unusable. Now one would think "downgrading" to an older To summarize,
This issue desperately needs a fix. It's almost as if you can't use Windows Containers for any of your gMSA and Active Directory use cases anymore! |
We are facing also similar issue on the usage of gMSA within scaling windows containers. We also provide hostname into the container creation, but in fact due to gMSA containers are identify themself as the gMSA name. This leads to mismatches on our backend that tries to keep track on incoming traffic. It gets heavily confused while all request are coming from the same "machine". Of course, as long as I only have one container started making use of the one gMSA I am all good. the moment I scale it crashes. (fun fact: the product that gets confused is also from Microsoft :P) So also curious what will happen to this :) Ultimately, this is what kills me (from here ) Can't it put the container/hostname as suffix or so ? :D |
We appear to maybe be facing a similair issue "The trust relationship between the primary domain and the trusted domain failed" on our AKS cluster. Is this being worked on? |
Quick question on the environment you folks have on which you are seeing this issue: Is NETBIOS enabled in your environment? NETBIOS uses port 137,138, and 139, with 139 being Netlogon. I have tested this with a customer (who was kind enough to validate their environment) on which a deployment with multiple pods worked normally. This customer has NETBIOS disabled and port 139 between pods/AKS cluster is blocked to the Domain Controllers. I'm not saying this is a fix, but wanted to check if others see this error even with NETBIOS disabled or the port blocked. |
From what I have found (I can do a more thorough test later), NETBIOS is disabled on the container host's primary interface and on the HNS management vNIC (we use Calico in VXLAN mode). However, the vNICs for individual pods show NETBIOS as enabled. We haven't done anything to block traffic on Port 139. Do you suggest we perform a test after disabling NETBIOS on Pod vNICs as well; AND blocking Port 139 ? I am not sure how to configure this within CNI but perhaps I can write some script to disable netbios by making registry change after the container is network has come up, unless you have some script handy that you could share. BTW just to reiterate the severity from my earlier comment #405 (comment) - nanoserver images after March 2024 have made this problem worse. Earlier the issue was intermittent and dependent on some environmental factors but March 2024+ nanoserver images are causing 100% failures. |
Thanks @avin3sh for the note. No need for a fancy script or worrying from the cluster/pod side - if you block port 139 at the network/NSG level, this should help validate. Again, I'm asking here as a validation, we haven't been able to narrow it down yet, but we have customers running multiple containers simultaneously with no errors and I noticed they have NETBIOS disabled AND port 139 blocked. As for the Nano Server issue, can you please clarify: The issue happens even if you launch just one container? You're saying gMSA is not working on Nano Server at all? |
Thank you so much for clarifying. I will share my observation after blocking traffic on port 139.
We have a bunch of ASP.NET services. We use Negotiate/Kerberos authentication middleware. If I use an ASP.NET nanoserver image that is using Windows build from So essentially our web services are not able authenticate using negotiate when using any image from April or later. This does not happen if I launch just one container, but it happens 100% if there are multiple containers. I think I haven't seen this behavior in beefier windowserver image but can't say for sure as we don't generally use them due to their large size. I have also seen varying behavior depending on whether the container user is |
@avin3sh a little off topic, but you may want to look at my project that can seamlessly translate tokens from jwt to kerberos and vice versa. It's often used as sidecar and it doesn't require container to be domain joined - it uses kerberos.net library under the covers which is a managed implementation instead of relying on sspi. |
@vrapolinario I tried this with Port 139 blocked like so (for TCP, UDP, Inbound and Outbound):
But the problem persisted. Any chance the customer who tried this had large number of domain controllers in their environment ? We have seen that as long as your deployment replicas is less than or equal to number of domain controllers in the environment, you typically don't run into this issue. |
We are happy to collaborate with you to test out various scenarios/experimental patches/etc. We already have a Microsoft Support case ongoing (@ntrappe-msft may be familiar) but it hasn't moved in several months - if you want to take a look at our case, more than willing to validate any suggestions that you may have for this problem. |
I believe I'm aware of the internal support case and I reached out to the team with this note as well. They are now running some internal tests as well, but I haven't heard back from them. The main thing I wanted you all here to please evaluate is if your environment is for some reason using NETBIOS. The fact that some of you reported the DCs getting the same hostname from the pod requests with a character limit of 15 tells me there's some NETBIOS communication happening. https://learn.microsoft.com/en-us/windows-server/identity/ad-ds/manage/dc-locator-changes By default, DNS should be in use, so if you only see 15 characters in the hostnames going to the DCs, tells me something is off. By disabling NETBIOS or blocking port 139, you can quickly check if this helps solve the issue you are seeing. |
Thank you for the validation. We actually ran the same test last night, but I didn't have time to reply here. I can confirm that blocking TCP 139 won't solve the problem. Microsoft still recommends moving away from NETBIOS unless you need it for compatibility, but this is not the issue here. We're still investigating this internally and will report back. As for the Nano Server image issue, can I ask you to please open a separate issue here so we can investigate? These seem like Teo separate problems that are unrelated. The fact that you can't make the nano server ima work at all indicates a different root cause. |
@vrapolinario I have create a new issue #537 with the exact steps to reproduce the bug. It's a simple aspnetcore web app with minimal api with Kerberos enabled. Given the error message is related to the domain trust failure and it does not happen when using NTLM but only Kerberos, I strongly feel it may be related to the larger gMSA issue being discussed here but I will wait for your analysis. |
I've followed this guide on a new cluster https://learn.microsoft.com/en-us/virtualization/windowscontainers/manage-containers/gmsa-aks-ps-module It results in the same error 1786 0x6fa ERROR_NO_TRUST_LSA_SECRET This is with a AD server running Windows Server 2016 and a AKS cluster with windows server 2019 nodes with GMSA enabled. |
This issue has been open for 30 days with no updates. |
Sharing an update from the internal team. We've released a new version of the Windows gMSA webhook. One of the new changes is creating a random hostname for gMSA which should help to get around the random domain trust failures. To use this:
Thanks to @jsturtevant, @zylxjtu, and @AbelHu for this release. |
Thank you @ntrappe-msft and everyone who worked on releasing the workaround. While I am still evaluating it and there are some error scenario that no longer occur after deploying the newer gmsa-webhook, I still see few inconsistencies. #537 is still an issue, unless I use nanoserver image released in March 2024 or prior. All the recent Windows 2022 nanoserver images don't work with NTAuth + aspnetcore unless I change my container user to |
Describe the bug
When running multiple containers simultaneously using the same gMSA on either the same host or different hosts, it causes one or multiple containers to lose their domain trust relationship leading to various issues including LsaLookUp and negotiate auth failures. This especially happens when the count of containers is equal to or more than count of domain controllers in the environment. However, it is also possible to run into this issue when the count of containers is less than count of domain controllers in the environment, provided two or more containers attempt to talk to the same domain controller.
To Reproduce
Replace
contoso\someobj
above with sam name of an actual object.Replace
<gMSAName>
with actual gMSA andfile://gmsa-credspec.json
with actual gMSA Credential Spec file and<image>
with the container imageMonitor the output of all the containers, eventually one or more containers will start throwing the following error message. This usually happens within first few seconds of the container starting, assuming the
docker run ...
in (2) above was run simultaneously on different hosts. If it does not happen, repeat (2) until it does.While a running container is throwing the above error message in its output,
exec
into it and try performing some domain operation - that will fail as well.Expected behavior
gMSAs on multiple Windows Containers is officially supported since at least Windows Server 2019. Running gMSA on multiple containers simultaneously should not result in trust relationship to fail.
Configuration:
Additional context
While the reproducer uses a PowerShell base image to demonstrate the bug, we had originally run into this issue in an ASP.NET Core web application while performing negotiate authentication.
The container image in the reproducer purposefully disables LSA LookUp Cache by setting
LsaLookupCacheMaxSize
to0
to simplify the example.If you were to observe traffic of a container that has run into this issue, the packet capture will indicate a lot of DSERPC/RPC_NETLOGON failure messages. You may also observe packets reporting
nca_s_fault_sec_pkg_error
Sometimes the container may "autorecover". It is purely a chance event. Whenever this happens, you can see RPC_NETLOGON packets in the network capture. Typically this results in the container recovering its domain trust relationship only when the NETLOGON happens through a different domain controller than what container had earlier communicated to.
It is also possible to re-establish domain trust relationship of a failing container by running the following command in a failing container (the runtime user should be a
ContainerAdministrator
or should have administrators privileges)If the above command does not succeed, you may have to run it more than once. When the command succeeds, more often than not, all the affected containers and not just the current container "recover".
As mentioned in the bug description, it is very easy to run into this issue when the count of containers is more than the number of domain controllers in the environment but that is not the only scenario.
docker run ...
is not the only way to run into this issue. It can be also be reproduced on an orchestration platform like Kubernetes, by settingreplicas
count of the Deployment to N+1; or by using scaling feature.The text was updated successfully, but these errors were encountered: