Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Service bus refreshing AAD tokens a lot #22991

Closed
Mortana89 opened this issue Jul 30, 2021 · 24 comments
Closed

[BUG] Service bus refreshing AAD tokens a lot #22991

Mortana89 opened this issue Jul 30, 2021 · 24 comments
Assignees
Labels
bug This issue requires a change to an existing behavior in the product in order to be resolved. Client This issue points to a problem in the data-plane of the library. customer-reported Issues that are reported by GitHub users external to the Azure organization. needs-team-attention Workflow: This issue needs attention from Azure service team or SDK team Service Bus

Comments

@Mortana89
Copy link

Describe the bug
Got a message that our AI daily quota, on our QA environment, has been reached.
Upon investigation I noticed there's A LOT of AAD requests happening under the bonnet by the service bus SDK.
I don't know if this is normal behavior, I would expect this to be cached perhaps?

Expected behavior
The token to only be refreshed when it is expired, is it cached?

Actual behavior (include Exception or Stack Trace)
This is a screenshot of our AI instance, showing requests for past 30minutes;
image
This is on a VMSS, having 23 ASP.NET services using ASB.

Environment:

  • Azure.Messaging.ServiceBus v7.1.2
  • ASP.NET Core 3.1, Azure Service Fabric 8.0
@ghost ghost added needs-triage Workflow: This is a new issue that needs to be triaged to the appropriate team. customer-reported Issues that are reported by GitHub users external to the Azure organization. question The issue doesn't require a change to the product in order to be resolved. Most issues start as that labels Jul 30, 2021
@jsquire jsquire added Azure.Identity Client This issue points to a problem in the data-plane of the library. needs-team-attention Workflow: This issue needs attention from Azure service team or SDK team labels Jul 30, 2021
@ghost ghost removed the needs-triage Workflow: This is a new issue that needs to be triaged to the appropriate team. label Jul 30, 2021
@jsquire
Copy link
Member

jsquire commented Jul 30, 2021

//fyi: @JoshLove-msft

@jsquire
Copy link
Member

jsquire commented Jul 30, 2021

Hi @Mortana89. Thank you for reaching out and we regret that you're experiencing difficulties. Because the credential is self-managing, any refresh requests would be triggered by it and not the Service Bus client library. I've looped in our Azure.Identity experts, as they're best able to assist and offer insight around the TokenCredential types.

@christothes
Copy link
Member

The ServiceBusClient appears to not be doing any token caching. @JoshLove-msft - it looks like only ServiceBusAdministrationClient uses BearerTokenAuthenticationPolicy, which would cache tokens - Is there a reason it isn't used in the other clients?

@jsquire
Copy link
Member

jsquire commented Jul 30, 2021

@christothes: The AMQP connection is persistent and authorization isn't sent with every network request. The token is only sent once when a connection/link is established and then not again until a background timer refreshes it slightly before it expires.

(example)

@christothes
Copy link
Member

It sounds like we need to investigate this from the ServiceBus perspective to determine why GetToken is being called so much. The credential won't ever call on its own.

@Mortana89
Copy link
Author

I see this is happening per session? We have many unique session IDs as we use this to have FIFO per entity ID (like customer id). This could be the issue perhaps? Shouldn't the AMQP connection be shared, or am I reading the code wrong 🧐

@jsquire
Copy link
Member

jsquire commented Jul 30, 2021

My understanding was that the credential was expiration-aware and would only perform an active request when GetToken was called and the token was within an interval close to its expiration. To clarify, are we saying that's not the case?

Assuming my understanding is flawed, then yes... this would likely indicate an issue where the AMQP connection/link is either dropping frequently or Service Bus is needing to create new links for some reason.

@jsquire
Copy link
Member

jsquire commented Jul 30, 2021

@JoshLove-msft: Is Service Bus opening a new link for each session? If so, and @christothes confirms the token makes an active request for every GetToken call, it sounds like we'll need to add another layer of caching around it.

@christothes
Copy link
Member

My understanding was that the credential was expiration-aware and would only perform an active request when GetToken was called and the token was within an interval close to its expiration. To clarify, are we saying that's not the case?

It's actually BearerTokenAuthenticationPolicy that typically handles token expiration and caching. I don't believe any of our credentials handle expiration automatically, although some handle caching via the underlying MSAL client.

@jsquire
Copy link
Member

jsquire commented Jul 30, 2021

@christothes. Thanks. It seems that we've got a gap based on our misunderstanding. That puts this firmly back in Service Bus territory.

@jsquire
Copy link
Member

jsquire commented Jul 30, 2021

Shouldn't the AMQP connection be shared, or am I reading the code wrong 🧐

Thanks, @Mortana89. That is helpful context; the connection is shared, but I suspect that each session needs a dedicated link. Since authorization exists at the link level, if I'm correct, and you're moving through sessions, then the client will be closing/opening links frequently - effectively working around the timer-based mechanism that manages refreshes and caching.

@JoshLove-msft
Copy link
Member

@JoshLove-msft: Is Service Bus opening a new link for each session? If so, and @christothes confirms the token makes an active request for every GetToken call, it sounds like we'll need to add another layer of caching around it.

Yes, each unique session would have its own receive link.

@JoshLove-msft
Copy link
Member

I don't think this is specific to sessions, or even Service Bus for that matter (should apply to Event Hubs as well), but it is just more likely to be noticed when using and switching between many sessions. The root cause seems to be that we do not cache across links for the same AMQP connection.

@jsquire
Copy link
Member

jsquire commented Jul 30, 2021

The root cause seems to be that we do not cache across links for the same AMQP connection.

Agreed; It's pretty clear that we do need to look at a different approach to caching in Service Bus since the link use is expected. For Event Hubs, I'm not so sure. We use links much less frequently and, though it was unintentional, the current pattern does force expiration and renewal to be staggered, so it may not be completely undesirable.

@JoshLove-msft
Copy link
Member

JoshLove-msft commented Jul 30, 2021

The root cause seems to be that we do not cache across links for the same AMQP connection.

Agreed; It's pretty clear that we do need to look at a different approach to caching in Service Bus if the link use is expected. That said, if Service Bus is not closing/opening links frequently as an expected side-effect of sessions, then something else seems to be off and we should understand why link use is unstable.

Yep, this is certainly expected for some use cases with sessions. My point was just that we may want to fix this across the board rather than special casing sessions.
Imagine someone is just creating a ton of EH producer clients with the same connection - should we reuse tokens in that case?

@jsquire
Copy link
Member

jsquire commented Jul 30, 2021

Yep, this is certainly expected for some use cases with sessions.

Sorry; I overlooked your earlier comment that said exactly that. 😄

My point was just that we may want to fix this across the board rather than special casing sessions.

Agreed.

Imagine someone is just creating a ton of EH producer clients with the same connection - should we reuse tokens in that case?

Likely; though that's an unusual scenario since each Event Hubs client type has it's own connection by default.

@JoshLove-msft
Copy link
Member

Likely; though that's an unusual scenario since each Event Hubs client type has it's own connection by default.

Fair point, but from a maintenance perspective I would sleep easier at night if we had a common strategy here between both services - it would help surface bugs quicker and avoid forking our behavior.

@Mortana89
Copy link
Author

For what it's worth, I've collected the relevant pieces of code (skipped some stuff), perhaps it might have to do with how we do stuff, so thought I'd share;
https://gist.github.com/Mortana89/407ccef8ffc429989a097391be660270

@jsquire
Copy link
Member

jsquire commented Jul 30, 2021

Fair point, but from a maintenance perspective I would sleep easier at night if we had a common strategy here between both services - it would help surface bugs quicker and avoid forking our behavior.

Also fair. It should be a small surface area and low-risk fix. I'll take a peek at it early next week for both.

@jsquire
Copy link
Member

jsquire commented Jul 30, 2021

For what it's worth, I've collected the relevant pieces of code (skipped some stuff), perhaps it might have to do with how we do stuff, so thought I'd share;
https://gist.github.com/Mortana89/407ccef8ffc429989a097391be660270

Thanks, @Mortana89!

@jsquire jsquire assigned jsquire and unassigned JoshLove-msft Jul 30, 2021
@jsquire jsquire added bug This issue requires a change to an existing behavior in the product in order to be resolved. and removed question The issue doesn't require a change to the product in order to be resolved. Most issues start as that labels Jul 30, 2021
@jsquire jsquire added this to the [2021] August milestone Jul 30, 2021
@Mortana89
Copy link
Author

Hi, is there any ETA for this?
It's blocking our production release as it causes our instances to max process 5 unique sessions in total per second, while we have 20+ instances processing easily 100 sessions each in parallel. As a result, most of the messages will go to the deadletter with a max delivery count of 10.
As a last resort we could revert to using a connection string authentication, but I like to prevent this if possible for compliance reasons.

Thanks!

@jsquire
Copy link
Member

jsquire commented Aug 6, 2021

Hi @Mortana89. We've got an approach that we're comfortable with and the changes are in-progress. Because this impacts a critical path of the SDK, we'll need to do a full validation pass of long-running stress tests before releasing.

This won't be complete in time for our August release cycle. For Service Bus, I'll defer to @JoshLove-msft on potential release timing. For Event Hubs, these changes will be part of our September release.

@JoshLove-msft
Copy link
Member

This won't be complete in time for our August release cycle. For Service Bus, I'll defer to @JoshLove-msft on potential release timing. For Event Hubs, these changes will be part of our September release.

I think we will do the same for Service Bus.

@jsquire
Copy link
Member

jsquire commented Aug 16, 2021

Closing this out; the fix has been merged in and will be included in the next release cycle.

@jsquire jsquire closed this as completed Aug 16, 2021
@github-actions github-actions bot locked and limited conversation to collaborators Mar 27, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug This issue requires a change to an existing behavior in the product in order to be resolved. Client This issue points to a problem in the data-plane of the library. customer-reported Issues that are reported by GitHub users external to the Azure organization. needs-team-attention Workflow: This issue needs attention from Azure service team or SDK team Service Bus
Projects
None yet
Development

No branches or pull requests

5 participants