-
-
Notifications
You must be signed in to change notification settings - Fork 197
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Need to "re-assume the IAM role when re-authenticating" to avoid "Session too short" errors on MSK #731
Comments
We are using franz-go v1.11.0. We are working on upgrading to and testing with franz-go v1.16.1. |
For context, this is the chunk of code: For comparison, the Kafka code for determining lifetime is here: https://github.com/apache/kafka/blob/bef83ce89bb6a869cf884b37d5b5f86d8f243488/clients/src/main/java/org/apache/kafka/common/security/authenticator/SaslClientAuthenticator.java#L683-L700 Kafka's code blindly chooses between 85% and 95% of the returned lifetime, meaning if there is 1s left to the lifetime, Kafka will reauthenticate after ~900ms. If there are 10s left, Kafka reauthenticates after ~9s. franz-go reauths based on the lifetime minus some "pessimism". For large lifetimes, between 2% and 5% of the lifetime is used -- meaning if the lifetime is 10s, franz-go will reauth after ~9.65s. For small lifetimes, a minimum pessimism of 1s is used (or 1.1x the request latency, if it is larger) -- meaning if the lifetime is 1s, franz-go reauthenticates immediately after one request goes through. In practice, the behavior should be roughly the same. I don't know what Amazon is doing on their side, but the error you're seeing is new behavior by Amazon considering the code path in franz-go that is running into this hasn't changed in 2yr. |
I'm not convinced that something necessarily changed on the AWS side in the last two years because we have been seeing these "Session too short" errors for well over a year. We just didn't realize they were impactful and causing failures because our fallback mechanism was kicking in. If I understand correctly, AWS does reject sessions even before they are expired. You can see the details here: #249 (comment). I'm not sure if that "100ms" value is still used or if they have increased that number. I will have them review your response here and either post on their behalf or have them respond if they are willing. I admit this seems more like a bug or inconsistency on the AWS side, but AWS has made it clear that "This behavior is not considered to be a bug by the MSK internal team, it is expected behavior. To avoid the session too short error it is suggested to re-assume the IAM role when re-authenticating." |
If I'm not mistaken, re-assuming the IAM role is done automatically if you follow the example here: franz-go/examples/sasl/aws_msk_iam/main.go Lines 40 to 51 in 2bc29e6
Credentials are never cached, so any time the client internally goes through the SASL flow, whatever is returned from the Authenticate function (which is the credential loading: Lines 82 to 99 in 2bc29e6
Basically what it sounds like is that AWS is returning that SASL credentials can be used for maybe 1s, but there's enough delay in going through and issuing a request that when the request is finally issued and received by the broker, the SASL lifetime is <100ms left and the request is rejected. If the above^^ is the case, then there really is nothing that can be done by the client. My original implementation was to refuse SASL credentials with a <5s lifetime specifically because we can't trust network RTT during latency spikes, but I removed that due to #249. It's unreasonable for Amazon to a 1s lifetime and then expect a request to make it to the broker within that 1s. There's also literally no downside for Amazon to just return a new session whenever the old one has less than, say, 5s left. |
I'm going to close this for now, but let me know if AWS gets back or if you have an idea of what to improve within the client. FWIW it behaves pretty closely to Kafka now, so I'd expect these same errors on the Java client? |
Thanks. We are still waiting to hear back from AWS on this. I will update here once I hear from them. |
AWS provided the following direction. "We recommend you to ... use different session name during re-authentication when using your go library. Please be cautious that you are not are not creating too many sessions as that may impact their CPU usage." We updated the closure passed to aws.ManagedStreamingIAM to cycle through 5 different session names as it authenticates. This has completely resolved the "Session too short" issue as far as we can tell. Thanks for your patience and quick responses! They were a huge help in getting to the right solution. |
Hey @saholman, my team is facing the same issue Thanks in advance! |
We are regularly unable to produce to our MSK Kafka cluster because we are getting back "Session too short: SASL_AUTHENTICATION_FAILED: SASL Authentication failed" errors returned from the ProduceSync method. @twmb I know you clarified in #249 that this was because AWS was returning the same session, even if it had <2s left. We asked AWS about this and got the following response:
"Regarding your question about if MSK will return a session that is expected to expire within 2 seconds, the internal team has confirmed that yes, if the same credentials are used to authenticate then the current session will be returned even if it is about to expire. If you are re-assuming the IAM role while authenticating this should not happen."
Can franz-go be updated to "re-assume the IAM role while authenticating" so that MSK clients don’t run into regular "Session too short" errors?
Thanks!
The text was updated successfully, but these errors were encountered: