-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SPARK-41415/SPARK-42090 Backport to 3.3 #39634
SPARK-41415/SPARK-42090 Backport to 3.3 #39634
Conversation
### What changes were proposed in this pull request? Add the ability to retry SASL requests. Will add it as a metric too soon to track SASL retries. ### Why are the changes needed? We are seeing increased SASL timeouts internally, and this issue would mitigate the issue. We already have this feature enabled for our 2.3 jobs, and we have seen failures significantly decrease. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added unit tests, and tested on cluster to ensure the retries are being triggered correctly. Closes apache#38959 from akpatnam25/SPARK-41415. Authored-by: Aravind Patnam <[email protected]> Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com>
### What changes were proposed in this pull request? This PR introduces sasl retry count in RetryingBlockTransferor. ### Why are the changes needed? Previously a boolean variable, saslTimeoutSeen, was used. However, the boolean variable wouldn't cover the following scenario: 1. SaslTimeoutException 2. IOException 3. SaslTimeoutException 4. IOException Even though IOException at apache#2 is retried (resulting in increment of retryCount), the retryCount would be cleared at step apache#4. Since the intention of saslTimeoutSeen is to undo the increment due to retrying SaslTimeoutException, we should keep a counter for SaslTimeoutException retries and subtract the value of this counter from retryCount. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? New test is added, courtesy of Mridul. Closes apache#39611 from tedyu/sasl-cnt. Authored-by: Ted Yu <[email protected]> Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com>
@mridulm @otterc @tedyu @dongjoon-hyun backport into 3.3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
--
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we backport one by one?
Are we sure we want to do backport one by one? Asking because the 2nd backport fixes a corner case in which the 1st one does not. Ideally, I feel like they should be backported together. WDYT @dongjoon-hyun ? |
One-by-one is more clear. I'm sure what I want to keep in that way, @akpatnam25 . |
Also please write |
What changes were proposed in this pull request?
Add the ability to retry SASL requests. Will add it as a metric too soon to track SASL retries.
Why are the changes needed?
We are seeing increased SASL timeouts internally, and this issue would mitigate the issue. We already have this feature enabled for our 2.3 jobs, and we have seen failures significantly decrease.
Does this PR introduce any user-facing change?
No
How was this patch tested?
Added unit tests, and tested on cluster to ensure the retries are being triggered correctly.
Closes #38959 from akpatnam25/SPARK-41415.
Authored-by: Aravind Patnam [email protected]
Signed-off-by: Mridul Muralidharan <mridulgmail.com>
What changes were proposed in this pull request?
This PR introduces sasl retry count in RetryingBlockTransferor.
Why are the changes needed?
Previously a boolean variable, saslTimeoutSeen, was used. However, the boolean variable wouldn't cover the following scenario:
Even though IOException at #2 is retried (resulting in increment of retryCount), the retryCount would be cleared at step #4.
Since the intention of saslTimeoutSeen is to undo the increment due to retrying SaslTimeoutException, we should keep a counter for SaslTimeoutException retries and subtract the value of this counter from retryCount.
Does this PR introduce any user-facing change?
No
How was this patch tested?
New test is added, courtesy of Mridul.
Closes #39611 from tedyu/sasl-cnt.
Authored-by: Ted Yu [email protected]
Signed-off-by: Mridul Muralidharan <mridulgmail.com>