SPARK-41415/SPARK-42090 Backport to 3.2 #39632
Closed
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What changes were proposed in this pull request?
Add the ability to retry SASL requests. Will add it as a metric too soon to track SASL retries.
Why are the changes needed?
We are seeing increased SASL timeouts internally, and this issue would mitigate the issue. We already have this feature enabled for our 2.3 jobs, and we have seen failures significantly decrease.
Does this PR introduce any user-facing change?
No
How was this patch tested?
Added unit tests, and tested on cluster to ensure the retries are being triggered correctly.
Closes #38959 from akpatnam25/SPARK-41415.
Authored-by: Aravind Patnam [email protected]
Signed-off-by: Mridul Muralidharan <mridulgmail.com>
What changes were proposed in this pull request?
This PR introduces sasl retry count in RetryingBlockTransferor.
Why are the changes needed?
Previously a boolean variable, saslTimeoutSeen, was used. However, the boolean variable wouldn't cover the following scenario:
Even though IOException at #2 is retried (resulting in increment of retryCount), the retryCount would be cleared at step #4.
Since the intention of saslTimeoutSeen is to undo the increment due to retrying SaslTimeoutException, we should keep a counter for SaslTimeoutException retries and subtract the value of this counter from retryCount.
Does this PR introduce any user-facing change?
No
How was this patch tested?
New test is added, courtesy of Mridul.
Closes #39611 from tedyu/sasl-cnt.
Authored-by: Ted Yu [email protected]
Signed-off-by: Mridul Muralidharan <mridulgmail.com>