-
Notifications
You must be signed in to change notification settings - Fork 72
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add retry mechanism for neural search inference. #91
Add retry mechanism for neural search inference. #91
Conversation
Signed-off-by: Zan Niu <[email protected]>
Signed-off-by: Zan Niu <[email protected]>
private void inferenceSentencesWithRetry( | ||
@NonNull final List<String> targetResponseFilters, | ||
@NonNull final String modelId, | ||
@NonNull final List<String> inputText, | ||
final int retryTime, | ||
@NonNull final ActionListener<List<List<Float>>> listener |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove @nonnull from private functions. Given that it is private function we can be sure that the inputs shouldn't be null.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed.
final int nodeNotConnectedExceptionIndex = ExceptionUtils.indexOfThrowable(e, NodeNotConnectedException.class); | ||
final int nodeDisconnectExceptionIndex = ExceptionUtils.indexOfThrowable(e, NodeDisconnectedException.class); | ||
if ((nodeDisconnectExceptionIndex != -1 || nodeNotConnectedExceptionIndex != -1) && retryTime < MAX_RETRY) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should encapsulate this in a different method which can give a boolean response if we should do a retry or not. This will allow us in future to add more retryable exceptions and will also allow us to evolve the conditions for doing the retry.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Refactored to a new method.
@@ -114,6 +119,28 @@ public void testInferenceSentences_whenExceptionFromMLClient_thenFailure() { | |||
Mockito.verifyNoMoreInteractions(resultListener); | |||
} | |||
|
|||
public void testInferenceSentences_whenNodeNotConnectedException_thenRetry() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add these test cases also:
- Test case which tests when all the retries are finished.
- Test cases which test when these retryable exceptions are not thrown we should get a failure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two cases covered:
- check the method
testInferenceSentences_whenNodeNotConnectedException_thenRetry_3Times
and theMockito.verify(client, times(4))
stands for the client method has been invoked for 4 times in total. - Add a new method
testInferenceSentences_whenNotConnectionException_thenNoRetry
and the verify has time of 1, meaning method only run once.
Signed-off-by: Zan Niu <[email protected]>
private boolean shouldRetry(Exception e, int retryTime) { | ||
final int nodeNotConnectedExceptionIndex = ExceptionUtils.indexOfThrowable(e, NodeNotConnectedException.class); | ||
final int nodeDisconnectExceptionIndex = ExceptionUtils.indexOfThrowable(e, NodeDisconnectedException.class); | ||
return (nodeDisconnectExceptionIndex != -1 || nodeNotConnectedExceptionIndex != -1) && retryTime < MAX_RETRY; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would suggest 1 more simplification here. It will allow us just to add new Exceptions in future to the list doing the retry.
private boolean shouldRetry(Exception e, int retryTime) { | |
final int nodeNotConnectedExceptionIndex = ExceptionUtils.indexOfThrowable(e, NodeNotConnectedException.class); | |
final int nodeDisconnectExceptionIndex = ExceptionUtils.indexOfThrowable(e, NodeDisconnectedException.class); | |
return (nodeDisconnectExceptionIndex != -1 || nodeNotConnectedExceptionIndex != -1) && retryTime < MAX_RETRY; | |
private boolean shouldRetry(Exception e, int retryTime) { | |
return (RetryUtil.containsException(INFERENCE_API_RETRYABLE_EXCEPTION_LIST, e)) && retryTime < MAX_RETRY; | |
} | |
class RetryUtil { | |
public static boolean containsException(@NonNull final List<Class<?>> exceptionClassesList, final Exception ex) { | |
// call below function to check if any exception is present and then return true. | |
} | |
private static boolean containsException(final Exception ex, final ClassClass<?> exceptionClazz) { | |
return ExceptionUtils.indexOfThrowable(ex, exceptionClazz) != -1; | |
} | |
} | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In fact, we have analyzed all the exceptions if they're retryable and it doesn't seem to have more exceptions could be retried, otherwise the retry may cause cluster level instability, I wouldn't suggest go too far and over design for this for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zane-neo This logic would retry if a model is not loaded, correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In fact, we have analyzed all the exceptions if they're retryable and it doesn't seem to have more exceptions could be retried, otherwise the retry may cause cluster level instability, I wouldn't suggest go too far and over design for this for now.
So you are saying there will be no case in future where we need add exception? Thats very strange. I would recommend lets build retry mechanism extensible.
Also the suggested code is just an example.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zane-neo This logic would retry if a model is not loaded, correct?
No, this logic only retries the network issue: NodeNotConnectedException and NodeDisconnectedException. When a model is not loaded the requests won't be dispatched to that node at the beginning.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In fact, we have analyzed all the exceptions if they're retryable and it doesn't seem to have more exceptions could be retried, otherwise the retry may cause cluster level instability, I wouldn't suggest go too far and over design for this for now.
So you are saying there will be no case in future where we need add exception? Thats very strange. I would recommend lets build retry mechanism extensible.
Also the suggested code is just an example.
Are you talking about business level retries? These should be retried at underlying layers to avoid network calls. But I can do the refactor to the potential business retries at client side.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you talking about business level retries?
No.
I am not saying to refractor any code at client side. I am saying just refractor the code at the Neural Search plugin side as suggested in my older comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Retries would be useful for ephemeral issues, correct? So which exceptions would we expect to be resolved on failure like this?
Signed-off-by: Zan Niu <[email protected]>
Signed-off-by: Zan Niu <[email protected]>
Two exceptions we're going to handle are: NodeNotConnectedException and NodeDisconnectedException |
Signed-off-by: Zan Niu <[email protected]>
Add basic retry mechanism for neural search inference Signed-off-by: Zan Niu <[email protected]> (cherry picked from commit b207176)
Add basic retry mechanism for neural search inference Signed-off-by: Zan Niu <[email protected]> (cherry picked from commit b207176)
Add basic retry mechanism for neural search inference (cherry picked from commit b207176) Signed-off-by: John Mazanec <[email protected]>
Add basic retry mechanism for neural search inference (cherry picked from commit b207176) Signed-off-by: John Mazanec <[email protected]>
Add basic retry mechanism for neural search inference Signed-off-by: Zan Niu <[email protected]>
Description
This feature is an enhancement to neural-search and the purpose is to improve the availability of the opensearch ml cluster. The core idea is add retry for inference operation, once a request is been sending or have sent to a ml instance which is not available now, rather than simply fail this request we use retry to improve the success rate of requests.
Issues Resolved
#92
Check List
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.