Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add retry mechanism for neural search inference. #91

Merged
merged 6 commits into from
Jan 10, 2023

Conversation

zane-neo
Copy link
Collaborator

@zane-neo zane-neo commented Dec 27, 2022

Description

This feature is an enhancement to neural-search and the purpose is to improve the availability of the opensearch ml cluster. The core idea is add retry for inference operation, once a request is been sending or have sent to a ml instance which is not available now, rather than simply fail this request we use retry to improve the success rate of requests.

Issues Resolved

#92

Check List

  • New functionality includes testing.
    • All tests pass
  • New functionality has been documented.
    • New functionality has javadoc added
  • Commits are signed as per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@zane-neo zane-neo requested a review from a team December 27, 2022 02:57
@zane-neo zane-neo changed the title Neural search retry Add retry mechanism for neural search inference. Dec 27, 2022
Comment on lines 110 to 115
private void inferenceSentencesWithRetry(
@NonNull final List<String> targetResponseFilters,
@NonNull final String modelId,
@NonNull final List<String> inputText,
final int retryTime,
@NonNull final ActionListener<List<List<Float>>> listener
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove @nonnull from private functions. Given that it is private function we can be sure that the inputs shouldn't be null.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed.

Comment on lines 123 to 125
final int nodeNotConnectedExceptionIndex = ExceptionUtils.indexOfThrowable(e, NodeNotConnectedException.class);
final int nodeDisconnectExceptionIndex = ExceptionUtils.indexOfThrowable(e, NodeDisconnectedException.class);
if ((nodeDisconnectExceptionIndex != -1 || nodeNotConnectedExceptionIndex != -1) && retryTime < MAX_RETRY) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should encapsulate this in a different method which can give a boolean response if we should do a retry or not. This will allow us in future to add more retryable exceptions and will also allow us to evolve the conditions for doing the retry.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Refactored to a new method.

@@ -114,6 +119,28 @@ public void testInferenceSentences_whenExceptionFromMLClient_thenFailure() {
Mockito.verifyNoMoreInteractions(resultListener);
}

public void testInferenceSentences_whenNodeNotConnectedException_thenRetry() {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add these test cases also:

  1. Test case which tests when all the retries are finished.
  2. Test cases which test when these retryable exceptions are not thrown we should get a failure.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two cases covered:

  1. check the method testInferenceSentences_whenNodeNotConnectedException_thenRetry_3Times and the Mockito.verify(client, times(4)) stands for the client method has been invoked for 4 times in total.
  2. Add a new method testInferenceSentences_whenNotConnectionException_thenNoRetry and the verify has time of 1, meaning method only run once.

Comment on lines 132 to 135
private boolean shouldRetry(Exception e, int retryTime) {
final int nodeNotConnectedExceptionIndex = ExceptionUtils.indexOfThrowable(e, NodeNotConnectedException.class);
final int nodeDisconnectExceptionIndex = ExceptionUtils.indexOfThrowable(e, NodeDisconnectedException.class);
return (nodeDisconnectExceptionIndex != -1 || nodeNotConnectedExceptionIndex != -1) && retryTime < MAX_RETRY;
Copy link
Collaborator

@navneet1v navneet1v Dec 28, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest 1 more simplification here. It will allow us just to add new Exceptions in future to the list doing the retry.

Suggested change
private boolean shouldRetry(Exception e, int retryTime) {
final int nodeNotConnectedExceptionIndex = ExceptionUtils.indexOfThrowable(e, NodeNotConnectedException.class);
final int nodeDisconnectExceptionIndex = ExceptionUtils.indexOfThrowable(e, NodeDisconnectedException.class);
return (nodeDisconnectExceptionIndex != -1 || nodeNotConnectedExceptionIndex != -1) && retryTime < MAX_RETRY;
private boolean shouldRetry(Exception e, int retryTime) {
return (RetryUtil.containsException(INFERENCE_API_RETRYABLE_EXCEPTION_LIST, e)) && retryTime < MAX_RETRY;
}
class RetryUtil {
public static boolean containsException(@NonNull final List<Class<?>> exceptionClassesList, final Exception ex) {
// call below function to check if any exception is present and then return true.
}
private static boolean containsException(final Exception ex, final ClassClass<?> exceptionClazz) {
return ExceptionUtils.indexOfThrowable(ex, exceptionClazz) != -1;
}
}

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In fact, we have analyzed all the exceptions if they're retryable and it doesn't seem to have more exceptions could be retried, otherwise the retry may cause cluster level instability, I wouldn't suggest go too far and over design for this for now.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zane-neo This logic would retry if a model is not loaded, correct?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In fact, we have analyzed all the exceptions if they're retryable and it doesn't seem to have more exceptions could be retried, otherwise the retry may cause cluster level instability, I wouldn't suggest go too far and over design for this for now.

So you are saying there will be no case in future where we need add exception? Thats very strange. I would recommend lets build retry mechanism extensible.

Also the suggested code is just an example.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zane-neo This logic would retry if a model is not loaded, correct?

No, this logic only retries the network issue: NodeNotConnectedException and NodeDisconnectedException. When a model is not loaded the requests won't be dispatched to that node at the beginning.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In fact, we have analyzed all the exceptions if they're retryable and it doesn't seem to have more exceptions could be retried, otherwise the retry may cause cluster level instability, I wouldn't suggest go too far and over design for this for now.

So you are saying there will be no case in future where we need add exception? Thats very strange. I would recommend lets build retry mechanism extensible.

Also the suggested code is just an example.

Are you talking about business level retries? These should be retried at underlying layers to avoid network calls. But I can do the refactor to the potential business retries at client side.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you talking about business level retries?

No.

I am not saying to refractor any code at client side. I am saying just refractor the code at the Neural Search plugin side as suggested in my older comment.

Copy link
Member

@jmazanec15 jmazanec15 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Retries would be useful for ephemeral issues, correct? So which exceptions would we expect to be resolved on failure like this?

@zane-neo
Copy link
Collaborator Author

zane-neo commented Jan 4, 2023

Retries would be useful for ephemeral issues, correct? So which exceptions would we expect to be resolved on failure like this?

Two exceptions we're going to handle are: NodeNotConnectedException and NodeDisconnectedException

@jmazanec15 jmazanec15 added Enhancements Increases software capabilities beyond original client specifications backport 2.x Label will add auto workflow to backport PR to 2.x branch backport 2.5 labels Jan 10, 2023
@jmazanec15 jmazanec15 merged commit b207176 into opensearch-project:main Jan 10, 2023
opensearch-trigger-bot bot pushed a commit that referenced this pull request Jan 10, 2023
Add basic retry mechanism for neural search inference

Signed-off-by: Zan Niu <[email protected]>
(cherry picked from commit b207176)
opensearch-trigger-bot bot pushed a commit that referenced this pull request Jan 10, 2023
Add basic retry mechanism for neural search inference

Signed-off-by: Zan Niu <[email protected]>
(cherry picked from commit b207176)
jmazanec15 pushed a commit that referenced this pull request Jan 10, 2023
Add basic retry mechanism for neural search inference

(cherry picked from commit b207176)
Signed-off-by: John Mazanec <[email protected]>
jmazanec15 pushed a commit that referenced this pull request Jan 10, 2023
Add basic retry mechanism for neural search inference

(cherry picked from commit b207176)
Signed-off-by: John Mazanec <[email protected]>
navneet1v pushed a commit to navneet1v/neural-search that referenced this pull request Feb 24, 2023
Add basic retry mechanism for neural search inference

Signed-off-by: Zan Niu <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport 2.x Label will add auto workflow to backport PR to 2.x branch backport 2.5 Enhancements Increases software capabilities beyond original client specifications
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants