Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add check and handle negative SearchRequestStats #16569

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

dzane17
Copy link
Contributor

@dzane17 dzane17 commented Nov 5, 2024

Description

We identified a bug where negative search stats are causing an exception when processed using VLong in the following line:

https://github.com/opensearch-project/OpenSearch/blob/main/server/src/main/java/org/opensearch/index/search/stats/SearchStats.java#L89

out.writeVLong(current);

This issue occurs because VLong does not handle negative values, resulting in an exception.

Solution

We have implemented a temporary fix that adds a check for negative values before calling writeVLong. If a negative value is encountered, a warning is logged, and 0 is written instead of the negative value:

            if (current < 0) {
                logger.warn("current is negative: {}", current);
                out.writeVLong(0);
            } else {
                out.writeVLong(current);
            }

This fix addresses the immediate problem but does not solve the root cause. We will continue to investigate why search stats are negative in the first place.

Related Issues

#16598

Check List

  • Functionality includes testing.
  • API changes companion pull request created, if applicable.
  • Public documentation issue/PR created, if applicable.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@dzane17 dzane17 changed the title Add check and for negative SearchRequestStats Add check and handle negative SearchRequestStats Nov 5, 2024
@dzane17 dzane17 added the backport 2.x Backport to 2.x branch label Nov 5, 2024
Copy link
Contributor

github-actions bot commented Nov 6, 2024

❌ Gradle check result for e3db948: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@jed326
Copy link
Collaborator

jed326 commented Nov 7, 2024

Thanks @dzane17 , this looks pretty interesting. Are you tracking the investigation of this in another GitHub issue? I'm curious which search phase specifically you're seeing this one since having negative time values anywhere is quite concerning.

@jed326
Copy link
Collaborator

jed326 commented Nov 7, 2024

In particular this block of code comes to mind, where for scroll requests fetch phase can be performed before query phase is completed

public void executeFetchPhase(
InternalScrollSearchRequest request,
SearchShardTask task,
ActionListener<ScrollQueryFetchSearchResult> listener
) {
final LegacyReaderContext readerContext = (LegacyReaderContext) findReaderContext(request.contextId(), request);
final Releasable markAsUsed;
try {
markAsUsed = readerContext.markAsUsed(getScrollKeepAlive(request.scroll()));
} catch (Exception e) {
// We need to release the reader context of the scroll when we hit any exception (here the keep_alive can be too large)
freeReaderContext(readerContext.id());
throw e;
}
runAsync(getExecutor(readerContext.indexShard()), () -> {
final ShardSearchRequest shardSearchRequest = readerContext.getShardSearchRequest(null);
try (
SearchContext searchContext = createContext(readerContext, shardSearchRequest, task, false);
SearchOperationListenerExecutor executor = new SearchOperationListenerExecutor(searchContext)
) {
searchContext.assignRescoreDocIds(readerContext.getRescoreDocIds(null));
searchContext.searcher().setAggregatedDfs(readerContext.getAggregatedDfs(null));
processScroll(request, readerContext, searchContext);
queryPhase.execute(searchContext);
final long afterQueryTime = executor.success();
QueryFetchSearchResult fetchSearchResult = executeFetchPhase(readerContext, searchContext, afterQueryTime);
return new ScrollQueryFetchSearchResult(fetchSearchResult, searchContext.shardTarget());
} catch (Exception e) {
assert TransportActions.isShardNotAvailableException(e) == false : new AssertionError(e);
logger.trace("Fetch phase failed", e);
// we handle the failure in the failure listener below
throw e;
} finally {
taskResourceTrackingService.writeTaskResourceUsage(task, clusterService.localNode().getId());
}
}, wrapFailureListener(listener, readerContext, markAsUsed));
}

and since the current is decremented on phase end there might be an issue where we are decrementing before incrementing.

Copy link
Contributor

github-actions bot commented Nov 7, 2024

✅ Gradle check result for b72b74d: SUCCESS

Copy link

codecov bot commented Nov 7, 2024

Codecov Report

Attention: Patch coverage is 70.00000% with 3 lines in your changes missing coverage. Please review.

Project coverage is 72.10%. Comparing base (5909e1a) to head (b72b74d).

Files with missing lines Patch % Lines
...org/opensearch/index/search/stats/SearchStats.java 70.00% 2 Missing and 1 partial ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               main   #16569      +/-   ##
============================================
+ Coverage     72.08%   72.10%   +0.02%     
+ Complexity    65099    65079      -20     
============================================
  Files          5315     5315              
  Lines        303572   303581       +9     
  Branches      43925    43927       +2     
============================================
+ Hits         218817   218892      +75     
+ Misses        66861    66730     -131     
- Partials      17894    17959      +65     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@dzane17
Copy link
Contributor Author

dzane17 commented Nov 8, 2024

@jed326 Here is the github issue: #16598

Actually, current is a request counter. We have not yet seen the time statistic go negative. In either case though it is concerning.

Regarding this code block in SearchService, we will need to check exactly where in the search path it is invoked. The coordinator phase listener hooks (onPhaseStart, onPhaseEnd, onPhaseFailure), where search stats are updated, are in AbstractSearchAsyncAction.

Copy link
Collaborator

@jainankitk jainankitk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am wondering if we should have generic way of ignoring negative values (writeUnsignedLong), especially for stat type of use cases instead of handling it every place separately. Missing any occurrence can result in node drops which is never desirable IMO. @jed326 - Thoughts?

@jed326
Copy link
Collaborator

jed326 commented Nov 8, 2024

I am wondering if we should have generic way of ignoring negative values (writeUnsignedLong), especially for stat type of use cases instead of handling it every place separately. Missing any occurrence can result in node drops which is never desirable IMO. @jed326 - Thoughts?

@jainankitk I'm actually a little apprehensive about ignoring the negative values at all. This PR is adding negative checking to both count and timeInMillis, both of which should not be possible to be negative. These numbers being negative is indicative of some other underlying problem, or at the very least indicates we have an incorrect assumption about the code flow somewhere. By ignoring the negative values we wouldn't be able to catch any such issues in the future as there wouldn't be any symptoms for any such issues.

Copy link
Contributor

github-actions bot commented Nov 8, 2024

❌ Gradle check result for e48ffba: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@dbwiddis
Copy link
Member

These numbers being negative is indicative of some other underlying problem, or at the very least indicates we have an incorrect assumption about the code flow somewhere. By ignoring the negative values we wouldn't be able to catch any such issues in the future as there wouldn't be any symptoms for any such issues.

+1

I suspect a situation where both failure and success code are called (with rare failures) but wasn't able to find it in about an hour of searching.

@opensearch-trigger-bot
Copy link
Contributor

This PR is stalled because it has been open for 30 days with no activity.

@opensearch-trigger-bot opensearch-trigger-bot bot added the stalled Issues that have stalled label Dec 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport 2.x Backport to 2.x branch stalled Issues that have stalled
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants