-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
EQL: add support for partial shard results #116388
EQL: add support for partial shard results #116388
Conversation
Hi @luigidellaquila, I've created a changelog YAML for you. |
Pinging @elastic/es-analytical-engine (Team:Analytics) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, @luigidellaquila. It looks promising from my point of view. On the technical side of things - a review from someone in the ES team who is more familiar with partial results would help greatly.
I would also look into some tests that use CCS. I recently learned that skip_unavailable
setting for a CCS cluster is a flavor of "unavailable shards" situation.
Also, some tests with Security enabled would help as well.
If true (default), the sequences are calculated on the available shards If false, in case of shard failures, the query does not return any results, but does not throw errors either
…h_results' into eql/allow_partial_search_results
@elasticmachine update branch |
merge conflict between base and head |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM - thank you so much for working on this addition. I think this now sets us up well on our end to decide how we want to expose this functionality on our end.
@yctercero @approksiu thanks for the off-line conversations and for confirming that the PR covers your needs. @astefan @javanna please let me know if you have further remarks of if we can proceed. |
…h_results' into eql/allow_partial_search_results
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've left a first round of comments. It looks good.
In PartialSearchResultsIT or in toml files or in yml files I think we can do better in terms of covering more scenarios (involving optional fields for example). TumblingWindow specifically and SampleIterator are complex enough to trigger an ActionListener to return in different use cases.
if (allowPartialResultsInBody) { | ||
builder.field("allow_partial_search_results", String.valueOf(allowPartialSearchResults)); | ||
if (allowPartialSequenceResults != null) { | ||
builder.field("allow_partial_sequence_results", String.valueOf(allowPartialSequenceResults)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if an user provides only allow_partial_sequece_results=true
or uses allow_partial_search_results=false
and allow_partial_sequece_results=true
at the same time?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
allow_partial_sequece_results
alone (or with allow_partial_search_results=false
) has no effect.
There are randomized tests for this in PartialSearchResultsIT , but probably I can extend them a bit, randomly setting allow_partial_search_results=false
@@ -88,6 +88,53 @@ request that targets only `bar*` still returns an error. | |||
+ | |||
Defaults to `true`. | |||
|
|||
`allow_partial_search_results`:: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The ES search API supports partial results for async search as well, if I read the documentation correctly. Is there something stopping us for doing the same with EQL?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think when _search docs mention partial results at the beginning of the page, they refer to something slightly different, ie. the first part of a response that is still being calculated.
Search results can also be partial because of missing shards, that is the same we have in EQL, regardless of the request being sync or async.
This said, we definitely need some tests for async EQL queries with allow_partial_search_results=true
. I'm adding them.
builder.endObject(); | ||
|
||
Request request = new Request("POST", "/" + index + "/_eql/search"); | ||
Boolean ccsMinimizeRoundtrips = ccsMinimizeRoundtrips(); | ||
if (ccsMinimizeRoundtrips != null) { | ||
request.addParameter("ccs_minimize_roundtrips", ccsMinimizeRoundtrips.toString()); | ||
} | ||
if (allowPartialSearchResults != null) { | ||
if (allowPartialResultsInBody == false) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
allowPartialResultsInBody
value seems to suggest that "partial results" settings can be used either as request parameter or request body parameter. I think we should test the case when the user provides:
- request parameter values AND request body parameters (both of them or just one)
- either one of them
In other words, randomize any combination of request parameters presence and values in the two places where we allow them.
@@ -27,9 +27,23 @@ public EqlMissingEventsSpecTestCase( | |||
List<long[]> eventIds, | |||
String[] joinKeys, | |||
Integer size, | |||
Integer maxSamplesPerKey | |||
Integer maxSamplesPerKey, | |||
Boolean allowPartialSearchResults, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be great if all the existent tests in EQL that have no idea about partial results (all the tests based on BaseEqlSpecTestCase before this PR) would completely randomize these two new settings. All of them should behave completely the same because we have no "broken" shard setup for them which means that irrespective of the values provided, the tests should pass.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is some randomization already, but I'm making it more complete.
allow_partial_search_results = true | ||
allow_partial_sequence_results = true | ||
expected_event_ids = [1, 2] | ||
expect_shard_failures = true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add here more "failing shards" tests in other scenarios? (optional fields, samples, samples with optional fields, sequences with optional fields, sequences with missing events...)
|
||
public BasicQueryClient(EqlSession eqlSession) { | ||
this.cfg = eqlSession.configuration(); | ||
this.client = eqlSession.client(); | ||
this.indices = cfg.indices(); | ||
this.fetchFields = cfg.fetchFields(); | ||
this.allowPartialSearchResults = eqlSession.configuration().allowPartialSearchResults(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this.allowPartialSearchResults = eqlSession.configuration().allowPartialSearchResults(); | |
this.allowPartialSearchResults = cfg.allowPartialSearchResults(); |
if (completed.isEmpty()) { | ||
listener.onResponse(new EmptyPayload(Type.SEQUENCE, timeTook())); | ||
if (completed.isEmpty() || (allowPartialSequenceResults == false && shardFailures.isEmpty() == false)) { | ||
listener.onResponse(new EmptyPayload(Type.SEQUENCE, timeTook(), shardFailures.values().toArray(new ShardSearchFailure[0]))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here you already know the size of the shardfailures, why not use it to its most optimal way:
listener.onResponse(new EmptyPayload(Type.SEQUENCE, timeTook(), shardFailures.values().toArray(new ShardSearchFailure[0]))); | |
listener.onResponse(new EmptyPayload(Type.SEQUENCE, timeTook(), shardFailures.values().toArray(new ShardSearchFailure[shardFailures.size()]))); |
@@ -918,4 +948,13 @@ public KeyAndOrdinal next() { | |||
}; | |||
}; | |||
} | |||
|
|||
private void addShardFailures(SearchResponse r) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've seen this exact method in SampleIterator as well. Maybe extract it somewhere and re-use it?
false, | ||
Setting.Property.NodeScope, | ||
Setting.Property.Dynamic | ||
); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm with @javanna here. If there is no need for this, let's not make it more complicated than it should be. If there will be a desire in the future for this, we can add it.
The basic principle of "it's easy to put stuff in, harder to take it out" (think about BWC) applies here.
Same principle for having this as both request parameter and request body parameter: it complicates the logic unnecessarily and confuses users with a lot of options to think and reason about. I realize that ES query DSL has this in the same way (both request parameter and request body parameter) and I am not against this that much, but unless there is an actual need and request I wouldn't confuse users.
null, | ||
matcher, | ||
Collections.emptyList(), | ||
randomBoolean(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome 👍
@@ -41,6 +41,16 @@ | |||
"type": "time", | |||
"description": "Update the time interval in which the results (partial or final) for this search will be available", | |||
"default": "5d" | |||
}, | |||
"allow_partial_search_results": { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@astefan to allow yaml tests to accept these as query parameters I had to change the API spec, and I noticed that it contains only a small subset of the allowed (and documented) parameters.
Should we update it, maybe in a followup PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general, this LGTM. I've left some comments around testing, mostly. And also one related to TumblingWindow where the two settings related to sequences are not clear to me where they are properly set.
@@ -2432,7 +2432,7 @@ public static void afterClass() throws Exception { | |||
/** | |||
* After the cluster is stopped, there are a few netty threads that can linger, so we make sure we don't leak any tasks on them. | |||
*/ | |||
static void awaitGlobalNettyThreadsFinish() throws Exception { | |||
public static void awaitGlobalNettyThreadsFinish() throws Exception { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't find an obvious reason for this change. Can you shed some light, please?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a leftover, I don't need it anymore. Reverting
builder.field("allow_partial_sequence_results", String.valueOf(allowPartialSequenceResults)); | ||
} | ||
} else { | ||
// these will be overwritten by the query params, that have higher priority than the body params |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment is confusing. Maybe it would be more clear is you call a request param as "path param" and a request body param as "query param", like the documentation does.
builder.field("allow_partial_sequence_results", randomBoolean()); | ||
} | ||
} | ||
} else { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please, add a comment here explaining the intent, ie spec tests that have no special setup for "partial results" support which should always pass should be ignorant of the presence of these params.
request.addParameter("allow_partial_sequence_results", String.valueOf(allowPartialSequenceResults)); | ||
} | ||
} | ||
} else { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here, regarding a comment explaining the intent of this else
branch.
} else { | ||
assertNull(shardFailures); | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If expectShardFailures == null
I think you should add assertNull(shardFailures)
.
@@ -147,6 +155,7 @@ private void advance(ActionListener<Payload> listener) { | |||
|
|||
private void queryForCompositeAggPage(ActionListener<Payload> listener, final SampleQueryRequest request) { | |||
client.query(request, listener.delegateFailureAndWrap((delegate, r) -> { | |||
addShardFailures(shardFailures, r); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In an ideal (and paranoid) world we should have tests for each of these places where we deal with shard failures because these can happen at different points in the life of a sample
or sequence
query. But, because those scenarios are not easy to test, I am perfectly fine with the tests we have so far and I trust them they are good enough for this feature.
OpenPointInTimeRequest request = new OpenPointInTimeRequest(indices).indicesOptions(IndexResolver.FIELD_CAPS_INDICES_OPTIONS) | ||
.keepAlive(keepAlive); | ||
.keepAlive(keepAlive) | ||
.allowPartialSearchResults(allowPartialSearchResults); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's interesting, I didn't expect for the open_pit to fail and report back as a partial shard result.
ShardSearchFailure[] failures = response.getShardFailures(); | ||
if (CollectionUtils.isEmpty(failures) == false) { | ||
if (CollectionUtils.isEmpty(failures) == false && allowPartialSearchResults == false) { | ||
failure = new EqlIllegalArgumentException(failures[0].reason(), failures[0].getCause()); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can rewrite this to something like this:
if (allowPartialSearchResults == false) {
ShardSearchFailure[] failures = response.getShardFailures();
if (CollectionUtils.isEmpty(failures) == false) {
failure = new EqlIllegalArgumentException(failures[0].reason(), failures[0].getCause());
}
}
@@ -127,7 +133,10 @@ public TumblingWindow( | |||
List<SequenceCriterion> criteria, | |||
SequenceCriterion until, | |||
SequenceMatcher matcher, | |||
List<List<Attribute>> listOfKeys | |||
List<List<Attribute>> listOfKeys, | |||
boolean allowPartialSearchResults, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given that allowPartialSequenceResults
depends on allowPartialSearchResults
being true
, I am surprised to see no checks here and allowPartialSequenceResults
just being used. Can you point me to where we make sure that allowPartialSequenceResults
is valid if allowPartialSearchResults
has the right value?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no specific check, because in case of partial shard failures the query would fail much earlier. The final result will be an error 400.
@@ -135,6 +141,13 @@ public EqlSearchRequest(StreamInput in) throws IOException { | |||
if (in.getTransportVersion().onOrAfter(TransportVersions.V_8_7_0)) { | |||
maxSamplesPerKey = in.readInt(); | |||
} | |||
if (in.getTransportVersion().onOrAfter(TransportVersions.EQL_ALLOW_PARTIAL_SEARCH_RESULTS)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have any tests for serialization to make sure in a mixed cluster tests everything is ok from this point of view?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BWC serialization is unit tested in EqlSearchRequestTests
.
I'm adding some randomization to mixed-node tests
Allow queries to succeed if some shards are failing
💚 Backport successful
|
Allow queries to succeed if some shards are failing
Fixes: #114918
Allow EQL to execute queries also when some shards are unavailable.
This adds a new parameter (both available as a query parameter and in the request body) ,
allow_partial_search_results
(defaultfalse
, that can be configured as a cluster settingxpack.eql.default_allow_partial_results
), that enables this behavior.If shard failures happen during the query, the execution only takes into consideration the available shards, and returns an additional "shard_failures" value in the response
Sequence queries are particularly problematic, because their result does not depend on a single record, but by the correlation of multiple records, so the result of a query can be slightly incorrect in case of partial shards (consider the case where a missing event clause matches records that are in missing shards, it will return false positives).
For this reason, we also provide an additional parameter,
allow_partial_sequence_results
to fine-tune the behavior of sequence queries:allow_partial_sequence_results=false
(default): in case of missing shards, sequence queries always return empty results, to avoid mistakesallow_partial_sequence_results=true
: the sequences are calculated only on the available data, ie. on records present in available shards.If
allow_partial_search_results=false
(default), thenallow_partial_sequence_results
has no effect.TODO:
shard_failures
)search.default_allow_partial_results
)