Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EQL: add support for partial shard results #116388

Merged

Conversation

luigidellaquila
Copy link
Contributor

@luigidellaquila luigidellaquila commented Nov 7, 2024

Fixes: #114918

Allow EQL to execute queries also when some shards are unavailable.
This adds a new parameter (both available as a query parameter and in the request body) , allow_partial_search_results (default false, that can be configured as a cluster setting xpack.eql.default_allow_partial_results), that enables this behavior.

If shard failures happen during the query, the execution only takes into consideration the available shards, and returns an additional "shard_failures" value in the response

{
   "query": "sequence ...",
   "allow_partial_search_results": true
}
{
    "is_partial": false,
    "is_running": false,
    "took": ...,
    "timed_out": false,
    "hits": {
        ...
     },
    "shard_failures": [
        {
            "shard": ...,
            "index": ...,
            "node": ...,
            "reason": {
                "type": "no_shard_available_action_exception",
                "reason": null
            }
        }
    ]
}

Sequence queries are particularly problematic, because their result does not depend on a single record, but by the correlation of multiple records, so the result of a query can be slightly incorrect in case of partial shards (consider the case where a missing event clause matches records that are in missing shards, it will return false positives).
For this reason, we also provide an additional parameter, allow_partial_sequence_results to fine-tune the behavior of sequence queries:

  • allow_partial_sequence_results=false (default): in case of missing shards, sequence queries always return empty results, to avoid mistakes
  • allow_partial_sequence_results=true: the sequences are calculated only on the available data, ie. on records present in available shards.

If allow_partial_search_results=false (default), then allow_partial_sequence_results has no effect.

TODO:

  • Implement request parameter and query logic
  • Implement response (shard_failures)
  • Allow override for the default value (eg. with something similar to search.default_allow_partial_results)
  • Docs

@luigidellaquila luigidellaquila added >enhancement :Analytics/EQL EQL querying auto-backport Automatically create backport pull requests when merged labels Nov 11, 2024
@elasticsearchmachine
Copy link
Collaborator

Hi @luigidellaquila, I've created a changelog YAML for you.

@luigidellaquila luigidellaquila marked this pull request as ready for review November 11, 2024 11:41
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-analytical-engine (Team:Analytics)

@elasticsearchmachine elasticsearchmachine added the Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) label Nov 11, 2024
Copy link
Contributor

@astefan astefan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @luigidellaquila. It looks promising from my point of view. On the technical side of things - a review from someone in the ES team who is more familiar with partial results would help greatly.

I would also look into some tests that use CCS. I recently learned that skip_unavailable setting for a CCS cluster is a flavor of "unavailable shards" situation.

Also, some tests with Security enabled would help as well.

If true (default), the sequences are calculated on the available shards
If false, in case of shard failures, the query does not return any results, but does not throw errors either
…h_results' into eql/allow_partial_search_results
@luigidellaquila
Copy link
Contributor Author

@elasticmachine update branch

@elasticmachine
Copy link
Collaborator

merge conflict between base and head

Copy link

@yctercero yctercero left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - thank you so much for working on this addition. I think this now sets us up well on our end to decide how we want to expose this functionality on our end.

@luigidellaquila
Copy link
Contributor Author

@yctercero @approksiu thanks for the off-line conversations and for confirming that the PR covers your needs.

@astefan @javanna please let me know if you have further remarks of if we can proceed.

Copy link
Contributor

@astefan astefan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've left a first round of comments. It looks good.

In PartialSearchResultsIT or in toml files or in yml files I think we can do better in terms of covering more scenarios (involving optional fields for example). TumblingWindow specifically and SampleIterator are complex enough to trigger an ActionListener to return in different use cases.

if (allowPartialResultsInBody) {
builder.field("allow_partial_search_results", String.valueOf(allowPartialSearchResults));
if (allowPartialSequenceResults != null) {
builder.field("allow_partial_sequence_results", String.valueOf(allowPartialSequenceResults));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if an user provides only allow_partial_sequece_results=true or uses allow_partial_search_results=false and allow_partial_sequece_results=true at the same time?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

allow_partial_sequece_results alone (or with allow_partial_search_results=false) has no effect.
There are randomized tests for this in PartialSearchResultsIT , but probably I can extend them a bit, randomly setting allow_partial_search_results=false

@@ -88,6 +88,53 @@ request that targets only `bar*` still returns an error.
+
Defaults to `true`.

`allow_partial_search_results`::
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ES search API supports partial results for async search as well, if I read the documentation correctly. Is there something stopping us for doing the same with EQL?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think when _search docs mention partial results at the beginning of the page, they refer to something slightly different, ie. the first part of a response that is still being calculated.
Search results can also be partial because of missing shards, that is the same we have in EQL, regardless of the request being sync or async.

This said, we definitely need some tests for async EQL queries with allow_partial_search_results=true. I'm adding them.

builder.endObject();

Request request = new Request("POST", "/" + index + "/_eql/search");
Boolean ccsMinimizeRoundtrips = ccsMinimizeRoundtrips();
if (ccsMinimizeRoundtrips != null) {
request.addParameter("ccs_minimize_roundtrips", ccsMinimizeRoundtrips.toString());
}
if (allowPartialSearchResults != null) {
if (allowPartialResultsInBody == false) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

allowPartialResultsInBody value seems to suggest that "partial results" settings can be used either as request parameter or request body parameter. I think we should test the case when the user provides:

  • request parameter values AND request body parameters (both of them or just one)
  • either one of them

In other words, randomize any combination of request parameters presence and values in the two places where we allow them.

@@ -27,9 +27,23 @@ public EqlMissingEventsSpecTestCase(
List<long[]> eventIds,
String[] joinKeys,
Integer size,
Integer maxSamplesPerKey
Integer maxSamplesPerKey,
Boolean allowPartialSearchResults,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be great if all the existent tests in EQL that have no idea about partial results (all the tests based on BaseEqlSpecTestCase before this PR) would completely randomize these two new settings. All of them should behave completely the same because we have no "broken" shard setup for them which means that irrespective of the values provided, the tests should pass.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is some randomization already, but I'm making it more complete.

allow_partial_search_results = true
allow_partial_sequence_results = true
expected_event_ids = [1, 2]
expect_shard_failures = true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add here more "failing shards" tests in other scenarios? (optional fields, samples, samples with optional fields, sequences with optional fields, sequences with missing events...)


public BasicQueryClient(EqlSession eqlSession) {
this.cfg = eqlSession.configuration();
this.client = eqlSession.client();
this.indices = cfg.indices();
this.fetchFields = cfg.fetchFields();
this.allowPartialSearchResults = eqlSession.configuration().allowPartialSearchResults();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
this.allowPartialSearchResults = eqlSession.configuration().allowPartialSearchResults();
this.allowPartialSearchResults = cfg.allowPartialSearchResults();

if (completed.isEmpty()) {
listener.onResponse(new EmptyPayload(Type.SEQUENCE, timeTook()));
if (completed.isEmpty() || (allowPartialSequenceResults == false && shardFailures.isEmpty() == false)) {
listener.onResponse(new EmptyPayload(Type.SEQUENCE, timeTook(), shardFailures.values().toArray(new ShardSearchFailure[0])));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here you already know the size of the shardfailures, why not use it to its most optimal way:

Suggested change
listener.onResponse(new EmptyPayload(Type.SEQUENCE, timeTook(), shardFailures.values().toArray(new ShardSearchFailure[0])));
listener.onResponse(new EmptyPayload(Type.SEQUENCE, timeTook(), shardFailures.values().toArray(new ShardSearchFailure[shardFailures.size()])));

@@ -918,4 +948,13 @@ public KeyAndOrdinal next() {
};
};
}

private void addShardFailures(SearchResponse r) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've seen this exact method in SampleIterator as well. Maybe extract it somewhere and re-use it?

false,
Setting.Property.NodeScope,
Setting.Property.Dynamic
);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm with @javanna here. If there is no need for this, let's not make it more complicated than it should be. If there will be a desire in the future for this, we can add it.

The basic principle of "it's easy to put stuff in, harder to take it out" (think about BWC) applies here.

Same principle for having this as both request parameter and request body parameter: it complicates the logic unnecessarily and confuses users with a lot of options to think and reason about. I realize that ES query DSL has this in the same way (both request parameter and request body parameter) and I am not against this that much, but unless there is an actual need and request I wouldn't confuse users.

null,
matcher,
Collections.emptyList(),
randomBoolean(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome 👍

@@ -41,6 +41,16 @@
"type": "time",
"description": "Update the time interval in which the results (partial or final) for this search will be available",
"default": "5d"
},
"allow_partial_search_results": {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@astefan to allow yaml tests to accept these as query parameters I had to change the API spec, and I noticed that it contains only a small subset of the allowed (and documented) parameters.
Should we update it, maybe in a followup PR?

Copy link
Contributor

@astefan astefan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, this LGTM. I've left some comments around testing, mostly. And also one related to TumblingWindow where the two settings related to sequences are not clear to me where they are properly set.

@@ -2432,7 +2432,7 @@ public static void afterClass() throws Exception {
/**
* After the cluster is stopped, there are a few netty threads that can linger, so we make sure we don't leak any tasks on them.
*/
static void awaitGlobalNettyThreadsFinish() throws Exception {
public static void awaitGlobalNettyThreadsFinish() throws Exception {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't find an obvious reason for this change. Can you shed some light, please?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a leftover, I don't need it anymore. Reverting

builder.field("allow_partial_sequence_results", String.valueOf(allowPartialSequenceResults));
}
} else {
// these will be overwritten by the query params, that have higher priority than the body params
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment is confusing. Maybe it would be more clear is you call a request param as "path param" and a request body param as "query param", like the documentation does.

builder.field("allow_partial_sequence_results", randomBoolean());
}
}
} else {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please, add a comment here explaining the intent, ie spec tests that have no special setup for "partial results" support which should always pass should be ignorant of the presence of these params.

request.addParameter("allow_partial_sequence_results", String.valueOf(allowPartialSequenceResults));
}
}
} else {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here, regarding a comment explaining the intent of this else branch.

} else {
assertNull(shardFailures);
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If expectShardFailures == null I think you should add assertNull(shardFailures).

@@ -147,6 +155,7 @@ private void advance(ActionListener<Payload> listener) {

private void queryForCompositeAggPage(ActionListener<Payload> listener, final SampleQueryRequest request) {
client.query(request, listener.delegateFailureAndWrap((delegate, r) -> {
addShardFailures(shardFailures, r);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In an ideal (and paranoid) world we should have tests for each of these places where we deal with shard failures because these can happen at different points in the life of a sample or sequence query. But, because those scenarios are not easy to test, I am perfectly fine with the tests we have so far and I trust them they are good enough for this feature.

OpenPointInTimeRequest request = new OpenPointInTimeRequest(indices).indicesOptions(IndexResolver.FIELD_CAPS_INDICES_OPTIONS)
.keepAlive(keepAlive);
.keepAlive(keepAlive)
.allowPartialSearchResults(allowPartialSearchResults);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's interesting, I didn't expect for the open_pit to fail and report back as a partial shard result.

Comment on lines 88 to 91
ShardSearchFailure[] failures = response.getShardFailures();
if (CollectionUtils.isEmpty(failures) == false) {
if (CollectionUtils.isEmpty(failures) == false && allowPartialSearchResults == false) {
failure = new EqlIllegalArgumentException(failures[0].reason(), failures[0].getCause());
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can rewrite this to something like this:

                    if (allowPartialSearchResults == false) {
                        ShardSearchFailure[] failures = response.getShardFailures();
                        if (CollectionUtils.isEmpty(failures) == false) {
                            failure = new EqlIllegalArgumentException(failures[0].reason(), failures[0].getCause());
                        }
                    }

@@ -127,7 +133,10 @@ public TumblingWindow(
List<SequenceCriterion> criteria,
SequenceCriterion until,
SequenceMatcher matcher,
List<List<Attribute>> listOfKeys
List<List<Attribute>> listOfKeys,
boolean allowPartialSearchResults,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that allowPartialSequenceResults depends on allowPartialSearchResults being true, I am surprised to see no checks here and allowPartialSequenceResults just being used. Can you point me to where we make sure that allowPartialSequenceResults is valid if allowPartialSearchResults has the right value?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no specific check, because in case of partial shard failures the query would fail much earlier. The final result will be an error 400.

@@ -135,6 +141,13 @@ public EqlSearchRequest(StreamInput in) throws IOException {
if (in.getTransportVersion().onOrAfter(TransportVersions.V_8_7_0)) {
maxSamplesPerKey = in.readInt();
}
if (in.getTransportVersion().onOrAfter(TransportVersions.EQL_ALLOW_PARTIAL_SEARCH_RESULTS)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have any tests for serialization to make sure in a mixed cluster tests everything is ok from this point of view?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BWC serialization is unit tested in EqlSearchRequestTests.
I'm adding some randomization to mixed-node tests

@luigidellaquila luigidellaquila merged commit 1d2840e into elastic:main Dec 17, 2024
16 checks passed
luigidellaquila added a commit to luigidellaquila/elasticsearch that referenced this pull request Dec 17, 2024
Allow queries to succeed if some shards are failing
@elasticsearchmachine
Copy link
Collaborator

💚 Backport successful

Status Branch Result
8.x

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Analytics/EQL EQL querying auto-backport Automatically create backport pull requests when merged >enhancement Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) v8.18.0 v9.0.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

EQL: add support for partial results
7 participants