-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
EQL: consolidate response format #57036
Comments
Pinging @elastic/es-ql (:Query Languages/EQL) |
Tagging @colings86 and @tsg in particular |
After a first round of discussions, the following format is proposed. At high-level, the results are "merged" into a list so all responses will essentially return a list of results, which for sequences and joins will be lists of lists. To avoid ambiguity and indicate the type of query, an enum-like field is added so folks that are interested in the type of query can find out without having to filter the results. The proposed generic response is: {
"took": 5,
"timed_out": false,
"hits": {
"total": {
"value": 100,
"relation": "eq"
},
"type": "event" // one of event, join, sequence, count
"results":
[{
"result" : [{ result1 }],
},
{
"result" : [{ result2 }]
}
]
}
} To wit: event query
{
...
"type": "event"
"results":
[{
"result": [{
"_index": "my_index",
"_id": "0",
"_sequence_id": 0,
"_source": {
"date": "2009-11-15T14:12:12",
"event": {
"type": "process"
}
}
}
],
[{
...
}
]
}
]
} join
{
...
"type": "join"
"results":
[{
"keys": ["a", "b"],
"result": [{
"_index": "my_index",
"_id": "0",
"_sequence_id": 0,
"_source": {
"date": "2009-11-15T14:12:12",
"event": {
"type": "process"
}
}
}, {
"_index": "my_index",
"_id": "1",
"_sequence_id": 1,
"_source": {
"date": "2009-11-15T14:13:13",
"event": {
"type": "process"
}
}
},
...
],
[{
...
}
]
}
]
} sequenceEssentially identical to
{
...
"type": "sequence"
"results":
[{
"keys": ["a", "b"],
"result": [{
"_index": "my_index",
"_id": "0",
"_sequence_id": 0,
"_source": {
"date": "2009-11-15T14:12:12",
"event": {
"type": "process"
}
}
}, {
"_index": "my_index",
"_id": "1",
"_sequence_id": 1,
"_source": {
"date": "2009-11-15T14:13:13",
"event": {
"type": "process"
}
}
},
...
],
[{
...
}
]
}
]
} countLastly
{
...
"type": "count"
"results":
[{
"result": [{
{
"_count": 40,
"_keys": [...],
"_percent": 0.4223148165093,
"_values": [...]
}
}
]
}
]
} One questions is what does The number of total documents hit or the number of results? For a join, do we return the number of joins or the number of documents (or matches) across all results? |
Tagging @Mpdreamz, @cjcenizal, @stacey-gammon and @tsg to check for any red flags on the consumer side. |
Take my comments with a grain of salt, since I have never worked with EQL before... 😅 I like the solution we arrived at. If I were consuming these responses in JS, I'd be able to use the What's the intention behind wrapping each individual result object in an array for the Have you considered adding new information for surfacing the number of results, e.g. a field called |
@cjcenizal I think the worry is that embedding several specialized schemas within the response will be more cumbersome to consumers of the API. If we just have a list of results returned, you just have to know how to render "a result". It would also be more future proof, if we added new capabilities to EQL. Results with a single event can be treated as a special case but they don't have to. Sequences, joins, counts, and events could all be rendered with the same code. |
Thanks for the explanation @rw-access! |
This looks good good.
I'm also wondering what something like this would look like: That will add a {
...
"type": str, // "count" | "sequence" | "join"
"results": [
{
"_count": null | int
"_join_keys": null | [object, ...],
"_count_keys": null | [object, ...],
"_percent": null | float,
"_values": null | [object, ... ],
"_events": null | [
{"_source": ... },
{"_source": ... },
{"_source": ... },
{"_source": ... },
]
}, ...
}
]
} |
The issue has been raised during our meeting yesterday. |
The convention has been to use Outside an actual user json document, using |
total hits in ES currently means all the documents that were matched and by default it is not exact. Do I understand correctly that you are proposing adding two counters:
What about aggregations like count? What should the totals be there?
I'm tempted to say yes because it seems to be already included in the other totals that we add.
I'm not familiar with |
Prefacing this with that my understanding of EQL is extremely limited. I am worried about shoehorning everything in to As it is now conceptually there is:
Where
In types this would mean abstracting to interfaces and figuring out at run time what concrete implementation to deserialize too but the user would still only be exposed to the interfaces and would need to do runtime inspection of these. To that end I'd very strongly prefer the initial The I can write a much larger reply on the implications to the exposed types to the user but tried to keep the initial reply to a bare minimum. Let me know if I need to expand my concerns. |
Thanks for the feedback @Mpdreamz.
Can you expand on that and how it is different from having dedicated json definitions for each result? Shouldn't the
A client can provide more tighter constrains than the json one which is fairly loose. Taking a step back: The objective is to formalize these into one response. Which essentially boils down to everything returns a list of lists. An event query return That is to say, would you be able to have different |
As it stands this warrants the following in pseudo code: type EqlSearchResponse<TSource> {
hits: Hits;
}
type Hits {
results: ResultBase[]
}
type ResultBase {}
type Event<TSource> inherits ResultBase {
results: EventResult<TSource>[]
}
type KeyedEvent<TSource> inherits Event<TSource> {
keys: string[]
}
type SyntheticEvent inherits ResultBase {
_count: int?; //etc
}
type EventResult<TSource> {
_index: string; //etc
_source: TSource;
} We can only present the lowest common denominator back to the user If it was modeled as separate properties: type EqlSearchResponse<TSource> {
hits: Hits<TSource>;
}
type Hits<TSource> {
events: Events<TSource>[];
sequences: KeyedEevnts<TSource>[]
counts: SyntheticEvent[]
}
type Event<TSource> {
results: EventResult<TSource>[]
}
type KeyedEvent<TSource> inherits Event<TSource> {
keys: string[]
}
type SyntheticEvent {
_count: int?; //etc
}
type EventResult<TSource> {
_index: string; //etc
_source: TSource;
} Here the user can use It is entirely possible to read type Hits<TSource> {
results: (EventSource<TSource> | KeyedEvent<TSource> | SyntheticEvent)[]
} but this poses several problems:
|
I had a discussion with @Mpdreamz which I'll try to summarize below - Martijn please check whether this is accurate:
specifying the type of result (
@Mpdreamz pointed out that being able to map constructs in the response to entities allows the generation of high-level clients driven by the API.
The aggregations in EQL return customized responses - currently these are exposed as
The other aggregation which is somewhat similar is It is likely that more aggregations will be added in the future - using a dedicated blog for them can be expensive from a backwards compatibility POV.
vs be explicit
Being loose is misleading since a new agg can return different fields than those expected while explicit can break existing clients by introducing an unknown element. |
A lot of this discussion has been about the strictness of a good type system and the role that plays in the parser, but I think that sidesteps the initial concern raised during the EAH talk: can (and should) we use a generic result format, so that we can be more future-proof to other response types that don't exist yet? Would it result in simpler or more complex parsing code? I still don't see the problem with having a generic "result" object, where you just return an array of results. I don't understand what gain by having Sequence or Join objects vs just a generic result. I think constructing our interface via composition makes more sense than inheritance, since that's analogous to how we perform queries. The generic
class Result {
_count: int?;
_percent: float?;
_count_keys: object[]?;
_join_keys: object[]?;
_type: string; // "sequence"/ "join" / "event"
_events: Event[]?;
}
class Event {
_source: object;
// ...
} We could also consolidate the |
If I understand your proposal correctly, you are suggesting to:
That to me is a step backwards since the response is even loser and reduces the data uniformity, that one entity to iterate on (the events) and encapsulation as the I think we should have as little variability as possible in the response hence the preference to push that into |
@costin that's a fair summation 👍 A property (P1) should hold a single object with fixed keys In this particular case |
One other adjustment that would be useful to have as an API user would be some representation of the field names for each join key in addition to the values. Currently if you do |
If I understand correctly, you're preference would be for the current response, with dedicated formats per query and in case of Take the following sequence with declares a top-level join key (
This currently produces the following response structure:
Since the key names are the same for the sequence, it doesn't make sense to add them for each result, however we could add the option to describe the sequence.
That is add a separate field which would contain an array of arrays, that is for each declared query, the list of declared key names as declared in the query. |
I think that would work. Could we even take it a step further and separate out the top level keys from the query-level keys, like
The advantage here as an API user is I don't have to examine the key names to determine which ones are common across all sub-arrays in For context, the reason I'm particularly interested in the top level keys is that when generating an alert document based on a sequence match, I want to populate the alert with extra information about the sequence. If there's a specific field (or fields) that is the same in every event, it makes sense to include that field in the new generated alert document as well. For example I think |
I don't think so since top level keys are just syntactic sugar:
is the same as
Further more top-level keys don't cover all cases. For example, say you have two keys, the first per-query, the second one shared. One cannot declare the latter as a top-level key since that would change the joining order (the match should occur on the first then second key not vice versa):
is not the same as
Essentially, the only reliable way to find a common key across queries is to look at the key names and compare their value and position; the keys with the same name in the same position are the same since EQL does not support aliasing. To wit, using the example I gave in my previous post:
There are 2 queries, both with 2 keys with the first one in-common (
and
They are equivalent and both have a key in common though only one uses the top-level declaration. |
The proposal in this ticket did not get enough support so I'll be closing the ticket without any changes to the existing request/response format. Thanks for the feedback everyone. |
A piece of feedback from the demo sessions on EQL was that having different response types for each query makes things difficult for consumers. That is, regardless of whether the consumer cares about the initial query, it has to have different types of parsing.
Please see #49634 on how we ended up with the current approach.
One thing to keep in mind is we should strive for extensibility - if we are to add other features, we should have space to evolve the response. With the current approach that would be both easy and hard - easy because we could add just another response, hard because existing clients will have issues adapting.
To reiterate, currently each query creates a slightly different response structure. All responses return events but sequences and joins wrap said events into another structure:
event where filter
sequence by ..
join
queries are similar to the ones above, replacingsequence
accordingly.The other response which break the mold is
count
since it is not document based:Here are a number of proposals on my end to get the discussion started:
join_keys
tokeys
.Since we have both
sequence
andjoin
, havingjoin
as a prefix is redundant and confusing. It also makes folks think of ajoin
and all its baggage when typically we would expect sequences to be used.Hence why I propose, and will use in the examples,
keys
instead ofjoin_keys
, same thing but shorter and more generic.events
to something more generic.EQL is all about events so using
events
in the response is a good choice. However it's limiting if the response should be more generic such as returning a sequence or a join since that's not just one event but rather multiple.Maybe using
hit
(though that one is overloaded) ormatch
orresult
?Proposal
Try to combine all the responses into one, indicating the type of query executed either explicitly (though a separate field for example) or implicitly based on the response (and the presence of certain fields).
So instead of having different top-level responses, the nesting is lower-level:
sequence by
If no key is specified the
keys
entry would be empty:Do we want to differentiate between a join/sequence ?
Join are unordered sequences - even though the results are ordered based on the declaration order. Ideally based on the response we would know whether a join or a sequence query were asked however is that really needed?
If we do potentially we can add an extra field such as
ordered
to differentiate between the two however that is clunky.As an alternative we could add extra information about when each event matched so one could deduce based on the order whether it is a sequence (the order is ascending) or a join (there is no order though that's not always the case):
Based on that approach, a simple query:
event where condition
would return:Essentially there will be one entry per match since a sequence/join requires at least two queries. Notice that there is no
keys
entries which might be confusing for folks expecting this entry, we could add it and have no keys for it.count
would take a similar approach, returning the results as a document inside amatch
clause however there would be no_source
to speak of but rather the predefined aggregation fields.Thoughts?
The text was updated successfully, but these errors were encountered: