Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EQL: consolidate response format #57036

Closed
costin opened this issue May 21, 2020 · 23 comments
Closed

EQL: consolidate response format #57036

costin opened this issue May 21, 2020 · 23 comments
Labels
:Analytics/EQL EQL querying >enhancement Team:QL (Deprecated) Meta label for query languages team

Comments

@costin
Copy link
Member

costin commented May 21, 2020

A piece of feedback from the demo sessions on EQL was that having different response types for each query makes things difficult for consumers. That is, regardless of whether the consumer cares about the initial query, it has to have different types of parsing.

Please see #49634 on how we ended up with the current approach.

One thing to keep in mind is we should strive for extensibility - if we are to add other features, we should have space to evolve the response. With the current approach that would be both easy and hard - easy because we could add just another response, hard because existing clients will have issues adapting.

To reiterate, currently each query creates a slightly different response structure. All responses return events but sequences and joins wrap said events into another structure:

event where filter

{
    "took" : 5,
    "timed_out" : false,
    "hits" : {
        "total" : {
            "value" : 100,
            "relation" : "eq"
        },
        "events" : [
            {
                "_index" : "my_index",
                "_id" : "0",
                "_sequence_id": 0,
                "_source" : {
                    "date" : "2009-11-15T14:12:12",
                    "event": {
                        "type": "process"
					}
				}
}

sequence by ..

{
    "took" : 5,
    "timed_out" : false,
    "hits" : {
        "total" : {
        },
         "sequences" : [
            {
                "join_keys": [ "4021" ],
                "events":[ ... ]
            }
        ]
}

join queries are similar to the ones above, replacing sequence accordingly.

The other response which break the mold is count since it is not document based:

{
    "timed_out": false,
    "took": 5,
    "hits": {
        "total" : {
            "value" : 100,
            "relation" : "eq"
        },
        "counts": [
            {
                "_count": 40,
                "_keys": [...],
                "_percent": 0.4223148165093,
                "_values": [...]
            }
        ]
    }
} 

Here are a number of proposals on my end to get the discussion started:

  1. rename join_keys to keys.

Since we have both sequence and join, having join as a prefix is redundant and confusing. It also makes folks think of a join and all its baggage when typically we would expect sequences to be used.
Hence why I propose, and will use in the examples, keys instead of join_keys, same thing but shorter and more generic.

  1. consider moving away from events to something more generic.

EQL is all about events so using events in the response is a good choice. However it's limiting if the response should be more generic such as returning a sequence or a join since that's not just one event but rather multiple.
Maybe using hit (though that one is overloaded) or match or result ?

Proposal

Try to combine all the responses into one, indicating the type of query executed either explicitly (though a separate field for example) or implicitly based on the response (and the presence of certain fields).

So instead of having different top-level responses, the nesting is lower-level:

sequence by

{
    "took" : 5,
    "timed_out" : false,
    "hits" : {
        "total" : {
        },
         "result" : [
            {
                "keys": [ "key1" ],
                "match": [{ _source1 } , { _source2 } ]
            },
            {}
        ]
}

If no key is specified the keys entry would be empty:

 "result" : [
    {
        "keys": [],
        "match": [{ _source1 } , { _source2 } ]
    },

Do we want to differentiate between a join/sequence ?

Join are unordered sequences - even though the results are ordered based on the declaration order. Ideally based on the response we would know whether a join or a sequence query were asked however is that really needed?
If we do potentially we can add an extra field such as ordered to differentiate between the two however that is clunky.
As an alternative we could add extra information about when each event matched so one could deduce based on the order whether it is a sequence (the order is ascending) or a join (there is no order though that's not always the case):

"match": [{ pos: 0, _source1 } , {  pos: 1, _source2 } ]

Based on that approach, a simple query:

event where condition would return:

"result" : [
{
    "match": [{ _source1 }]
},
{
    "match": [{ _source2 }]
},

Essentially there will be one entry per match since a sequence/join requires at least two queries. Notice that there is no keys entries which might be confusing for folks expecting this entry, we could add it and have no keys for it.

count would take a similar approach, returning the results as a document inside a match clause however there would be no _source to speak of but rather the predefined aggregation fields.

Thoughts?

@costin costin added >enhancement needs:triage Requires assignment of a team area label :Analytics/EQL EQL querying team-discuss and removed needs:triage Requires assignment of a team area label labels May 21, 2020
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-ql (:Query Languages/EQL)

@elasticmachine elasticmachine added the Team:QL (Deprecated) Meta label for query languages team label May 21, 2020
@costin
Copy link
Member Author

costin commented May 21, 2020

Tagging @colings86 and @tsg in particular

@costin
Copy link
Member Author

costin commented May 27, 2020

After a first round of discussions, the following format is proposed. At high-level, the results are "merged" into a list so all responses will essentially return a list of results, which for sequences and joins will be lists of lists.

To avoid ambiguity and indicate the type of query, an enum-like field is added so folks that are interested in the type of query can find out without having to filter the results.

The proposed generic response is:

{
 "took": 5,
 "timed_out": false,
 "hits": {
  "total": {
   "value": 100,
   "relation": "eq"
  },
  "type": "event" // one of event, join, sequence, count
  "results":
  [{ 
    "result" : [{ result1 }], 
   },
   {
    "result" : [{ result2 }]
   }  
  ]
 }
}

To wit:

event query

event where filter

{
 ...
 "type": "event"
 "results":
 [{
   "result": [{
     "_index": "my_index",
     "_id": "0",
     "_sequence_id": 0,
     "_source": {
      "date": "2009-11-15T14:12:12",
      "event": {
       "type": "process"
      }
     }
    }
   ],
   [{
     ...
    }
   ]
  }
 ]
}

join

join by ...
 [query1]
 [query2]
{
 ...
 "type": "join"
 "results":
 [{
   "keys": ["a", "b"],
   "result": [{
     "_index": "my_index",
     "_id": "0",
     "_sequence_id": 0,
     "_source": {
      "date": "2009-11-15T14:12:12",
      "event": {
       "type": "process"
      }
     }
    }, {
     "_index": "my_index",
     "_id": "1",
     "_sequence_id": 1,
     "_source": {
      "date": "2009-11-15T14:13:13",
      "event": {
       "type": "process"
      }
     }
    },
    ...
   ],
   [{
     ...
    }
   ]
  }
 ]
}

sequence

Essentially identical to join bar the type value:

sequence by ...
 [query1]
 [query2]
{
 ...
 "type": "sequence"
 "results":
 [{
   "keys": ["a", "b"],
   "result": [{
     "_index": "my_index",
     "_id": "0",
     "_sequence_id": 0,
     "_source": {
      "date": "2009-11-15T14:12:12",
      "event": {
       "type": "process"
      }
     }
    }, {
     "_index": "my_index",
     "_id": "1",
     "_sequence_id": 1,
     "_source": {
      "date": "2009-11-15T14:13:13",
      "event": {
       "type": "process"
      }
     }
    },
    ...
   ],
   [{
     ...
    }
   ]
  }
 ]
}

count

Lastly count:

query
| count
{
 ...
 "type": "count"
 "results":
 [{
   "result": [{ 
     {
      "_count": 40,
      "_keys": [...],
      "_percent": 0.4223148165093,
      "_values": [...]
     }
    }
   ]
  }
 ]
}

One questions is what does total hits represent?

The number of total documents hit or the number of results? For a join, do we return the number of joins or the number of documents (or matches) across all results?

@costin
Copy link
Member Author

costin commented May 27, 2020

Tagging @Mpdreamz, @cjcenizal, @stacey-gammon and @tsg to check for any red flags on the consumer side.

@cjcenizal
Copy link
Contributor

Take my comments with a grain of salt, since I have never worked with EQL before... 😅

I like the solution we arrived at. If I were consuming these responses in JS, I'd be able to use the type field to determine which shape to expect the results to be and how to interpret them. The simplification of join_keys -> keys also makes sense to me.

What's the intention behind wrapping each individual result object in an array for the event and count responses? As a consumer, I think I'd find it cumbersome and confusing to have to extract the object at result[0] for these types. If the intention was simply consistency, then I think we should sacrifice consistency for ergonomics and remove the wrapping arrays. For clients, I expect it'd be trivial to change the way this shape is consumed in response to the type.

Have you considered adding new information for surfacing the number of results, e.g. a field called hits.total_results? This would allow you to retain the existing meaning of hits.total to indicate the total number of documents hit.

@rw-access
Copy link
Contributor

@cjcenizal I think the worry is that embedding several specialized schemas within the response will be more cumbersome to consumers of the API. If we just have a list of results returned, you just have to know how to render "a result". It would also be more future proof, if we added new capabilities to EQL. Results with a single event can be treated as a special case but they don't have to. Sequences, joins, counts, and events could all be rendered with the same code.

@cjcenizal
Copy link
Contributor

Thanks for the explanation @rw-access!

@rw-access
Copy link
Contributor

rw-access commented May 28, 2020

This looks good good.
A few things that I would add or resolve

  • consistency between keys and _keys. we should pick one. Should everything have a leading underscore? That seems consistent with other APIs?
  • total_results will be all of the results; total_events is the sum of all sequence lengths
  • should we remove a notion of total_hits when it's not relevant? it's nontrivial what it means for sequences/joins and is harder to estimate
  • we should account for unique_count which performs a unique while adding _count/_percent to each result

I'm also wondering what something like this would look like:
sequence [A] [B] [C] | unique_count events[0].process_path

That will add a _count to each result. But I don't think to each event in each result. Would you then want to capture the count key and the join key?

{
   ...
  "type": str, // "count" | "sequence" | "join"
  "results": [
    {
      "_count": null | int
      "_join_keys": null | [object, ...],
      "_count_keys": null | [object, ...],
      "_percent": null | float,
      "_values": null | [object, ... ],
      "_events": null | [
          {"_source": ... },
          {"_source": ... },
          {"_source": ... },
          {"_source": ... },
       ]
     }, ...
  }
 ]
}

@costin
Copy link
Member Author

costin commented May 29, 2020

The issue has been raised during our meeting yesterday.
@jpountz pointed out that total hits might not be all that useful so another question for the group is whether we should still keep it or remove it.

@costin
Copy link
Member Author

costin commented May 29, 2020

consistency between keys and _keys. we should pick one. Should everything have a leading underscore? That seems consistent with other APIs?

The convention has been to use _ as a namespace for internal fields within a document. That is fields that are not set by the user but potentially can be seen by them.

Outside an actual user json document, using _ doesn't serve any functionality and impacts the response readability.

@costin
Copy link
Member Author

costin commented May 29, 2020

total_results will be all of the results; total_events is the sum of all sequence lengths

total hits in ES currently means all the documents that were matched and by default it is not exact. Do I understand correctly that you are proposing adding two counters:

  • one for results - essentially all results (whether we're talking about events/joins/events)
  • one for events - all the matches (potentially within sequences/joins)

What about aggregations like count? What should the totals be there?

should we remove a notion of total_hits when it's not relevant? it's nontrivial what it means for sequences/joins and is harder to estimate

I'm tempted to say yes because it seems to be already included in the other totals that we add.

That will add a _count to each result. But I don't think to each event in each result. Would you then want to capture the count key and the join key?

I'm not familiar with unique_count - what does it return currently. Since it's an aggregation, I would have expected only the count to be returned but it seems the unique values are returned as well.
Probably that means we'll have to come up with a dedicated document for it, potentially breaking the count, key and other fields per entry...

@Mpdreamz
Copy link
Member

Mpdreamz commented Jun 2, 2020

Prefacing this with that my understanding of EQL is extremely limited.

I am worried about shoehorning everything in to results

As it is now conceptually there is:

results: EventResult[]

results: EventResultWithKeys[]

Where EventResultsWithKeys has another result array with different "shapes"

join sequence return the same hit shape but count queries return values.

In types this would mean abstracting to interfaces and figuring out at run time what concrete implementation to deserialize too but the user would still only be exposed to the interfaces and would need to do runtime inspection of these.

To that end I'd very strongly prefer the initial events and sequences and counts on hits. It would allow for these three to evolve independently too.

The type hint would be a welcome addition to dispatch.

I can write a much larger reply on the implications to the exposed types to the user but tried to keep the initial reply to a bare minimum. Let me know if I need to expand my concerns.

@costin
Copy link
Member Author

costin commented Jun 3, 2020

Thanks for the feedback @Mpdreamz.

In types this would mean abstracting to interfaces and figuring out at run time what concrete implementation to deserialize too but the user would still only be exposed to the interfaces and would need to do runtime inspection of these.

Can you expand on that and how it is different from having dedicated json definitions for each result? Shouldn't the type presence make it possible to return the appropriate interface implementation?

As it is now conceptually there is:

results: EventResult[]

results: EventResultWithKeys[]

Where EventResultsWithKeys has another result array with different "shapes"

A client can provide more tighter constrains than the json one which is fairly loose.

Taking a step back:
An event query returns a list of events -> event[]
A sequence/join returns a list of lists of events with potential keys -> [<event[],keys>]
Some aggregations like count are out-of-band and are similar to event queries expect their data is synthetic, they don't return a _source but rather dedicated fields.

The objective is to formalize these into one response. Which essentially boils down to everything returns a list of lists.

An event query return [event[1]], sequence/join return [<event[],keys>] and some aggs [<synthetic_event[1]].
There are different "shapes" for each event but why is that an issue?
Different hierarchies would force the same outcome - the upside of the proposed approach is that iterating over the data (list of lists) is common across different the different result types.

That is to say, would you be able to have different events/sequences and counts with this approach as well?

@Mpdreamz
Copy link
Member

Mpdreamz commented Jun 3, 2020

As it stands this warrants the following in pseudo code:

type EqlSearchResponse<TSource> {
   hits: Hits;
}

type Hits {
    results: ResultBase[]
}

type ResultBase {}

type Event<TSource> inherits ResultBase {
     results: EventResult<TSource>[]
}

type KeyedEvent<TSource> inherits Event<TSource> {
    keys: string[]
}

type SyntheticEvent inherits ResultBase {
    _count: int?; //etc
}

type EventResult<TSource> {
    _index: string; //etc
    _source: TSource;
}

We can only present the lowest common denominator back to the user ResultBase which would require run time down casting to the proper subclass. The generic for TSource is also not persisted down.

If it was modeled as separate properties:

type EqlSearchResponse<TSource> {
   hits: Hits<TSource>;
}

type Hits<TSource> {
    events: Events<TSource>[];
    sequences: KeyedEevnts<TSource>[]
    counts: SyntheticEvent[]
}

type Event<TSource> {
     results: EventResult<TSource>[]
}

type KeyedEvent<TSource> inherits Event<TSource> {
    keys: string[]
}

type SyntheticEvent {
    _count: int?; //etc
}

type EventResult<TSource> {
    _index: string; //etc
    _source: TSource;
}

Here the user can use response.hits.type to choose the right property to itterate over and that property can hold concrete types without doing runtime introspection on the actual types.

It is entirely possible to read type and present back a discriminated union/algabraic type to the user:

type Hits<TSource> {
    results: (EventSource<TSource> | KeyedEvent<TSource> | SyntheticEvent)[]
}

but this poses several problems:

  • Not all ecosystems provide good OOTB support for those (I wish they all did)
    • Even in those that do some are very weak and for intersecting types KeyedEvent<TSource> might be treated as EventSource` if the inspection happens in the wrong order.
  • It dictates a model where there needs to be a stateful deserializer which makes it hard to later introduce POJO based generation of this part of the API.

@costin
Copy link
Member Author

costin commented Jun 5, 2020

I had a discussion with @Mpdreamz which I'll try to summarize below - Martijn please check whether this is accurate:

  1. type

specifying the type of result (type) looks useful for serialization and received positive feedback. Thus there's consensus in adding that to response moving forward.
However due to the JSON spec, there are no guarantees on field ordering and thus, by itself, cannot help the serializer find out the payload which would have to be buffered in order to properly interpret it.

  1. Desire to have 1:1 mapping

@Mpdreamz pointed out that being able to map constructs in the response to entities allows the generation of high-level clients driven by the API.
Having dedicated elements sequence , join, etc.. makes the response type explicit and thus simplifies SerDes.

  1. the aggregation blocks still in-flux

The aggregations in EQL return customized responses - currently these are exposed as synthetic documents with a set of predefined fields. Take count:

"_count": 40,
"_keys": [...],
"_percent": 0.4223148165093,
"_values": [...]

The other aggregation which is somewhat similar is unique_count which is a combination between count and unique - that is, it not only filters things (so duplicates are not returned) but also adds the group count to the result (in SQL that would be the equivalent of SELECT g, COUNT(*) FROM x GROUP BY g).

It is likely that more aggregations will be added in the future - using a dedicated blog for them can be expensive from a backwards compatibility POV.
In other words, we can either be loose and essentially put results in a bag:

"aggregation" : [{ ... whatever fields we want }]

vs be explicit

"agg1" : [{ "fieldA_agg1" : "", "fieldB_agg1" : "", ... }]

Being loose is misleading since a new agg can return different fields than those expected while explicit can break existing clients by introducing an unknown element.

@rw-access
Copy link
Contributor

rw-access commented Jun 8, 2020

A lot of this discussion has been about the strictness of a good type system and the role that plays in the parser, but I think that sidesteps the initial concern raised during the EAH talk: can (and should) we use a generic result format, so that we can be more future-proof to other response types that don't exist yet? Would it result in simpler or more complex parsing code?

I still don't see the problem with having a generic "result" object, where you just return an array of results. I don't understand what gain by having Sequence or Join objects vs just a generic result.

I think constructing our interface via composition makes more sense than inheritance, since that's analogous to how we perform queries. The generic Result below has enough information to express all of these.

  • <anything> | count
  • <anything> | count X
  • sequence
  • join
  • event where ...
  • event where ... | unique_count X
  • sequence ... | unique_count X
  • join | unique_count X
class Result {
   _count: int?;
   _percent: float?;
   _count_keys: object[]?;
   _join_keys: object[]?;
   _type: string; // "sequence"/ "join" / "event"
   _events: Event[]?;
}

class Event {
  _source: object;
  // ...
}

We could also consolidate the _count, _percent and _count_keys fields separately if that makes more sense.

@costin
Copy link
Member Author

costin commented Jun 11, 2020

If I understand your proposal correctly, you are suggesting to:

  • make the event list optional
  • promote non-source fields up top, such as join keys and any potential aggregation fields
  • have a type on each result

That to me is a step backwards since the response is even loser and reduces the data uniformity, that one entity to iterate on (the events) and encapsulation as the Result can have any possible combination of _count, _join_keys etc...

I think we should have as little variability as possible in the response hence the preference to push that into Event (which already has _source open-ended). With your approach, Result starts having a variable form as well due to a large number of optional fields.
We end up with fairly different responses under the same name: with fields but no events, events and no fields or both.
I find that hard to interpret vs there's always a list of events.

@Mpdreamz
Copy link
Member

Mpdreamz commented Jun 17, 2020

@costin that's a fair summation 👍

A property (P1) should hold a single object with fixed keys A. If P sometimes returns B and C as well its more useful to introduce separate properties P2 and P3 for those. I am not the EQL domain expert and so I do not know if A B and C can be condensed to a meaningful single D representation. I'll leave that in yours and @rw-access (and everyone elses) capable hands 😄. I just want to leave saying that if D ends up being a bag that can have many different properties its not a good abstraction.

In this particular case D always has to be open (generic) due to it holding a _source which the user needs to tell us what type it is. So having a D<SourceType> where SourceType is only exposed in some cases already seems like a smell to me.

@marshallmain
Copy link
Contributor

One other adjustment that would be useful to have as an API user would be some representation of the field names for each join key in addition to the values. Currently if you do sequence by host.id then the host.id of the sequence comes back in the join_keys, but we have to either parse some of the original query or write extra logic to know which field the join was done on. I understand the general join case is more complex than sequence by since each sequence event could be using a different expression for the join but hopefully we can find a representation that provides additional details about the join in the response.

@costin
Copy link
Member Author

costin commented Sep 8, 2020

I understand the general join case is more complex than sequence by since each sequence event could be using a different expression for the join but hopefully we can find a representation that provides additional details about the join in the response.

If I understand correctly, you're preference would be for the current response, with dedicated formats per query and in case of sequence, join to enhance the key support to include their expression names.
The main issue that I see is dealing with keys at query level as you pointed out.

Take the following sequence with declares a top-level join key (t) and a secondary key, at query-level key (a) and (b).

  sequence by t
   [queryA] by a
   [queryB] by b

This currently produces the following response structure:

{
    "took" : 5,
    "timed_out" : false,
    "hits" : {
        "total" : {
        },
         "sequences" : [
            {
                "keys": [ "firstKey", "secondKey" ],
                "events":[ ... ]
            }
        ]
}

Since the key names are the same for the sequence, it doesn't make sense to add them for each result, however we could add the option to describe the sequence.

{
    "took" : 5,
    "timed_out" : false,
    "hits" : {
        "total" : {
        },
         "sequences" : [
            {
                "keyNames" : [ ["t", "a"], ["t", "b"]],
                "keys": [ "firstKey", "secondKey" ],
                "events":[ ... ]
            }
        ]
}

That is add a separate field which would contain an array of arrays, that is for each declared query, the list of declared key names as declared in the query.
Would that work?

@marshallmain
Copy link
Contributor

I think that would work. Could we even take it a step further and separate out the top level keys from the query-level keys, like

{
    "took" : 5,
    "timed_out" : false,
    "hits" : {
        "total" : {
        },
         "sequences" : [
            {
                "topLevelKeys": {"t": "firstKey"},
                "queryKeyNames" : [ ["a"], ["b"]],
                "queryKeys": [ "secondKey" ],
                "events":[ ... ]
            }
        ]
}

The advantage here as an API user is I don't have to examine the key names to determine which ones are common across all sub-arrays in keyNames.

For context, the reason I'm particularly interested in the top level keys is that when generating an alert document based on a sequence match, I want to populate the alert with extra information about the sequence. If there's a specific field (or fields) that is the same in every event, it makes sense to include that field in the new generated alert document as well. For example I think host.id would be a common field to use sequence by with

@costin
Copy link
Member Author

costin commented Sep 8, 2020

Could we even take it a step further and separate out the top level keys from the query-level keys

I don't think so since top level keys are just syntactic sugar:

sequence by x
 [queryA]
 [queryB]

is the same as

sequence
 [queryA] by x
 [queryB] by x

Further more top-level keys don't cover all cases. For example, say you have two keys, the first per-query, the second one shared. One cannot declare the latter as a top-level key since that would change the joining order (the match should occur on the first then second key not vice versa):

sequence 
 [queryA] by y, x
 [queryB] by z, x

is not the same as

sequence by x
 [queryA] by y
 [queryB] by z

Essentially, the only reliable way to find a common key across queries is to look at the key names and compare their value and position; the keys with the same name in the same position are the same since EQL does not support aliasing.

To wit, using the example I gave in my previous post:

 "keyNames" : [ ["t", "a"], ["t", "b"]],

There are 2 queries, both with 2 keys with the first one in-common (t) for both queries.
This is true for the following queries:

 sequence by t
  [queryA] by a
  [queryB] by b

and

 sequence
  [queryA] by t, a
  [queryB] by t, b

They are equivalent and both have a key in common though only one uses the top-level declaration.

@costin
Copy link
Member Author

costin commented Sep 28, 2020

The proposal in this ticket did not get enough support so I'll be closing the ticket without any changes to the existing request/response format.
Event the improvements discussed with @marshallmain turned out to be non-essential as the information already exists and the sequence structure itself needs to be parsed anyway.

Thanks for the feedback everyone.

@costin costin closed this as completed Sep 28, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Analytics/EQL EQL querying >enhancement Team:QL (Deprecated) Meta label for query languages team
Projects
None yet
Development

No branches or pull requests

6 participants