-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Create Synchronous EQL querying REST API #49634
Comments
Pinging @elastic/es-search (:Search/EQL) |
@wylieconlon @chrisdavies -- would be interested to hear your thoughts on the last section or something like this: Note that a row can belong in multiple sequences, hence the sequence_id column is an array type, not a string type. I feel like the first structure might be easier for suggesting visualizations? Especially if we want a generic data table layer like we were talking about, I think we probably don't want to do something like, if the table has a column named "sequenceIds", suggest this visualization. This would probably be a new visualization type as well... maybe something like a table where each row is a sequence but it has collapsible sub rows (kind of like that EAH demo from RBC)? |
For the JSON response of the API, I am leaning towards the response providing a flat list of events with a field for sequence_id. The reason for this is that it will be easily compatible with the current output of results from Endpoint, it fits into the way that SIEM is likely to want to display the results and the data can be used to create the kinds of nested tables that @stacey-gammon mentions above since all the information will be present. As for the fact that an event can belong in multiple sequences, one solution to this would be to duplicate the event rather than make the sequence_id field multi-valued. @rw-access How does the current EQL implementation output results where the sequences overlap? does it duplicate the events that are shared by multiple sequences in the results? |
Sorry for jamming several replies in one comment. Buckle in.
What's the rationale for making this implicit vs explicit? One thing I observed in the POC is that the order of the join keys can impact total search time significantly, because of the search_after skipping. For instance, using endpoint_id as the last key took one query 40 minutes, but when I put it as the first key, it was ~15-20% complete after four hours. I eventually just canceled it. If we guess this ordering wrong for implicit join keys, we could degrade performance.
I still don't think we should only allow this to be a specific field, because it might not be flexible enough. If we do, I think {
"event_mapping": {
"file" : {
"index": "endgame-file-*"
},
"process": {
"index": ["endgame-*"],
"filter": "event.category == 'process'"
}
}
}
Within the Endgame platform, EQL was a bit of an afterthought, so for I had to flatten every array regardless, and added some fields so that arrays could be reassembled, with each event as a separate "card", similar to how kibana presents data. I think we'll have to brainstorm more what the best representation is. This means that if an event is in two sequences (more below), then it is outputted twice.
This client still operates on the flattened results, with each event as a separate row. If you display columns across different data types, you end up with a sparse matrix.
Good question. Overlaps are technically possible but only for events that could satisfy multiple positions in a sequence. But you won't find one event in the same position for two separate sequences (with one undocumented exception). For reference, https://eql.readthedocs.io/en/latest/query-guide/implementation.html#sequences For instance, this sequence will have no overlaps because there are no events that satisfy sequence
[file where true]
[process where true] But this sequence would link each process to its first child. Since every process (minus the initial one at the of the chain or ones without children) is both a parent and a child, so it'll be in two sequences: sequence
[process where true] by pid
[process where true] by pid If you have a lineage of A -> B -> C -> D, then you'll see sequences for (A, B), (B, C), (C, D). In my opinion, the most clear representation of sequences was the first picture that @stacey-gammon showed: If you went with the second view,
Then you end up with results (a, b), (b, d), (c, e), (d, f). That's tricky if you require each event to only be shown once. Also, note that the join key is different for each pair.
|
The rationale is that we want the same rule as written to run against both Elasticsearch and Endpoint so we need some way to replicate the per endpoint querying in Elasticsearch without needing to change the rule that's run between the endpoint and Elasticsearch.
I think it will be flexible enough, especially when combined with #49713. The ability to create a constant field in the index will allow users to effectively do the reverse of your suggestion and will enable users to query over many indexes whilst still efficiently filtering on the event type. The problem with your suggestion is that its another kinda of mapping which needs to be stored and maintained which will make the users experience (especially whilst users are learning the feature) more complicated.
This will actually make the response much easier since each "row" will only belong to a single sequence |
There seems to be a preference for defining sequences in structure rather than a flat list of events with a sequence id. This was also shared by @tsg when I spoke to him the other day about how he would use EQL. @scunningham would Endpoint be ok with getting back structured sequences in the response from EQL in Elasticsearch (essentially option 1 or 2 below) or would it need a flat event structure like it currently has for the endpoints (essentially option 3 below)? One thing to note is that EQL results are not always sequences, a result can be made up of 1 event or multiple events. To cope with this we can either define the response format as if every result has multiple events (so a result contains an array of events) and the array will only contain a single event if the query does not use sequence or join, or we can define the response so the result can contain different types of payload; event or sequence. The former has the advantage that clients has one kind of response to process and can process it in the same way every time. To help us make progress below are some examples of responses in the different forms. Sequences as structure in the responseOption 1 - Same format for sequence and non-sequence resultsExample 1 - non-sequence queryQuery:
Response:
Example 2 - sequence queryQuery:
Response:
Option 2 - Different format for sequence and non-sequence resultsExample 1 - non-sequence queryQuery:
Response:
Example 2 - sequence queryQuery:
Response:
Option 3 - Results as flat list of events with sequence idExample 1 - non-sequence queryQuery:
Response:
Example 2 - sequence queryQuery:
Response:
|
Another option that we discussed today: Option 4 - Define results type at top levelExample 1 - non-sequence queryQuery:
Response:
Example 2 - sequence queryQuery:
Response:
|
Adding example of counts query response here based on conversation with @rw-access
|
to add a drive by comment, I'm in favor of the former approach, unless there is a compelling reason why a single event returned would be truly a different type of thing here. Main reason being it adds complexity to both server and client in terms of constructing and parsing response, with usually a limited benefit |
The reason, imo, to allow the client to differentiate between the two different types of responses (without having to parse the request) is that the user will likely want to view them differently. How will the user want to view a sequence query result vs the result of a "process where true" query? The same or different? I think the answer is differently. If the result of all non-sequence queries is a table, these queries can be used as a data source in Lens and they can view the results with all of the usual visualization types. Sequence queries are special in that you probably don't want to view the results in something like a bar chart, but more like a specialized nested table structure. |
We meet to talk about this particularly in the context of SIEM and agreed that we will go with option 4 above for the response format. |
There are aspects to this which we need to work ourt like pagination and reposnses for pipes but they are tracked in separate issues so I'm closing this one |
The first mode of execution for EQL queries will be running ad hoc EQL queries against historical data (i.e. running the query over large amounts of data already stored in an index in a single run). For this issue we will make the API a synchronous request/response where the execution of the query will complete before returning the response. In a later issue we will address long running EQL queries and explore converting this to an asynchronous API.
Request
Parameters on the request should be (note that we can probably define sensible defaults for everything except the index and the rule):
null
(no query)50
?null
.@timestamp
event.type
null
so by default we would only use the join keys specified in the EQL rule. This option will be useful for the Endpoint use case since we need to be able to run the same rules on Elasticsearch as on the Endpoints but when querying the endpoints, each endpoint is considered individually so we will need some control outside of the rule to get the same behaviour in Elasticsearch.Note parameter names are not intended to be final suggestions
Example minimal request:
Example request with all options:
Response
Although the response does not need to be tabular, it is much easier for UIs and users to consume the results if the response is easily converted to a table.
Information required in response:
Information required for each rule result:
Current format of results
For EQL queries without pipes, the results of an EQL query are always a flat list of events. This means that if the query is a sequence the ordering of the events in the results defines the sequence rather than the sequence being defined by structure. For example if the query was looking for file events followed by a process event the results would look like the following:
From the list above you can see that every 2 events make up an instance of the sequence we are looking for. The downside here is that the client needs to understand the query being run to be able to understand the results. We will probably need to support this style of results output in order to fit in with the way that the endpoint SMP Server currently uses EQL. Note that the SMP Server currently pushes the understanding of the sequences to the user (i.e. it shows the flat events output as returned) for cases where the query is defined by the user. For cases where the server itself defines the EQL (such as in the resolver view) the server has implicit knowledge of what it's asking for so knows how to interpret the results.
Alternatives to current result output
For clients like Kibana (and probably SIEM) it would be better if the client does not need to understand the query in order to interpret the results. The difference here compared with the SMP server is that in Kibana the user will define an arbitrary EQL query and expect Kibana to know how to render it in a way that makes sense. This means that Kibana should not have to understand the query (since we don’t want to have to add a query parser in Kibana as well as ES) but does need the results in a generic understandable form. If sequences are defined as structure Kibana can identify sequences without understanding the query itself (it just needs to understand it might get sequences back containing 1 or more events each). Another option would be to have a “sequence group id” field in the response for each event so events in the same sequence can be matched without having to have explicit response structure.
The endgame CLI client also has the option to define
--flat --columns
which pivot the result data into a table form with the specified columns. This may also be something we would want to support since it will put the results into a much more consumable form for clients like Kibana and is the kind of operation analysts will naturally reach for following the search anyway.The text was updated successfully, but these errors were encountered: