-
Notifications
You must be signed in to change notification settings - Fork 419
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Meta] Add ECS Dataset fields #845
Conversation
This adds the stream fields which are used for the new indexing strategy to ECS: https://github.com/elastic/kibana/blob/master/docs/ingest_manager/index.asciidoc#indexing-strategy-1 To goal of having these in ECS is to allow any timeseries shipper to use these fields and get the benefit of the new indexing strategy. Before we landed on `stream.*` quite a few discussions happened if we could use `event.kind`, `event.dataset`, `event.type`. With `event.kind` there are two main problems: * It is not a constant_keyword so this would be a breaking change * It already contains more values in it then we need. The first problem also applies to `event.dataset` even though the field has the same content. It felt also odd to have some of the fields under `event.*` and some under `stream.*`. An other option discussed was to use `datastream.*` based on the new Data Stream feature in Elasticsearch as this is where all data ends up. But this would indicate it is a feature of the Data Streams itself but it is not. So `stream` seemed to be the best fit.
With elastic#845 stream.dataset is introduced. Even though event.dataset is still heavily used, users should start to switch over to using stream.dataset. I expect event.dataset to be remove from ECS in the next major version. stream.dataset and event.dataset have the same content but are not of the same type. First one is constant_keyword, other one keyword. On the query side this should not make a difference but not sure if it is possible to use alias to reference one to an other.
event.module was introduced together with event.dataset. With elastic#845 the new stream fields are introduced and module is not used anymore. When we introduce event.module the assumption was that this is the field used for quering but it turned out in most casese event.dataset is needed and required. I expect event.module to be removed from ECS in the next major version but this does not mean, it cant still be used by Beats for example.
@webmat CI fails because there is no support for constant_keyword fields. Let me see if I can contribute that. |
Two related PR's to deprecate existing fields can be found here: |
event dataset and event module and event kind are "core" fields, what is happening? will there be a backwards compatibility field/value/thing added for people to automatically use/run to alleviate what this will cause? I think you all probably understand what this affects in SIEM, Kibana, and the entire communities ECS parsers (rocknsm, security onion, etc...) - who I don’t speak on behalf of any of them btw. |
As a replacement for event.[dataset/module] the stream fields (And what I've quickly read on the constant_keyword idea) seem like the kind of thing that:
Could event.dataset / event.module not be allowed to be either keywords or constant_keywords depending on implementation? |
I labeled this PR @neu5ron I hope my above comment also helps to answer the "what is happening" question. @neu5ron @defensivedepth Thanks for your responses. This is really helpful for me to better understand on how these fields are used at the moment and I think I underestimated how broadly they are used, which is great! I'm sorry that my proposal was probably a bit too direct. @defensivedepth What do the indices look like that you use to store your datasets? |
Thanks @ruflin! constant_keyword I see At one extreme we have the new indexing strategy you refer to (linked to in the field set description) that has one index per combination of type/dataset/namespace. But at the other extreme, a user could perfectly have a The datatype could get a special mention for these fields here of course, because it'll be a significant performance improvement for folks that do adopt this indexing strategy. deprecations I think deprecations can and should be done when deemed necessary. I agree with the community that opening the PRs to deprecate these two event fields was perhaps a bit fast. Let's ease into the discussion and see if it's needed ;-) I also think that the mention of As a contrast, I think some deprecations we're considering are not as controversial, because they specifically fix issues that have confused users. Like #841 👀 I do think that the current fields to capture the raw "where it's from" ( issues I see with the proposal Currently, the values for In this proposal About the idea of using So here perhaps a solution is to replace issues with the current fields The fields as currently defined aren't perfect either:
But at this point I haven't seen any analysis to address this. |
Will you adjust this PR with the new proposal in elastic/package-registry#482? |
@webmat As an update, I started to push some changes to the yaml files but need to find some time to also answer all the above inputs. Will hopefully get to this soon. |
@webmat Here are finally the answers for your comments above:
In summary, the With the new indexing strategy we also enforce the indices to be ECS compatible (at least the basics). So having these fields here in ECS would make it possible for us to point users to it for details and also the other way around to point users to the new indexing strategy. |
I also update the PR description to match the most recent changes in the PR. |
Just wondering if the above means: How difficult will it be for users to e.g. extend e.g. Cisco Firepower integrations - for either A or B (or another option if I missed something)? |
a) We probably must define the term "ECS compatible" here. We currently enforce a few fields like And as you know, with our stack "enforcing" is a tricky thing as users can change everything. Perhaps we should rephrase it to "out-of-the-box". Extending integrations we ship is something we work on and is a bit tricker then what I described above. If we would never upgrade integrations, it would be simple. The user extends and then has his own fields on top of ours. But if we upgrade an integration, we might also add new fields and new ingest pipeline, in the worst case not compatible with what the user added. The mapping additions are the easiest part here assuming component templates are used, but ingest pipelines might conflict. So no final answer on how it exactly going to work but we will find a way. |
[[ecs-dataset]] | ||
=== Dataset Fields | ||
|
||
The dataset fields are part of the new [indexing strategy](https://github.com/elastic/kibana/blob/master/docs/ingest_manager/index.asciidoc#indexing-strategy-1). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This link is broken.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will update it but also need to find a more stable place. I wonder if I should link to a specific version instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps we could have this documentation page live right in ECS?
I think this would be a good place to describe the new indexing strategy. In addition to this, it could also address using keyword
instead of constant_keyword
, when users want to adopt these fields in indices that contain more than one data source.
@ruflin I think one of the goals of adding the new Given that ECS categorization fields have been published and are implemented already by producers and consumers of ECS data, I think that the ECS implementation should govern the name choice of the new Specifically, I think that the proposed Here's a picture that i hope is worth 1000 words to help the discussion. And since That would lead us to have two sets of categorization fields:
Thoughts? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the only requirement that
-
cannot be used in the name of a dataset
I think the list of disallowed characters should be longer: for example /
and
. I'm sure there's more: essentially all characters that aren't allowed in an index name.
Too many indices: We are aware of a potential challenge... Zeek should have 40 datasets if all the dataset look different.
👍
the dataset.* fields are not a replacement for any of the existing fields.
Yes I think this is becoming clearer, thanks. The dataset fields are meant to describe "where it's from" in a better way than event.module
and event.dataset
, whereas the ECS categorization fields are meant to capture "what it is".
The point above could also help answer the array question for the new dataset fields. While two of ECS' categorization fields can be arrays (event.category
and event.type
), this probably doesn't apply to the dataset fields. Events that have enough information to fall in multiple categories on the "what it is" axis (e.g. it's a network flow + it's an authentication, a typical combo from firewalls) would still only come from one place (a firewall).
the new indexing strategy with these fields. | ||
|
||
All three fields are `constant_keyword` fields. | ||
footnote: > |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The footnote
attribute has never been used, after all. Although you can leave this here. We're working on supporting an additional plain asciidoc file per field set, to allow documenting field sets at length. When we have this, we'll review the footnote
entries for content :-)
[[ecs-dataset]] | ||
=== Dataset Fields | ||
|
||
The dataset fields are part of the new [indexing strategy](https://github.com/elastic/kibana/blob/master/docs/ingest_manager/index.asciidoc#indexing-strategy-1). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps we could have this documentation page live right in ECS?
I think this would be a good place to describe the new indexing strategy. In addition to this, it could also address using keyword
instead of constant_keyword
, when users want to adopt these fields in indices that contain more than one data source.
@MikePaquette Thanks for putting these details together. One of the reasons I initially picked I think moving forward there are two models:
Even if we go with option 2, I think it is good to keep a certain relation of these fields in mind. If we go with 1, I'm worried we dilute the categorization fields to also be used for other purposes. The main purpose for me of the |
I'm closing this PR as in the meantime, the fields we use in agent have been renamed to |
This adds the dataset fields which are used for the new indexing strategy to ECS. More discussion around these fields can be found here: elastic/package-registry#482 To goal of having these in ECS is to allow any timeseries shipper to use these fields and get the benefit of the new indexing strategy.
Before we landed on
dataset.*
quite a few discussions happened if we could useevent.kind
,event.dataset
,event.type
. Withevent.kind
there are two main problems:The first problem also applies to
event.dataset
even though the field has the same content. It felt also odd to have some of the fields underevent.*
and some underdataset.*
.An other option discussed was to use
datastream.*
based on the new Data Stream feature in Elasticsearch as this is where all data ends up. But this would indicate it is a feature of the Data Streams itself but it is not. Sodataset
seemed to be the best fit.