Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Meta] Add ECS Dataset fields #845

Closed
wants to merge 5 commits into from
Closed

Conversation

ruflin
Copy link
Contributor

@ruflin ruflin commented May 13, 2020

This adds the dataset fields which are used for the new indexing strategy to ECS. More discussion around these fields can be found here: elastic/package-registry#482 To goal of having these in ECS is to allow any timeseries shipper to use these fields and get the benefit of the new indexing strategy.

Before we landed on dataset.* quite a few discussions happened if we could use event.kind, event.dataset, event.type. With event.kind there are two main problems:

  • It is not a constant_keyword so this would be a breaking change
  • It already contains more values in it than we need.

The first problem also applies to event.dataset even though the field has the same content. It felt also odd to have some of the fields under event.* and some under dataset.*.

An other option discussed was to use datastream.* based on the new Data Stream feature in Elasticsearch as this is where all data ends up. But this would indicate it is a feature of the Data Streams itself but it is not. So dataset seemed to be the best fit.

This adds the stream fields which are used for the new indexing strategy to ECS: https://github.com/elastic/kibana/blob/master/docs/ingest_manager/index.asciidoc#indexing-strategy-1 To goal of having these in ECS is to allow any timeseries shipper to use these fields and get the benefit of the new indexing strategy.

Before we landed on `stream.*` quite a few discussions happened if we could use `event.kind`, `event.dataset`, `event.type`. With `event.kind` there are two main problems:

* It is not a constant_keyword so this would be a breaking change
* It already contains more values in it then we need.

The first problem also applies to `event.dataset` even though the field has the same content. It felt also odd to have some of the fields under `event.*` and some under `stream.*`.

An other option discussed was to use `datastream.*` based on the new Data Stream feature in Elasticsearch as this is where all data ends up. But this would indicate it is a feature of the Data Streams itself but it is not. So `stream` seemed to be the best fit.
@ruflin ruflin requested a review from webmat May 13, 2020 07:20
@ruflin ruflin self-assigned this May 13, 2020
@ruflin
Copy link
Contributor Author

ruflin commented May 13, 2020

@webmat It delete the use case directory when I did run make clean. Not sure if this is expected, seems like you have a PR for that: #632

There is no unreleased section in the Changelog. Where should I add it?

ruflin added a commit to ruflin/ecs that referenced this pull request May 13, 2020
With elastic#845 stream.dataset is introduced. Even though event.dataset is still heavily used, users should start to switch over to using stream.dataset. I expect event.dataset to be remove from ECS in the next major version.

stream.dataset and event.dataset have the same content but are not of the same type. First one is constant_keyword, other one keyword. On the query side this should not make a difference but not sure if it is possible to use alias to reference one to an other.
@ruflin ruflin mentioned this pull request May 13, 2020
ruflin added a commit to ruflin/ecs that referenced this pull request May 13, 2020
event.module was introduced together with event.dataset. With elastic#845 the new stream fields are introduced and module is not used anymore.

When we introduce event.module the assumption was that this is the field used for quering but it turned out in most casese event.dataset is needed and required. I expect event.module to be removed from ECS in the next major version but this does not mean, it cant still be used by Beats for example.
@ruflin ruflin mentioned this pull request May 13, 2020
@ruflin
Copy link
Contributor Author

ruflin commented May 13, 2020

@webmat CI fails because there is no support for constant_keyword fields. Let me see if I can contribute that.

@ruflin
Copy link
Contributor Author

ruflin commented May 13, 2020

Two related PR's to deprecate existing fields can be found here:

@ruflin ruflin removed the request for review from amitaiporat May 13, 2020 07:44
schemas/stream.yml Outdated Show resolved Hide resolved
@neu5ron
Copy link

neu5ron commented May 15, 2020

event dataset and event module and event kind are "core" fields, what is happening?

will there be a backwards compatibility field/value/thing added for people to automatically use/run to alleviate what this will cause? I think you all probably understand what this affects in SIEM, Kibana, and the entire communities ECS parsers (rocknsm, security onion, etc...) - who I don’t speak on behalf of any of them btw.

@defensivedepth
Copy link

Hello there!

Which Elastic license are these new ECS fields under?

I'm also trying to wrap my mind around how we would migrate to this new structure for how we are currently using Modules & Dataset?

image

@dainperkins
Copy link
Contributor

As a replacement for event.[dataset/module] the stream fields (And what I've quickly read on the constant_keyword idea) seem like the kind of thing that:

  1. Goes against the ecs concept of describing events (in this case, describing the target indexing strategy)
  2. will require significant restructuring of, well, every ingest mechanism I can think of
  3. Will seriously increase the # of indexes (every dataset / module it seems would be its own index - e.g. [PA]*[traffic, threat, malware, a/v, flow, auth, vpn, etc.], 5x as many for cisco, 20 or so for zeek, and thats only 3 that come immediately to mind
  4. Deprecation of the original fields would effectively relegate ecs to commercial licenses only (for a complete implementation anyway)

Could event.dataset / event.module not be allowed to be either keywords or constant_keywords depending on implementation?

@ruflin ruflin added the discuss label May 19, 2020
@ruflin
Copy link
Contributor Author

ruflin commented May 19, 2020

I labeled this PR discuss and closed the deprecation PRs to make it more obvious, this PR is to start the discussion and is not meant to be merged 1:1 (or at all).

@neu5ron I hope my above comment also helps to answer the "what is happening" question.

@neu5ron @defensivedepth Thanks for your responses. This is really helpful for me to better understand on how these fields are used at the moment and I think I underestimated how broadly they are used, which is great! I'm sorry that my proposal was probably a bit too direct.

@defensivedepth What do the indices look like that you use to store your datasets?

@webmat
Copy link
Contributor

webmat commented May 20, 2020

Thanks @ruflin!

constant_keyword

I see constant_keyword fields like a performance optimization that's tied to a specific indexing strategy. Therefore perhaps it's not necessary to capture in a hard and fast way as constant_keyword at the ECS level.

At one extreme we have the new indexing strategy you refer to (linked to in the field set description) that has one index per combination of type/dataset/namespace. constant_keyword makes total sense there.

But at the other extreme, a user could perfectly have a logstash-* index template that defines the same 3 fields as keyword, and have their index be compatible with querying stream.* fields (just not as fast).

The datatype could get a special mention for these fields here of course, because it'll be a significant performance improvement for folks that do adopt this indexing strategy.

deprecations

I think deprecations can and should be done when deemed necessary. I agree with the community that opening the PRs to deprecate these two event fields was perhaps a bit fast. Let's ease into the discussion and see if it's needed ;-) I also think that the mention of event.kind in the PR body added to the confusion 😂 The 4 categorization fields are not going away :-)

As a contrast, I think some deprecations we're considering are not as controversial, because they specifically fix issues that have confused users. Like #841 👀

I do think that the current fields to capture the raw "where it's from" (event.module, event.dataset) have issues. But replacing them with these new fields that are laser focused on one precise use case, with no consideration for partner and user data sources is not the right approach.

issues I see with the proposal

Currently, the values for event.module and event.dataset are left to the implementation, with no restrictions from ECS' point of view. Filebeat and Metricbeat happen to use a "[module].[dataset]" convention to populate it. But other sources like Auditbeat and Winlogbeat don't follow this convention. I'm not sure what the upcoming Endpoint solution will have in these fields either. Please check in with them, and confirm whether adopting this strategy and these new conventions is on the roadmap.

In this proposal stream.dataset is now used as part of the index name (like the 2 other fields), which means these now have strict restrictions on which characters are acceptable in there. So this means stream.dataset is not a direct replacement for event.dataset. Users couldn't necessarily use an alias field to migrate. They'd need to ensure none of the values in there for their custom datasets contain / or spaces, for example.

About the idea of using *.dataset to create distinct indices, I think this will lead to too many indices for integrations like the Zeek module: https://github.com/elastic/beats/tree/master/x-pack/filebeat/module/zeek. This gets back to constant_keyword being an optimization: I think the Zeek module could very well index events in 2-3 different indices in this new indexing strategy. Perhaps one index for "notice" (alert) events, one for all the network protocol events, and one for stats. But I still think it's useful to have the fine grained event.dataset to capture the fine grained details to distinguish "zeek.smtp", "zeek.snmp", "zeek.ssl" and so on.

So here perhaps a solution is to replace stream.dataset with stream.name which would be meant to be strictly be the middle part of the index name, with no overloaded meaning. If for a simple integration like Apache HTTPD's you still need a distinct index for access logs and error logs, then in this case stream.name and event.dataset match 1:1. But in the case of more complex integrations like Zeek, you keep all of the variability in event.dataset, and perhaps you have only "zeek.notice" and "zeek.event" in the stream.name field, yielding two indices instead of 40+.

issues with the current fields

The fields as currently defined aren't perfect either:

  • event.module is not a great name, it sounds too Filebeat/Metricbeat specific. Not all pipelines or tools have "modules". But most have broad groupings or pipelines per vendor/event source.
  • event.* is a bit overloaded. If we could capture the raw "where it's from" via a different dataset, this may be a good clarification.
  • As we consider expanding the scope of ECS to capture documents that aren't time series (e.g. various inventories, end user documents), I think we'll also need to capture a raw "where it's from". In this case, stream.* is not appropriate (and to be fair, neither is event.*).
  • Right now we don't offer a clear way for community members such as partners to also fill in these "where it's from" fields without conflicting with existing modules.

But at this point I haven't seen any analysis to address this.

@webmat
Copy link
Contributor

webmat commented May 28, 2020

Will you adjust this PR with the new proposal in elastic/package-registry#482?

@ruflin
Copy link
Contributor Author

ruflin commented Jun 10, 2020

@webmat As an update, I started to push some changes to the yaml files but need to find some time to also answer all the above inputs. Will hopefully get to this soon.

@webmat webmat changed the title Add stream fields Add dataset fields Jun 11, 2020
@ruflin
Copy link
Contributor Author

ruflin commented Jun 22, 2020

@webmat Here are finally the answers for your comments above:

  • constant_keyword: If we mark the fields as keyword, we don't know which fields should / could be used as constant_keywords. Because of this, I would propose to specify them as constant_keyword by default but have it as a rule in ECS, that keyword can also be used. How this will work for the template generation, I don't know :-(
  • event.dataset vs module: The assumption here is not correct. event.dataset and dataset.name are equal. What you refer to as [module].[dataset] is [module].[metricset/fileset]. So event.dataset and dataset.name contain the same values.
  • Consideration of other datasets: The goal of the new indexing strategy is to make it available to everyone and any dataset definition from the community and users. It is today with the only requirement that - cannot be used in the name of a dataset. It will of course take some time to rollout this new strategy and get users on board with it. But I'm personally convinced long term most users should use it to also get all the benefits of it.
  • Endpoint: Endpoint will fully use the new indexing strategy
  • Too many indices: We are aware of a potential challenge here and work closely with the ES team to monitor this and solve it, so I'm not too worried. Zeek should have 40 datasets if all the dataset look different.

In summary, the dataset.* fields are not a replacement for any of the existing fields. There is a partial overlap we need to address in the future but I think this will come up with the more fundamental question around event.* you brought up.

With the new indexing strategy we also enforce the indices to be ECS compatible (at least the basics). So having these fields here in ECS would make it possible for us to point users to it for details and also the other way around to point users to the new indexing strategy.

@ruflin
Copy link
Contributor Author

ruflin commented Jun 22, 2020

I also update the PR description to match the most recent changes in the PR.

@dainperkins
Copy link
Contributor

@ruflin

With the new indexing strategy we also enforce the indices to be ECS compatible (at least the basics)

Just wondering if the above means:
a) All stream indices MUST be ECS compatible?
b) All stream indices will start as ECS compatible, but can be extended easily by users with custom requirements?

How difficult will it be for users to e.g. extend e.g. Cisco Firepower integrations - for either A or B (or another option if I missed something)?

@ruflin
Copy link
Contributor Author

ruflin commented Jun 23, 2020

a) We probably must define the term "ECS compatible" here. We currently enforce a few fields like @timestamp, ecs.version, host.ip to make sure for example someone doesn't use host as keyword.
b) yes. This becomes especially easy with the component template feature where they can add their own component with the fields needed.

And as you know, with our stack "enforcing" is a tricky thing as users can change everything. Perhaps we should rephrase it to "out-of-the-box".

Extending integrations we ship is something we work on and is a bit tricker then what I described above. If we would never upgrade integrations, it would be simple. The user extends and then has his own fields on top of ours. But if we upgrade an integration, we might also add new fields and new ingest pipeline, in the worst case not compatible with what the user added. The mapping additions are the easiest part here assuming component templates are used, but ingest pipelines might conflict. So no final answer on how it exactly going to work but we will find a way.

[[ecs-dataset]]
=== Dataset Fields

The dataset fields are part of the new [indexing strategy](https://github.com/elastic/kibana/blob/master/docs/ingest_manager/index.asciidoc#indexing-strategy-1).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This link is broken.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will update it but also need to find a more stable place. I wonder if I should link to a specific version instead.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we could have this documentation page live right in ECS?

I think this would be a good place to describe the new indexing strategy. In addition to this, it could also address using keyword instead of constant_keyword, when users want to adopt these fields in indices that contain more than one data source.

@MikePaquette
Copy link
Contributor

@ruflin I think one of the goals of adding the new dataset.* fields is to provide alignment across the entire user experience from Elastic web site, to Elastic in-product experiences, to index patterns, to ECS.

Given that ECS categorization fields have been published and are implemented already by producers and consumers of ECS data, I think that the ECS implementation should govern the name choice of the new dataset.* fields. For example, the new dataset fields could/should augment the existing ECS categorization fields in a consistent way, by keeping the part of the name after the dot consistent.

Specifically, I think that the proposed dataset.type field would be better/less confusing as dataset.kind to be consistent with ECS field event.kind. And the dataset.type field could be redefined to be a subcategory similar to event.type.

Here's a picture that i hope is worth 1000 words to help the discussion.

image

And since event.kind does not currently have an allowed value of "logs", we could create a new allowed value for it.

That would lead us to have two sets of categorization fields:

  • event.kind + dataset.kind
  • event.dataset + dataset.name
  • event.category + dataset.category
  • event.type + dataset.type
  • none + dataset.namespace

Thoughts?

Copy link
Contributor

@webmat webmat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the only requirement that - cannot be used in the name of a dataset

I think the list of disallowed characters should be longer: for example / and . I'm sure there's more: essentially all characters that aren't allowed in an index name.

Too many indices: We are aware of a potential challenge... Zeek should have 40 datasets if all the dataset look different.

👍

the dataset.* fields are not a replacement for any of the existing fields.

Yes I think this is becoming clearer, thanks. The dataset fields are meant to describe "where it's from" in a better way than event.module and event.dataset, whereas the ECS categorization fields are meant to capture "what it is".

The point above could also help answer the array question for the new dataset fields. While two of ECS' categorization fields can be arrays (event.category and event.type), this probably doesn't apply to the dataset fields. Events that have enough information to fall in multiple categories on the "what it is" axis (e.g. it's a network flow + it's an authentication, a typical combo from firewalls) would still only come from one place (a firewall).

the new indexing strategy with these fields.

All three fields are `constant_keyword` fields.
footnote: >
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The footnote attribute has never been used, after all. Although you can leave this here. We're working on supporting an additional plain asciidoc file per field set, to allow documenting field sets at length. When we have this, we'll review the footnote entries for content :-)

[[ecs-dataset]]
=== Dataset Fields

The dataset fields are part of the new [indexing strategy](https://github.com/elastic/kibana/blob/master/docs/ingest_manager/index.asciidoc#indexing-strategy-1).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we could have this documentation page live right in ECS?

I think this would be a good place to describe the new indexing strategy. In addition to this, it could also address using keyword instead of constant_keyword, when users want to adopt these fields in indices that contain more than one data source.

@ruflin
Copy link
Contributor Author

ruflin commented Jul 3, 2020

@MikePaquette Thanks for putting these details together. One of the reasons I initially picked dataset.type is that I think it aligns well with some other *.type fields in ECS. We considered using event.kind initially but stayed away from it because as you mentioned, logs is missing and it is used as the foundation for categorisation which I don't think applies to dataset.type.

I think moving forward there are two models:

  • We align the fields as much as possible, the naming should be aligned
  • We treat them as independent and each focuses on its use case and different naming might even be a feature

Even if we go with option 2, I think it is good to keep a certain relation of these fields in mind. If we go with 1, I'm worried we dilute the categorization fields to also be used for other purposes.

The main purpose for me of the dataset.* fields is for the indexing strategy and not related to categorisation. Should these fields even be in ECS? This is a valid question. My answer would be yes as they play a pretty fundamental role in the Elastic Stack moving forward. I think one of the problems I caused myself in the beginning is that I attached the event.* and dataset.* to each other but they are not as they have different purposes?

@jamiehynds jamiehynds changed the title Add dataset fields [Meta] Add ECS Dataset fields Sep 3, 2020
@ruflin
Copy link
Contributor Author

ruflin commented Sep 15, 2020

I'm closing this PR as in the meantime, the fields we use in agent have been renamed to data_stream.*. We will open a separate PR for this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants