Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] Data Source Categorization Fields #958

Closed
wants to merge 4 commits into from
Closed
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
73 changes: 73 additions & 0 deletions rfcs/text/0000-data-source-categorization-fields.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,73 @@
# 0000: Data Source Categorization Fields
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In Ingest Management we had many iterations on naming on data source has also some history in it. I'm wondering what exactly we categorise here. Is it the data itself which is in data_streams? Do we category the data_streams? Do we categorize the source from where the data is coming from?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @ruflin - the intent here is to categorize the source from where the data is coming from.

<!-- Leave this ID at 0000. The ECS team will assign a unique, contiguous RFC number upon merging the initial stage of this RFC. -->

- Stage: **0 (strawperson)** <!-- Update to reflect target stage. See https://elastic.github.io/ecs/stages.html -->
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest we retarget to stage 1, since there's been so much discussion already.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

++

- Date: **August 26 2020** <!-- The ECS team sets this date at merge time. This is the date of the latest stage advancement. -->
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will update right before we merge to reflect current date.


<!--
As you work on your RFC, use the "Stage N" comments to guide you in what you should focus on, for the stage you're targeting.
Feel free to remove these comments as you go along.
-->

<!--
Stage 0: Provide a high level summary of the premise of these changes. Briefly describe the nature, purpose, and impact of the changes. ~2-5 sentences.
-->

Elastic currently supports ingestion of data from 180+ sources, and growing. However, we do not have a coherent way to categorise these sources. This has resulted in a disconnect in how we categorize these sources from the Elastic website, in-product experiences and ECS.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if the allowed values data_stream.category and observer.type should be the same?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea to bring this up.

I'm not sure I would go this direction. I think we should establish a list of allowed values, and make sure sources and pipelines populate based on this predictable list. Otherwise we could get all sorts of arbitrary differences in capitalizations and ways of writing things.


The fieldset we use to describe the data source is up for discussion, data_stream.category is a possibility. Here are proposed allowed values:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might have been the one suggesting data_stream.category as a possibility, a while ago.

But as the data_stream RFC is progressing, I no longer think this is the right approach.

I think the data_stream fields should be only dedicated to the indexing strategy itself, such as "how the index name is created".

I agree that a way of categorizing data sources is needed, but I think we should have this be another field, that would also makes sense in the 7.x monolithic indices. Having an out of place data_stream.category field there would not be appropriate.


- apm
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small thing. I suggest we standardize on the capitalization and naming. For example we have an event.category of "iam" but a proposed data_stream.category of "Identity and access management". Also we have an example of "ids" for observer.type and a proposed data_stream.category of "IDS".

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

- application
- audit
- CASB
- cloud
- collaboration
- Config Management
- containers
- CRM
- EDR
- email
- firewall
- Identity and access management
- IDS/IPS
- Operating System
- productivity
- proxy
- queue/message queue
- security
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How wide/all-encompassing are these feels intended to be? It looks like a mixture of pretty narrow as well as pretty wide categories. For example, would all firewall, audit, edr, ids/ips, threat intelligence, and vulnerability scanner categories also be marked security?

Similar thoughts with things like proxy, application, and cloud.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. We included some generic categories to allow for searching/correlation across these categories, e.g. show me events across all my security data sources, cloud sources, etc. It cloud also open up the possibility for subcategories e.g. AWS being cloud, but within AWS, CloudTrail could fall under security.

- storage
- threat intelligence
- ticketing
- VPN
- vulnerability scanner
- Web server

## Usage
Categorization fields in ECS can govern how we categorize these data source, but only a limited set of event.category values are supported by the schema today. The event categorisation fields are catered to individual events, but don't categorise the data source. Expanding the values we support, allows us to align the user experience from ECS, Ingest Manager and the Elastic Website (elastic.co/integrations). Some additional context here: #845 (comment).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Categorization fields in ECS can govern how we categorize these data source, but only a limited set of event.category values are supported by the schema today. The event categorisation fields are catered to individual events, but don't categorise the data source. Expanding the values we support, allows us to align the user experience from ECS, Ingest Manager and the Elastic Website (elastic.co/integrations). Some additional context here: #845 (comment).
Categorization fields in ECS can govern how we categorize these data source, but only a limited set of event.category values are supported by the schema today. The event categorisation fields are catered to individual events, but don't categorise the data source. Expanding the values we support, allows us to align the user experience from ECS, Ingest Manager and the Elastic Website (elastic.co/integrations). Some additional context here: [#845 (comment)](https://github.com/elastic/ecs/pull/845#issuecomment-651414817).

Looks like the Markdown link got lost in the copy/paste.


These categories could also be used to categorise detection rules, to map data sources to corresponding rules. This would improve our onboarding experience by suggesting detection rules to users based on the sources they are ingesting data from.


## People

The following are the people that consulted on the contents of this RFC.

* @jamiehynds | author
* @exekias | sponsor

## References

* https://github.com/elastic/ecs/issues/901
* https://github.com/elastic/ecs/pull/845
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add the link provided by @ruflin to the references, please?

Thanks for providing it, Nic 👍

However let's make sure the link stands the test of time, and link via the latest tag, rather than master:

Suggested change
* https://github.com/elastic/ecs/pull/845
* https://github.com/elastic/ecs/pull/845
* https://github.com/elastic/package-registry/blob/v0.12.1/util/package.go#L27


### RFC Pull Requests

<!-- An RFC should link to the PRs for each of it stage advancements. -->

* Stage 0: https://github.com/elastic/ecs/pull/958

<!--
* Stage 1: https://github.com/elastic/ecs/pull/NNN
...
-->