-
Notifications
You must be signed in to change notification settings - Fork 430
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow generation of a subset of ECS and custom schema fields #737
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Recursive merging of two objects should have some tests 👀
I can definitely see the value in this so that you don't have to copy/paste a bunch of fields to describe a specific event (especially an alert since it's huge). I think there are a couple downsides though, the tooling doesn't support multiple nested fields so if you had a schema file like this: - name: endgame
description: >
The endgame.
title: Endgame
group: 2
short: short endgame.
fields:
- name: test
type: group
level: extended
description: >
The DNS packet identifier assigned by the program that generated the
query. The identifier is copied to the response.
fields:
- name: final
type: keyword
description: >
blah description. The generated documents won't have the The generated files will have a bunch of fields on them not probably aren't necessary for translation into C++ classes but that's probably not that big of a deal. |
@jonathan-buttner The other way to get around that issue is add a field called As for the generated files, if we want to generate files in the true nested format I think a good approach would be to implement another generator that renders the fields in the desired format. Ideally any new renderers would take the intermediate format as an input so we can move towards consolidating the nested and flat formats as well. |
Thinking about this more the subset format should allow fields to be defined as either optional or required to make a better specification for clients that will use the mapping. This will require modifying the subset format to expect "options" and "fields" keys in each field dictionary. In the future support for more options could be added. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a great addition! It implements one of the features we've been wanting to add as part of #497.
Thanks for submitting this :-)
Only thing I'd like to address is a small clarification of the changelog entry.
I also have questions to confirm usage, but we'll document this as part of another PR. Answering as a comment here is fine.
So other than these two small things, I'm good with this. But I notice you still have the "in progress" label applied. Are you good with merging this?
@@ -48,6 +55,8 @@ def argument_parser(): | |||
help='generate intermediary files only') | |||
parser.add_argument('--include', action='store', | |||
help='include user specified directory of custom field definitions') | |||
parser.add_argument('--subset', action='store', | |||
help='render a subset of the schema') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I noticed I hadn't documented the usage of --include
yet, so I won't ask you to add documentation for --subset
, we'll do that as a separate PR. I opened #746 for this.
But just to make sure I understand the usage correctly, is it correct to say that given my custom fields in my-fields.yml
and my subset selectors at my-subset.yml
, the command would be the following?
scripts/generator.py --include my-fields.yml --subset my-subset.yml
Also, this supports passing multiple subsets as well? I'm not 100% sure about the Python args parser, how would this look at the command line?
Is it a repeatable argument (--subset my-subset.yml --subset my-other-subset.yml
) or an argument that takes multiple values (--subset my-subset.yml my-other-subset.yml
)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the combination of --include and --subset that's the right command to use.
Good point, multiple value support is an important part that I missed. I just pushed another commit up that enables the second style you mentioned (--subset my-subset.yml my-other-subset.yml
). It can be combined with wildcards too, i.e. --subset dir1/*.yml dir2/*.yml
would get all the subsets from both dir1 and dir2.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice. And for the record, the same is true of --include
already
@webmat The last thing I want to update before removing the in progress label and merging is the subset yml format - right now I think it's a bit too simple and inflexible. I'm planning to add one level between each field and the field's children where we can potentially add options or other fields to describe expected usage later. The use case I have in mind for this is marking whether a particular field is required for a doc type or simply optional. I'd like to add the extra level now so that we can add features like that later without making breaking changes to the subset yml format. |
Thanks @marshallmain ❤️ For the record, after the latest changes, the "subset" yaml file becomes host:
fields: "*"
base:
fields:
"@timestamp":
fields: "*"
agent:
fields:
id:
fields: "*"
name:
fields: "*"
event:
fields: "*" This will be documented via #746 |
…#737) Usage: python scripts/generator.py --subset my-field-whitelist.yml
This PR extends prior work on the intermediate data format introduced in #722 and provides a way to specify a subset of the full set of ECS fields plus any fields defined by a custom schema and render only that subset to the generated files. This would be particularly useful when multiple types of documents will be stored in the same index: one monolithic set of fields can be defined using the official ECS fields in addition to a custom schema that encompass every possible field for the documents, and the --subset parameter can be used to define which fields exist in each type of document.
An example subset document is
This subset includes every field in
host
,@timestamp
frombase
,id
andname
fromagent
, and every field inevent
. The format supports an arbitrary level of nesting as well.Finally, multiple subsets can be provided at once and they will be automatically merged together to produce the set of fields which is the union of all the subsets. This creates the exact mapping that is needed to support the set of documents.