Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow generation of a subset of ECS and custom schema fields #737

Merged
merged 13 commits into from
Feb 13, 2020

Conversation

marshallmain
Copy link
Contributor

@marshallmain marshallmain commented Jan 30, 2020

This PR extends prior work on the intermediate data format introduced in #722 and provides a way to specify a subset of the full set of ECS fields plus any fields defined by a custom schema and render only that subset to the generated files. This would be particularly useful when multiple types of documents will be stored in the same index: one monolithic set of fields can be defined using the official ECS fields in addition to a custom schema that encompass every possible field for the documents, and the --subset parameter can be used to define which fields exist in each type of document.

An example subset document is

host:
  fields: "*"
base:
  fields:
    "@timestamp":
      fields: "*"
agent:
  fields:
    id:
      fields: "*"
    name:
      fields: "*"
event:
  fields: "*"

This subset includes every field in host, @timestamp from base, id and name from agent, and every field in event. The format supports an arbitrary level of nesting as well.

Finally, multiple subsets can be provided at once and they will be automatically merged together to produce the set of fields which is the union of all the subsets. This creates the exact mapping that is needed to support the set of documents.

Copy link

@alexk307 alexk307 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Recursive merging of two objects should have some tests 👀

@jonathan-buttner
Copy link
Contributor

I can definitely see the value in this so that you don't have to copy/paste a bunch of fields to describe a specific event (especially an alert since it's huge). I think there are a couple downsides though, the tooling doesn't support multiple nested fields so if you had a schema file like this:

- name: endgame
  description: >
    The endgame.
  title: Endgame
  group: 2
  short: short endgame.
  fields:
    - name: test
      type: group
      level: extended
      description: >
        The DNS packet identifier assigned by the program that generated the
        query. The identifier is copied to the response.
      fields:
        - name: final
          type: keyword
          description: >
            blah description.

The generated documents won't have the final field in them. To get around that you'd have to test it's own schema and put it in its own file. We could address that using some of the recursive changes in my PR as a follow up though.

The generated files will have a bunch of fields on them not probably aren't necessary for translation into C++ classes but that's probably not that big of a deal.

@marshallmain
Copy link
Contributor Author

@jonathan-buttner The other way to get around that issue is add a field called test.final, similar to https://github.com/elastic/ecs/blob/master/schemas/process.yml#L37. I would be in favor of adding support for the nested format in your example though since we saw that it's a very small change.

As for the generated files, if we want to generate files in the true nested format I think a good approach would be to implement another generator that renders the fields in the desired format. Ideally any new renderers would take the intermediate format as an input so we can move towards consolidating the nested and flat formats as well.

@marshallmain marshallmain requested a review from webmat January 31, 2020 17:20
@marshallmain
Copy link
Contributor Author

Thinking about this more the subset format should allow fields to be defined as either optional or required to make a better specification for clients that will use the mapping. This will require modifying the subset format to expect "options" and "fields" keys in each field dictionary. In the future support for more options could be added.

Copy link
Contributor

@webmat webmat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great addition! It implements one of the features we've been wanting to add as part of #497.

Thanks for submitting this :-)

Only thing I'd like to address is a small clarification of the changelog entry.

I also have questions to confirm usage, but we'll document this as part of another PR. Answering as a comment here is fine.

So other than these two small things, I'm good with this. But I notice you still have the "in progress" label applied. Are you good with merging this?

@@ -48,6 +55,8 @@ def argument_parser():
help='generate intermediary files only')
parser.add_argument('--include', action='store',
help='include user specified directory of custom field definitions')
parser.add_argument('--subset', action='store',
help='render a subset of the schema')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed I hadn't documented the usage of --include yet, so I won't ask you to add documentation for --subset, we'll do that as a separate PR. I opened #746 for this.

But just to make sure I understand the usage correctly, is it correct to say that given my custom fields in my-fields.yml and my subset selectors at my-subset.yml, the command would be the following?

scripts/generator.py --include my-fields.yml --subset my-subset.yml

Also, this supports passing multiple subsets as well? I'm not 100% sure about the Python args parser, how would this look at the command line?

Is it a repeatable argument (--subset my-subset.yml --subset my-other-subset.yml) or an argument that takes multiple values (--subset my-subset.yml my-other-subset.yml)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the combination of --include and --subset that's the right command to use.

Good point, multiple value support is an important part that I missed. I just pushed another commit up that enables the second style you mentioned (--subset my-subset.yml my-other-subset.yml). It can be combined with wildcards too, i.e. --subset dir1/*.yml dir2/*.yml would get all the subsets from both dir1 and dir2.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice. And for the record, the same is true of --include already

@marshallmain
Copy link
Contributor Author

@webmat The last thing I want to update before removing the in progress label and merging is the subset yml format - right now I think it's a bit too simple and inflexible. I'm planning to add one level between each field and the field's children where we can potentially add options or other fields to describe expected usage later. The use case I have in mind for this is marking whether a particular field is required for a doc type or simply optional. I'd like to add the extra level now so that we can add features like that later without making breaking changes to the subset yml format.

@webmat
Copy link
Contributor

webmat commented Feb 13, 2020

Thanks @marshallmain ❤️

For the record, after the latest changes, the "subset" yaml file becomes

host:
  fields: "*"
base:
  fields:
    "@timestamp":
      fields: "*"
agent:
  fields:
    id:
      fields: "*"
    name:
      fields: "*"
event:
  fields: "*"

This will be documented via #746

@webmat webmat merged commit fc7ab4e into elastic:master Feb 13, 2020
dcode pushed a commit to dcode/ecs that referenced this pull request Apr 15, 2020
…#737)

Usage:

python scripts/generator.py --subset my-field-whitelist.yml
@ebeahan ebeahan mentioned this pull request Jul 9, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants