Allow generation of a subset of ECS and custom schema fields #737

marshallmain · 2020-01-30T21:14:28Z

This PR extends prior work on the intermediate data format introduced in #722 and provides a way to specify a subset of the full set of ECS fields plus any fields defined by a custom schema and render only that subset to the generated files. This would be particularly useful when multiple types of documents will be stored in the same index: one monolithic set of fields can be defined using the official ECS fields in addition to a custom schema that encompass every possible field for the documents, and the --subset parameter can be used to define which fields exist in each type of document.

An example subset document is

host:
  fields: "*"
base:
  fields:
    "@timestamp":
      fields: "*"
agent:
  fields:
    id:
      fields: "*"
    name:
      fields: "*"
event:
  fields: "*"

This subset includes every field in host, @timestamp from base, id and name from agent, and every field in event. The format supports an arbitrary level of nesting as well.

Finally, multiple subsets can be provided at once and they will be automatically merged together to produce the set of fields which is the union of all the subsets. This creates the exact mapping that is needed to support the set of documents.

scripts/generator.py

alexk307

LGTM. Recursive merging of two objects should have some tests 👀

jonathan-buttner · 2020-01-30T22:26:43Z

I can definitely see the value in this so that you don't have to copy/paste a bunch of fields to describe a specific event (especially an alert since it's huge). I think there are a couple downsides though, the tooling doesn't support multiple nested fields so if you had a schema file like this:

- name: endgame
  description: >
    The endgame.
  title: Endgame
  group: 2
  short: short endgame.
  fields:
    - name: test
      type: group
      level: extended
      description: >
        The DNS packet identifier assigned by the program that generated the
        query. The identifier is copied to the response.
      fields:
        - name: final
          type: keyword
          description: >
            blah description.

The generated documents won't have the final field in them. To get around that you'd have to test it's own schema and put it in its own file. We could address that using some of the recursive changes in my PR as a follow up though.

The generated files will have a bunch of fields on them not probably aren't necessary for translation into C++ classes but that's probably not that big of a deal.

marshallmain · 2020-01-30T23:06:49Z

@jonathan-buttner The other way to get around that issue is add a field called test.final, similar to https://github.com/elastic/ecs/blob/master/schemas/process.yml#L37. I would be in favor of adding support for the nested format in your example though since we saw that it's a very small change.

As for the generated files, if we want to generate files in the true nested format I think a good approach would be to implement another generator that renders the fields in the desired format. Ideally any new renderers would take the intermediate format as an input so we can move towards consolidating the nested and flat formats as well.

marshallmain · 2020-02-03T15:40:41Z

Thinking about this more the subset format should allow fields to be defined as either optional or required to make a better specification for clients that will use the mapping. This will require modifying the subset format to expect "options" and "fields" keys in each field dictionary. In the future support for more options could be added.

webmat

This is a great addition! It implements one of the features we've been wanting to add as part of #497.

Thanks for submitting this :-)

Only thing I'd like to address is a small clarification of the changelog entry.

I also have questions to confirm usage, but we'll document this as part of another PR. Answering as a comment here is fine.

So other than these two small things, I'm good with this. But I notice you still have the "in progress" label applied. Are you good with merging this?

scripts/tests/test_ecs_helpers.py

CHANGELOG.next.md

webmat · 2020-02-13T15:28:54Z

scripts/generator.py

@@ -48,6 +55,8 @@ def argument_parser():
                        help='generate intermediary files only')
    parser.add_argument('--include', action='store',
                        help='include user specified directory of custom field definitions')
+    parser.add_argument('--subset', action='store',
+                        help='render a subset of the schema')


I noticed I hadn't documented the usage of --include yet, so I won't ask you to add documentation for --subset, we'll do that as a separate PR. I opened #746 for this.

But just to make sure I understand the usage correctly, is it correct to say that given my custom fields in my-fields.yml and my subset selectors at my-subset.yml, the command would be the following?

scripts/generator.py --include my-fields.yml --subset my-subset.yml

Also, this supports passing multiple subsets as well? I'm not 100% sure about the Python args parser, how would this look at the command line?

Is it a repeatable argument (--subset my-subset.yml --subset my-other-subset.yml) or an argument that takes multiple values (--subset my-subset.yml my-other-subset.yml)?

For the combination of --include and --subset that's the right command to use.

Good point, multiple value support is an important part that I missed. I just pushed another commit up that enables the second style you mentioned (--subset my-subset.yml my-other-subset.yml). It can be combined with wildcards too, i.e. --subset dir1/*.yml dir2/*.yml would get all the subsets from both dir1 and dir2.

Nice. And for the record, the same is true of --include already

Co-Authored-By: Mathieu Martin <[email protected]>

marshallmain · 2020-02-13T18:03:09Z

@webmat The last thing I want to update before removing the in progress label and merging is the subset yml format - right now I think it's a bit too simple and inflexible. I'm planning to add one level between each field and the field's children where we can potentially add options or other fields to describe expected usage later. The use case I have in mind for this is marking whether a particular field is required for a doc type or simply optional. I'd like to add the extra level now so that we can add features like that later without making breaking changes to the subset yml format.

webmat · 2020-02-13T19:19:17Z

Thanks @marshallmain ❤️

For the record, after the latest changes, the "subset" yaml file becomes

host:
  fields: "*"
base:
  fields:
    "@timestamp":
      fields: "*"
agent:
  fields:
    id:
      fields: "*"
    name:
      fields: "*"
event:
  fields: "*"

This will be documented via #746

…#737) Usage: python scripts/generator.py --subset my-field-whitelist.yml

marshallmain added 3 commits January 30, 2020 11:35

separate schema reading from schema merging

f6690b2

remove unnecessary check

1316189

add subset schema capability

c87510e

marshallmain commented Jan 30, 2020

View reviewed changes

scripts/generator.py Outdated Show resolved Hide resolved

marshallmain mentioned this pull request Jan 30, 2020

Add more support for custom nested style schemas #736

Closed

alexk307 approved these changes Jan 30, 2020

View reviewed changes

update tests

73ad57a

marshallmain added 2 commits January 30, 2020 18:09

linting stuff

7b017eb

add some tests

44145fc

marshallmain requested a review from webmat January 31, 2020 17:20

update changelog.next

06fe291

marshallmain added the in progress label Jan 31, 2020

use glob for subset filenames

666e866

This was referenced Feb 13, 2020

Document the usage of the ECS generator #746

Closed

Make ECS tooling friendly to generating custom templates based on ECS #497

Closed

webmat reviewed Feb 13, 2020

View reviewed changes

marshallmain and others added 4 commits February 13, 2020 12:13

support multiple values in subset parameter

489803c

Update CHANGELOG.next.md

23d5c37

Co-Authored-By: Mathieu Martin <[email protected]>

Merge branch 'master' into more-intermediate-usage

11c7274

linting strikes again

2334ee8

add extra level to subset schemas

60c1755

marshallmain removed the in progress label Feb 13, 2020

webmat merged commit fc7ab4e into elastic:master Feb 13, 2020

jonathan-buttner mentioned this pull request Feb 24, 2020

Add basic logs and metrics template elastic/package-registry#224

Merged

dcode pushed a commit to dcode/ecs that referenced this pull request Apr 15, 2020

Allow generation of a subset of ECS and custom schema fields (elastic…

fbc75d1

…#737) Usage: python scripts/generator.py --subset my-field-whitelist.yml

ebeahan mentioned this pull request Jul 9, 2020

Usage improvements #884

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow generation of a subset of ECS and custom schema fields #737

Allow generation of a subset of ECS and custom schema fields #737

marshallmain commented Jan 30, 2020 •

edited

Loading

alexk307 left a comment

jonathan-buttner commented Jan 30, 2020

marshallmain commented Jan 30, 2020

marshallmain commented Feb 3, 2020

webmat left a comment •

edited

Loading

webmat Feb 13, 2020

marshallmain Feb 13, 2020

webmat Feb 13, 2020

marshallmain commented Feb 13, 2020

webmat commented Feb 13, 2020

Allow generation of a subset of ECS and custom schema fields #737

Allow generation of a subset of ECS and custom schema fields #737

Conversation

marshallmain commented Jan 30, 2020 • edited Loading

alexk307 left a comment

Choose a reason for hiding this comment

jonathan-buttner commented Jan 30, 2020

marshallmain commented Jan 30, 2020

marshallmain commented Feb 3, 2020

webmat left a comment • edited Loading

Choose a reason for hiding this comment

webmat Feb 13, 2020

Choose a reason for hiding this comment

marshallmain Feb 13, 2020

Choose a reason for hiding this comment

webmat Feb 13, 2020

Choose a reason for hiding this comment

marshallmain commented Feb 13, 2020

webmat commented Feb 13, 2020

marshallmain commented Jan 30, 2020 •

edited

Loading

webmat left a comment •

edited

Loading