Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document the usage of the ECS generator #746

Closed
webmat opened this issue Feb 13, 2020 · 7 comments
Closed

Document the usage of the ECS generator #746

webmat opened this issue Feb 13, 2020 · 7 comments
Labels
documentation ready Issues we'd like to address in the future.

Comments

@webmat
Copy link
Contributor

webmat commented Feb 13, 2020

The ECS generator is accruing features that let users generate their artifacts based on their additional custom fields.

We should document the usage of

Since this is low level advanced stuff, I think we could document this in the github repo in generated/README.md, with a mention in the main readme as well.

@webmat
Copy link
Contributor Author

webmat commented Mar 25, 2020

Until I have time to put together coherent docs, I guess we can still drop something I can point to here.

Beats users should disregard this. Beats includes hundreds of other field definitions that aren't in ECS. Follow Beats docs to add custom fields to Beats.

These tools are experimental and should be used for custom indices only.

Prior to using these tools, the user should check out the git branch for the ECS version they are targeting. E.g. for ECS 1.5.0:

git checkout 1.5

The ECS tooling is built for Python 3.6+.

Help

python scripts/generator.py --help

Output

Generate ECS artifacts in a different directory

python scripts/generator.py --out ../myproject/ecs/out/

ECS + Custom fields

Generate ECS artifacts based on ECS + my custom fields.

# one or more yml files in a directory
python scripts/generator.py --include ../myproject/ecs/custom-fields/

Check out the schemas/README.md or the YAML files in this directory, for the file format of how to put together these YAML files.

Pick a subset of ECS

If your index will never populate some of the ECS fields, no need to have these field defs in your mapping. You can trim it down by creating a YAML file that indicates which field sets, or specific fields to include.

python scripts/generator.py --subset ../myproject/ecs/subset.yml

The structure of this YAML file should be as follow:

base:
  fields: "*"
event:
  fields: "*"
host:
  fields:
    name:
      fields: "*"

The above will generate a template that contains the following, and nothing else:

  • All base fields
  • All event.* fields
  • Only host.name, out of the host.* field set.

Note that if you use --subset and --include together, your subset file should list the custom fields you're importing via --include. Otherwise --subset will filter them out right away :-)

Complete example

To generate a template

  • that contains only the ECS fields as described in subset.yml
  • then add custom fields from acme.yml
  • and output all generated artifacts to ../myproject/ecs/out/

A user could run:

python scripts/generator.py \
  --include ../myproject/ecs/custom-fields/acme.yml \
  --subset ../myproject/ecs/subset.yml \
  --out ../myproject/ecs/out/

Caveats:

  • The Elasticsearch sample templates generated by this (and otherwise hosted at generated/elasticsearch) are not production ready. The user is still expected to adjust their index pattern, their index settings & so on.

@webmat
Copy link
Contributor Author

webmat commented May 21, 2020

#856 Lets the user override the Elasticsearch template settings as well

@rgmz
Copy link
Contributor

rgmz commented Jul 2, 2020

Until I have time to put together coherent docs, I guess we can still drop something I can point to here.

Thank you, @webmat, this covers a lot of the questions I had.

As a new user, my initial thought process was as follows (you've covered a lot of these).

I've condensed these to save space - click here to expand
  1. While reading the ECS documentation:
    Can I generate schemas? (Is it possible? If it's possible, is it an internal tool or something meant for general usage?)

  2. After discovering generated/README.md:

    Various kinds of files or programs can be generated directly based on ECS.

    It's possible to generate schemas (I think -- "various kinds of files or programs" is vague), but how do I do it?

  3. After discovering the schemas folder:
    Okay, so this is how I define a schema for generation. How do I actually generate it?

  4. After discovering scripts and searching through issues and commits:
    There's no README.md, however, it seems that I can generate schemas (and "other files and programs").

    Based on my prior experience with Python I know that I need to:

    • Create a virtual environment (python venv venv; source venv/bin/activate)
    • Install the dependencies (pip install -r requirements.txt)

    Running generator.py yields the following error:

    Traceback (most recent call last):
      File "generator.py", line 93, in <module>
        main()
      File "generator.py", line 22, in main
        ecs_version = read_version(args.ref)
      File "generator.py", line 88, in read_version
        with open('version', 'r') as infile:
    FileNotFoundError: [Errno 2] No such file or directory: 'version'
    
  5. After realizing that the script needs to be run from the root, and not scripts/, it works:

    Loading schemas from local files
    Running generator. ECS version 1.6.0-dev
    

    I didn't realize that this script had generated any new files until I checked the git status (maybe include a final print statement for that).

  6. How do I use the generator to include my own schema files?
    Looking through the generator.py#argument_parser provides some insight, however it's not clear what each does / expects:

    • --intermediate-only - What is an intermediate file?
    • --include - What type of argument does this take?
      • What is a custom field definition? (I later realized it was the schemas/)
      • Can I pass a glob / specific file / directory?
      • Can I specify this multiple times or do my schemas need to be in a flat directory?
    • --subset
      • What is a "subset"? How do I define one? (Had to search through issues/commits to find an example usage)
      • Do I need to specify a directory like --include or only a specific file?
      • What relationship does this argument have to --include?
    • '--template-settings and --mapping-settings - What does the input look like for these? Is it also a YAML file, or an Elasticsearch template without any properties? Is it a top-level json?

@ebeahan
Copy link
Member

ebeahan commented Jul 2, 2020

@rgmz thank you so much for taking the time to document your experience as a new user. This feedback and perspective is extremely valuable! I've tried to address each of your questions (focusing on the ones you didn't answer along the way), but if I've overlooked any questions or concerns, please let me know.

The contributors documentation does include a bit more detail of initial setup for someone looking to contribute changes to ECS and covers running routine tasks via make, but we understand this shouldn't replace the need for a getting started or quick-start guide.


I didn't realize that this script had generated any new files until I checked the git status (maybe include a final print statement for that).

great usability suggestion 👍


Looking through the generator.py#argument_parser provides some insight, however it's not clear what each does / expects:

Yes agree again. @webmat usage notes here are great, but we need to add them to the repo's documentation vs. requiring someone to search the issue backlog 😄 . We also can add some better details in the args themselves for generator.py script's help output. Also, as cited, there have been some additional options added recently (--template-settings, and --mapping-settings) that could use better documentation + example usage.


--intermediate-only - What is an intermediate file?

The intermediate files are the intermediary in-memory representation of the schema as a generated files. This allows generators/tools outside ECS' own tooling to load this fleshed out and simplified file(s).

This option instructs generator.py to only generate this files.


--include - What type of argument does this take?

This currently argument accepts a single directory or multiple whitespaced separated directories: scripts/generator.py --include _testing/schemas _testing/schemas_two. Note that the generated artifacts generated with --include will include the published schema as well as the provided custom schema. This allows for users to bring their own custom fields in addition to the ECS core/extended field sets.


Can I pass a glob / specific file / directory

Based on some quick testing and reviewing the implementation, passing a specific filename or wildcard attempting to match filename will not work today. I'm planning to review and better note the supported options for this arg.

Passing a wildcard pattern for directories does work: scripts/generator.py --include _testing/schemas*


What is a "subset"? How do I define one? (Had to search through issues/commits to find an example usage)

Yes another candidate for better documentation. 😄 Subset is intended for the user to provide a "subset" YAML file that will limit the file fields generated in the generated output.

The best current resources are again @webmat comment above as well as here (sounds like you may have came across both already). I will call out some ongoing discussion in this PR which would be a breaking change to the existing subset YAML format in exchange for some additional functionality and flexibility.


Do I need to specify a directory like --include or only a specific file?

Currently it looks like directories are not supported, but using a wildcard pattern does work:

scripts/generator.py --subset _testing/subsets/* --out ./_testing/generated

There are some inconsistencies in file vs wildcard vs directory behavior from option to option surfacing for improvement from argument to argument.


What relationship does this argument have to --include?

--include passes custom fields that are combined with the ECS fields. --subset can then be used to generate artifacts defined in the subset YAML file that exist in either the ECS fields or custom fields (--include provided).


'--template-settings and --mapping-settings - What does the input look like for these? Is it also a YAML file, or an Elasticsearch template without any properties? Is it a top-level json?

These options update the default mapping and template settings Elasticsearch templates with the options passed in. These are JSON files.

example template.json:

{
        "index_patterns": ["ecs-*"],
        "order": 1,
        "settings": {
            "index": {
                "mapping": {
                    "total_fields": {
                        "limit": 10000
                    }
                },
                "refresh_interval": "10s"
            }
        },
        "mappings": {}
    }

example mapping.json:

{
    "_meta": {
        "version": "1.5.0"
    },
    "date_detection": false,
    "dynamic_templates": [
        {
            "strings_as_keyword": {
                "mapping": {
                    "ignore_above": 1024,
                    "type": "keyword"
                },
                "match_mapping_type": "string"
            }
        }
    ],
    "properties": {}
}

Note that in template.json the mappings object is an empty object ({}) and likewise in mapping.json properties is also. The tooling fills these values into the template after the initial template body is created.


We're actively working on improving the ECS "Getting Started" experience and documentation and are aware there are current gaps, so again these notes are really a tremendous help!!

@webmat
Copy link
Contributor Author

webmat commented Jul 3, 2020

100% agree we need to add this to the repo, that's why I opened the issue ;-) And I want to second Eric and say thank you for sharing such detailed notes on your experience <3

You can disregard --intermediate-only, I added it a long time ago as a debugging help. It's purpose was to only generate generated/ecs/ecs_{nested,flat}.yml and stop after that (no docs, no ES template, no csv).

To explain more clearly --subset, its purpose is to allow you to generate artifacts that contain only a subset of the fields. ECS has a lot of fields. If for example you're creating an index for web logs, you may in your subset file specify that you want the fields from http, user_agent, network, user, source, destination and nothing else. If a data source will never populate the other fields, you don't need them in your index.

Subset has a quirk though: let's say you --include your custom fields, you have to make sure --subset also says that you want them. ECS + custom fields are merged first, and the subset filtering happens at the very end.

You highlight a lot of other things we can improve, thank you! We'll use this in our next rounds of improvements to the getting started experience for implementers 👍

@webmat
Copy link
Contributor Author

webmat commented Jul 3, 2020

The --subset feature should soon get a boost in functionality, at the cost of a few breaking changes on how the file is put together. See #873

@ebeahan ebeahan added the ready Issues we'd like to address in the future. label Jul 13, 2020
@ebeahan
Copy link
Member

ebeahan commented Jul 13, 2020

Added usage documentation for the ECS generator in #884.

@ebeahan ebeahan closed this as completed Jul 13, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation ready Issues we'd like to address in the future.
Projects
None yet
Development

No branches or pull requests

3 participants