Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merge custom and core multi_fields array #982

Merged
merged 8 commits into from
Jan 6, 2021

Conversation

jonathan-buttner
Copy link
Contributor

@jonathan-buttner jonathan-buttner commented Sep 29, 2020

We'd like to introduce custom multi_fields definitions in the endpoint package's custom schema. An example of this is here:
https://github.com/elastic/endpoint-package/pull/79/files#diff-7f0ee89a2e91f4b29aa03f75b80a16acR22-R26

---
- name: file
  title: File
  group: 2
  short: Fields describing files.
  description: >
    Custom file
  fields:
    - name: path
      multi_fields:
        - name: caseless
          type: keyword
          normalizer: lowercase

Currently, the ECS scripts do not merge the multi_fields array but instead uses the custom schema's definition after merging the included files. Since the custom schema's definition overwrites the core schema's definition, the custom schema must include any multi_fields core elements in its definition otherwise they'll inadvertently be removed. The above example will result in the path.text field being removed: https://github.com/elastic/ecs/blob/master/schemas/file.yml#L62-L64

This PR adds functionality to merge the custom multi_fields array with the core one. The approach I took was to convert the list into a map so we can perform deduplication. The keys in the map come from the list entries (which are a map) name field. The included custom schema will override the core schema if it defines a multi_field entry with the same name field.

@jonathan-buttner jonathan-buttner marked this pull request as ready for review September 29, 2020 13:49
Copy link
Member

@ebeahan ebeahan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jonathan-buttner! Sorry for taking a bit for an initial review.

The use-case makes good sense, and I think this will be a good addition to the tooling. After testing out the changes, I did have a couple of notes.

scripts/schema/loader.py Outdated Show resolved Hide resolved
def dedup_and_merge_lists(list_a, list_b):
list_a_set = array_of_dicts_to_set(list_a)
list_b_set = array_of_dicts_to_set(list_b)
return set_of_sets_to_array(list_a_set | list_b_set)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor issue I stumbled across while testing this out. Not sure it would be a blocker to merging, but worth noting the behavior.

The union will remove exact duplicate items:

> list_a_set
{frozenset({('name', 'text'), ('type', 'text')})}

> list_b_set
{frozenset({('name', 'text'), ('type', 'text')}), frozenset({('type', 'keyword'), ('normalizer', 'lowercase'), ('name', 'caseless')})}

> list_a_set | list_b_set
{frozenset({('name', 'text'), ('type', 'text')}), frozenset({('type', 'keyword'), ('normalizer', 'lowercase'), ('name', 'caseless')})}

But if the sets are not exact duplicates, it could lead to duplicate field names:

> list_a_set
{frozenset({('type', 'text'), ('name', 'text')})}

> list_b_set
{frozenset({('normalizer', 'lowercase'), ('type', 'keyword'), ('name', 'caseless')}), frozenset({('type', 'keyword'), ('name', 'text')})}

> list_a_set | list_b_set
{frozenset({('normalizer', 'lowercase'), ('type', 'keyword'), ('name', 'caseless')}), frozenset({('type', 'text'), ('name', 'text')}), frozenset({('type', 'keyword'), ('name', 'text')})}

schema include file:

---
  - name: file
    title: File
    group: 2
    short: Fields describing files.
    description: >
      Custom file
    fields:
      - name: path
        multi_fields:
          - name: caseless
            type: keyword
            normalizer: lowercase
          - name: text  
            type: keyword <= I imagine this would only happen by accident 😃

Resulting intermediate state:

        multi_fields:
        - flat_name: file.path.caseless
          ignore_above: 1024
          name: caseless
          normalizer: lowercase
          type: keyword
        - flat_name: file.path.text
          ignore_above: 1024
          name: text
          type: keyword
        - flat_name: file.path.text
          name: text
          norms: false
          type: text

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh good catch, what do we think the expected behavior should be in this scenario? I could put in a check to ensure that two of the same name fields don't exist in the resulting set and throw an error if they do? Or maybe just have core override?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO we should dedupe on name and take the most recent definition in the case of dupes (this would allow for overrides).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@webmat do you have any thoughts? I recall back in #864, logic was removed from the tooling to allow --include supplied custom fields to be more permissive:

This means the tooling must now accept included files as they are, with all of the power this entails.

Perhaps we simply make sure to note that users need to be aware of introducing such duplicates fields?

Copy link
Contributor

@webmat webmat Nov 2, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @madirey. We should keep it simple and only ensure we have unique multi-field names.

The --include option is meant to override, so the ideal behaviour is for a custom multi-field definition to replace or be merged with an entry of the same name. I'm on the fence on whether to merge/replace an entry of the same name, though. Happy to be convinced either way.

But to take a concrete example, let's say someone has tuned a normalizer that works well for user agent strings, I want them to be able to replace the default user_agent.original.text multi-field with such a custom definition:

        multi_fields:
        - name: text
          norms: false
          type: text
          normalizer: ua_normalizer 

I think I have a preference with merging the pre-existing multi-field definitions of the same name, as this is more in line with how everything else is handled with custom fields. And it has the bonus of allowing a more terse custom definition:

        - name: text
          normalizer: ua_normalizer 

@madirey
Copy link
Contributor

madirey commented Oct 12, 2020

Thanks for doing this!

Copy link
Contributor

@webmat webmat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for submitting this, that's a good addition!

Side note: you're using this to add a .caseless multi-field, but with the coming of query param case_sensitive in 7.10, are you sure you need this multi-field?

In any case, this is a good addition, this will make adjustments to multi-fields much smoother.

def dedup_and_merge_lists(list_a, list_b):
list_a_set = array_of_dicts_to_set(list_a)
list_b_set = array_of_dicts_to_set(list_b)
return set_of_sets_to_array(list_a_set | list_b_set)
Copy link
Contributor

@webmat webmat Nov 2, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @madirey. We should keep it simple and only ensure we have unique multi-field names.

The --include option is meant to override, so the ideal behaviour is for a custom multi-field definition to replace or be merged with an entry of the same name. I'm on the fence on whether to merge/replace an entry of the same name, though. Happy to be convinced either way.

But to take a concrete example, let's say someone has tuned a normalizer that works well for user agent strings, I want them to be able to replace the default user_agent.original.text multi-field with such a custom definition:

        multi_fields:
        - name: text
          norms: false
          type: text
          normalizer: ua_normalizer 

I think I have a preference with merging the pre-existing multi-field definitions of the same name, as this is more in line with how everything else is handled with custom fields. And it has the bonus of allowing a more terse custom definition:

        - name: text
          normalizer: ua_normalizer 

@ebeahan ebeahan added the ready Issues we'd like to address in the future. label Nov 3, 2020
@ebeahan ebeahan removed the ready Issues we'd like to address in the future. label Nov 17, 2020
@ebeahan
Copy link
Member

ebeahan commented Dec 1, 2020

@jonathan-buttner Is this still a need, or are you pursuing using the new case_sensitive option instead?

@jonathan-buttner
Copy link
Contributor Author

@jonathan-buttner Is this still a need, or are you pursuing using the new case_sensitive option instead?

Sorry completely dropped the ball on this one. I've been trying to get some features done for 7.11. I think it'd be nice to still have this. I probably won't get to in until after feature freeze for 7.11 though, if that's ok. I don't think it's super important but would be nice to have it.

@jonathan-buttner
Copy link
Contributor Author

@ebeahan I think this PR is in a better spot now haha. I updated the description as well but I took the approach deduping based on the name field in the multi_field list of dictionaries.

Copy link
Contributor

@webmat webmat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adjusting and overriding based on the multi-field name 👍

I have comments on how the tests are put together.

scripts/tests/unit/test_schema_loader.py Outdated Show resolved Hide resolved
scripts/tests/unit/test_schema_loader.py Outdated Show resolved Hide resolved
scripts/tests/unit/test_schema_loader.py Outdated Show resolved Hide resolved
Copy link
Contributor

@webmat webmat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Member

@ebeahan ebeahan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other than the one nit for the changelog, looks good! 👍

CHANGELOG.next.md Show resolved Hide resolved
@jonathan-buttner
Copy link
Contributor Author

Thanks Mat and Eric!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants