Merge custom and core multi_fields array #982

jonathan-buttner · 2020-09-29T13:46:34Z

We'd like to introduce custom multi_fields definitions in the endpoint package's custom schema. An example of this is here:
https://github.com/elastic/endpoint-package/pull/79/files#diff-7f0ee89a2e91f4b29aa03f75b80a16acR22-R26

---
- name: file
  title: File
  group: 2
  short: Fields describing files.
  description: >
    Custom file
  fields:
    - name: path
      multi_fields:
        - name: caseless
          type: keyword
          normalizer: lowercase

Currently, the ECS scripts do not merge the multi_fields array but instead uses the custom schema's definition after merging the included files. Since the custom schema's definition overwrites the core schema's definition, the custom schema must include any multi_fields core elements in its definition otherwise they'll inadvertently be removed. The above example will result in the path.text field being removed: https://github.com/elastic/ecs/blob/master/schemas/file.yml#L62-L64

This PR adds functionality to merge the custom multi_fields array with the core one. The approach I took was to convert the list into a map so we can perform deduplication. The keys in the map come from the list entries (which are a map) name field. The included custom schema will override the core schema if it defines a multi_field entry with the same name field.

ebeahan

Thanks @jonathan-buttner! Sorry for taking a bit for an initial review.

The use-case makes good sense, and I think this will be a good addition to the tooling. After testing out the changes, I did have a couple of notes.

scripts/schema/loader.py

ebeahan · 2020-10-07T15:17:27Z

scripts/schema/loader.py

+def dedup_and_merge_lists(list_a, list_b):
+    list_a_set = array_of_dicts_to_set(list_a)
+    list_b_set = array_of_dicts_to_set(list_b)
+    return set_of_sets_to_array(list_a_set | list_b_set)


Minor issue I stumbled across while testing this out. Not sure it would be a blocker to merging, but worth noting the behavior.

The union will remove exact duplicate items:

> list_a_set {frozenset({('name', 'text'), ('type', 'text')})} > list_b_set {frozenset({('name', 'text'), ('type', 'text')}), frozenset({('type', 'keyword'), ('normalizer', 'lowercase'), ('name', 'caseless')})} > list_a_set | list_b_set {frozenset({('name', 'text'), ('type', 'text')}), frozenset({('type', 'keyword'), ('normalizer', 'lowercase'), ('name', 'caseless')})}

But if the sets are not exact duplicates, it could lead to duplicate field names:

> list_a_set {frozenset({('type', 'text'), ('name', 'text')})} > list_b_set {frozenset({('normalizer', 'lowercase'), ('type', 'keyword'), ('name', 'caseless')}), frozenset({('type', 'keyword'), ('name', 'text')})} > list_a_set | list_b_set {frozenset({('normalizer', 'lowercase'), ('type', 'keyword'), ('name', 'caseless')}), frozenset({('type', 'text'), ('name', 'text')}), frozenset({('type', 'keyword'), ('name', 'text')})}

schema include file:

--- - name: file title: File group: 2 short: Fields describing files. description: > Custom file fields: - name: path multi_fields: - name: caseless type: keyword normalizer: lowercase - name: text type: keyword <= I imagine this would only happen by accident 😃

Resulting intermediate state:

multi_fields: - flat_name: file.path.caseless ignore_above: 1024 name: caseless normalizer: lowercase type: keyword - flat_name: file.path.text ignore_above: 1024 name: text type: keyword - flat_name: file.path.text name: text norms: false type: text

Oh good catch, what do we think the expected behavior should be in this scenario? I could put in a check to ensure that two of the same name fields don't exist in the resulting set and throw an error if they do? Or maybe just have core override?

IMO we should dedupe on name and take the most recent definition in the case of dupes (this would allow for overrides).

@webmat do you have any thoughts? I recall back in #864, logic was removed from the tooling to allow --include supplied custom fields to be more permissive:

This means the tooling must now accept included files as they are, with all of the power this entails.

Perhaps we simply make sure to note that users need to be aware of introducing such duplicates fields?

I agree with @madirey. We should keep it simple and only ensure we have unique multi-field names.

The --include option is meant to override, so the ideal behaviour is for a custom multi-field definition to replace or be merged with an entry of the same name. I'm on the fence on whether to merge/replace an entry of the same name, though. Happy to be convinced either way.

But to take a concrete example, let's say someone has tuned a normalizer that works well for user agent strings, I want them to be able to replace the default user_agent.original.text multi-field with such a custom definition:

multi_fields: - name: text norms: false type: text normalizer: ua_normalizer

I think I have a preference with merging the pre-existing multi-field definitions of the same name, as this is more in line with how everything else is handled with custom fields. And it has the bonus of allowing a more terse custom definition:

- name: text normalizer: ua_normalizer

madirey · 2020-10-12T18:01:56Z

Thanks for doing this!

webmat

Thanks for submitting this, that's a good addition!

Side note: you're using this to add a .caseless multi-field, but with the coming of query param case_sensitive in 7.10, are you sure you need this multi-field?

In any case, this is a good addition, this will make adjustments to multi-fields much smoother.

webmat · 2020-11-02T15:46:36Z

scripts/schema/loader.py

+def dedup_and_merge_lists(list_a, list_b):
+    list_a_set = array_of_dicts_to_set(list_a)
+    list_b_set = array_of_dicts_to_set(list_b)
+    return set_of_sets_to_array(list_a_set | list_b_set)


I agree with @madirey. We should keep it simple and only ensure we have unique multi-field names.

The --include option is meant to override, so the ideal behaviour is for a custom multi-field definition to replace or be merged with an entry of the same name. I'm on the fence on whether to merge/replace an entry of the same name, though. Happy to be convinced either way.

But to take a concrete example, let's say someone has tuned a normalizer that works well for user agent strings, I want them to be able to replace the default user_agent.original.text multi-field with such a custom definition:

multi_fields: - name: text norms: false type: text normalizer: ua_normalizer

I think I have a preference with merging the pre-existing multi-field definitions of the same name, as this is more in line with how everything else is handled with custom fields. And it has the bonus of allowing a more terse custom definition:

- name: text normalizer: ua_normalizer

ebeahan · 2020-12-01T14:38:34Z

@jonathan-buttner Is this still a need, or are you pursuing using the new case_sensitive option instead?

jonathan-buttner · 2020-12-01T15:21:08Z

@jonathan-buttner Is this still a need, or are you pursuing using the new case_sensitive option instead?

Sorry completely dropped the ball on this one. I've been trying to get some features done for 7.11. I think it'd be nice to still have this. I probably won't get to in until after feature freeze for 7.11 though, if that's ok. I don't think it's super important but would be nice to have it.

… the core

jonathan-buttner · 2021-01-04T20:32:35Z

@ebeahan I think this PR is in a better spot now haha. I updated the description as well but I took the approach deduping based on the name field in the multi_field list of dictionaries.

webmat

Thanks for adjusting and overriding based on the multi-field name 👍

I have comments on how the tests are put together.

scripts/tests/unit/test_schema_loader.py

webmat

LGTM

ebeahan

Other than the one nit for the changelog, looks good! 👍

CHANGELOG.next.md

jonathan-buttner · 2021-01-06T17:16:38Z

Thanks Mat and Eric!

jonathan-buttner added 4 commits September 28, 2020 17:59

Deduping multi_fields lists

eae6bf5

Adding tests

9898651

Updating changelog

f0746ae

Fixing typo

d1953cf

jonathan-buttner marked this pull request as ready for review September 29, 2020 13:49

jonathan-buttner added enhancement New feature or request Feature: ECS review labels Oct 2, 2020

jonathan-buttner requested review from webmat and ebeahan October 2, 2020 12:50

ebeahan reviewed Oct 7, 2020

View reviewed changes

Addressing feedback about default value

2ca5584

webmat reviewed Nov 2, 2020

View reviewed changes

ebeahan added the ready Issues we'd like to address in the future. label Nov 3, 2020

ebeahan removed the ready Issues we'd like to address in the future. label Nov 17, 2020

jonathan-buttner added 2 commits January 4, 2021 14:02

Merge branch 'master' of github.com:elastic/ecs into merge-multi-fields

23cc12c

Deduping on the name field, allowing the included schema to overwrite…

cb0e5b9

… the core

webmat reviewed Jan 5, 2021

View reviewed changes

scripts/tests/unit/test_schema_loader.py Outdated Show resolved Hide resolved

scripts/tests/unit/test_schema_loader.py Outdated Show resolved Hide resolved

scripts/tests/unit/test_schema_loader.py Outdated Show resolved Hide resolved

Addressing feedback for tests

6eb01c0

webmat approved these changes Jan 6, 2021

View reviewed changes

ebeahan approved these changes Jan 6, 2021

View reviewed changes

CHANGELOG.next.md Show resolved Hide resolved

webmat merged commit 2365a0d into elastic:master Jan 6, 2021

webmat added needs_backport 1.8.0 labels Jan 6, 2021

webmat mentioned this pull request Jan 6, 2021

[1.x] Merge custom and core multi_fields arrays (#982) #1213

Merged

webmat mentioned this pull request Jan 6, 2021

[1.8] Merge custom and core multi_fields arrays (#982) #1214

Merged

jonathan-buttner deleted the merge-multi-fields branch January 6, 2021 17:16

webmat removed the needs_backport label Jan 6, 2021

ebeahan removed the review label Jan 6, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Merge custom and core multi_fields array #982

Merge custom and core multi_fields array #982

jonathan-buttner commented Sep 29, 2020 •

edited

Loading

ebeahan left a comment

ebeahan Oct 7, 2020

jonathan-buttner Oct 12, 2020

madirey Oct 12, 2020

ebeahan Oct 13, 2020

webmat Nov 2, 2020 •

edited

Loading

madirey commented Oct 12, 2020

webmat left a comment

webmat Nov 2, 2020 •

edited

Loading

ebeahan commented Dec 1, 2020

jonathan-buttner commented Dec 1, 2020

jonathan-buttner commented Jan 4, 2021

webmat left a comment

webmat left a comment

ebeahan left a comment

jonathan-buttner commented Jan 6, 2021

Merge custom and core multi_fields array #982

Merge custom and core multi_fields array #982

Conversation

jonathan-buttner commented Sep 29, 2020 • edited Loading

ebeahan left a comment

Choose a reason for hiding this comment

ebeahan Oct 7, 2020

Choose a reason for hiding this comment

jonathan-buttner Oct 12, 2020

Choose a reason for hiding this comment

madirey Oct 12, 2020

Choose a reason for hiding this comment

ebeahan Oct 13, 2020

Choose a reason for hiding this comment

webmat Nov 2, 2020 • edited Loading

Choose a reason for hiding this comment

madirey commented Oct 12, 2020

webmat left a comment

Choose a reason for hiding this comment

webmat Nov 2, 2020 • edited Loading

Choose a reason for hiding this comment

ebeahan commented Dec 1, 2020

jonathan-buttner commented Dec 1, 2020

jonathan-buttner commented Jan 4, 2021

webmat left a comment

Choose a reason for hiding this comment

webmat left a comment

Choose a reason for hiding this comment

ebeahan left a comment

Choose a reason for hiding this comment

jonathan-buttner commented Jan 6, 2021

jonathan-buttner commented Sep 29, 2020 •

edited

Loading

webmat Nov 2, 2020 •

edited

Loading

webmat Nov 2, 2020 •

edited

Loading