-
Notifications
You must be signed in to change notification settings - Fork 422
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merge custom and core multi_fields array #982
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @jonathan-buttner! Sorry for taking a bit for an initial review.
The use-case makes good sense, and I think this will be a good addition to the tooling. After testing out the changes, I did have a couple of notes.
scripts/schema/loader.py
Outdated
def dedup_and_merge_lists(list_a, list_b): | ||
list_a_set = array_of_dicts_to_set(list_a) | ||
list_b_set = array_of_dicts_to_set(list_b) | ||
return set_of_sets_to_array(list_a_set | list_b_set) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor issue I stumbled across while testing this out. Not sure it would be a blocker to merging, but worth noting the behavior.
The union will remove exact duplicate items:
> list_a_set
{frozenset({('name', 'text'), ('type', 'text')})}
> list_b_set
{frozenset({('name', 'text'), ('type', 'text')}), frozenset({('type', 'keyword'), ('normalizer', 'lowercase'), ('name', 'caseless')})}
> list_a_set | list_b_set
{frozenset({('name', 'text'), ('type', 'text')}), frozenset({('type', 'keyword'), ('normalizer', 'lowercase'), ('name', 'caseless')})}
But if the sets are not exact duplicates, it could lead to duplicate field names:
> list_a_set
{frozenset({('type', 'text'), ('name', 'text')})}
> list_b_set
{frozenset({('normalizer', 'lowercase'), ('type', 'keyword'), ('name', 'caseless')}), frozenset({('type', 'keyword'), ('name', 'text')})}
> list_a_set | list_b_set
{frozenset({('normalizer', 'lowercase'), ('type', 'keyword'), ('name', 'caseless')}), frozenset({('type', 'text'), ('name', 'text')}), frozenset({('type', 'keyword'), ('name', 'text')})}
schema include file:
---
- name: file
title: File
group: 2
short: Fields describing files.
description: >
Custom file
fields:
- name: path
multi_fields:
- name: caseless
type: keyword
normalizer: lowercase
- name: text
type: keyword <= I imagine this would only happen by accident 😃
Resulting intermediate state:
multi_fields:
- flat_name: file.path.caseless
ignore_above: 1024
name: caseless
normalizer: lowercase
type: keyword
- flat_name: file.path.text
ignore_above: 1024
name: text
type: keyword
- flat_name: file.path.text
name: text
norms: false
type: text
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh good catch, what do we think the expected behavior should be in this scenario? I could put in a check to ensure that two of the same name
fields don't exist in the resulting set and throw an error if they do? Or maybe just have core override?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO we should dedupe on name
and take the most recent definition in the case of dupes (this would allow for overrides).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@webmat do you have any thoughts? I recall back in #864, logic was removed from the tooling to allow --include
supplied custom fields to be more permissive:
This means the tooling must now accept included files as they are, with all of the power this entails.
Perhaps we simply make sure to note that users need to be aware of introducing such duplicates fields?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with @madirey. We should keep it simple and only ensure we have unique multi-field names.
The --include
option is meant to override, so the ideal behaviour is for a custom multi-field definition to replace or be merged with an entry of the same name. I'm on the fence on whether to merge/replace an entry of the same name, though. Happy to be convinced either way.
But to take a concrete example, let's say someone has tuned a normalizer that works well for user agent strings, I want them to be able to replace the default user_agent.original.text
multi-field with such a custom definition:
multi_fields:
- name: text
norms: false
type: text
normalizer: ua_normalizer
I think I have a preference with merging the pre-existing multi-field definitions of the same name, as this is more in line with how everything else is handled with custom fields. And it has the bonus of allowing a more terse custom definition:
- name: text
normalizer: ua_normalizer
Thanks for doing this! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for submitting this, that's a good addition!
Side note: you're using this to add a .caseless
multi-field, but with the coming of query param case_sensitive
in 7.10, are you sure you need this multi-field?
In any case, this is a good addition, this will make adjustments to multi-fields much smoother.
scripts/schema/loader.py
Outdated
def dedup_and_merge_lists(list_a, list_b): | ||
list_a_set = array_of_dicts_to_set(list_a) | ||
list_b_set = array_of_dicts_to_set(list_b) | ||
return set_of_sets_to_array(list_a_set | list_b_set) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with @madirey. We should keep it simple and only ensure we have unique multi-field names.
The --include
option is meant to override, so the ideal behaviour is for a custom multi-field definition to replace or be merged with an entry of the same name. I'm on the fence on whether to merge/replace an entry of the same name, though. Happy to be convinced either way.
But to take a concrete example, let's say someone has tuned a normalizer that works well for user agent strings, I want them to be able to replace the default user_agent.original.text
multi-field with such a custom definition:
multi_fields:
- name: text
norms: false
type: text
normalizer: ua_normalizer
I think I have a preference with merging the pre-existing multi-field definitions of the same name, as this is more in line with how everything else is handled with custom fields. And it has the bonus of allowing a more terse custom definition:
- name: text
normalizer: ua_normalizer
@jonathan-buttner Is this still a need, or are you pursuing using the new |
Sorry completely dropped the ball on this one. I've been trying to get some features done for 7.11. I think it'd be nice to still have this. I probably won't get to in until after feature freeze for 7.11 though, if that's ok. I don't think it's super important but would be nice to have it. |
@ebeahan I think this PR is in a better spot now haha. I updated the description as well but I took the approach deduping based on the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adjusting and overriding based on the multi-field name 👍
I have comments on how the tests are put together.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Other than the one nit for the changelog, looks good! 👍
Thanks Mat and Eric! |
We'd like to introduce custom
multi_fields
definitions in the endpoint package's custom schema. An example of this is here:https://github.com/elastic/endpoint-package/pull/79/files#diff-7f0ee89a2e91f4b29aa03f75b80a16acR22-R26
Currently, the ECS scripts do not merge the
multi_fields
array but instead uses the custom schema's definition after merging the included files. Since the custom schema's definition overwrites the core schema's definition, the custom schema must include anymulti_fields
core elements in its definition otherwise they'll inadvertently be removed. The above example will result in thepath.text
field being removed: https://github.com/elastic/ecs/blob/master/schemas/file.yml#L62-L64This PR adds functionality to merge the custom
multi_fields
array with the core one. The approach I took was to convert the list into a map so we can perform deduplication. The keys in the map come from the list entries (which are a map)name
field. The included custom schema will override the core schema if it defines a multi_field entry with the samename
field.