Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reinterpret dots in field names as object structure #79922

Merged
merged 22 commits into from
Nov 23, 2021

Conversation

romseygeek
Copy link
Contributor

DocumentParser parses documents by following their object hierarchy, and
using a parallel hierarchy of ObjectMappers to work out how to map leaf fields.
Field names that contain dots complicate this, meaning that many methods
need to reverse-engineer the object hierarchy to check that the current parent
object mapper is the correct one; this is particularly complex when objects
are being created dynamically.

To simplify this logic, this commit introduces a DotExpandingXContentParser,
which will wrap another XContentParser and re-interpret a field name containing
dots as a series of objects. So for example, "foo.bar.baz":{ ... } is passed
to the DocumentParser as "foo":{"bar":{"baz":{...}}}. The central parsing
logic of DocumentParser detects when a fieldname has this form and temporarily
swaps out its context for a context with a wrapped parser.

@romseygeek romseygeek added :Search Foundations/Mapping Index mappings, including merging and defining field types >refactoring v8.1.0 labels Oct 27, 2021
@romseygeek romseygeek self-assigned this Oct 27, 2021
@elasticmachine elasticmachine added the Team:Search Meta label for search team label Oct 27, 2021
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-search (Team:Search)

@romseygeek
Copy link
Contributor Author

After this has been merged, handling dots in field names becomes much simpler, as we just check to see if the current parent object has flatten:true before wrapping the xcontent parser.

assertEquals(
"Cannot add a value for field [field.bar] since one of the intermediate objects is mapped as a nested object: [field]",
e.getMessage()
);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A nice upside of this change is that we can now handle dynamically mapped fields with dots that lead into a nested object.

Copy link
Member

@nik9000 nik9000 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

XContentParser bits look sensible. Left a suggestion for an extra thing to test.

After looking through the DocumentParser bit it looks like I was misunderstanding what this was doing the last time I read it. I'm actually a big fan now. I think I'll need to give it another look on Monday though before I can decide if it's, like, right. I mean, if the tests pass that's a huge thing, because this is mostly about replacing tricky code with a sensible abstraction. But I'd like to give it a look with a more clearerer head.

assertEquals(XContentParser.Token.VALUE_STRING, parser.nextToken());
assertEquals("value2", parser.text());
assertEquals(XContentParser.Token.END_OBJECT, parser.nextToken());
assertNull(parser.nextToken());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be nice to test this with copyCurrentStructure and XContentHelper.convertToMap - those might make the tests a bit more readable. And might find extra fun sneaky stuff.

@nik9000
Copy link
Member

nik9000 commented Nov 5, 2021

Monday

Er. I'm off on Monday. Wednesday. Or, someone else who understand document parsing can have it. I'm not picky.

@romseygeek
Copy link
Contributor Author

@elasticmachine run elasticsearch-ci/part-1
@elasticmachine run elasticsearch-ci/part-2

@romseygeek romseygeek merged commit 2e8a973 into elastic:master Nov 23, 2021
@romseygeek romseygeek deleted the mapper/fold-dots-into-parser branch November 23, 2021 15:52
weizijun added a commit to weizijun/elasticsearch that referenced this pull request Nov 24, 2021
* upstream/master: (29 commits)
  Fix typo (elastic#80925)
  Increase docker compose timeouts for CI builds
  TSDB: fix error without feature flag (elastic#80945)
  [DOCS] Relocate `index.mapping.dimension_fields.limit` setting docs (elastic#80964)
  Explicit write methods for always-missing values (elastic#80958)
  TSDB: move TimeSeriesModeIT to yaml tests (elastic#80933)
  [ML] Removing temporary debug (elastic#80956)
  Remove unused ConnectTransportException#node (elastic#80944)
  Reinterpret dots in field names as object structure (elastic#79922)
  Remove obsolete typed legacy index templates (elastic#80937)
  Remove unnecessary shuffle in unassigned shards allocation. (elastic#65172)
  TSDB: Tests for nanosecond timeprecision timestamp just beyond the limit (elastic#80932)
  Cleanup SLM History Item .equals (elastic#80938)
  Rework breaking changes for new structure (elastic#80907)
  [DOCS] Fix elasticsearch-reset-password typo (elastic#80919)
  [ML] No need to use parent task client when internal infer delegates (elastic#80905)
  Fix shadowed vars pt6 (elastic#80899)
  add ignore info (elastic#80924)
  Fix several potential circuit breaker leaks in Aggregators (elastic#79676)
  Extract more standard metadata from binary files (elastic#78754)
  ...
weizijun added a commit to weizijun/elasticsearch that referenced this pull request Nov 24, 2021
* upstream/master: (319 commits)
  Fix typo (elastic#80925)
  Increase docker compose timeouts for CI builds
  TSDB: fix error without feature flag (elastic#80945)
  [DOCS] Relocate `index.mapping.dimension_fields.limit` setting docs (elastic#80964)
  Explicit write methods for always-missing values (elastic#80958)
  TSDB: move TimeSeriesModeIT to yaml tests (elastic#80933)
  [ML] Removing temporary debug (elastic#80956)
  Remove unused ConnectTransportException#node (elastic#80944)
  Reinterpret dots in field names as object structure (elastic#79922)
  Remove obsolete typed legacy index templates (elastic#80937)
  Remove unnecessary shuffle in unassigned shards allocation. (elastic#65172)
  TSDB: Tests for nanosecond timeprecision timestamp just beyond the limit (elastic#80932)
  Cleanup SLM History Item .equals (elastic#80938)
  Rework breaking changes for new structure (elastic#80907)
  [DOCS] Fix elasticsearch-reset-password typo (elastic#80919)
  [ML] No need to use parent task client when internal infer delegates (elastic#80905)
  Fix shadowed vars pt6 (elastic#80899)
  add ignore info (elastic#80924)
  Fix several potential circuit breaker leaks in Aggregators (elastic#79676)
  Extract more standard metadata from binary files (elastic#78754)
  ...
@mauro-r-hema
Copy link

Hi, Will this be ported also to 7.x versions?
We are affected by #80584

javanna added a commit to javanna/elasticsearch that referenced this pull request Jan 31, 2022
With elastic#79922 we have introduced a parser that expands dots in fields names on the fly, so that the expansion no longer needs to be handled by consumers.

The currentName method of such parser was not returning all the time the expected name, compared to the corresponding parser that receives as input the document with expanded fields.

This commit expands testing and addresses the issues that were found.
javanna added a commit that referenced this pull request Jan 31, 2022
With #79922 we have introduced a parser that expands dots in fields names on the fly, so that the expansion no longer needs to be handled by consumers.

The currentName method of such parser was not returning all the time the expected name, compared to the corresponding parser that receives as input the document with expanded fields.

This commit expands testing and addresses the issues that were found.
javanna added a commit to javanna/elasticsearch that referenced this pull request Mar 16, 2022
With elastic#79922 we have introduced a parser that expands dots in fields names on the fly, so that the expansion no longer needs to be handled by consumers.

The token location exposed by such parser can be confusing to interpret: consumers are parsing the expanded version which requires jumping ahead reading tokens and exposing additional field names and start objects, while users have sent the unexpanded version and would like errors to refer to the original content.

This commit adds a test for this scenario and tweaks the DotExpandingXContentParser to cache the token location before jumping ahead to expand dots in field names.
javanna added a commit that referenced this pull request Mar 16, 2022
)

With #79922 we have introduced a parser that expands dots in fields names on the fly, so that the expansion no longer needs to be handled by consumers.

The token location exposed by such parser can be confusing to interpret: consumers are parsing the expanded version which requires jumping ahead reading tokens and exposing additional field names and start objects, while users have sent the unexpanded version and would like errors to refer to the original content.

This commit adds a test for this scenario and tweaks the DotExpandingXContentParser to cache the token location before jumping ahead to expand dots in field names.
javanna added a commit to javanna/elasticsearch that referenced this pull request Mar 16, 2022
…stic#84970)

With elastic#79922 we have introduced a parser that expands dots in fields names on the fly, so that the expansion no longer needs to be handled by consumers.

The token location exposed by such parser can be confusing to interpret: consumers are parsing the expanded version which requires jumping ahead reading tokens and exposing additional field names and start objects, while users have sent the unexpanded version and would like errors to refer to the original content.

This commit adds a test for this scenario and tweaks the DotExpandingXContentParser to cache the token location before jumping ahead to expand dots in field names.
javanna added a commit that referenced this pull request Mar 16, 2022
)

With #79922 we have introduced a parser that expands dots in fields names on the fly, so that the expansion no longer needs to be handled by consumers.

The token location exposed by such parser can be confusing to interpret: consumers are parsing the expanded version which requires jumping ahead reading tokens and exposing additional field names and start objects, while users have sent the unexpanded version and would like errors to refer to the original content.

This commit adds a test for this scenario and tweaks the DotExpandingXContentParser to cache the token location before jumping ahead to expand dots in field names.
javanna added a commit to javanna/elasticsearch that referenced this pull request Mar 17, 2022
Previously, when using dynamic: false, an array field with a dot in its name, whose suffix matched
a mapped field’s name, had its values merged with the mapped field unexpectedly. This has been fixed by elastic#79922

This commit adds a test for that scenario and verifies that the bug is fixed.

Closes elastic#65333
javanna added a commit that referenced this pull request Mar 18, 2022
Previously, when using dynamic: false, an array field with a dot in its name, whose suffix matched
a mapped field’s name, had its values merged with the mapped field unexpectedly. This has been fixed by #79922

This commit adds a test for that scenario and verifies that the bug is fixed.

Closes #65333
romseygeek added a commit that referenced this pull request Jun 8, 2022
We changed how copy_to is implemented in #79922, which moved
the handling of dots in field names into a specialised parser. Unfortunately,
while doing this we added a bug whereby every time a copy_to directive
is processed for a nested field, the nested field's include_in_parent logic
would be run, meaning that the parent would end up with multiple copies
of the nested child's fields.

This commit fixes this by only running include_in_parent when the parser
is not in a copy_to context. It also fixes another bug that meant the parent
document would contain multiple copies of the ID field.

Fixes #87036
romseygeek added a commit to romseygeek/elasticsearch that referenced this pull request Jun 8, 2022
We changed how copy_to is implemented in elastic#79922, which moved
the handling of dots in field names into a specialised parser. Unfortunately,
while doing this we added a bug whereby every time a copy_to directive
is processed for a nested field, the nested field's include_in_parent logic
would be run, meaning that the parent would end up with multiple copies
of the nested child's fields.

This commit fixes this by only running include_in_parent when the parser
is not in a copy_to context. It also fixes another bug that meant the parent
document would contain multiple copies of the ID field.

Fixes elastic#87036
elasticsearchmachine pushed a commit that referenced this pull request Jun 8, 2022
We changed how copy_to is implemented in #79922, which moved
the handling of dots in field names into a specialised parser. Unfortunately,
while doing this we added a bug whereby every time a copy_to directive
is processed for a nested field, the nested field's include_in_parent logic
would be run, meaning that the parent would end up with multiple copies
of the nested child's fields.

This commit fixes this by only running include_in_parent when the parser
is not in a copy_to context. It also fixes another bug that meant the parent
document would contain multiple copies of the ID field.

Fixes #87036
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>refactoring :Search Foundations/Mapping Index mappings, including merging and defining field types Team:Search Meta label for search team v8.1.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants