Reinterpret dots in field names as object structure #79922

romseygeek · 2021-10-27T14:51:35Z

DocumentParser parses documents by following their object hierarchy, and
using a parallel hierarchy of ObjectMappers to work out how to map leaf fields.
Field names that contain dots complicate this, meaning that many methods
need to reverse-engineer the object hierarchy to check that the current parent
object mapper is the correct one; this is particularly complex when objects
are being created dynamically.

To simplify this logic, this commit introduces a DotExpandingXContentParser,
which will wrap another XContentParser and re-interpret a field name containing
dots as a series of objects. So for example, "foo.bar.baz":{ ... } is passed
to the DocumentParser as "foo":{"bar":{"baz":{...}}}. The central parsing
logic of DocumentParser detects when a fieldname has this form and temporarily
swaps out its context for a context with a wrapped parser.

elasticmachine · 2021-10-27T14:51:40Z

Pinging @elastic/es-search (Team:Search)

romseygeek · 2021-10-27T14:53:22Z

After this has been merged, handling dots in field names becomes much simpler, as we just check to see if the current parent object has flatten:true before wrapping the xcontent parser.

…to-parser

romseygeek · 2021-11-02T15:01:07Z

server/src/test/java/org/elasticsearch/index/mapper/DocumentParserTests.java

-        assertEquals(
-            "Cannot add a value for field [field.bar] since one of the intermediate objects is mapped as a nested object: [field]",
-            e.getMessage()
-        );


A nice upside of this change is that we can now handle dynamically mapped fields with dots that lead into a nested object.

…to-parser

nik9000

XContentParser bits look sensible. Left a suggestion for an extra thing to test.

After looking through the DocumentParser bit it looks like I was misunderstanding what this was doing the last time I read it. I'm actually a big fan now. I think I'll need to give it another look on Monday though before I can decide if it's, like, right. I mean, if the tests pass that's a huge thing, because this is mostly about replacing tricky code with a sensible abstraction. But I'd like to give it a look with a more clearerer head.

nik9000 · 2021-11-05T17:53:29Z

libs/x-content/src/test/java/org/elasticsearch/xcontent/DotExpandingXContentParserTests.java

+        assertEquals(XContentParser.Token.VALUE_STRING, parser.nextToken());
+        assertEquals("value2", parser.text());
+        assertEquals(XContentParser.Token.END_OBJECT, parser.nextToken());
+        assertNull(parser.nextToken());


It might be nice to test this with copyCurrentStructure and XContentHelper.convertToMap - those might make the tests a bit more readable. And might find extra fun sneaky stuff.

nik9000 · 2021-11-05T17:59:28Z

Monday

Er. I'm off on Monday. Wednesday. Or, someone else who understand document parsing can have it. I'm not picky.

…to-parser

romseygeek · 2021-11-23T14:52:07Z

@elasticmachine run elasticsearch-ci/part-1
@elasticmachine run elasticsearch-ci/part-2

* upstream/master: (29 commits) Fix typo (elastic#80925) Increase docker compose timeouts for CI builds TSDB: fix error without feature flag (elastic#80945) [DOCS] Relocate `index.mapping.dimension_fields.limit` setting docs (elastic#80964) Explicit write methods for always-missing values (elastic#80958) TSDB: move TimeSeriesModeIT to yaml tests (elastic#80933) [ML] Removing temporary debug (elastic#80956) Remove unused ConnectTransportException#node (elastic#80944) Reinterpret dots in field names as object structure (elastic#79922) Remove obsolete typed legacy index templates (elastic#80937) Remove unnecessary shuffle in unassigned shards allocation. (elastic#65172) TSDB: Tests for nanosecond timeprecision timestamp just beyond the limit (elastic#80932) Cleanup SLM History Item .equals (elastic#80938) Rework breaking changes for new structure (elastic#80907) [DOCS] Fix elasticsearch-reset-password typo (elastic#80919) [ML] No need to use parent task client when internal infer delegates (elastic#80905) Fix shadowed vars pt6 (elastic#80899) add ignore info (elastic#80924) Fix several potential circuit breaker leaks in Aggregators (elastic#79676) Extract more standard metadata from binary files (elastic#78754) ...

* upstream/master: (319 commits) Fix typo (elastic#80925) Increase docker compose timeouts for CI builds TSDB: fix error without feature flag (elastic#80945) [DOCS] Relocate `index.mapping.dimension_fields.limit` setting docs (elastic#80964) Explicit write methods for always-missing values (elastic#80958) TSDB: move TimeSeriesModeIT to yaml tests (elastic#80933) [ML] Removing temporary debug (elastic#80956) Remove unused ConnectTransportException#node (elastic#80944) Reinterpret dots in field names as object structure (elastic#79922) Remove obsolete typed legacy index templates (elastic#80937) Remove unnecessary shuffle in unassigned shards allocation. (elastic#65172) TSDB: Tests for nanosecond timeprecision timestamp just beyond the limit (elastic#80932) Cleanup SLM History Item .equals (elastic#80938) Rework breaking changes for new structure (elastic#80907) [DOCS] Fix elasticsearch-reset-password typo (elastic#80919) [ML] No need to use parent task client when internal infer delegates (elastic#80905) Fix shadowed vars pt6 (elastic#80899) add ignore info (elastic#80924) Fix several potential circuit breaker leaks in Aggregators (elastic#79676) Extract more standard metadata from binary files (elastic#78754) ...

mauro-r-hema · 2021-11-29T13:21:34Z

Hi, Will this be ported also to 7.x versions?
We are affected by #80584

With elastic#79922 we have introduced a parser that expands dots in fields names on the fly, so that the expansion no longer needs to be handled by consumers. The currentName method of such parser was not returning all the time the expected name, compared to the corresponding parser that receives as input the document with expanded fields. This commit expands testing and addresses the issues that were found.

With #79922 we have introduced a parser that expands dots in fields names on the fly, so that the expansion no longer needs to be handled by consumers. The currentName method of such parser was not returning all the time the expected name, compared to the corresponding parser that receives as input the document with expanded fields. This commit expands testing and addresses the issues that were found.

With elastic#79922 we have introduced a parser that expands dots in fields names on the fly, so that the expansion no longer needs to be handled by consumers. The token location exposed by such parser can be confusing to interpret: consumers are parsing the expanded version which requires jumping ahead reading tokens and exposing additional field names and start objects, while users have sent the unexpanded version and would like errors to refer to the original content. This commit adds a test for this scenario and tweaks the DotExpandingXContentParser to cache the token location before jumping ahead to expand dots in field names.

) With #79922 we have introduced a parser that expands dots in fields names on the fly, so that the expansion no longer needs to be handled by consumers. The token location exposed by such parser can be confusing to interpret: consumers are parsing the expanded version which requires jumping ahead reading tokens and exposing additional field names and start objects, while users have sent the unexpanded version and would like errors to refer to the original content. This commit adds a test for this scenario and tweaks the DotExpandingXContentParser to cache the token location before jumping ahead to expand dots in field names.

…stic#84970) With elastic#79922 we have introduced a parser that expands dots in fields names on the fly, so that the expansion no longer needs to be handled by consumers. The token location exposed by such parser can be confusing to interpret: consumers are parsing the expanded version which requires jumping ahead reading tokens and exposing additional field names and start objects, while users have sent the unexpanded version and would like errors to refer to the original content. This commit adds a test for this scenario and tweaks the DotExpandingXContentParser to cache the token location before jumping ahead to expand dots in field names.

) With #79922 we have introduced a parser that expands dots in fields names on the fly, so that the expansion no longer needs to be handled by consumers. The token location exposed by such parser can be confusing to interpret: consumers are parsing the expanded version which requires jumping ahead reading tokens and exposing additional field names and start objects, while users have sent the unexpanded version and would like errors to refer to the original content. This commit adds a test for this scenario and tweaks the DotExpandingXContentParser to cache the token location before jumping ahead to expand dots in field names.

Previously, when using dynamic: false, an array field with a dot in its name, whose suffix matched a mapped field’s name, had its values merged with the mapped field unexpectedly. This has been fixed by elastic#79922 This commit adds a test for that scenario and verifies that the bug is fixed. Closes elastic#65333

Previously, when using dynamic: false, an array field with a dot in its name, whose suffix matched a mapped field’s name, had its values merged with the mapped field unexpectedly. This has been fixed by #79922 This commit adds a test for that scenario and verifies that the bug is fixed. Closes #65333

We changed how copy_to is implemented in #79922, which moved the handling of dots in field names into a specialised parser. Unfortunately, while doing this we added a bug whereby every time a copy_to directive is processed for a nested field, the nested field's include_in_parent logic would be run, meaning that the parent would end up with multiple copies of the nested child's fields. This commit fixes this by only running include_in_parent when the parser is not in a copy_to context. It also fixes another bug that meant the parent document would contain multiple copies of the ID field. Fixes #87036

We changed how copy_to is implemented in elastic#79922, which moved the handling of dots in field names into a specialised parser. Unfortunately, while doing this we added a bug whereby every time a copy_to directive is processed for a nested field, the nested field's include_in_parent logic would be run, meaning that the parent would end up with multiple copies of the nested child's fields. This commit fixes this by only running include_in_parent when the parser is not in a copy_to context. It also fixes another bug that meant the parent document would contain multiple copies of the ID field. Fixes elastic#87036

We changed how copy_to is implemented in #79922, which moved the handling of dots in field names into a specialised parser. Unfortunately, while doing this we added a bug whereby every time a copy_to directive is processed for a nested field, the nested field's include_in_parent logic would be run, meaning that the parent would end up with multiple copies of the nested child's fields. This commit fixes this by only running include_in_parent when the parser is not in a copy_to context. It also fixes another bug that meant the parent document would contain multiple copies of the ID field. Fixes #87036

Reinterpret dots in field names as object structure

2e96997

romseygeek added :Search Foundations/Mapping Index mappings, including merging and defining field types >refactoring v8.1.0 labels Oct 27, 2021

romseygeek requested review from nik9000, jtibshirani and jimczi October 27, 2021 14:51

romseygeek self-assigned this Oct 27, 2021

elasticmachine added the Team:Search Meta label for search team label Oct 27, 2021

romseygeek added 3 commits November 2, 2021 14:02

Merge remote-tracking branch 'origin/master' into mapper/fold-dots-in…

2a00aa1

…to-parser

fixup

56aeafa

skip rest-api-bwc test because error message has changed

56b9315

romseygeek commented Nov 2, 2021

View reviewed changes

romseygeek added 9 commits November 3, 2021 09:57

adjust bwc test versions

7d7a2f7

Merge remote-tracking branch 'origin/master' into mapper/fold-dots-in…

16435ea

…to-parser

tify

b4c7413

Merge remote-tracking branch 'origin/master' into mapper/fold-dots-in…

118e3fb

…to-parser

copy_to should use the same path as standard parsing

a41f3b9

error message

94aae5c

Merge remote-tracking branch 'origin/master' into mapper/fold-dots-in…

cc29e95

…to-parser

precommit

ef84ad4

Merge remote-tracking branch 'origin/master' into mapper/fold-dots-in…

f134a7e

…to-parser

nik9000 reviewed Nov 5, 2021

View reviewed changes

Merge remote-tracking branch 'origin/master' into mapper/fold-dots-in…

3866763

…to-parser

romseygeek mentioned this pull request Nov 10, 2021

Different behaviour between Object and Dotted notations on ingestion #80584

Closed

romseygeek added 2 commits November 10, 2021 11:54

Merge remote-tracking branch 'origin/master' into mapper/fold-dots-in…

41fe0c6

…to-parser

Add test for flattened fields with dotted inputs

a8a00c0

romseygeek added 2 commits November 23, 2021 10:55

Merge remote-tracking branch 'origin/master' into mapper/fold-dots-in…

daf74fb

…to-parser

stack -> deque

1ebd2e6

romseygeek merged commit 2e8a973 into elastic:master Nov 23, 2021

romseygeek deleted the mapper/fold-dots-into-parser branch November 23, 2021 15:52

romseygeek mentioned this pull request Dec 17, 2021

Construct dynamic updates directly via object builders #81449

Merged

javanna mentioned this pull request Jan 31, 2022

Adjust DotExpandingXContentParser behaviour #83313

Merged

javanna mentioned this pull request Mar 15, 2022

DotExpandingXContentParser to expose the original token location #84970

Merged

This was referenced Mar 17, 2022

Add test for previously broken behaviour on dotted array field #85081

Merged

Dotted named field values can merge with non-dotted fields of the same name #65333

Closed

javanna mentioned this pull request Mar 23, 2022

Don't parse unmapped array field when dynamic is set to false #85082

Merged

javanna mentioned this pull request Apr 28, 2022

Add 'flatten' parameter to object mappers #78997

Closed

romseygeek mentioned this pull request May 25, 2022

Don't run include_in_parent when in copy_to context #87123

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reinterpret dots in field names as object structure #79922

Reinterpret dots in field names as object structure #79922

romseygeek commented Oct 27, 2021

elasticmachine commented Oct 27, 2021

romseygeek commented Oct 27, 2021

romseygeek Nov 2, 2021

nik9000 left a comment

nik9000 Nov 5, 2021

nik9000 commented Nov 5, 2021

romseygeek commented Nov 23, 2021

mauro-r-hema commented Nov 29, 2021

Reinterpret dots in field names as object structure #79922

Reinterpret dots in field names as object structure #79922

Conversation

romseygeek commented Oct 27, 2021

elasticmachine commented Oct 27, 2021

romseygeek commented Oct 27, 2021

romseygeek Nov 2, 2021

Choose a reason for hiding this comment

nik9000 left a comment

Choose a reason for hiding this comment

nik9000 Nov 5, 2021

Choose a reason for hiding this comment

nik9000 commented Nov 5, 2021

romseygeek commented Nov 23, 2021

mauro-r-hema commented Nov 29, 2021