-
Notifications
You must be signed in to change notification settings - Fork 25.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dotted field names that conflict with objects #63530
Comments
Pinging @elastic/es-search (:Search/Mapping) |
Is there an estimated release version for this? To choose the workaround with the right tradeoffs, for the time being, it would help us to know when we could expect this enhancement to land. See also elastic/apm#347 (comment). |
Just a thought. Is it possible to do something similar to Java auto-boxing here? For example, when encountering definitions above
ES detects the potential collision and wrap things into "super type", and produce these internally:
and the translation between external and internal representation get encapsulated somewhere and hidden from the outside world? |
Adding a suffix if there is a conflict was the initial idea we had as a work-around. However, that assumes that metrics that conflict are always sent together, or in a specific order, and we can't guarantee that in the context of APM agents as there might be more than one agent sending such data. If we can guarantee that sending both |
Hmm I see that's a good call out. What if ES suffix every field user defines, like such:
This basically cause every user defined field to be of object type, and thus order shouldn't matter? |
Yes, but in that case, unless there is a way to make this transparent to the user, they will have to use |
Yes exactly. Just to be clear, what I meant above was that the user should use and see representation in the existing format
but internally these representations could be converted effectively into the following that's hidden from the user (hence the __xxx notation here)
I think since there's already some field process logic to handle the dotted field name to object conversion, ES can potentially piggyback on that to support this conversion as well? |
We discussed offline and agreed that we'd prefer to consider this case under the Although we've spotted some limitations that we'll need to solve before switching to this solution:
|
An update on our thinking here: I'm experimenting with adding a flag to object fields that say 'everything under this object uses a flattened representation'; so anything using dot notation ends up as a field containing a dot, and documents using object notation will cause an indexing exception. So for the initial example here, the mappings would look like this:
And we can take as input both of the following formats:
But the following would throw an exception:
Objects with |
@romseygeek perhaps a stupid question: will it be possible to use this with dynamic mapping when there's no common prefix for the fields? In APM we map metrics as they're provided by applications (within limitations of field names of course). So we don't have anything to hang a |
From testing I did a while ago in a different context, that's not possible to have top-level flattened fields. Trying to use them in dynamic mappings leads to errors. That's indeed an issue for the metrics use case. |
@axw @felixbarny it will be possible to use this with dynamic mapping, yes, and you will also be able to set |
Here's a sample document, which includes a contrived metric with a dotted name, {
"@timestamp": "2021-10-18T08:33:40.086Z",
"agent": {
"name": "opentelemetry/go",
"version": "1.0.0"
},
"ecs": {
"version": "1.12.0"
},
"event": {
"ingested": "2021-10-18T08:33:46.302676058Z"
},
"metricset.name": "app",
"observer": {
"ephemeral_id": "e3784820-0da5-404b-a637-f9cb6c179196",
"hostname": "goat",
"id": "96a5e65c-0ee5-486f-8662-9c7a18d9381a",
"type": "apm-server",
"version": "8.0.0",
"version_major": 8
},
"processor": {
"event": "metric",
"name": "metric"
},
"service": {
"language": {
"name": "go"
},
"name": "unknown_service_systemtest_test"
},
"service.latency": {
"counts": [
1,
1,
1,
1
],
"values": [
50.5,
550,
5500,
10000
]
}
} Note that there's a For dynamically mapping histogram fields, we're using the named dynamic_templates feature that was introduced in 7.13. When APM Server receives a histogram-type metric, it adds something like this to the document: "_metric_descriptions": {
"service.latency": {
"type": "histogram"
}
} And then we use an ingest processor to map that to Would we just add/update our dynamic templates to set |
Let me double check, but I think this will work if you add the following:
With this included, you'll get fields I'll add a specific test to the PR to make sure this works. |
The complication here is that we don't want What I would like is to end up with a document that looks like this: {
"service": {
"language": {
"name": "go",
},
"name": "unknown_service_systemtest_test"
},
"service.latency": {
"counts": [
1,
1,
1,
1
],
"values": [
50.5,
550,
5500,
10000
]
}
} How would we flatten only |
I think we need to distinguish between what the document 'looks like' in json format (ie, what will be returned if you ask for
and this:
You address the fields via queries in exactly the same way for both formats, the |
Hi @axw, the example that you've given above will actually work with current versions of elasticsearch, because
AIUI we have some metrics mappings which are already doing filename manipulation to convert dots to underscores, do you have an example of those type of mappings that we could work with? |
Makes sense. I've been conflating the _source structure and field names.
I'm not too sure what you're referring to. The metrics in question are all dynamically mapped. The APM Java agent has configuration to de-dot metrics that are sent to APM Server, maybe that's what you've heard about? |
Once this problem is fixed in core Elasticsearch I foresee the next level of problem reports will be that the core solution doesn't work in anomaly detection jobs, transforms, and possibly other areas (alerts as data?). Please keep us in the loop when changes are made so that we can work out if and how corresponding downstream changes need to be made. /cc @elastic/ml-core |
Another potential use-case where it might be convenient to have this feature is to store OpenTelemetry (OTel) attributes. |
Stopping in to voice some concern around client level support for these sorts of features. ES-Hadoop currently does not support reading documents that have dots in field names. The Hadoop connector takes care of converting JSON documents into records for use in Hadoop and Spark. We read mappings at the start of a job to determine what the structure of the record should be. Documents that don't match this structure because they have dotted field names could be adapted to how the mappings are laid out, but it would change the original structure of the document. To make things complicated further, nothing stops users from writing these documents back to the source index at the end of the job. This causes an unpleasant situation where an update with no changes to the data has rewritten the The suggested solution for users currently is to make use of the Dot Expander Processor to normalize the JSON before it is ingested into Elasticsearch if there is any chance that data might be read by ES-Hadoop. If we move forward with adding support for something like this then I think we need to not only rethink our advice for Hadoop, but also discuss how important preserving the exact format of a document's /cc @masseyke @jakelandis |
This PR adds support for a new mapping parameter to the configuration of the object mapper (root as well as individual fields), that makes it possible to store metrics data where it's common to have fields with dots in their names in the following format: ``` { "metrics.time" : 10, "metrics.time.min" : 1, "metrics.time.max" : 500 } ``` Instead of expanding dotted paths the their corresponding object structure, objects can be configured to preserve dots in field names, in which case they can only hold leaf sub-fields and no further objects. The mapping parameter is called subobjects and controls whether an object can hold other objects (defaults to true) or not. The following example shows how it can be configured in the mappings: ``` { "mappings" : { "properties" : { "metrics" : { "type" : "object", "subobjects" : false } } } } ``` Closes #63530
Elasticsearch assumes that dots in fields names are an object separator. This means that a document such as this one:
is actually indexed as if it was formatted like below:
And in the mappings, this translates into two
object
fields calledmetric
andmetric.value
and along
field calledmetric.value.max
.This proves problematic when ingesting metrics that come from external systems such as Micrometer or OpenTelemetry, as it's not rare to have both
metric.value
andmetric.value.max
as metric names:Such a document will always fail indexing because
metric.value
would need to be anobject
field because ofmetric.value.max
and along
field at the same time, which is illegal.Some workarounds have been developed, such as replacing dots with underscores, or adding suffixes, but this creates a bad user experience in Kibana as users are not seeing the field names that they expect.
We should look into ways to make this supported in Elasticsearch.
The text was updated successfully, but these errors were encountered: