Dotted field names that conflict with objects #63530

jpountz · 2020-10-12T07:26:15Z

Elasticsearch assumes that dots in fields names are an object separator. This means that a document such as this one:

{
  "metric.value.max": 42
}

is actually indexed as if it was formatted like below:

{
  "metric": {
    "value": {
      "max": 42
    }
  }
}

And in the mappings, this translates into two object fields called metric and metric.value and a long field called metric.value.max.

This proves problematic when ingesting metrics that come from external systems such as Micrometer or OpenTelemetry, as it's not rare to have both metric.value and metric.value.max as metric names:

{
  "metric.value": 10,
  "metric.value.max": 42
}

Such a document will always fail indexing because metric.value would need to be an object field because of metric.value.max and a long field at the same time, which is illegal.

Some workarounds have been developed, such as replacing dots with underscores, or adding suffixes, but this creates a bad user experience in Kibana as users are not seeing the field names that they expect.

We should look into ways to make this supported in Elasticsearch.

The text was updated successfully, but these errors were encountered:

elasticmachine · 2020-10-12T07:26:16Z

Pinging @elastic/es-search (:Search/Mapping)

felixbarny · 2020-10-14T07:22:38Z

Some workarounds have been developed, such as replacing dots with underscores, or adding suffixes, but this creates a bad user experience in Kibana as users are not seeing the field names that they expect.

Is there an estimated release version for this? To choose the workaround with the right tradeoffs, for the time being, it would help us to know when we could expect this enhancement to land. See also elastic/apm#347 (comment).

zacharymorn · 2020-10-15T04:06:06Z

Just a thought. Is it possible to do something similar to Java auto-boxing here? For example, when encountering definitions above

{
  "metric.value": 10,
  "metric.value.max": 42
}

ES detects the potential collision and wrap things into "super type", and produce these internally:

{
  "metric.value.__long_value": 10,
  "metric.value.max": 42
}

and the translation between external and internal representation get encapsulated somewhere and hidden from the outside world?

SylvainJuge · 2020-10-15T07:08:11Z

Adding a suffix if there is a conflict was the initial idea we had as a work-around.

However, that assumes that metrics that conflict are always sent together, or in a specific order, and we can't guarantee that in the context of APM agents as there might be more than one agent sending such data.

If we can guarantee that sending both metric.value then metric.value.max independently and in any order produces the same result (at least to the end-user, internal storage may differ), that could work though.

zacharymorn · 2020-10-15T07:53:10Z

Hmm I see that's a good call out. What if ES suffix every field user defines, like such:

{
  "metric.value.__long_value": 10,
  "metric.value.max.__long_value": 42
}

This basically cause every user defined field to be of object type, and thus order shouldn't matter?

SylvainJuge · 2020-10-15T11:38:57Z

Yes, but in that case, unless there is a way to make this transparent to the user, they will have to use __long_value to query their metrics.

zacharymorn · 2020-10-16T05:12:27Z

Yes exactly.

Just to be clear, what I meant above was that the user should use and see representation in the existing format

{
  "metric.value": 10,
  "metric.value.max": 42
}

but internally these representations could be converted effectively into the following that's hidden from the user (hence the __xxx notation here)

{
  "metric.value.__long_value": 10,
  "metric.value.max.__long_value": 42
}

I think since there's already some field process logic to handle the dotted field name to object conversion, ES can potentially piggyback on that to support this conversion as well?

jimczi · 2021-02-02T17:56:31Z

We discussed offline and agreed that we'd prefer to consider this case under the flattened field use case.
For solutions that cannot control the name or the shape of the fields, the flattened field is a simple and powerful choice.

Although we've spotted some limitations that we'll need to solve before switching to this solution:

We should ensure that flattened field accepts foo.value and foo.value.max explicitly.
We need to handle numerics, not only keyword.
Kibana needs to support flattened type.
We need a suggester for field names that are under a flattened field. They don't appear in the mapping so we need to help users when writing queries and aggregations. That should help for the integration in Kibana.

romseygeek · 2021-10-07T12:45:21Z

An update on our thinking here: I'm experimenting with adding a flag to object fields that say 'everything under this object uses a flattened representation'; so anything using dot notation ends up as a field containing a dot, and documents using object notation will cause an indexing exception.

So for the initial example here, the mappings would look like this:

{
  "properties" : {
    "metric" : {
      "type" : "object",
      "flattened" : true,
      "properties" : {
        "value" : { "type" : "long" },
        "value.max" : { "type" : "long" }
      }
    }
  }
}

And we can take as input both of the following formats:

{ "metric" : { "value" : 10, "value.max" : 15 } }
{ "metric.value" : 10, "metric.value.max" : 15 }

But the following would throw an exception:

{ "metric" : { "value" : { "max" : 15 } } }

Objects with flattened=true can only contain leaf fields in their properties section.

axw · 2021-10-16T09:31:05Z

@romseygeek perhaps a stupid question: will it be possible to use this with dynamic mapping when there's no common prefix for the fields? In APM we map metrics as they're provided by applications (within limitations of field names of course). So we don't have anything to hang a "flattened": true off -- unless this will work with a dynamic template?

felixbarny · 2021-10-18T07:14:38Z

From testing I did a while ago in a different context, that's not possible to have top-level flattened fields. Trying to use them in dynamic mappings leads to errors. That's indeed an issue for the metrics use case.

romseygeek · 2021-10-18T08:05:19Z

@axw @felixbarny it will be possible to use this with dynamic mapping, yes, and you will also be able to set flatten:true on the root object so that everything is interpreted as a flat field. @felixbarny I think you're referring here to the flattened field type, which works slightly differently? It would be good to get some example inputs so that we can check things will work as needed.

axw · 2021-10-18T08:46:47Z

Here's a sample document, which includes a contrived metric with a dotted name, service.latency.

{
  "@timestamp": "2021-10-18T08:33:40.086Z",
  "agent": {
    "name": "opentelemetry/go",
    "version": "1.0.0"
  },
  "ecs": {
    "version": "1.12.0"
  },
  "event": {
    "ingested": "2021-10-18T08:33:46.302676058Z"
  },
  "metricset.name": "app",
  "observer": {
    "ephemeral_id": "e3784820-0da5-404b-a637-f9cb6c179196",
    "hostname": "goat",
    "id": "96a5e65c-0ee5-486f-8662-9c7a18d9381a",
    "type": "apm-server",
    "version": "8.0.0",
    "version_major": 8
  },
  "processor": {
    "event": "metric",
    "name": "metric"
  },
  "service": {
    "language": {
      "name": "go"
    },
    "name": "unknown_service_systemtest_test"
  },
  "service.latency": {
    "counts": [
      1,
      1,
      1,
      1
    ],
    "values": [
      50.5,
      550,
      5500,
      10000
    ]
  }
}

Note that there's a service field which should not have flattened keys -- only the specific service.latency metric field should have a flattened key. So setting flatten: true at the root might be a pain.

For dynamically mapping histogram fields, we're using the named dynamic_templates feature that was introduced in 7.13. When APM Server receives a histogram-type metric, it adds something like this to the document:

"_metric_descriptions": {
  "service.latency": {
    "type": "histogram"
  }
}

And then we use an ingest processor to map that to dynamic_templates.

Would we just add/update our dynamic templates to set flatten: truethen?

romseygeek · 2021-10-18T09:04:05Z

Let me double check, but I think this will work if you add the following:

"_metric_service": {
  "service" : {
    "type" : "object",
    "flatten" : "true",
  }
}

With this included, you'll get fields service.name, service.language.name and service.latency.

I'll add a specific test to the PR to make sure this works.

axw · 2021-10-19T02:57:35Z

With this included, you'll get fields service.name, service.language.name and service.latency.

The complication here is that we don't want service.name and service.language.name. Those are known, statically mapped fields, and should not be flattened. It's just service.latency that is dynamically mapped and should be flattened.

What I would like is to end up with a document that looks like this:

{
  "service": {
    "language": {
      "name": "go",
    },
    "name": "unknown_service_systemtest_test"
  },
  "service.latency": {
    "counts": [
      1,
      1,
      1,
      1
    ],
    "values": [
      50.5,
      550,
      5500,
      10000
    ]
  }
}

How would we flatten only service.latency?

romseygeek · 2021-10-19T08:05:00Z

I think we need to distinguish between what the document 'looks like' in json format (ie, what will be returned if you ask for _source) and how it is stored and queried internally. Functionally there's no difference between this:

{ "service" : {
    "language" : {
      "name" : "go"
    },
    "name" : "unknown"
}

and this:

{ 
"service.language.name" : "go",
"service.name" : "unknown"
}

You address the fields via queries in exactly the same way for both formats, the fields output will be identical, etc. The object structure is only present in source, it makes no difference at all for queries or aggregations.

romseygeek · 2021-11-03T09:54:08Z

Hi @axw, the example that you've given above will actually work with current versions of elasticsearch, because service is always interpreted as an object. The point of this issue is to deal with situations that look more like this:

{
  "service": "unknown_service_systemtest_test",
  "service.latency": {
    "counts": [
      1,
      1,
      1,
      1
    ],
    "values": [
      50.5,
      550,
      5500,
      10000
    ]
  }
}

AIUI we have some metrics mappings which are already doing filename manipulation to convert dots to underscores, do you have an example of those type of mappings that we could work with?

axw · 2021-11-15T02:09:06Z

Hi @axw, the example that you've given above will actually work with current versions of elasticsearch, because service is always interpreted as an object. The point of this issue is to deal with situations that look more like this:

Makes sense. I've been conflating the _source structure and field names.

AIUI we have some metrics mappings which are already doing filename manipulation to convert dots to underscores, do you have an example of those type of mappings that we could work with?

I'm not too sure what you're referring to. The metrics in question are all dynamically mapped. The APM Java agent has configuration to de-dot metrics that are sent to APM Server, maybe that's what you've heard about?

droberts195 · 2022-02-04T10:56:27Z

Once this problem is fixed in core Elasticsearch I foresee the next level of problem reports will be that the core solution doesn't work in anomaly detection jobs, transforms, and possibly other areas (alerts as data?). Please keep us in the loop when changes are made so that we can work out if and how corresponding downstream changes need to be made.

/cc @elastic/ml-core

SylvainJuge · 2022-02-11T14:09:12Z

Another potential use-case where it might be convenient to have this feature is to store OpenTelemetry (OTel) attributes.
They are just a map where keys are always dotted strings and values can be mapped to JSON equivalents (int, bool, string or array)

jbaiera · 2022-03-28T21:23:15Z

Stopping in to voice some concern around client level support for these sorts of features.

ES-Hadoop currently does not support reading documents that have dots in field names. The Hadoop connector takes care of converting JSON documents into records for use in Hadoop and Spark. We read mappings at the start of a job to determine what the structure of the record should be. Documents that don't match this structure because they have dotted field names could be adapted to how the mappings are laid out, but it would change the original structure of the document. To make things complicated further, nothing stops users from writing these documents back to the source index at the end of the job. This causes an unpleasant situation where an update with no changes to the data has rewritten the _source field formatting.

The suggested solution for users currently is to make use of the Dot Expander Processor to normalize the JSON before it is ingested into Elasticsearch if there is any chance that data might be read by ES-Hadoop. If we move forward with adding support for something like this then I think we need to not only rethink our advice for Hadoop, but also discuss how important preserving the exact format of a document's _source is from a client code perspective. I'd be interested in hearing if any other clients run into this issue currently, but I doubt it's wide spread since most clients don't directly handle deserializing _source into objects.

/cc @masseyke @jakelandis

This PR adds support for a new mapping parameter to the configuration of the object mapper (root as well as individual fields), that makes it possible to store metrics data where it's common to have fields with dots in their names in the following format: ``` { "metrics.time" : 10, "metrics.time.min" : 1, "metrics.time.max" : 500 } ``` Instead of expanding dotted paths the their corresponding object structure, objects can be configured to preserve dots in field names, in which case they can only hold leaf sub-fields and no further objects. The mapping parameter is called subobjects and controls whether an object can hold other objects (defaults to true) or not. The following example shows how it can be configured in the mappings: ``` { "mappings" : { "properties" : { "metrics" : { "type" : "object", "subobjects" : false } } } } ``` Closes #63530

jpountz added >enhancement :Search Foundations/Mapping Index mappings, including merging and defining field types labels Oct 12, 2020

elasticmachine added the Team:Search Meta label for search team label Oct 12, 2020

axw mentioned this issue Oct 13, 2020

Support metrics with dots in their names elastic/apm#347

Open

axw mentioned this issue Feb 3, 2021

Support for a fully numeric flattened field #61550

Open

cyrille-leclerc assigned jimczi Feb 11, 2021

ebeahan mentioned this issue May 3, 2021

[RFC] Label fields for additional types - Stage 0 elastic/ecs#1341

Closed

2 tasks

romseygeek mentioned this issue Oct 13, 2021

Add 'flatten' parameter to object mappers #78997

Closed

jimczi removed their assignment Dec 16, 2021

javanna changed the title ~~Dotted field names that conflict with objects~~ Support for dots in field names: Dotted field names that conflict with objects Mar 1, 2022

javanna changed the title ~~Support for dots in field names: Dotted field names that conflict with objects~~ Dotted field names that conflict with objects Mar 1, 2022

javanna self-assigned this Mar 1, 2022

Leaf-Lin mentioned this issue Mar 29, 2022

Documents with dotted field names incorrectly accepted for nested fields #85004

Open

javanna mentioned this issue Apr 28, 2022

Add support for dots in field names for metrics usecases #86166

Merged

javanna closed this as completed in #86166 May 17, 2022

kpollich mentioned this issue Jun 2, 2022

Add support for subobjects: false elastic/package-spec#349

Closed

eli-gc mentioned this issue Jan 30, 2024

Rejected by Elasticsearch [error type]: document_parsing_exception [reason]: '[1:660] failed to parse field [kubernetes.labels.app] of type [text] in document with id uken/fluent-plugin-elasticsearch#1041

Open

2 tasks

javanna added Team:Search Foundations Meta label for the Search Foundations team in Elasticsearch and removed Team:Search Meta label for search team labels Jul 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dotted field names that conflict with objects #63530

Dotted field names that conflict with objects #63530

jpountz commented Oct 12, 2020 •

edited

Loading

elasticmachine commented Oct 12, 2020

felixbarny commented Oct 14, 2020

zacharymorn commented Oct 15, 2020

SylvainJuge commented Oct 15, 2020

zacharymorn commented Oct 15, 2020

SylvainJuge commented Oct 15, 2020

zacharymorn commented Oct 16, 2020

jimczi commented Feb 2, 2021 •

edited

Loading

romseygeek commented Oct 7, 2021

axw commented Oct 16, 2021

felixbarny commented Oct 18, 2021

romseygeek commented Oct 18, 2021

axw commented Oct 18, 2021

romseygeek commented Oct 18, 2021

axw commented Oct 19, 2021

romseygeek commented Oct 19, 2021

romseygeek commented Nov 3, 2021

axw commented Nov 15, 2021

droberts195 commented Feb 4, 2022

SylvainJuge commented Feb 11, 2022

jbaiera commented Mar 28, 2022

Dotted field names that conflict with objects #63530

Dotted field names that conflict with objects #63530

Comments

jpountz commented Oct 12, 2020 • edited Loading

elasticmachine commented Oct 12, 2020

felixbarny commented Oct 14, 2020

zacharymorn commented Oct 15, 2020

SylvainJuge commented Oct 15, 2020

zacharymorn commented Oct 15, 2020

SylvainJuge commented Oct 15, 2020

zacharymorn commented Oct 16, 2020

jimczi commented Feb 2, 2021 • edited Loading

romseygeek commented Oct 7, 2021

axw commented Oct 16, 2021

felixbarny commented Oct 18, 2021

romseygeek commented Oct 18, 2021

axw commented Oct 18, 2021

romseygeek commented Oct 18, 2021

axw commented Oct 19, 2021

romseygeek commented Oct 19, 2021

romseygeek commented Nov 3, 2021

axw commented Nov 15, 2021

droberts195 commented Feb 4, 2022

SylvainJuge commented Feb 11, 2022

jbaiera commented Mar 28, 2022

jpountz commented Oct 12, 2020 •

edited

Loading

jimczi commented Feb 2, 2021 •

edited

Loading