Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dotted field names that conflict with objects #63530

Closed
jpountz opened this issue Oct 12, 2020 · 21 comments · Fixed by #86166
Closed

Dotted field names that conflict with objects #63530

jpountz opened this issue Oct 12, 2020 · 21 comments · Fixed by #86166
Assignees
Labels
>enhancement :Search Foundations/Mapping Index mappings, including merging and defining field types Team:Search Foundations Meta label for the Search Foundations team in Elasticsearch

Comments

@jpountz
Copy link
Contributor

jpountz commented Oct 12, 2020

Elasticsearch assumes that dots in fields names are an object separator. This means that a document such as this one:

{
  "metric.value.max": 42
}

is actually indexed as if it was formatted like below:

{
  "metric": {
    "value": {
      "max": 42
    }
  }
}

And in the mappings, this translates into two object fields called metric and metric.value and a long field called metric.value.max.

This proves problematic when ingesting metrics that come from external systems such as Micrometer or OpenTelemetry, as it's not rare to have both metric.value and metric.value.max as metric names:

{
  "metric.value": 10,
  "metric.value.max": 42
}

Such a document will always fail indexing because metric.value would need to be an object field because of metric.value.max and a long field at the same time, which is illegal.

Some workarounds have been developed, such as replacing dots with underscores, or adding suffixes, but this creates a bad user experience in Kibana as users are not seeing the field names that they expect.

We should look into ways to make this supported in Elasticsearch.

@jpountz jpountz added >enhancement :Search Foundations/Mapping Index mappings, including merging and defining field types labels Oct 12, 2020
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-search (:Search/Mapping)

@felixbarny
Copy link
Member

Some workarounds have been developed, such as replacing dots with underscores, or adding suffixes, but this creates a bad user experience in Kibana as users are not seeing the field names that they expect.

Is there an estimated release version for this? To choose the workaround with the right tradeoffs, for the time being, it would help us to know when we could expect this enhancement to land. See also elastic/apm#347 (comment).

@zacharymorn
Copy link
Contributor

Just a thought. Is it possible to do something similar to Java auto-boxing here? For example, when encountering definitions above

{
  "metric.value": 10,
  "metric.value.max": 42
}

ES detects the potential collision and wrap things into "super type", and produce these internally:

{
  "metric.value.__long_value": 10,
  "metric.value.max": 42
}

and the translation between external and internal representation get encapsulated somewhere and hidden from the outside world?

@SylvainJuge
Copy link
Member

Adding a suffix if there is a conflict was the initial idea we had as a work-around.

However, that assumes that metrics that conflict are always sent together, or in a specific order, and we can't guarantee that in the context of APM agents as there might be more than one agent sending such data.

If we can guarantee that sending both metric.value then metric.value.max independently and in any order produces the same result (at least to the end-user, internal storage may differ), that could work though.

@zacharymorn
Copy link
Contributor

Hmm I see that's a good call out. What if ES suffix every field user defines, like such:

{
  "metric.value.__long_value": 10,
  "metric.value.max.__long_value": 42
}

This basically cause every user defined field to be of object type, and thus order shouldn't matter?

@SylvainJuge
Copy link
Member

Yes, but in that case, unless there is a way to make this transparent to the user, they will have to use __long_value to query their metrics.

@zacharymorn
Copy link
Contributor

Yes exactly.

Just to be clear, what I meant above was that the user should use and see representation in the existing format

{
  "metric.value": 10,
  "metric.value.max": 42
}

but internally these representations could be converted effectively into the following that's hidden from the user (hence the __xxx notation here)

{
  "metric.value.__long_value": 10,
  "metric.value.max.__long_value": 42
}

I think since there's already some field process logic to handle the dotted field name to object conversion, ES can potentially piggyback on that to support this conversion as well?

@jimczi
Copy link
Contributor

jimczi commented Feb 2, 2021

We discussed offline and agreed that we'd prefer to consider this case under the flattened field use case.
For solutions that cannot control the name or the shape of the fields, the flattened field is a simple and powerful choice.

Although we've spotted some limitations that we'll need to solve before switching to this solution:

  1. We should ensure that flattened field accepts foo.value and foo.value.max explicitly.
  2. We need to handle numerics, not only keyword.
  3. Kibana needs to support flattened type.
  4. We need a suggester for field names that are under a flattened field. They don't appear in the mapping so we need to help users when writing queries and aggregations. That should help for the integration in Kibana.

@romseygeek
Copy link
Contributor

An update on our thinking here: I'm experimenting with adding a flag to object fields that say 'everything under this object uses a flattened representation'; so anything using dot notation ends up as a field containing a dot, and documents using object notation will cause an indexing exception.

So for the initial example here, the mappings would look like this:

{
  "properties" : {
    "metric" : {
      "type" : "object",
      "flattened" : true,
      "properties" : {
        "value" : { "type" : "long" },
        "value.max" : { "type" : "long" }
      }
    }
  }
}

And we can take as input both of the following formats:

{ "metric" : { "value" : 10, "value.max" : 15 } }
{ "metric.value" : 10, "metric.value.max" : 15 }

But the following would throw an exception:

{ "metric" : { "value" : { "max" : 15 } } }

Objects with flattened=true can only contain leaf fields in their properties section.

@axw
Copy link
Member

axw commented Oct 16, 2021

@romseygeek perhaps a stupid question: will it be possible to use this with dynamic mapping when there's no common prefix for the fields? In APM we map metrics as they're provided by applications (within limitations of field names of course). So we don't have anything to hang a "flattened": true off -- unless this will work with a dynamic template?

@felixbarny
Copy link
Member

From testing I did a while ago in a different context, that's not possible to have top-level flattened fields. Trying to use them in dynamic mappings leads to errors. That's indeed an issue for the metrics use case.

@romseygeek
Copy link
Contributor

@axw @felixbarny it will be possible to use this with dynamic mapping, yes, and you will also be able to set flatten:true on the root object so that everything is interpreted as a flat field. @felixbarny I think you're referring here to the flattened field type, which works slightly differently? It would be good to get some example inputs so that we can check things will work as needed.

@axw
Copy link
Member

axw commented Oct 18, 2021

Here's a sample document, which includes a contrived metric with a dotted name, service.latency.

{
  "@timestamp": "2021-10-18T08:33:40.086Z",
  "agent": {
    "name": "opentelemetry/go",
    "version": "1.0.0"
  },
  "ecs": {
    "version": "1.12.0"
  },
  "event": {
    "ingested": "2021-10-18T08:33:46.302676058Z"
  },
  "metricset.name": "app",
  "observer": {
    "ephemeral_id": "e3784820-0da5-404b-a637-f9cb6c179196",
    "hostname": "goat",
    "id": "96a5e65c-0ee5-486f-8662-9c7a18d9381a",
    "type": "apm-server",
    "version": "8.0.0",
    "version_major": 8
  },
  "processor": {
    "event": "metric",
    "name": "metric"
  },
  "service": {
    "language": {
      "name": "go"
    },
    "name": "unknown_service_systemtest_test"
  },
  "service.latency": {
    "counts": [
      1,
      1,
      1,
      1
    ],
    "values": [
      50.5,
      550,
      5500,
      10000
    ]
  }
}

Note that there's a service field which should not have flattened keys -- only the specific service.latency metric field should have a flattened key. So setting flatten: true at the root might be a pain.

For dynamically mapping histogram fields, we're using the named dynamic_templates feature that was introduced in 7.13. When APM Server receives a histogram-type metric, it adds something like this to the document:

"_metric_descriptions": {
  "service.latency": {
    "type": "histogram"
  }
}

And then we use an ingest processor to map that to dynamic_templates.

Would we just add/update our dynamic templates to set flatten: truethen?

@romseygeek
Copy link
Contributor

Let me double check, but I think this will work if you add the following:

"_metric_service": {
  "service" : {
    "type" : "object",
    "flatten" : "true",
  }
}

With this included, you'll get fields service.name, service.language.name and service.latency.

I'll add a specific test to the PR to make sure this works.

@axw
Copy link
Member

axw commented Oct 19, 2021

With this included, you'll get fields service.name, service.language.name and service.latency.

The complication here is that we don't want service.name and service.language.name. Those are known, statically mapped fields, and should not be flattened. It's just service.latency that is dynamically mapped and should be flattened.

What I would like is to end up with a document that looks like this:

{
  "service": {
    "language": {
      "name": "go",
    },
    "name": "unknown_service_systemtest_test"
  },
  "service.latency": {
    "counts": [
      1,
      1,
      1,
      1
    ],
    "values": [
      50.5,
      550,
      5500,
      10000
    ]
  }
}

How would we flatten only service.latency?

@romseygeek
Copy link
Contributor

I think we need to distinguish between what the document 'looks like' in json format (ie, what will be returned if you ask for _source) and how it is stored and queried internally. Functionally there's no difference between this:

{ "service" : {
    "language" : {
      "name" : "go"
    },
    "name" : "unknown"
}

and this:

{ 
"service.language.name" : "go",
"service.name" : "unknown"
}

You address the fields via queries in exactly the same way for both formats, the fields output will be identical, etc. The object structure is only present in source, it makes no difference at all for queries or aggregations.

@romseygeek
Copy link
Contributor

Hi @axw, the example that you've given above will actually work with current versions of elasticsearch, because service is always interpreted as an object. The point of this issue is to deal with situations that look more like this:

{
  "service": "unknown_service_systemtest_test",
  "service.latency": {
    "counts": [
      1,
      1,
      1,
      1
    ],
    "values": [
      50.5,
      550,
      5500,
      10000
    ]
  }
}

AIUI we have some metrics mappings which are already doing filename manipulation to convert dots to underscores, do you have an example of those type of mappings that we could work with?

@axw
Copy link
Member

axw commented Nov 15, 2021

Hi @axw, the example that you've given above will actually work with current versions of elasticsearch, because service is always interpreted as an object. The point of this issue is to deal with situations that look more like this:

Makes sense. I've been conflating the _source structure and field names.

AIUI we have some metrics mappings which are already doing filename manipulation to convert dots to underscores, do you have an example of those type of mappings that we could work with?

I'm not too sure what you're referring to. The metrics in question are all dynamically mapped. The APM Java agent has configuration to de-dot metrics that are sent to APM Server, maybe that's what you've heard about?

@jimczi jimczi removed their assignment Dec 16, 2021
@droberts195
Copy link
Contributor

Once this problem is fixed in core Elasticsearch I foresee the next level of problem reports will be that the core solution doesn't work in anomaly detection jobs, transforms, and possibly other areas (alerts as data?). Please keep us in the loop when changes are made so that we can work out if and how corresponding downstream changes need to be made.

/cc @elastic/ml-core

@SylvainJuge
Copy link
Member

Another potential use-case where it might be convenient to have this feature is to store OpenTelemetry (OTel) attributes.
They are just a map where keys are always dotted strings and values can be mapped to JSON equivalents (int, bool, string or array)

@javanna javanna changed the title Dotted field names that conflict with objects Support for dots in field names: Dotted field names that conflict with objects Mar 1, 2022
@javanna javanna changed the title Support for dots in field names: Dotted field names that conflict with objects Dotted field names that conflict with objects Mar 1, 2022
@javanna javanna self-assigned this Mar 1, 2022
@jbaiera
Copy link
Member

jbaiera commented Mar 28, 2022

Stopping in to voice some concern around client level support for these sorts of features.

ES-Hadoop currently does not support reading documents that have dots in field names. The Hadoop connector takes care of converting JSON documents into records for use in Hadoop and Spark. We read mappings at the start of a job to determine what the structure of the record should be. Documents that don't match this structure because they have dotted field names could be adapted to how the mappings are laid out, but it would change the original structure of the document. To make things complicated further, nothing stops users from writing these documents back to the source index at the end of the job. This causes an unpleasant situation where an update with no changes to the data has rewritten the _source field formatting.

The suggested solution for users currently is to make use of the Dot Expander Processor to normalize the JSON before it is ingested into Elasticsearch if there is any chance that data might be read by ES-Hadoop. If we move forward with adding support for something like this then I think we need to not only rethink our advice for Hadoop, but also discuss how important preserving the exact format of a document's _source is from a client code perspective. I'd be interested in hearing if any other clients run into this issue currently, but I doubt it's wide spread since most clients don't directly handle deserializing _source into objects.

/cc @masseyke @jakelandis

javanna added a commit that referenced this issue May 17, 2022
This PR adds support for a new mapping parameter to the configuration of the object mapper (root as well as individual fields), that makes it possible to store metrics data where it's common to have fields with dots in their names in the following format:

```
{
  "metrics.time" : 10,
  "metrics.time.min" : 1,
  "metrics.time.max" : 500
}
```

Instead of expanding dotted paths the their corresponding object structure, objects can be configured to preserve dots in field names, in which case they can only hold leaf sub-fields and no further objects.

The mapping parameter is called subobjects and controls whether an object can hold other objects (defaults to true) or not. The following example shows how it can be configured in the mappings:

```
{
  "mappings" : {
    "properties" : {
      "metrics" : {
        "type" : "object", 
        "subobjects" : false
      }
    }
  }
}
```

Closes #63530
@javanna javanna added Team:Search Foundations Meta label for the Search Foundations team in Elasticsearch and removed Team:Search Meta label for search team labels Jul 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>enhancement :Search Foundations/Mapping Index mappings, including merging and defining field types Team:Search Foundations Meta label for the Search Foundations team in Elasticsearch
Projects
None yet
Development

Successfully merging a pull request may close this issue.