Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Give doc-value-only mappings to numeric fields on metrics templates. #87100

Merged
merged 6 commits into from
Jun 14, 2022

Conversation

jpountz
Copy link
Contributor

@jpountz jpountz commented May 25, 2022

It doesn't make much sense to filter samples depending on the values of some
metric fields, as these metrics mostly make sense in aggregate and over time.
This change proposes to give doc-value-only mappings to numeric fields in the
default metrics template. This helps reduce index-time CPU usage and disk usage
while not impacting the search-time experience.

Another improvement that this change brings is mapping floating-point numbers
as doubles instead of floats by default, since double precision is more
commonly used across other systems that are designed at handling metrics. I
could revert that part and switch back to floats if it proved controversial.

These doc-value-only dynamic templates might catch fields that are not metrics
at times, which is probably ok.

It doesn't make much sense to filter samples depending on the values of some
metric fields, as these metrics mostly make sense in aggregate and over time.
This change proposes to give doc-value-only mappings to numeric fields in the
default metrics template. This helps reduce index-time CPU usage and disk usage
while not impacting the search-time experience.

Another improvement that this change brings is mapping floating-point numbers
as doubles instead of floats by default, since double precision is more
commonly used across other systems that are designed at handling metrics. I
could revert that part and switch back to floats if it proved controversial.

These doc-value-only dynamic templates might catch fields that are not metrics
at times, which is probably ok.
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-data-management (Team:Data Management)

@elasticsearchmachine
Copy link
Collaborator

Hi @jpountz, I've created a changelog YAML for you.

Copy link
Member

@dakrone dakrone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM from the ES side, but since these mappings came from the observability side I'd prefer someone from that team sign off before merging.

Copy link
Member

@nik9000 nik9000 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's fine by me too.

@ruflin
Copy link
Contributor

ruflin commented May 31, 2022

Change LGTM. On the double change, will this have any storage or performance impact for numbers with only 1 digit for example?

From an Observability perspective, by default this will not have an effect on all the data we load packages with templates for as these have higher priority. It will mainly affect the data without predefined mappings. If the above change is unexpected for a user, I assume our recommendation is to set a specific template in place for these data streams with higher priority to revert back to the old behaviour.

@axw @felixbarny I assume this will not really have an affect on APM?

@joshdover This change of defaults should also be applied to Fleet and the default mappings we load for each data stream. We should have again a discussion with @jpountz around reusing or not the "base" templates from Elasticsearch for these.

@felixbarny
Copy link
Member

+1 to index: false and rely only on doc values. This is how I used to map metrics indices in the past.

Last time I checked, which was quite a while ago, doubles took up more space than floats but if the industry standard is to use doubles that that's what we should do, too. But maybe compression has improved in the meantime in ES?

a discussion with @jpountz around reusing or not the "base" templates from Elasticsearch for these.

Is there any capability to "inherit" from the base template or will a higher priority template always override and not merge with other templates?

@ruflin
Copy link
Contributor

ruflin commented May 31, 2022

Is there any capability to "inherit" from the base template or will a higher priority template always override and not merge with other templates?

Yes. If we use component templates, I think the last (or first) component wins.

@axw
Copy link
Member

axw commented May 31, 2022

@axw @felixbarny I assume this will not really have an affect on APM?

APM has its own dynamic templates. We should probably follow suit though: elastic/apm-server#8261

@jpountz
Copy link
Contributor Author

jpountz commented May 31, 2022

On the double change, will this have any storage or performance impact for numbers with only 1 digit for example?

It will increase storage requirements indeed.

One special case that Elasticsearch/Lucene detect and that wouldn't become less space-efficient is when numbers are effectively all integers across a segment. In that case, the binary representation of the numbers likely has lots of trailing zeroes and Lucene manages to store them implicitly. However, if numbers have a decimal part then Lucene will likely spend 8 whole bytes on doc values for numbers mapped as doubles and 4 whole bytes for numbers mapped as floats. This is something that we discussed improving a bit for time series data, but that hasn't been done yet.

I made this change because my understanding is that double precision is pretty standard for metrics, but like I said I can revert this part if it's not as obvious as I thought, we can do it later.

If the above change is unexpected for a user, I assume our recommendation is to set a specific template in place for these data streams with higher priority to revert back to the old behaviour.

Yes. We have to be able to make improvements to the out-of-the-box experience. :)

This is how I used to map metrics indices in the past.

Oh, I like references to old blog posts. :)

We should have again a discussion with @jpountz around reusing or not the "base" templates from Elasticsearch for these.

It's a different topic, but I have a strong preference for not reusing the Elasticsearch base templates as they have a different lifecycle from integrations templates, so creating dependencies between them has a high risk of introducing breaks. I'd rather duplicate the relevant bits.

@elasticmachine
Copy link
Collaborator

Pinging @elastic/clients-team (Team:Clients)

Copy link
Contributor

@ruflin ruflin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving this change. The assumption is that at some point we likely will optimise double vs float usage with TSDB and on the reuse (or not) of component templates, we will have a separate discussion.

@joshdover
Copy link
Contributor

joshdover commented Jun 1, 2022

This change of defaults should also be applied to Fleet and the default mappings we load for each data stream.

@ruflin Which defaults are you referring to? Today Fleet only has two base component templates whcih contain mappings that are installed for all data streams. Neither of these contain any numeric fields:

{
  "_meta": {
    "managed_by": "fleet",
    "managed": true
  },
  "dynamic_templates": [
    {
      "strings_as_keyword": {
        "mapping": {
          "ignore_above": 1024,
          "type": "keyword"
        },
        "match_mapping_type": "string"
      }
    }
  ],
  "date_detection": false
}
{
  "properties": {
    "event": {
      "properties": {
        "agent_id_status": {
          "ignore_above": 1024,
          "type": "keyword"
        },
        "ingested": {
          "format": "strict_date_time_no_millis||strict_date_optional_time||epoch_millis",
          "type": "date"
        }
      }
    }
  }
}

@ruflin
Copy link
Contributor

ruflin commented Jun 1, 2022

@joshdover These are the ones I was referring to and we should likely add some numeric values. But lets split the discussion from this PR as there is more to it. I'll ping you.

@jpountz
Copy link
Contributor Author

jpountz commented Jun 8, 2022

I reverted the change from float to double, which I thought would better align with users' expectations, but maybe not from the perspective of disk usage until we have better compression for gauges whose value is rather stable over time. We can re-evaluate this later.

@jpountz
Copy link
Contributor Author

jpountz commented Jun 13, 2022

@elasticmachine test this please

@jpountz jpountz merged commit 2243f5e into elastic:master Jun 14, 2022
@jpountz jpountz deleted the doc_value_only_metrics branch June 14, 2022 07:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>enhancement Team:Clients Meta label for clients team Team:Data Management Meta label for data/management team v8.4.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants