[Transform] continuous transform date_histogram group_by performance #54254

hendrikmuhs · 2020-03-26T09:23:53Z

Affected versions: < 7.7

Problem

Continuous Transforms are optimized for usecases, where sessions are grouped using terms. Grouping on date_histogram - e.g. per hour metrics - with large datasets suffers from re-writing all buckets for every checkpoint. This causes a lot of load on the cluster and might result in service degradation.

Mitigation

Rollup is optimized for this usecase and provides - via rollup search - aggregations on aggregations. Please consider using rollup instead.

Transform will provide an optimization for grouping on date_histogram with version 7.7. Please consider upgrading to 7.7. (Note that you can use a separate cluster for transform as transform supports CCS)

For this optimization to kick in, the field you configure for sync must be the same field you configure for the date_histogram group_by. Using multiple group_by is still possible, the transform gets optimized for the group_by where the field matches the field for time-based sync.

If you can not switch to rollup and upgrading to 7.7 is not possible, you can workaround the problem by adding a query filter that filters out data, you know is not required for updating the transform:

"range" : {
    "TIMESTAMP_FIELD" : {
        "gte" : "FILTER_VALUE",
    }
}

TIMESTAMP_FIELD should be the same that you use for date_histogram as well as sync.

The FILTER_VALUE should exclude at least everything before delay + interval. Also take bucket rounding into account.

For example if you group every 5 minutes and your ingest delay is 1 minute, the query should filter out everything older than 6 minutes. You can use date time logic for creating an absolute value: now.

Examples:

now-1h/h excludes everything older than 1 hour rounded down to the hour
now-1d/d excludes everything older than 1 day rounded down to the day.

Note: This does not have to be exact, you can filter less. However it is important to round down to a start of a bucket. Without rounding down, transform will overwrite older buckets with wrong/incomplete data.

The text was updated successfully, but these errors were encountered:

elasticmachine · 2020-03-26T09:23:55Z

Pinging @elastic/ml-core (:ml/Transform)

hendrikmuhs · 2020-03-26T17:04:21Z

Runtime statistic example for optimization:

Dataset: user reviews, 5.3 million reviews
feed: 20 events/s
transform config:
frequency 10s
date_histogram 1m

Run	input documents	documents written	index time in ms	search time in ms
base (batch)	5261600	452	11	407
continuous before	7120905395	611556	22647	367183
continuous after	20843400	3157	11421	5345

Note that because the frequency 10s was smaller than the bucket interval of 60s, the last bucket result had to be re-written multiple times. With a frequency closer to the bucket interval the continuous transform would be closer to the batch transform.

optimize transform for group_by on date_histogram by injecting an additional range query. This limits the number of search and index requests and avoids unnecessary updates. Only recent buckets get re-written. fixes #54254

hendrikmuhs added >bug :ml/Transform Transform labels Mar 26, 2020

hendrikmuhs mentioned this issue Mar 26, 2020

[Transform] Transform optmize date histogram #54068

Merged

hendrikmuhs closed this as completed in #54068 Mar 26, 2020

codebrain mentioned this issue Apr 1, 2020

7.7.0 meta ticket (Part 3) elastic/elasticsearch-net#4534

Closed

lcawl mentioned this issue Apr 14, 2020

[DOCS] Removes transform performance note #55177

Merged

hendrikmuhs mentioned this issue Jul 8, 2020

[Transform] Performance: Unexpected long runtime for date_histogram group_by when using 2 different time fields #59061

Closed

hendrikmuhs mentioned this issue Oct 7, 2020

[Transform] improve continuous transform date_histogram group_by with ingest timestamps #63315

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Transform] continuous transform date_histogram group_by performance #54254

[Transform] continuous transform date_histogram group_by performance #54254

hendrikmuhs commented Mar 26, 2020 •

edited

Loading

elasticmachine commented Mar 26, 2020

hendrikmuhs commented Mar 26, 2020

[Transform] continuous transform date_histogram group_by performance #54254

[Transform] continuous transform date_histogram group_by performance #54254

Comments

hendrikmuhs commented Mar 26, 2020 • edited Loading

elasticmachine commented Mar 26, 2020

hendrikmuhs commented Mar 26, 2020

hendrikmuhs commented Mar 26, 2020 •

edited

Loading