-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Transform] continuous transform date_histogram group_by performance #54254
Comments
Pinging @elastic/ml-core (:ml/Transform) |
Runtime statistic example for optimization: Dataset: user reviews, 5.3 million reviews
Note that because the frequency |
optimize transform for group_by on date_histogram by injecting an additional range query. This limits the number of search and index requests and avoids unnecessary updates. Only recent buckets get re-written. fixes #54254
optimize transform for group_by on date_histogram by injecting an additional range query. This limits the number of search and index requests and avoids unnecessary updates. Only recent buckets get re-written. fixes #54254
optimize transform for group_by on date_histogram by injecting an additional range query. This limits the number of search and index requests and avoids unnecessary updates. Only recent buckets get re-written. fixes #54254
Affected versions: < 7.7
Problem
Continuous Transforms are optimized for usecases, where sessions are grouped using terms. Grouping on
date_histogram
- e.g. per hour metrics - with large datasets suffers from re-writing all buckets for every checkpoint. This causes a lot of load on the cluster and might result in service degradation.Mitigation
Rollup is optimized for this usecase and provides - via rollup search - aggregations on aggregations. Please consider using rollup instead.
Transform will provide an optimization for grouping on date_histogram with version 7.7. Please consider upgrading to 7.7. (Note that you can use a separate cluster for transform as transform supports CCS)
For this optimization to kick in, the field you configure for
sync
must be the same field you configure for the date_histogramgroup_by
. Using multiplegroup_by
is still possible, the transform gets optimized for the group_by where the field matches the field for time-basedsync
.If you can not switch to rollup and upgrading to
7.7
is not possible, you can workaround the problem by adding a query filter that filters out data, you know is not required for updating the transform:TIMESTAMP_FIELD
should be the same that you use fordate_histogram
as well assync
.The
FILTER_VALUE
should exclude at least everything beforedelay + interval
. Also take bucket rounding into account.For example if you group every 5 minutes and your ingest delay is 1 minute, the query should filter out everything older than 6 minutes. You can use date time logic for creating an absolute value:
now
.Examples:
now-1h/h
excludes everything older than 1 hour rounded down to the hournow-1d/d
excludes everything older than 1 day rounded down to the day.Note: This does not have to be exact, you can filter less. However it is important to round down to a start of a bucket. Without rounding down, transform will overwrite older buckets with wrong/incomplete data.
The text was updated successfully, but these errors were encountered: