-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(tag_cardinality_limit transform): mitigate performance bottleneck #19281
Conversation
Your preview site for the VRL Playground will be ready in a few minutes, please allow time for it to build. Heres your preview link: |
/ci-run-regression |
@@ -29,8 +63,7 @@ impl AcceptedTagValueSet { | |||
let storage = match &mode { | |||
Mode::Exact => TagValueSetStorage::Set(HashSet::with_capacity(value_limit)), | |||
Mode::Probabilistic(config) => { | |||
let num_bits = config.cache_size_per_key / 8; // Convert bytes to bits |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This conversion was wrong and BloomFilter::with_size
takes the size in bytes, not bits, as an argument.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doh.
Regression Detector ResultsRun ID: c95e23d0-49e2-4044-b555-06f889b77ec5 Explanation A regression test is an A/B test of target performance in a repeatable rig, where "performance" is measured as "comparison variant minus baseline variant" for an optimization goal (e.g., ingress throughput). Due to intrinsic variability in measuring that goal, we can only estimate its mean value for each experiment; we report uncertainty in that value as a 90.00% confidence interval denoted "Δ mean % CI". For each experiment, we decide whether a change in performance is a "regression" -- a change worth investigating further -- if all of the following criteria are true:
Changes in experiment optimization goals with confidence ≥ 90.00% and |Δ mean %| ≥ 5.00%:
Experiments that were not declared erratic but were detected as being so, coefficient of variation cutoff 0.10:
Fine details of change detection per experiment.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice find!
@@ -29,8 +63,7 @@ impl AcceptedTagValueSet { | |||
let storage = match &mode { | |||
Mode::Exact => TagValueSetStorage::Set(HashSet::with_capacity(value_limit)), | |||
Mode::Probabilistic(config) => { | |||
let num_bits = config.cache_size_per_key / 8; // Convert bytes to bits |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doh.
/// Count of items inserted into the bloom filter. | ||
/// We manually track this because `BloomFilter::count` has O(n) time complexity. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 to this comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It also represents a different value. The bloom filter, if it is counting the bits in the filter, can only return the number of reduced items, which will frequently (usually?) be less than the number of actual items inserted in the filter.
@@ -59,3 +92,47 @@ impl AcceptedTagValueSet { | |||
}; | |||
} | |||
} | |||
|
|||
#[cfg(test)] | |||
mod tests { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 to these tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm remembering this separately maintained count used to exist but then I dropped it in #17911 since I thought it was maintained separately only because the old bloom filter type didn't have a way to fetch the count.
@@ -86,7 +86,7 @@ const fn default_value_limit() -> usize { | |||
} | |||
|
|||
pub(crate) const fn default_cache_size() -> usize { | |||
5000 * 1024 // 5KB | |||
5 * 1024 // 5KB |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a potential behavior change, in that it will cause more false positives for sites that have not explictly configured the cache size.
/// Count of items inserted into the bloom filter. | ||
/// We manually track this because `BloomFilter::count` has O(n) time complexity. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It also represents a different value. The bloom filter, if it is counting the bits in the filter, can only return the number of reduced items, which will frequently (usually?) be less than the number of actual items inserted in the filter.
Datadog ReportBranch report: ✅ |
Regression Detector ResultsRun ID: 248f4a82-ca98-49e0-9ed4-c72d63176611 Explanation A regression test is an A/B test of target performance in a repeatable rig, where "performance" is measured as "comparison variant minus baseline variant" for an optimization goal (e.g., ingress throughput). Due to intrinsic variability in measuring that goal, we can only estimate its mean value for each experiment; we report uncertainty in that value as a 90.00% confidence interval denoted "Δ mean % CI". For each experiment, we decide whether a change in performance is a "regression" -- a change worth investigating further -- if all of the following criteria are true:
Changes in experiment optimization goals with confidence ≥ 90.00% and |Δ mean %| ≥ 5.00%:
Experiments that were not declared erratic but were detected as being so, coefficient of variation cutoff 0.10:
Fine details of change detection per experiment.
|
Regression Detector ResultsRun ID: f687d12f-ea75-4dc8-b89c-1c92d7b5cfac Explanation A regression test is an A/B test of target performance in a repeatable rig, where "performance" is measured as "comparison variant minus baseline variant" for an optimization goal (e.g., ingress throughput). Due to intrinsic variability in measuring that goal, we can only estimate its mean value for each experiment; we report uncertainty in that value as a 90.00% confidence interval denoted "Δ mean % CI". For each experiment, we decide whether a change in performance is a "regression" -- a change worth investigating further -- if all of the following criteria are true:
Changes in experiment optimization goals with confidence ≥ 90.00% and |Δ mean %| ≥ 5.00%:
Experiments that were not declared erratic but were detected as being so, coefficient of variation cutoff 0.10:
Fine details of change detection per experiment.
|
Regression Detector ResultsRun ID: 9df2330e-d484-49f2-8f6f-ac8c16fabb60 Explanation A regression test is an A/B test of target performance in a repeatable rig, where "performance" is measured as "comparison variant minus baseline variant" for an optimization goal (e.g., ingress throughput). Due to intrinsic variability in measuring that goal, we can only estimate its mean value for each experiment; we report uncertainty in that value as a 90.00% confidence interval denoted "Δ mean % CI". For each experiment, we decide whether a change in performance is a "regression" -- a change worth investigating further -- if all of the following criteria are true:
Changes in experiment optimization goals with confidence ≥ 90.00% and |Δ mean %| ≥ 5.00%:
Experiments that were not declared erratic but were detected as being so, coefficient of variation cutoff 0.10:
Fine details of change detection per experiment.
|
Regression Detector ResultsRun ID: a9ace7b3-f1b2-47e9-bd95-462ace0cff85 Explanation A regression test is an A/B test of target performance in a repeatable rig, where "performance" is measured as "comparison variant minus baseline variant" for an optimization goal (e.g., ingress throughput). Due to intrinsic variability in measuring that goal, we can only estimate its mean value for each experiment; we report uncertainty in that value as a 90.00% confidence interval denoted "Δ mean % CI". For each experiment, we decide whether a change in performance is a "regression" -- a change worth investigating further -- if all of the following criteria are true:
Changes in experiment optimization goals with confidence ≥ 90.00% and |Δ mean %| ≥ 5.00%:
Experiments that were not declared erratic but were detected as being so, coefficient of variation cutoff 0.10:
Fine details of change detection per experiment.
|
Regression Detector ResultsRun ID: e618fae7-14d2-4e5f-abba-28ff37c68c13 Explanation A regression test is an A/B test of target performance in a repeatable rig, where "performance" is measured as "comparison variant minus baseline variant" for an optimization goal (e.g., ingress throughput). Due to intrinsic variability in measuring that goal, we can only estimate its mean value for each experiment; we report uncertainty in that value as a 90.00% confidence interval denoted "Δ mean % CI". For each experiment, we decide whether a change in performance is a "regression" -- a change worth investigating further -- if all of the following criteria are true:
Changes in experiment optimization goals with confidence ≥ 90.00% and |Δ mean %| ≥ 5.00%:
Experiments that were not declared erratic but were detected as being so, coefficient of variation cutoff 0.10:
Fine details of change detection per experiment.
|
This is blocked from merging due to a bug in the regression detector that is flagging positive changes as blocking. I'll merge manually. |
This avoids calling BloomFilter::count which has a time complexity of O(n) where n is the number of bits in the filter.
127ab11
to
42683ef
Compare
This avoids calling
BloomFilter::count
which has a time complexity ofO(n)
wheren
is the number of bits in the filter.