-
Notifications
You must be signed in to change notification settings - Fork 525
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introduce transaction histogram metrics #3485
Comments
Another TBD, requiring research: what's the aggregation/publishing interval for these metrics? Too long, and it'll create noticeable delays for stats showing up in Kibana. Too short, and we won't see storage or query improvements. |
For RUM, we'll probably need to add some other user-agent fields over time to satisfy elastic/apm#198. Another option for these user-agent fields is we just use the original user-agent value in the grouping key, and continue to rely on ingest node to pull it apart in the resulting metrics docs. The only question I have about that is whether the original user-agent string would be too high-cardinality normally. High-cardinality isn't a problem over time, only within the short term – the aggregation interval. After aggregation intervals we would reset and reuse space. We'll still need to deal with malicious high-cardinality values anyway, since RUM is unauthenticated. To deal with that we could have a fixed maximum set of aggregation buckets, and once they're full we would devolve to creating a single-value histogram metric document for each other grouping key. |
If we do end up parsing the user-agent in the server, here's a couple of reasonable-ish off-the-shelf options: uap-go would be more in line with the ingest-node processor, which uses the regexes.yaml from https://github.com/ua-parser/uap-core. However, the Go implementation appears to be a bit out of date, and looking at its API, I suspect might have some performance issues. mssola/user_agent looks decent, but has hard coded logic, unlike ingest-node. I'm not sure if we could get away with that. |
@axw it would be ideal for us as well to have transaction.duration.histogram on regular transaction documents as well - that way we can run roughly the same queries on transaction and metrics. |
FYI: we are trying very hard to use APM for our production services and replace the commercial monitoring tools. Disk space is one of the main concerns we have right now as we get so much traffic and APM generates around 300-400GB on a good day. We are so excited for this feature.
|
The primary motivation behind this change is to lay the groundwork for merging shared (i.e. stream) and per-event metadata at decode time, rather than transformation time, which we'll need for #3485. We could merge metadata without these changes, but it would be more difficult and error prone. Making these change also provide some performance improvements – see below. Finally, there is also overlap between merging metadata and revising the decoders to enable memory use (#3551 (comment)). In theory this could be a considered a breaking change, due to the fact that an empty string coming from an agent would no longer be recorded in output documents. In practice, it does not make sense for any of the metadata fields to have empty string values. Due to the use of empty strings, we would have to change the behaviour of utility.Set to not record empty strings. Because I have only modified metadata types, and not all model types, I instead changed the metadata types' Fields methods to stop using utility.Set and implemented a limited version of #3565 which is more explicit about omitting empty strings. These changes yield a significant performance improvement in micro-benchmarks, both in decoding and transformation. Decoding improvements can be attributed to fewer allocations, while transformation improvements can be attributed to: - fewer allocations -- no interface allocations, or unnecessary deep copying of maps, due to utility.Set -- lazy map construction - less reflection, due to not using utility.Set - less pointer indirection name old time/op new time/op delta pkg:github.com/elastic/apm-server/model/metadata goos:linux goarch:amd64 MetadataSet/minimal-8 1.16µs ± 6% 0.38µs ±11% -67.59% (p=0.008 n=5+5) MetadataSet/full-8 11.9µs ± 4% 5.3µs ± 6% -55.53% (p=0.008 n=5+5) pkg:github.com/elastic/apm-server/model/modeldecoder goos:linux goarch:amd64 DecodeMetadata-8 9.70µs ± 1% 9.30µs ±17% ~ (p=0.690 n=5+5) name old alloc/op new alloc/op delta pkg:github.com/elastic/apm-server/model/metadata goos:linux goarch:amd64 MetadataSet/minimal-8 896B ± 0% 368B ± 0% -58.93% (p=0.008 n=5+5) MetadataSet/full-8 14.0kB ± 0% 6.2kB ± 0% -55.36% (p=0.008 n=5+5) pkg:github.com/elastic/apm-server/model/modeldecoder goos:linux goarch:amd64 DecodeMetadata-8 1.31kB ± 0% 1.06kB ± 0% -18.96% (p=0.000 n=5+4) name old allocs/op new allocs/op delta pkg:github.com/elastic/apm-server/model/metadata goos:linux goarch:amd64 MetadataSet/minimal-8 10.0 ± 0% 4.0 ± 0% -60.00% (p=0.008 n=5+5) MetadataSet/full-8 114 ± 0% 68 ± 0% -40.35% (p=0.008 n=5+5) pkg:github.com/elastic/apm-server/model/modeldecoder goos:linux goarch:amd64 DecodeMetadata-8 61.0 ± 0% 28.0 ± 0% -54.10% (p=0.008 n=5+5) * model/modeldecoder: benchmark DecodeMetadata * Benchmark recycled memory decoding * model/modeldecoder: update decoding * model/metadata: use non-pointer fields * Adapt inputs to model changes * model/metadata: benchmark Metadata.Set * model: fix golint error (Id->ID)
The primary motivation behind this change is to lay the groundwork for merging shared (i.e. stream) and per-event metadata at decode time, rather than transformation time, which we'll need for elastic#3485. We could merge metadata without these changes, but it would be more difficult and error prone. Making these change also provide some performance improvements – see below. Finally, there is also overlap between merging metadata and revising the decoders to enable memory use (elastic#3551 (comment)). In theory this could be a considered a breaking change, due to the fact that an empty string coming from an agent would no longer be recorded in output documents. In practice, it does not make sense for any of the metadata fields to have empty string values. Due to the use of empty strings, we would have to change the behaviour of utility.Set to not record empty strings. Because I have only modified metadata types, and not all model types, I instead changed the metadata types' Fields methods to stop using utility.Set and implemented a limited version of elastic#3565 which is more explicit about omitting empty strings. These changes yield a significant performance improvement in micro-benchmarks, both in decoding and transformation. Decoding improvements can be attributed to fewer allocations, while transformation improvements can be attributed to: - fewer allocations -- no interface allocations, or unnecessary deep copying of maps, due to utility.Set -- lazy map construction - less reflection, due to not using utility.Set - less pointer indirection name old time/op new time/op delta pkg:github.com/elastic/apm-server/model/metadata goos:linux goarch:amd64 MetadataSet/minimal-8 1.16µs ± 6% 0.38µs ±11% -67.59% (p=0.008 n=5+5) MetadataSet/full-8 11.9µs ± 4% 5.3µs ± 6% -55.53% (p=0.008 n=5+5) pkg:github.com/elastic/apm-server/model/modeldecoder goos:linux goarch:amd64 DecodeMetadata-8 9.70µs ± 1% 9.30µs ±17% ~ (p=0.690 n=5+5) name old alloc/op new alloc/op delta pkg:github.com/elastic/apm-server/model/metadata goos:linux goarch:amd64 MetadataSet/minimal-8 896B ± 0% 368B ± 0% -58.93% (p=0.008 n=5+5) MetadataSet/full-8 14.0kB ± 0% 6.2kB ± 0% -55.36% (p=0.008 n=5+5) pkg:github.com/elastic/apm-server/model/modeldecoder goos:linux goarch:amd64 DecodeMetadata-8 1.31kB ± 0% 1.06kB ± 0% -18.96% (p=0.000 n=5+4) name old allocs/op new allocs/op delta pkg:github.com/elastic/apm-server/model/metadata goos:linux goarch:amd64 MetadataSet/minimal-8 10.0 ± 0% 4.0 ± 0% -60.00% (p=0.008 n=5+5) MetadataSet/full-8 114 ± 0% 68 ± 0% -40.35% (p=0.008 n=5+5) pkg:github.com/elastic/apm-server/model/modeldecoder goos:linux goarch:amd64 DecodeMetadata-8 61.0 ± 0% 28.0 ± 0% -54.10% (p=0.008 n=5+5) * model/modeldecoder: benchmark DecodeMetadata * Benchmark recycled memory decoding * model/modeldecoder: update decoding * model/metadata: use non-pointer fields * Adapt inputs to model changes * model/metadata: benchmark Metadata.Set * model: fix golint error (Id->ID)
The primary motivation behind this change is to lay the groundwork for merging shared (i.e. stream) and per-event metadata at decode time, rather than transformation time, which we'll need for #3485. We could merge metadata without these changes, but it would be more difficult and error prone. Making these change also provide some performance improvements – see below. Finally, there is also overlap between merging metadata and revising the decoders to enable memory use (#3551 (comment)). In theory this could be a considered a breaking change, due to the fact that an empty string coming from an agent would no longer be recorded in output documents. In practice, it does not make sense for any of the metadata fields to have empty string values. Due to the use of empty strings, we would have to change the behaviour of utility.Set to not record empty strings. Because I have only modified metadata types, and not all model types, I instead changed the metadata types' Fields methods to stop using utility.Set and implemented a limited version of #3565 which is more explicit about omitting empty strings. These changes yield a significant performance improvement in micro-benchmarks, both in decoding and transformation. Decoding improvements can be attributed to fewer allocations, while transformation improvements can be attributed to: - fewer allocations -- no interface allocations, or unnecessary deep copying of maps, due to utility.Set -- lazy map construction - less reflection, due to not using utility.Set - less pointer indirection name old time/op new time/op delta pkg:github.com/elastic/apm-server/model/metadata goos:linux goarch:amd64 MetadataSet/minimal-8 1.16µs ± 6% 0.38µs ±11% -67.59% (p=0.008 n=5+5) MetadataSet/full-8 11.9µs ± 4% 5.3µs ± 6% -55.53% (p=0.008 n=5+5) pkg:github.com/elastic/apm-server/model/modeldecoder goos:linux goarch:amd64 DecodeMetadata-8 9.70µs ± 1% 9.30µs ±17% ~ (p=0.690 n=5+5) name old alloc/op new alloc/op delta pkg:github.com/elastic/apm-server/model/metadata goos:linux goarch:amd64 MetadataSet/minimal-8 896B ± 0% 368B ± 0% -58.93% (p=0.008 n=5+5) MetadataSet/full-8 14.0kB ± 0% 6.2kB ± 0% -55.36% (p=0.008 n=5+5) pkg:github.com/elastic/apm-server/model/modeldecoder goos:linux goarch:amd64 DecodeMetadata-8 1.31kB ± 0% 1.06kB ± 0% -18.96% (p=0.000 n=5+4) name old allocs/op new allocs/op delta pkg:github.com/elastic/apm-server/model/metadata goos:linux goarch:amd64 MetadataSet/minimal-8 10.0 ± 0% 4.0 ± 0% -60.00% (p=0.008 n=5+5) MetadataSet/full-8 114 ± 0% 68 ± 0% -40.35% (p=0.008 n=5+5) pkg:github.com/elastic/apm-server/model/modeldecoder goos:linux goarch:amd64 DecodeMetadata-8 61.0 ± 0% 28.0 ± 0% -54.10% (p=0.008 n=5+5) * model/modeldecoder: benchmark DecodeMetadata * Benchmark recycled memory decoding * model/modeldecoder: update decoding * model/metadata: use non-pointer fields * Adapt inputs to model changes * model/metadata: benchmark Metadata.Set * model: fix golint error (Id->ID)
We have issues for remaining work:
Nothing else to do here for now, so I'll close this out and we can option more specific issues as needed. |
Motivation / summary
Currently the way Elastic APM works is by recording every single transaction as an individual document - including "unsampled transactions". In order to compute statistics, we use Elasticsearch's aggregations framework to query over the documents in real time.
This is a simple approach, and permits aggregation over arbitrary dimensions (filtering criteria), but also comes with some downsides such as higher storage cost and aggregation/query performance.
See also: elastic/apm#104.
We will introduce an option to record pre-aggregated transaction duration histograms, using the histogram field type introduced in Elasticsearch 7.6.
Approach
The APM Server will take responsibility for aggregating and producing these histogram metrics. We may later also support agents producing the metrics in order to avoid sending unsampled transactions to the APM Server, but this is initially out of scope. To support that we would have the agents set a flag on events indicating that they have already been aggregated into a histogram.
Enabling transaction histogram metrics will not initially cause unsampled transactions to be dropped, so there will be no storage reduction -- only improved query performance. We will introduce a separate option for dropping unsampled transactions, and in a future major version (e.g. 8.0) we may start dropping unsampled transactions by default.
The histogram field will be used to power most if not all aggregations used in the APM UI, when the search bar is not in use. When the search bar is in use, we will fall back to the existing approach of querying over the individual documents. For the case where the histogram fields are used, the histogram metric documents must also include the context fields used by the default filters in the APM UI.
To support identifying trace groups (i.e. root transaction groups), we will flag the documents relating to root transactions with
transaction.root: true
.Thus, the APM Server must record histograms for each observed combination of the following fields:
The histogram field itself will be called
transaction.duration.histogram
. The exact algorithm and parameters to be used is TBD, but following suit with Elasticsearch and using HDRHistogram is likely.RUM-specific support
For RUM, the server will also need to perform GeoIP lookup and User-Agent parsing, and include their results to power RUM-specific visualisations. Specifically we would also need these fields:
These might be tackled in a second phase, which would require the RUM visualisations to continue using the existing aggregation approach in the mean time. However, it may be too difficult to switch the UI over to using the metrics before RUM support exists, so we should aim to implement this as soon as possible to avoid delaying the UI implementation.
ML Anomaly Detection support
If one were to drop unsampled transactions, then the existing ML Anomaly Detection jobs would no longer be accurate. We will need to update the jobs to use aggregations based on the histogram field: https://www.elastic.co/guide/en/machine-learning/current/ml-configuring-aggregation.html
Support/ramifications for SIEM
SIEM currently displays APM transactions as "Events", with two visualisations:
Dropping unsampled transactions will naturally lead to the event list only listing events for sampled transactions. For the chart we could go one of two ways: either have it match up with the list (i.e. count only sampled transactions), or base the aggregation on the histogram field.
Proposal: keep it simple and continue to base SIEM events off transaction documents. This means that if unsampled transactions are dropped, they will not show up in SIEM's event counts or event list.
Configuration
For various reasons, this feature will be opt-in when we introduce it, and in a later major version (e.g. 8.0) we would enable it by default for the default distribution. The reasons for initially making it opt-in are:
As mentioned, we will provide separate configuration for dropping unsampled transaction documents. This is separately configurable in order to maintain the ability to use the search bar to search over both sampled and unsampled transactions, and to support the RUM-specific map visualisation.
The exact configuration names are TBD.
The text was updated successfully, but these errors were encountered: