Introduce transaction histogram metrics #3485

axw · 2020-03-15T06:53:57Z

Motivation / summary

Currently the way Elastic APM works is by recording every single transaction as an individual document - including "unsampled transactions". In order to compute statistics, we use Elasticsearch's aggregations framework to query over the documents in real time.

This is a simple approach, and permits aggregation over arbitrary dimensions (filtering criteria), but also comes with some downsides such as higher storage cost and aggregation/query performance.

Approach

The APM Server will take responsibility for aggregating and producing these histogram metrics. We may later also support agents producing the metrics in order to avoid sending unsampled transactions to the APM Server, but this is initially out of scope. To support that we would have the agents set a flag on events indicating that they have already been aggregated into a histogram.

Enabling transaction histogram metrics will not initially cause unsampled transactions to be dropped, so there will be no storage reduction -- only improved query performance. We will introduce a separate option for dropping unsampled transactions, and in a future major version (e.g. 8.0) we may start dropping unsampled transactions by default.

The histogram field will be used to power most if not all aggregations used in the APM UI, when the search bar is not in use. When the search bar is in use, we will fall back to the existing approach of querying over the individual documents. For the case where the histogram fields are used, the histogram metric documents must also include the context fields used by the default filters in the APM UI.

To support identifying trace groups (i.e. root transaction groups), we will flag the documents relating to root transactions with transaction.root: true.

Thus, the APM Server must record histograms for each observed combination of the following fields:

agent.name
service.name
service.version (for deployment annotations)
service.environment
transaction.name
transaction.type
transaction.result
transaction.root
host.hostname
container.id
kubernetes.pod.name

The histogram field itself will be called transaction.duration.histogram. The exact algorithm and parameters to be used is TBD, but following suit with Elasticsearch and using HDRHistogram is likely.

RUM-specific support

For RUM, the server will also need to perform GeoIP lookup and User-Agent parsing, and include their results to power RUM-specific visualisations. Specifically we would also need these fields:

client.geo.country_iso_code
user_agent.name

These might be tackled in a second phase, which would require the RUM visualisations to continue using the existing aggregation approach in the mean time. However, it may be too difficult to switch the UI over to using the metrics before RUM support exists, so we should aim to implement this as soon as possible to avoid delaying the UI implementation.

ML Anomaly Detection support

If one were to drop unsampled transactions, then the existing ML Anomaly Detection jobs would no longer be accurate. We will need to update the jobs to use aggregations based on the histogram field: https://www.elastic.co/guide/en/machine-learning/current/ml-configuring-aggregation.html

Support/ramifications for SIEM

SIEM currently displays APM transactions as "Events", with two visualisations:

a bar chart of the number of number of events (presumably, a date_histogram)
a list of events/transactions

Dropping unsampled transactions will naturally lead to the event list only listing events for sampled transactions. For the chart we could go one of two ways: either have it match up with the list (i.e. count only sampled transactions), or base the aggregation on the histogram field.

Proposal: keep it simple and continue to base SIEM events off transaction documents. This means that if unsampled transactions are dropped, they will not show up in SIEM's event counts or event list.

Configuration

For various reasons, this feature will be opt-in when we introduce it, and in a later major version (e.g. 8.0) we would enable it by default for the default distribution. The reasons for initially making it opt-in are:

backwards compatibility for older versions of APM UI/Kibana
the histogram field type is available only under the Elastic license
using this approach effectively may require deployment changes, moving the APM Server to the edge machines to avoid high cardinality aggregation dimensions such as hostname, container name, etc.

As mentioned, we will provide separate configuration for dropping unsampled transaction documents. This is separately configurable in order to maintain the ability to use the search bar to search over both sampled and unsampled transactions, and to support the RUM-specific map visualisation.

The exact configuration names are TBD.

The text was updated successfully, but these errors were encountered:

axw · 2020-03-19T03:07:01Z

Another TBD, requiring research: what's the aggregation/publishing interval for these metrics? Too long, and it'll create noticeable delays for stats showing up in Kibana. Too short, and we won't see storage or query improvements.

axw · 2020-03-19T06:22:29Z

For RUM, we'll probably need to add some other user-agent fields over time to satisfy elastic/apm#198.

Another option for these user-agent fields is we just use the original user-agent value in the grouping key, and continue to rely on ingest node to pull it apart in the resulting metrics docs.

The only question I have about that is whether the original user-agent string would be too high-cardinality normally. High-cardinality isn't a problem over time, only within the short term – the aggregation interval. After aggregation intervals we would reset and reuse space.

We'll still need to deal with malicious high-cardinality values anyway, since RUM is unauthenticated. To deal with that we could have a fixed maximum set of aggregation buckets, and once they're full we would devolve to creating a single-value histogram metric document for each other grouping key.

axw · 2020-03-19T06:36:15Z

If we do end up parsing the user-agent in the server, here's a couple of reasonable-ish off-the-shelf options:

uap-go would be more in line with the ingest-node processor, which uses the regexes.yaml from https://github.com/ua-parser/uap-core. However, the Go implementation appears to be a bit out of date, and looking at its API, I suspect might have some performance issues.

mssola/user_agent looks decent, but has hard coded logic, unlike ingest-node. I'm not sure if we could get away with that.

dgieselaar · 2020-03-19T06:45:00Z

@axw it would be ideal for us as well to have transaction.duration.histogram on regular transaction documents as well - that way we can run roughly the same queries on transaction and metrics.

balusarakesh · 2020-03-19T17:41:44Z

FYI: we are trying very hard to use APM for our production services and replace the commercial monitoring tools. Disk space is one of the main concerns we have right now as we get so much traffic and APM generates around 300-400GB on a good day. We are so excited for this feature.

Thank you for all your work on APM

The primary motivation behind this change is to lay the groundwork for merging shared (i.e. stream) and per-event metadata at decode time, rather than transformation time, which we'll need for #3485. We could merge metadata without these changes, but it would be more difficult and error prone. Making these change also provide some performance improvements – see below. Finally, there is also overlap between merging metadata and revising the decoders to enable memory use (#3551 (comment)). In theory this could be a considered a breaking change, due to the fact that an empty string coming from an agent would no longer be recorded in output documents. In practice, it does not make sense for any of the metadata fields to have empty string values. Due to the use of empty strings, we would have to change the behaviour of utility.Set to not record empty strings. Because I have only modified metadata types, and not all model types, I instead changed the metadata types' Fields methods to stop using utility.Set and implemented a limited version of #3565 which is more explicit about omitting empty strings. These changes yield a significant performance improvement in micro-benchmarks, both in decoding and transformation. Decoding improvements can be attributed to fewer allocations, while transformation improvements can be attributed to: - fewer allocations -- no interface allocations, or unnecessary deep copying of maps, due to utility.Set -- lazy map construction - less reflection, due to not using utility.Set - less pointer indirection name old time/op new time/op delta pkg:github.com/elastic/apm-server/model/metadata goos:linux goarch:amd64 MetadataSet/minimal-8 1.16µs ± 6% 0.38µs ±11% -67.59% (p=0.008 n=5+5) MetadataSet/full-8 11.9µs ± 4% 5.3µs ± 6% -55.53% (p=0.008 n=5+5) pkg:github.com/elastic/apm-server/model/modeldecoder goos:linux goarch:amd64 DecodeMetadata-8 9.70µs ± 1% 9.30µs ±17% ~ (p=0.690 n=5+5) name old alloc/op new alloc/op delta pkg:github.com/elastic/apm-server/model/metadata goos:linux goarch:amd64 MetadataSet/minimal-8 896B ± 0% 368B ± 0% -58.93% (p=0.008 n=5+5) MetadataSet/full-8 14.0kB ± 0% 6.2kB ± 0% -55.36% (p=0.008 n=5+5) pkg:github.com/elastic/apm-server/model/modeldecoder goos:linux goarch:amd64 DecodeMetadata-8 1.31kB ± 0% 1.06kB ± 0% -18.96% (p=0.000 n=5+4) name old allocs/op new allocs/op delta pkg:github.com/elastic/apm-server/model/metadata goos:linux goarch:amd64 MetadataSet/minimal-8 10.0 ± 0% 4.0 ± 0% -60.00% (p=0.008 n=5+5) MetadataSet/full-8 114 ± 0% 68 ± 0% -40.35% (p=0.008 n=5+5) pkg:github.com/elastic/apm-server/model/modeldecoder goos:linux goarch:amd64 DecodeMetadata-8 61.0 ± 0% 28.0 ± 0% -54.10% (p=0.008 n=5+5) * model/modeldecoder: benchmark DecodeMetadata * Benchmark recycled memory decoding * model/modeldecoder: update decoding * model/metadata: use non-pointer fields * Adapt inputs to model changes * model/metadata: benchmark Metadata.Set * model: fix golint error (Id->ID)

The primary motivation behind this change is to lay the groundwork for merging shared (i.e. stream) and per-event metadata at decode time, rather than transformation time, which we'll need for elastic#3485. We could merge metadata without these changes, but it would be more difficult and error prone. Making these change also provide some performance improvements – see below. Finally, there is also overlap between merging metadata and revising the decoders to enable memory use (elastic#3551 (comment)). In theory this could be a considered a breaking change, due to the fact that an empty string coming from an agent would no longer be recorded in output documents. In practice, it does not make sense for any of the metadata fields to have empty string values. Due to the use of empty strings, we would have to change the behaviour of utility.Set to not record empty strings. Because I have only modified metadata types, and not all model types, I instead changed the metadata types' Fields methods to stop using utility.Set and implemented a limited version of elastic#3565 which is more explicit about omitting empty strings. These changes yield a significant performance improvement in micro-benchmarks, both in decoding and transformation. Decoding improvements can be attributed to fewer allocations, while transformation improvements can be attributed to: - fewer allocations -- no interface allocations, or unnecessary deep copying of maps, due to utility.Set -- lazy map construction - less reflection, due to not using utility.Set - less pointer indirection name old time/op new time/op delta pkg:github.com/elastic/apm-server/model/metadata goos:linux goarch:amd64 MetadataSet/minimal-8 1.16µs ± 6% 0.38µs ±11% -67.59% (p=0.008 n=5+5) MetadataSet/full-8 11.9µs ± 4% 5.3µs ± 6% -55.53% (p=0.008 n=5+5) pkg:github.com/elastic/apm-server/model/modeldecoder goos:linux goarch:amd64 DecodeMetadata-8 9.70µs ± 1% 9.30µs ±17% ~ (p=0.690 n=5+5) name old alloc/op new alloc/op delta pkg:github.com/elastic/apm-server/model/metadata goos:linux goarch:amd64 MetadataSet/minimal-8 896B ± 0% 368B ± 0% -58.93% (p=0.008 n=5+5) MetadataSet/full-8 14.0kB ± 0% 6.2kB ± 0% -55.36% (p=0.008 n=5+5) pkg:github.com/elastic/apm-server/model/modeldecoder goos:linux goarch:amd64 DecodeMetadata-8 1.31kB ± 0% 1.06kB ± 0% -18.96% (p=0.000 n=5+4) name old allocs/op new allocs/op delta pkg:github.com/elastic/apm-server/model/metadata goos:linux goarch:amd64 MetadataSet/minimal-8 10.0 ± 0% 4.0 ± 0% -60.00% (p=0.008 n=5+5) MetadataSet/full-8 114 ± 0% 68 ± 0% -40.35% (p=0.008 n=5+5) pkg:github.com/elastic/apm-server/model/modeldecoder goos:linux goarch:amd64 DecodeMetadata-8 61.0 ± 0% 28.0 ± 0% -54.10% (p=0.008 n=5+5) * model/modeldecoder: benchmark DecodeMetadata * Benchmark recycled memory decoding * model/modeldecoder: update decoding * model/metadata: use non-pointer fields * Adapt inputs to model changes * model/metadata: benchmark Metadata.Set * model: fix golint error (Id->ID)

The primary motivation behind this change is to lay the groundwork for merging shared (i.e. stream) and per-event metadata at decode time, rather than transformation time, which we'll need for #3485. We could merge metadata without these changes, but it would be more difficult and error prone. Making these change also provide some performance improvements – see below. Finally, there is also overlap between merging metadata and revising the decoders to enable memory use (#3551 (comment)). In theory this could be a considered a breaking change, due to the fact that an empty string coming from an agent would no longer be recorded in output documents. In practice, it does not make sense for any of the metadata fields to have empty string values. Due to the use of empty strings, we would have to change the behaviour of utility.Set to not record empty strings. Because I have only modified metadata types, and not all model types, I instead changed the metadata types' Fields methods to stop using utility.Set and implemented a limited version of #3565 which is more explicit about omitting empty strings. These changes yield a significant performance improvement in micro-benchmarks, both in decoding and transformation. Decoding improvements can be attributed to fewer allocations, while transformation improvements can be attributed to: - fewer allocations -- no interface allocations, or unnecessary deep copying of maps, due to utility.Set -- lazy map construction - less reflection, due to not using utility.Set - less pointer indirection name old time/op new time/op delta pkg:github.com/elastic/apm-server/model/metadata goos:linux goarch:amd64 MetadataSet/minimal-8 1.16µs ± 6% 0.38µs ±11% -67.59% (p=0.008 n=5+5) MetadataSet/full-8 11.9µs ± 4% 5.3µs ± 6% -55.53% (p=0.008 n=5+5) pkg:github.com/elastic/apm-server/model/modeldecoder goos:linux goarch:amd64 DecodeMetadata-8 9.70µs ± 1% 9.30µs ±17% ~ (p=0.690 n=5+5) name old alloc/op new alloc/op delta pkg:github.com/elastic/apm-server/model/metadata goos:linux goarch:amd64 MetadataSet/minimal-8 896B ± 0% 368B ± 0% -58.93% (p=0.008 n=5+5) MetadataSet/full-8 14.0kB ± 0% 6.2kB ± 0% -55.36% (p=0.008 n=5+5) pkg:github.com/elastic/apm-server/model/modeldecoder goos:linux goarch:amd64 DecodeMetadata-8 1.31kB ± 0% 1.06kB ± 0% -18.96% (p=0.000 n=5+4) name old allocs/op new allocs/op delta pkg:github.com/elastic/apm-server/model/metadata goos:linux goarch:amd64 MetadataSet/minimal-8 10.0 ± 0% 4.0 ± 0% -60.00% (p=0.008 n=5+5) MetadataSet/full-8 114 ± 0% 68 ± 0% -40.35% (p=0.008 n=5+5) pkg:github.com/elastic/apm-server/model/modeldecoder goos:linux goarch:amd64 DecodeMetadata-8 61.0 ± 0% 28.0 ± 0% -54.10% (p=0.008 n=5+5) * model/modeldecoder: benchmark DecodeMetadata * Benchmark recycled memory decoding * model/modeldecoder: update decoding * model/metadata: use non-pointer fields * Adapt inputs to model changes * model/metadata: benchmark Metadata.Set * model: fix golint error (Id->ID)

axw · 2020-09-09T07:40:13Z

We have issues for remaining work:

docs (Document transaction metrics #4031)
geoIP enrichment (GeoIP enrichment processor #3613)

Nothing else to do here for now, so I'll close this out and we can option more specific issues as needed.

axw added the enhancement label Mar 15, 2020

axw added this to the 7.8 milestone Mar 15, 2020

graphaelli added the [zube]: Backlog label Mar 18, 2020

This was referenced Mar 30, 2020

all: add metadata directly to model types #3573

Merged

GeoIP enrichment processor #3613

Closed

model/metadata: make all fields non-pointer types #3618

Merged

axw mentioned this issue Apr 7, 2020

model: merge stream and per-event metadata #3623

Merged

5 tasks

axw added [zube]: Ready and removed [zube]: Backlog labels Apr 7, 2020

axw self-assigned this Apr 7, 2020

This was referenced Apr 9, 2020

model: enhancements to support transaction aggregations #3639

Merged

Introduce transaction metrics aggregation #3651

Merged

model/...: introduce metadata.Client #3652

Merged

axw added [zube]: In Progress and removed [zube]: Ready labels Apr 23, 2020

This was referenced Apr 24, 2020

aggregation/txmetrics: optimise for concurrent writers #3691

Merged

Configuration for dropping non-sampled transactions #3702

Merged

axw added [zube]: In Review and removed [zube]: In Progress labels Apr 30, 2020

dgieselaar mentioned this issue May 18, 2020

[APM] Metrics powered UI POC elastic/kibana#66871

Closed

axw added [zube]: In Progress and removed [zube]: In Review labels May 27, 2020

zube bot added [zube]: Blocked and removed [zube]: In Progress labels Jun 17, 2020

axw modified the milestones: 7.8, 7.9 Jun 24, 2020

talevy mentioned this issue Jul 9, 2020

Add doc_count field mapper elastic/elasticsearch#58339

Closed

axw mentioned this issue Jul 29, 2020

Document transaction metrics #4031

Closed

axw modified the milestones: 7.9, 7.10 Aug 10, 2020

axw closed this as completed Sep 9, 2020

axw added [zube]: Done and removed [zube]: Blocked labels Sep 9, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introduce transaction histogram metrics #3485

Introduce transaction histogram metrics #3485

axw commented Mar 15, 2020 •

edited

Loading

axw commented Mar 19, 2020

axw commented Mar 19, 2020

axw commented Mar 19, 2020

dgieselaar commented Mar 19, 2020

balusarakesh commented Mar 19, 2020 •

edited

Loading

axw commented Sep 9, 2020

Introduce transaction histogram metrics #3485

Introduce transaction histogram metrics #3485

Comments

axw commented Mar 15, 2020 • edited Loading

Motivation / summary

Approach

RUM-specific support

ML Anomaly Detection support

Support/ramifications for SIEM

Configuration

axw commented Mar 19, 2020

axw commented Mar 19, 2020

axw commented Mar 19, 2020

dgieselaar commented Mar 19, 2020

balusarakesh commented Mar 19, 2020 • edited Loading

axw commented Sep 9, 2020

axw commented Mar 15, 2020 •

edited

Loading

balusarakesh commented Mar 19, 2020 •

edited

Loading