Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce transaction histogram metrics #3485

Closed
axw opened this issue Mar 15, 2020 · 6 comments
Closed

Introduce transaction histogram metrics #3485

axw opened this issue Mar 15, 2020 · 6 comments
Assignees
Milestone

Comments

@axw
Copy link
Member

axw commented Mar 15, 2020

Motivation / summary

Currently the way Elastic APM works is by recording every single transaction as an individual document - including "unsampled transactions". In order to compute statistics, we use Elasticsearch's aggregations framework to query over the documents in real time.

This is a simple approach, and permits aggregation over arbitrary dimensions (filtering criteria), but also comes with some downsides such as higher storage cost and aggregation/query performance.

See also: elastic/apm#104.

We will introduce an option to record pre-aggregated transaction duration histograms, using the histogram field type introduced in Elasticsearch 7.6.

Approach

The APM Server will take responsibility for aggregating and producing these histogram metrics. We may later also support agents producing the metrics in order to avoid sending unsampled transactions to the APM Server, but this is initially out of scope. To support that we would have the agents set a flag on events indicating that they have already been aggregated into a histogram.

Enabling transaction histogram metrics will not initially cause unsampled transactions to be dropped, so there will be no storage reduction -- only improved query performance. We will introduce a separate option for dropping unsampled transactions, and in a future major version (e.g. 8.0) we may start dropping unsampled transactions by default.

The histogram field will be used to power most if not all aggregations used in the APM UI, when the search bar is not in use. When the search bar is in use, we will fall back to the existing approach of querying over the individual documents. For the case where the histogram fields are used, the histogram metric documents must also include the context fields used by the default filters in the APM UI.

To support identifying trace groups (i.e. root transaction groups), we will flag the documents relating to root transactions with transaction.root: true.

Thus, the APM Server must record histograms for each observed combination of the following fields:

  • agent.name
  • service.name
  • service.version (for deployment annotations)
  • service.environment
  • transaction.name
  • transaction.type
  • transaction.result
  • transaction.root
  • host.hostname
  • container.id
  • kubernetes.pod.name

The histogram field itself will be called transaction.duration.histogram. The exact algorithm and parameters to be used is TBD, but following suit with Elasticsearch and using HDRHistogram is likely.

RUM-specific support

For RUM, the server will also need to perform GeoIP lookup and User-Agent parsing, and include their results to power RUM-specific visualisations. Specifically we would also need these fields:

  • client.geo.country_iso_code
  • user_agent.name

These might be tackled in a second phase, which would require the RUM visualisations to continue using the existing aggregation approach in the mean time. However, it may be too difficult to switch the UI over to using the metrics before RUM support exists, so we should aim to implement this as soon as possible to avoid delaying the UI implementation.

ML Anomaly Detection support

If one were to drop unsampled transactions, then the existing ML Anomaly Detection jobs would no longer be accurate. We will need to update the jobs to use aggregations based on the histogram field: https://www.elastic.co/guide/en/machine-learning/current/ml-configuring-aggregation.html

Support/ramifications for SIEM

SIEM currently displays APM transactions as "Events", with two visualisations:

  • a bar chart of the number of number of events (presumably, a date_histogram)
  • a list of events/transactions

Dropping unsampled transactions will naturally lead to the event list only listing events for sampled transactions. For the chart we could go one of two ways: either have it match up with the list (i.e. count only sampled transactions), or base the aggregation on the histogram field.

Proposal: keep it simple and continue to base SIEM events off transaction documents. This means that if unsampled transactions are dropped, they will not show up in SIEM's event counts or event list.

Configuration

For various reasons, this feature will be opt-in when we introduce it, and in a later major version (e.g. 8.0) we would enable it by default for the default distribution. The reasons for initially making it opt-in are:

  • backwards compatibility for older versions of APM UI/Kibana
  • the histogram field type is available only under the Elastic license
  • using this approach effectively may require deployment changes, moving the APM Server to the edge machines to avoid high cardinality aggregation dimensions such as hostname, container name, etc.

As mentioned, we will provide separate configuration for dropping unsampled transaction documents. This is separately configurable in order to maintain the ability to use the search bar to search over both sampled and unsampled transactions, and to support the RUM-specific map visualisation.

The exact configuration names are TBD.

@axw axw added this to the 7.8 milestone Mar 15, 2020
@axw
Copy link
Member Author

axw commented Mar 19, 2020

Another TBD, requiring research: what's the aggregation/publishing interval for these metrics? Too long, and it'll create noticeable delays for stats showing up in Kibana. Too short, and we won't see storage or query improvements.

@axw
Copy link
Member Author

axw commented Mar 19, 2020

For RUM, we'll probably need to add some other user-agent fields over time to satisfy elastic/apm#198.

Another option for these user-agent fields is we just use the original user-agent value in the grouping key, and continue to rely on ingest node to pull it apart in the resulting metrics docs.

The only question I have about that is whether the original user-agent string would be too high-cardinality normally. High-cardinality isn't a problem over time, only within the short term – the aggregation interval. After aggregation intervals we would reset and reuse space.

We'll still need to deal with malicious high-cardinality values anyway, since RUM is unauthenticated. To deal with that we could have a fixed maximum set of aggregation buckets, and once they're full we would devolve to creating a single-value histogram metric document for each other grouping key.

@axw
Copy link
Member Author

axw commented Mar 19, 2020

If we do end up parsing the user-agent in the server, here's a couple of reasonable-ish off-the-shelf options:

uap-go would be more in line with the ingest-node processor, which uses the regexes.yaml from https://github.com/ua-parser/uap-core. However, the Go implementation appears to be a bit out of date, and looking at its API, I suspect might have some performance issues.

mssola/user_agent looks decent, but has hard coded logic, unlike ingest-node. I'm not sure if we could get away with that.

@dgieselaar
Copy link
Member

@axw it would be ideal for us as well to have transaction.duration.histogram on regular transaction documents as well - that way we can run roughly the same queries on transaction and metrics.

@balusarakesh
Copy link

balusarakesh commented Mar 19, 2020

FYI: we are trying very hard to use APM for our production services and replace the commercial monitoring tools. Disk space is one of the main concerns we have right now as we get so much traffic and APM generates around 300-400GB on a good day. We are so excited for this feature.

  • Thank you for all your work on APM

axw added a commit that referenced this issue Apr 6, 2020
The primary motivation behind this change is to lay the groundwork for merging shared (i.e. stream) and per-event metadata at decode time, rather than transformation time, which we'll need for #3485.

We could merge metadata without these changes, but it would be more difficult and error prone. Making these change also provide some performance improvements – see below. Finally, there is also overlap between merging metadata and revising the decoders to enable memory use (#3551 (comment)).

In theory this could be a considered a breaking change, due to the fact that an empty string coming from an agent would no longer be recorded in output documents. In practice, it does not make sense for any of the metadata fields to have empty string values.

Due to the use of empty strings, we would have to change the behaviour of utility.Set to not record empty strings. Because I have only modified metadata types, and not all model types, I instead changed the metadata types' Fields methods to stop using utility.Set and implemented a limited version of #3565 which is more explicit about omitting empty strings.

These changes yield a significant performance improvement in micro-benchmarks, both in decoding and transformation. Decoding improvements can be attributed to fewer allocations, while transformation improvements can be attributed to:

- fewer allocations
-- no interface allocations, or unnecessary deep copying of maps, due to utility.Set
-- lazy map construction
- less reflection, due to not using utility.Set
- less pointer indirection

name                   old time/op    new time/op    delta
pkg:github.com/elastic/apm-server/model/metadata goos:linux goarch:amd64
MetadataSet/minimal-8    1.16µs ± 6%    0.38µs ±11%  -67.59%  (p=0.008 n=5+5)
MetadataSet/full-8       11.9µs ± 4%     5.3µs ± 6%  -55.53%  (p=0.008 n=5+5)
pkg:github.com/elastic/apm-server/model/modeldecoder goos:linux goarch:amd64
DecodeMetadata-8         9.70µs ± 1%    9.30µs ±17%     ~     (p=0.690 n=5+5)

name                   old alloc/op   new alloc/op   delta
pkg:github.com/elastic/apm-server/model/metadata goos:linux goarch:amd64
MetadataSet/minimal-8      896B ± 0%      368B ± 0%  -58.93%  (p=0.008 n=5+5)
MetadataSet/full-8       14.0kB ± 0%     6.2kB ± 0%  -55.36%  (p=0.008 n=5+5)
pkg:github.com/elastic/apm-server/model/modeldecoder goos:linux goarch:amd64
DecodeMetadata-8         1.31kB ± 0%    1.06kB ± 0%  -18.96%  (p=0.000 n=5+4)

name                   old allocs/op  new allocs/op  delta
pkg:github.com/elastic/apm-server/model/metadata goos:linux goarch:amd64
MetadataSet/minimal-8      10.0 ± 0%       4.0 ± 0%  -60.00%  (p=0.008 n=5+5)
MetadataSet/full-8          114 ± 0%        68 ± 0%  -40.35%  (p=0.008 n=5+5)
pkg:github.com/elastic/apm-server/model/modeldecoder goos:linux goarch:amd64
DecodeMetadata-8           61.0 ± 0%      28.0 ± 0%  -54.10%  (p=0.008 n=5+5)

* model/modeldecoder: benchmark DecodeMetadata

* Benchmark recycled memory decoding

* model/modeldecoder: update decoding

* model/metadata: use non-pointer fields

* Adapt inputs to model changes

* model/metadata: benchmark Metadata.Set

* model: fix golint error (Id->ID)
@axw axw self-assigned this Apr 7, 2020
axw added a commit to axw/apm-server that referenced this issue Apr 8, 2020
The primary motivation behind this change is to lay the groundwork for merging shared (i.e. stream) and per-event metadata at decode time, rather than transformation time, which we'll need for elastic#3485.

We could merge metadata without these changes, but it would be more difficult and error prone. Making these change also provide some performance improvements – see below. Finally, there is also overlap between merging metadata and revising the decoders to enable memory use (elastic#3551 (comment)).

In theory this could be a considered a breaking change, due to the fact that an empty string coming from an agent would no longer be recorded in output documents. In practice, it does not make sense for any of the metadata fields to have empty string values.

Due to the use of empty strings, we would have to change the behaviour of utility.Set to not record empty strings. Because I have only modified metadata types, and not all model types, I instead changed the metadata types' Fields methods to stop using utility.Set and implemented a limited version of elastic#3565 which is more explicit about omitting empty strings.

These changes yield a significant performance improvement in micro-benchmarks, both in decoding and transformation. Decoding improvements can be attributed to fewer allocations, while transformation improvements can be attributed to:

- fewer allocations
-- no interface allocations, or unnecessary deep copying of maps, due to utility.Set
-- lazy map construction
- less reflection, due to not using utility.Set
- less pointer indirection

name                   old time/op    new time/op    delta
pkg:github.com/elastic/apm-server/model/metadata goos:linux goarch:amd64
MetadataSet/minimal-8    1.16µs ± 6%    0.38µs ±11%  -67.59%  (p=0.008 n=5+5)
MetadataSet/full-8       11.9µs ± 4%     5.3µs ± 6%  -55.53%  (p=0.008 n=5+5)
pkg:github.com/elastic/apm-server/model/modeldecoder goos:linux goarch:amd64
DecodeMetadata-8         9.70µs ± 1%    9.30µs ±17%     ~     (p=0.690 n=5+5)

name                   old alloc/op   new alloc/op   delta
pkg:github.com/elastic/apm-server/model/metadata goos:linux goarch:amd64
MetadataSet/minimal-8      896B ± 0%      368B ± 0%  -58.93%  (p=0.008 n=5+5)
MetadataSet/full-8       14.0kB ± 0%     6.2kB ± 0%  -55.36%  (p=0.008 n=5+5)
pkg:github.com/elastic/apm-server/model/modeldecoder goos:linux goarch:amd64
DecodeMetadata-8         1.31kB ± 0%    1.06kB ± 0%  -18.96%  (p=0.000 n=5+4)

name                   old allocs/op  new allocs/op  delta
pkg:github.com/elastic/apm-server/model/metadata goos:linux goarch:amd64
MetadataSet/minimal-8      10.0 ± 0%       4.0 ± 0%  -60.00%  (p=0.008 n=5+5)
MetadataSet/full-8          114 ± 0%        68 ± 0%  -40.35%  (p=0.008 n=5+5)
pkg:github.com/elastic/apm-server/model/modeldecoder goos:linux goarch:amd64
DecodeMetadata-8           61.0 ± 0%      28.0 ± 0%  -54.10%  (p=0.008 n=5+5)

* model/modeldecoder: benchmark DecodeMetadata

* Benchmark recycled memory decoding

* model/modeldecoder: update decoding

* model/metadata: use non-pointer fields

* Adapt inputs to model changes

* model/metadata: benchmark Metadata.Set

* model: fix golint error (Id->ID)
axw added a commit that referenced this issue Apr 8, 2020
The primary motivation behind this change is to lay the groundwork for merging shared (i.e. stream) and per-event metadata at decode time, rather than transformation time, which we'll need for #3485.

We could merge metadata without these changes, but it would be more difficult and error prone. Making these change also provide some performance improvements – see below. Finally, there is also overlap between merging metadata and revising the decoders to enable memory use (#3551 (comment)).

In theory this could be a considered a breaking change, due to the fact that an empty string coming from an agent would no longer be recorded in output documents. In practice, it does not make sense for any of the metadata fields to have empty string values.

Due to the use of empty strings, we would have to change the behaviour of utility.Set to not record empty strings. Because I have only modified metadata types, and not all model types, I instead changed the metadata types' Fields methods to stop using utility.Set and implemented a limited version of #3565 which is more explicit about omitting empty strings.

These changes yield a significant performance improvement in micro-benchmarks, both in decoding and transformation. Decoding improvements can be attributed to fewer allocations, while transformation improvements can be attributed to:

- fewer allocations
-- no interface allocations, or unnecessary deep copying of maps, due to utility.Set
-- lazy map construction
- less reflection, due to not using utility.Set
- less pointer indirection

name                   old time/op    new time/op    delta
pkg:github.com/elastic/apm-server/model/metadata goos:linux goarch:amd64
MetadataSet/minimal-8    1.16µs ± 6%    0.38µs ±11%  -67.59%  (p=0.008 n=5+5)
MetadataSet/full-8       11.9µs ± 4%     5.3µs ± 6%  -55.53%  (p=0.008 n=5+5)
pkg:github.com/elastic/apm-server/model/modeldecoder goos:linux goarch:amd64
DecodeMetadata-8         9.70µs ± 1%    9.30µs ±17%     ~     (p=0.690 n=5+5)

name                   old alloc/op   new alloc/op   delta
pkg:github.com/elastic/apm-server/model/metadata goos:linux goarch:amd64
MetadataSet/minimal-8      896B ± 0%      368B ± 0%  -58.93%  (p=0.008 n=5+5)
MetadataSet/full-8       14.0kB ± 0%     6.2kB ± 0%  -55.36%  (p=0.008 n=5+5)
pkg:github.com/elastic/apm-server/model/modeldecoder goos:linux goarch:amd64
DecodeMetadata-8         1.31kB ± 0%    1.06kB ± 0%  -18.96%  (p=0.000 n=5+4)

name                   old allocs/op  new allocs/op  delta
pkg:github.com/elastic/apm-server/model/metadata goos:linux goarch:amd64
MetadataSet/minimal-8      10.0 ± 0%       4.0 ± 0%  -60.00%  (p=0.008 n=5+5)
MetadataSet/full-8          114 ± 0%        68 ± 0%  -40.35%  (p=0.008 n=5+5)
pkg:github.com/elastic/apm-server/model/modeldecoder goos:linux goarch:amd64
DecodeMetadata-8           61.0 ± 0%      28.0 ± 0%  -54.10%  (p=0.008 n=5+5)

* model/modeldecoder: benchmark DecodeMetadata

* Benchmark recycled memory decoding

* model/modeldecoder: update decoding

* model/metadata: use non-pointer fields

* Adapt inputs to model changes

* model/metadata: benchmark Metadata.Set

* model: fix golint error (Id->ID)
@axw
Copy link
Member Author

axw commented Sep 9, 2020

We have issues for remaining work:

Nothing else to do here for now, so I'll close this out and we can option more specific issues as needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants