Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extrapolate Jaeger transaction count from reported sampling rate #3722

Merged
merged 7 commits into from
Jul 1, 2020

Conversation

axw
Copy link
Member

@axw axw commented Apr 30, 2020

Motivation/summary

When Jaeger is configured to sample a percentage of traces, then the statistics reported in APM UI will be proportional to the sampling rate, and not the actual number of operations. Jaeger reports the sampling rate as a pair of tags (sampler.type and sampler.param) with each span. We can use these to extrapolate the number of transactions when performing aggregations.

Checklist

  • I have signed the Contributor License Agreement.
  • My code follows the style guidelines of this project (run make check-full for static code checks and linting)
  • I have rebased my changes on top of the latest master branch
  • I have made corresponding changes to the documentation
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes
  • I have updated CHANGELOG.asciidoc

How to test these changes

  1. Enable transaction aggregation (apm-server.aggregation.enabled=true)
  2. Run an app instrumented with Jaeger, configured with remote sampling
  3. Set a sampling rate of 0.1 for the service in APM UI
  4. Run 1000 operations
  5. Check there are 100 (i.e. 1000 * 0.1) transactions
  6. Check the sum of counts in transaction.duration.histogram fields is 1000 (e.g. using Histogram field type support for ValueCount and Avg aggregations elasticsearch#55933 when it lands)

Related issues

Closes #3011

@axw
Copy link
Member Author

axw commented Apr 30, 2020

Rounding error on this is going to be pretty bad for some sampling rates (e.g 0.3, 0.4), so it might be worth scaling all counts recorded in the in-memory histograms up when recording (e.g. multiply by 100), and then back down when creating metricset docs.

@codecov-io
Copy link

Codecov Report

Merging #3722 into master will decrease coverage by 0.05%.
The diff coverage is 96.66%.

@@            Coverage Diff             @@
##           master    #3722      +/-   ##
==========================================
- Coverage   80.35%   80.30%   -0.06%     
==========================================
  Files         131      131              
  Lines        6022     6042      +20     
==========================================
+ Hits         4839     4852      +13     
- Misses       1183     1190       +7     
Impacted Files Coverage Δ
model/transaction.go 88.09% <ø> (ø)
...ack/apm-server/aggregation/txmetrics/aggregator.go 93.50% <92.30%> (+0.16%) ⬆️
processor/otel/consumer.go 93.40% <100.00%> (+0.26%) ⬆️
kibana/connecting_client.go 64.51% <0.00%> (-8.07%) ⬇️
beater/jaeger/common.go 78.78% <0.00%> (-6.07%) ⬇️

Copy link
Contributor

@simitt simitt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a fairly straight forward way to approximate the number of transactions. Not sure if it could lead to problematic edge cases, but I couldn't think of any that would be triggered by this handling. Would be great to see this getting in to provide better APM UI usability for Jaeger agent collected data.

@axw axw force-pushed the jaeger-histogram-aggregation branch from 69b8f86 to f266588 Compare June 25, 2020 07:25
@apmmachine
Copy link
Contributor

apmmachine commented Jun 25, 2020

💚 Build Succeeded

Pipeline View Test View Changes Artifacts preview

Expand to view the summary

Build stats

  • Build Cause: [Pull request #3722 updated]

  • Start Time: 2020-07-01T06:39:20.946+0000

  • Duration: 47 min 7 sec

Test stats 🧪

Test Results
Failed 0
Passed 3217
Skipped 147
Total 3364

Steps errors

Expand to view the steps failures

  • Name: Compress

    • Description: tar --exclude=coverage-files.tgz -czf coverage-files.tgz coverage

    • Duration: 0 min 0 sec

    • Start Time: 2020-07-01T06:54:35.458+0000

    • log

  • Name: Compress

    • Description: tar --exclude=system-tests-linux-files.tgz -czf system-tests-linux-files.tgz system-tests

    • Duration: 0 min 0 sec

    • Start Time: 2020-07-01T07:05:41.752+0000

    • log

  • Name: Test Sync

    • Description: ./script/jenkins/sync.sh

    • Duration: 3 min 31 sec

    • Start Time: 2020-07-01T06:49:31.371+0000

    • log

@axw axw force-pushed the jaeger-histogram-aggregation branch from 6c7b5cf to e3aedd6 Compare June 25, 2020 09:20
@axw axw force-pushed the jaeger-histogram-aggregation branch from e3aedd6 to dcb130d Compare June 25, 2020 09:21
Once the UI side of this is in, the histograms will
be used for RPM graphs, which will take sampling into
account.
@axw axw marked this pull request as ready for review June 25, 2020 09:57
@axw axw requested review from simitt and bmorelli25 June 25, 2020 09:57
Copy link
Member

@bmorelli25 bmorelli25 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Docs LGTM

@axw axw merged commit 76ac96e into elastic:master Jul 1, 2020
@axw axw deleted the jaeger-histogram-aggregation branch July 1, 2020 07:28
@axw axw added the v7.9.0 label Jul 1, 2020
axw added a commit to axw/apm-server that referenced this pull request Jul 1, 2020
…stic#3722)

* model: add Transaction.RepresentativeCount field

* jaeger: set Transaction.RepresentativeCount

Set Transaction.RepresentativeCount based on the
value of sampler.param if sampler.type=probabilistic.

* aggregation/txmetrics: use RepresentativeCount

* Update changelog

* docs: remove caveats about Jaeger sampling & RPMs

Once the UI side of this is in, the histograms will
be used for RPM graphs, which will take sampling into
account.

* Fix/add comments
axw added a commit that referenced this pull request Jul 2, 2020
…) (#3932)

* model: add Transaction.RepresentativeCount field

* jaeger: set Transaction.RepresentativeCount

Set Transaction.RepresentativeCount based on the
value of sampler.param if sampler.type=probabilistic.

* aggregation/txmetrics: use RepresentativeCount

* Update changelog

* docs: remove caveats about Jaeger sampling & RPMs

Once the UI side of this is in, the histograms will
be used for RPM graphs, which will take sampling into
account.

* Fix/add comments
jalvz added a commit to jalvz/apm-server that referenced this pull request Jul 2, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Jaeger] How to support sampling
5 participants