Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OTel Cassandra/Elasticsearch Exporter queue defaults #2533

Merged

Conversation

joe-elliott
Copy link
Member

Which problem is this PR solving?

While working on #2494 I discovered that the default otel config was pushing far less spans/second to my Scylla backend. After some research it was discovered that it's b/c the default otel collector config only has 10 workers while the default jaeger collector config has 50.

Short description of the changes

This PR uses the default collector queue settings for both the cassandra and elasticsearch exporters' queue settings. These are the two "production" exporters supported

This does bring up another concern that might make sense to discuss as part of this issue. By default the ingester pulls from the Kafka queue and immediately pushes its spans into the configured backend, but a no-queue configuration of the otel collector does not seem possible at the moment. The queue can technically be disabled but performance is abysmal. This requires more research.

There are certainly other ways to approach "fixing" this. Including leaving it alone and making the user change the configuration.

@joe-elliott joe-elliott requested a review from a team as a code owner October 2, 2020 20:36
@codecov
Copy link

codecov bot commented Oct 2, 2020

Codecov Report

Merging #2533 into master will increase coverage by 0.01%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #2533      +/-   ##
==========================================
+ Coverage   95.32%   95.33%   +0.01%     
==========================================
  Files         208      208              
  Lines        9248     9248              
==========================================
+ Hits         8816     8817       +1     
+ Misses        355      354       -1     
  Partials       77       77              
Impacted Files Coverage Δ
cmd/query/app/static_handler.go 79.81% <0.00%> (-2.76%) ⬇️
...lugin/sampling/strategystore/adaptive/processor.go 100.00% <0.00%> (+0.92%) ⬆️
cmd/query/app/server.go 90.16% <0.00%> (+1.63%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 1d37ebc...b9a3ec1. Read the comment docs.

@yurishkuro
Copy link
Member

Worth noting that the size 50 is pulled completely out of thin air :-). There's probably some queueing theory approach one could use, but we haven't done any testing with that. It's actually a multi-variate optimization since while playing with worker count could be optimized for a single collector w.r.t. its CPU utilization, it's far less clear what impact this would have on the storage. For example, Cassandra connection supports up to 128 (if not more) parallel streams, which could suggest matching number of worker threads, but I am sure the C servers would not be happy with that many (times the number of collectors). Quite curious if someone has an idea what formula could be used here.

@joe-elliott
Copy link
Member Author

Worth noting that the size 50 is pulled completely out of thin air :-).

I am unsurprised that 50 isn't rigourously proven to give the best performance per collector :).

The main reason I think this PR is important is that it makes the otel collector/ingester layers perform roughly 1:1 with the current collector/ingester. Before they were doing half the work or less which may be a surprise to someone trying to swap to the new otel collector based pipeline.

@objectiser objectiser merged commit 1e5b7b8 into jaegertracing:master Oct 6, 2020
@objectiser
Copy link
Contributor

@joe-elliott Seems reasonable having the same defaults as currently used in Jaeger. Just to confirm, these defaults can still be overridden using the CLI flags?

@joe-elliott
Copy link
Member Author

@objectiser These defaults can be overridden using the OTEL config, but not the Jaeger CLI. Currently the Jaeger CLI options collector.num-workers and collector.queue-size are unsupported. This PR does not change that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants