Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OpenTelemetry .NET Tracing AspNetCore instrumentation is inefficient #1745

Closed
dpk83 opened this issue Feb 22, 2023 · 24 comments
Closed

OpenTelemetry .NET Tracing AspNetCore instrumentation is inefficient #1745

dpk83 opened this issue Feb 22, 2023 · 24 comments
Labels
comp:instrumentation.aspnetcore Things related to OpenTelemetry.Instrumentation.AspNetCore enhancement New feature or request

Comments

@dpk83
Copy link

dpk83 commented Feb 22, 2023

Feature Request

It's more of an issue than a feature request. We care a lot about performance for our services and in our local benchmarking we found that AspNetCoreInstrumentation for tracing is extremely inefficient. In a simple AspNetCore application with an extremely simple rest endpoint enabling tracing reduces the service throughput by around 15% which is huge for serving telemetry. (Drops from 10K RPS to 8.6K RPS)

Is your feature request related to a problem?

Yes, telemetry collection should be high performance and shouldn't impact service throughput.

Describe the solution you'd like:

Improve the performance of OTel instrumentation so it's not impacting service throughput.

Describe alternatives you've considered.

A custom middleware which is only using resources for processing what's essential on top of what AspNetCore out of box tracing instrumentation already provides.

Additional Context

The code for application used to measure the throughput and the results can be found here https://github.com/dpk83/otelperfimpactanalysis/tree/main/TelemetryPerfImpactAnalysis

@dpk83 dpk83 added the enhancement New feature or request label Feb 22, 2023
@dpk83 dpk83 changed the title OpenTelemetry Tracing AspNetCore instrumentation is extremely inefficient OpenTelemetry .NET Tracing AspNetCore instrumentation is extremely inefficient Feb 22, 2023
@dpk83
Copy link
Author

dpk83 commented Feb 22, 2023

To add:

  • This benchmarking is done without any exporter (Adding exporter will likely reduce the throughput even further)
  • The requests also didn't have any baggage. I suspect requests with baggage will likely reduce the throughput even further.

@cijothomas
Copy link
Member

https://github.com/open-telemetry/opentelemetry-dotnet/blob/main/test/Benchmarks/Instrumentation/AspNetCoreInstrumentationBenchmarks.cs

The benchmarks above show a totally different story. The proper way to test overall web app throughput reduction is likely by using bombardier similar tools which simulates real scenarios. Maybe add new perf tests using new approach first, and keep benchmark.net for micro optimizations/measurements only.

@martinjt
Copy link
Member

martinjt commented Mar 3, 2023

@dpk83 there's a couple of point here that I think are interesting, I've done very specific analysis on this along side the benchmarks in the repo.

Your analysis of the middleware you added seems to suggest that it's a like-for-like replacement, which it is not. There is much more done to create a full activity, including things like trace propagation and also filling out additional properties too. That's not to say that it isn't valid that there should be a smaller instrumentation for those that want it, just that it's not a like-for-like comparison.

Your middleware approach does nothing with the Activity if I'm looking at it correctly. This means that, ultimately, there is no "work" being done, and therefore all you're testing is that you can add a tag and that it doesn't affect the load. I would highly recommend showing this as a full implementation. Specifically, use .AddSource("'Microsoft.AspNet") with the sdk instead of your ActivityListener to get a closer comparison as that would show whether you're saying it's OpenTelemetry, or specifically the Instrumentation as you suggest.

Baggage will cause some additional load as it has to be copied and put onto the outbound requests. This is more to do with Propagation than it is to do with AspNetCore instrumentation.

Exporters, as long as they're using Batch, shouldn't affect inline performance, but will increase memory usage.

I think "extremely" in the title here is incendiary and not required. This is quite an aggressive post, I'm hoping that it's coming from a place of frustration and not a criticism. If there are some optimizations you can see to the way it works, then I'd recommend working with the team to get those optimizations made for the benefit of everyone.

Telemetry is always additional processing, that's something that everyone needs to understand, and therefore will always cause some degree of degradation in throughput/latency and/or increase in memory. There's a trade off here, do you fly blind without tracing so you can have a smaller amount of servers running your application? or take the hit of adding an additional server and/or scaling up so you can have the telemetry.

@dpk83
Copy link
Author

dpk83 commented Mar 7, 2023

There is much more done to create a full activity, including things like trace propagation and also filling out additional properties too. That's not to say that it isn't valid that there should be a smaller instrumentation for those that want it, just that it's not a like-for-like comparison.
Your middleware approach does nothing with the Activity if I'm looking at it correctly. This means that, ultimately, there is no "work" being done, and therefore all you're testing is that you can add a tag and that it doesn't affect the load. I would highly recommend showing this as a full implementation. Specifically, use .AddSource("'Microsoft.AspNet") with the sdk instead of your ActivityListener to get a closer comparison as that would show whether you're saying it's OpenTelemetry, or specifically the Instrumentation as you suggest.

In the current benchmarking none of the additional functionality beyond what's already provided by AspNetCore is used, so it is actually a like-for-like comparison for the required functionality and I don't expect OpenTelemetry to have additional overhead on top of it. I intentionally kept is simple because that's going to be the case for majority of the cases where baggage isn't modified or the tracing isn't going through multiple different vendor implementations. Also, leave the middleware aside, just comparing the throughput for binary scenarios i.e. tracing disabled vs tracing enabled itself is sufficient to the message I am trying to put forth.

Baggage will cause some additional load as it has to be copied and put onto the outbound requests. This is more to do with Propagation than it is to do with AspNetCore instrumentation.

Yeah, I intentionally left out baggage in the benchmarking because I wanted to make a point here for the very basic functionality where I don't expect any additional overhead on top of AspNetCore. Regarding baggage it will cause further reduction in throughput. The way currently baggage part is designed in the library can definitely be improved further. For our cases >95% of the cases will not modify the baggage as the request is passing through the system, so transforming the baggage header into the object and then retransforming it back to it's way out is clearly an unnecessary overhead. Such processing should be moved to more on-demand basis (i.e. on the first such action where it's required). I also believe there are additional opportunities to extract more performance out of it even for cases where baggage needs to be transformed and used.

I think "extremely" in the title here is incendiary and not required. This is quite an aggressive post,

That's fair, I will remove the extremely word. But to me telemetry should have minimal overhead and this is actually quite a lot of overhead.

Telemetry is always additional processing, that's something that everyone needs to understand, and therefore will always cause some degree of degradation in throughput/latency and/or increase in memory. There's a trade off here, do you fly blind without tracing so you can have a smaller amount of servers running your application? or take the hit of adding an additional server and/or scaling up so you can have the telemetry.

Yes it's additional processing but that additional processing should be minimal so my services can utilize the system's resources efficiently for the actual business logic. Why would one use OpenTelemetry .NET library if they are able to achieve the same result with way lower processing and cost implications by writing their own solution or some other solution? For smaller services or adhoc usage it's fine but for large complex services which are processing millions and billions of requests a day telemetry costs isn't insignificant and matters a lot.

@dpk83 dpk83 changed the title OpenTelemetry .NET Tracing AspNetCore instrumentation is extremely inefficient OpenTelemetry .NET Tracing AspNetCore instrumentation is inefficient Mar 7, 2023
@cijothomas
Copy link
Member

cijothomas commented Mar 22, 2023

Adding this issue to the Instrumentation-1.0.0 milestone - we cannot release 1.0 until the perf is acceptable (and prove that with actual numbers).

Few things:

  1. Decide how to measure performance for instrumentations.
  2. Decide the acceptable perf overhead. We could find the theoretical max (by doing a custom middleware populating activities with right tags etc.) and add OTel tax.
  3. Do the actual fixes to improve perf. @vishweshbankwar has done a lot in this area in the past, but pretty sure we can do more.
    • We iterate through activity tag (more than once) to find things. Might be avoidable.
    • The Propagator might be doing duplicate works.
      We need to identify where the perf cost is coming from. Are they from the DS callbacks fired by asp.net core itself? OR in our instrumentation code? Or OTel SDK (It can't be otel sdk, as benchmarks show SDK perf is very good).

@dpk83 would you have time to send a PR to this repo with perf test instructions using bombardier? Its okay to be manual at first, we can see how to automate it to orchestrate and run and show results, just like benchmark tool.

@CodeBlanch
Copy link
Member

@dpk83 There is a lot of work that has to be done in the AspNetCoreInstrumention to support OTel baggage because .NET doesn't do it.

The requests also didn't have any baggage. I suspect requests with baggage will likely reduce the throughput even further.

If you aren't using baggage, try doing this...

// When the default propagator is just w3c we respect the work AspNetCore does instead of calling the SDK logic.
// See: https://github.com/open-telemetry/opentelemetry-dotnet/blob/e9226cfadf3eb4b58fb8c087c61291a3569ccea3/src/OpenTelemetry.Instrumentation.AspNetCore/Implementation/HttpInListener.cs#L133
Sdk.SetDefaultTextMapPropagator(new TraceContextPropagator());

That should make the OTel perf more in line with your middleware. It is still doing more work, so it will be more of a hit, but it should help a lot.

If you do want to use OTel baggage in your services, fix your middleware because it makes your tests invalid 😉

@dpk83
Copy link
Author

dpk83 commented Mar 22, 2023

If you aren't using baggage, try doing this...
// When the default propagator is just w3c we respect the work AspNetCore does instead of calling the SDK logic.
// See: https://github.com/open-telemetry/opentelemetry-dotnet/blob/e9226cfadf3eb4b58fb8c087c61291a3569ccea3/src/OpenTelemetry.Instrumentation.AspNetCore/Implementation/HttpInListener.cs#L133
Sdk.SetDefaultTextMapPropagator(new TraceContextPropagator());

This is good to know but I think there are opportunities to improve the baggage handling as well.

  • First creating it on demand as opposed to always (i.e. create it only if an API that's interacting with baggage is invoked by some operation)
  • Same goes for the outgoing side where if the baggage was never used/modified that you can avoid all the creation/serialization stuff and just pass the header from incoming request as is.
  • Even when baggage is created briefly looking at the code it looks like there is room for improvements and treating baggage as a string might actually get you better perf, for majority of the audience (assuming even for the cases where baggage is modified the number of entries added/removed are low)

@CodeBlanch
Copy link
Member

Great ideas to improve baggage! Mind spinning up an issue to track that?

For the case here where baggage isn't being used (probably very common case) what might be more immediately beneficial is for OTel .NET to use the .NET propagation API (somehow) to avoid the duplicate extraction/injection and activity creation. IIRC that is one thing @vishweshbankwar was looking into.

@dpk83
Copy link
Author

dpk83 commented Mar 22, 2023

Sdk.SetDefaultTextMapPropagator(new TraceContextPropagator());

@CodeBlanch I tried this and it's actually causing a different activity everytime. All the exported chained activities have different traceId.

@cijothomas
Copy link
Member

Sdk.SetDefaultTextMapPropagator(new TraceContextPropagator());

@CodeBlanch I tried this and it's actually causing a different activity everytime. All the exported chained activities have different traceId.

Please open a new issue with this one, so we dont add noise to this particular issue to track performance improvements.

@dpk83
Copy link
Author

dpk83 commented Mar 23, 2023

@dpk83
Copy link
Author

dpk83 commented May 4, 2023

This is an important issue so checking for the status here. It is on track for getting fixed in the upcoming stable OpenTelemetry release, right?

@cijothomas
Copy link
Member

This is an important issue so checking for the status here. It is on track for getting fixed in the upcoming stable OpenTelemetry release, right?

If you are referring to the 1.5 Stable Release (that is the next stable), then no. This issue is not planned for that milestone.

This is something to be addressed definitely for the Stable release milestone. No firm ETA on that yet.

@dpk83
Copy link
Author

dpk83 commented May 5, 2023

@reyang This is an important ask and I was expecting this to be addressed in 1.5 stable release. We have been pushing services across our org to adopt distributed tracing and the perf penalty here is unacceptable. We definitely want to avoid creating our own solution here because OpenTelemetry didn't provide the performant solution.

@reyang
Copy link
Member

reyang commented May 5, 2023

@reyang This is an important ask and I was expecting this to be addressed in 1.5 stable release. We have been pushing services across our org to adopt distributed tracing and the perf penalty here is unacceptable. We definitely want to avoid creating our own solution here because OpenTelemetry didn't provide the performant solution.

Sorry I'm confused, I think we all agree that this is part of the stable release. OpenTelemetry .NET 1.5 milestone is not planning to release a stable version of OpenTelemetry.Instrumentation.AspNetCore. Where is the expectation (that 1.5 release will ship stable OpenTelemetry.Instrumentation.AspNetCore and perf improvement will happen right now) coming from?

@dpk83
Copy link
Author

dpk83 commented May 5, 2023

Oh so you are saying this will only be available when AspNetCore instrumentation becomes stable. What are the timelines for that? Based on the timelines we will have to evaluate whether we can wait or we need to switch gears

@reyang
Copy link
Member

reyang commented May 5, 2023

Oh so you are saying this will only be available when AspNetCore instrumentation becomes stable.

Where is this coming from? I didn't say that.

@dpk83
Copy link
Author

dpk83 commented May 6, 2023

Okay looks like there is some problem on my side and I am not able to understand the statements properly.

Let’s simplify things:
Could you please share the ETA when a stable release with the fix for this issue is available?

@reyang
Copy link
Member

reyang commented May 6, 2023

@cijothomas already replied #1745 that we don't have the ETA yet.

@vishweshbankwar
Copy link
Member

vishweshbankwar commented Dec 13, 2023

@dpk83 : Sharing an update here based on my investigation

Regarding the issue about custom Middleware being more efficient than instrumentation library, I think this is not accurate. The sample shared is not 1:1 comparison with the instrumentation library.

I ran some tests and as per my investigation, custom Middleware has slightly slower throughput compared to instrumentation library. Please see: open-telemetry/opentelemetry-dotnet#5163

That being said there are few areas of improvements we can further consider, some of them shared in this thread above as well

  • Allow selecting particular set of tags based on user requirements. Setting fewer tags on activity could help with perf.
  • Propagation: We are currently doing this twice due to difference in Propagator APIs between OTel and .NET. This is another area where we could save some time. We are looking to get this fixed when the instrumentation is supported natively in ASP.NET Core.
  • Look for opportunities to further reduce the allocation cost associated with tag values. We have done couple of improvements in the past months but there could be more.
  • Investigate if cost associated with activity creation can be reduced in System.Diagnostics.DiagnosticSource.
  • Sampling: This needs no change in instrumentation but is another way to reduce cost associated with activity creation/enrichment.

We plan to continue the discussions/efforts on improving perf and also work with .NET team for bringing support for trace instrumentation natively in .NET9.0. ref: dotnet/aspnetcore#52439.

@reyang reyang transferred this issue from open-telemetry/opentelemetry-dotnet May 13, 2024
@reyang reyang added the comp:instrumentation.aspnetcore Things related to OpenTelemetry.Instrumentation.AspNetCore label May 13, 2024
@martinjt
Copy link
Member

martinjt commented Jun 2, 2024

This is an ongoing task to optimize and therefore this isn't a tracking issue. If you have specific areas and suggestions to improve them, I'd suggest they're split out into separate issues for you to work.

@martinjt martinjt closed this as completed Jun 2, 2024
@cijothomas
Copy link
Member

@cijothomas cijothomas reopened this Aug 29, 2024
@cijothomas
Copy link
Member

dotnet/runtime#107333 Should fix the contention from runtime itself.

@cijothomas
Copy link
Member

Closing as dotnet/runtime#107333 is merged, and will be part of .NET 10.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
comp:instrumentation.aspnetcore Things related to OpenTelemetry.Instrumentation.AspNetCore enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

6 participants