Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[api] Fix memory leaks in TracerProvider.GetTracer API #4906

Merged
merged 13 commits into from
Oct 5, 2023

Conversation

CodeBlanch
Copy link
Member

@CodeBlanch CodeBlanch commented Sep 29, 2023

Changes

  • TracerProvider now maintains a cache of the Tracers it has issued. When disposed it will turn them into no-op instances and release their associated ActivitySources.

Details

Consider the following simple application:

using var tracerProvider = Sdk.CreateTracerProviderBuilder().Build();

for (int i = 0; i < 2_500_000; i++)
{
    var tracer = tracerProvider.GetTracer("MyTracer");

    if (i % 500_000 == 0)
    {
        GC.Collect();
    }
}

Running that we will see memory growing per iteration that is never released:

image

What's going on here?

Today we create a Tracer each time GetTracer is called which is handed its own ActivitySource. Creating spurious ActivitySources is dangerous because there is a static list of all active sources.

Tracer does NOT implement IDisposable so users aren't given a chance to do this correctly.

After the cache introduced on this PR the graph looks like this:

image

Merge requirement checklist

  • CONTRIBUTING guidelines followed (nullable enabled, static analysis, etc.)
  • Unit tests added/updated
  • Appropriate CHANGELOG.md files updated for non-trivial changes

@codecov
Copy link

codecov bot commented Sep 29, 2023

Codecov Report

Merging #4906 (83e550b) into main (3e885c7) will increase coverage by 0.29%.
The diff coverage is 94.11%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #4906      +/-   ##
==========================================
+ Coverage   83.21%   83.51%   +0.29%     
==========================================
  Files         295      295              
  Lines       12294    12324      +30     
==========================================
+ Hits        10231    10292      +61     
+ Misses       2063     2032      -31     
Flag Coverage Δ
unittests 83.51% <94.11%> (+0.29%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Coverage Δ
src/OpenTelemetry.Api/Trace/Tracer.cs 92.68% <100.00%> (+0.18%) ⬆️
src/OpenTelemetry.Api/Trace/TracerProvider.cs 93.93% <93.33%> (-6.07%) ⬇️

... and 6 files with indirect coverage changes

@CodeBlanch CodeBlanch added the pkg:OpenTelemetry.Api Issues related to OpenTelemetry.Api NuGet package label Oct 2, 2023
@CodeBlanch CodeBlanch marked this pull request as ready for review October 3, 2023 23:27
@CodeBlanch CodeBlanch requested a review from a team October 3, 2023 23:27
{
if (this.tracers == null)
{
// Note: We check here for a race with Dispose and return a
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe we need to set this.tracers = null inside the same lock. Else we could still run into a situation where some thread calling Dispose sets this.tracers to null after this if check and before the new entry is added to the dictionary. We would want to return a no-op tracer in that case, but we would end up returning a valid tracer.

Copy link
Member Author

@CodeBlanch CodeBlanch Oct 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just checked it a couple times. I think it is good! Could be I'm not seeing something though. Can you write out a flow for me that you think is flawed? Here are a couple flows I'm imagining.

Case where Dispose runs in the middle of the writer and gets the lock...

  • Writer thread reads the this.tracers on Line 58. It is valid so it begins its work.
  • Dispose thread sets this.tracers to null.
  • Dispose thread takes the lock.
  • Reader thread misses the cache and tries to take the lock. It has to wait.
  • Dispose thread finishes its clean up and releases the lock.
  • Writer thread gets the lock. Now it checks this.tracers == null. This will be true now and it will return a no-op instance.

Case where Dispose runs in the middle of the writer and waits on the lock...

  • Writer thread reads the this.tracers on Line 58. It is valid so it begins its work.
  • Reader thread misses the cache and takes the lock. Inside the lock it checks this.tracers == null which is false. It begins to do its work.
  • Dispose thread sets this.tracers to null.
  • Dispose thread tries to takes the lock. It has to wait.
  • Writer thread adds a new tracer to the cache and releases the lock. It doesn't care that this.tracers is now actually null because it is working on a local copy.
  • Dispose thread gets the lock and makes all the tracers in the cache no-ops including the one that was just added.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For case 2,

Writer thread adds a new tracer to the cache and releases the lock. It doesn't care that this.tracers is now actually null because it is working on a local copy.

I think this is more of design choice. Yes, it doesn't care that this.tracers is now actually null but it could care about it 😄.

I was thinking we could offer a stronger guarantee that we would never return a Tracer when TracerProvider is disposed or being disposed. We could avoid this limbo state where the Dispose method may or may not have marked the newly returned Tracer no-op when its being used.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I merged the PR because I think what's there will work well enough. I'll circle back to this comment when I have a sec to see if I can simplify it or clean it up in a way that doesn't introduce a bunch of contention.

Copy link
Contributor

@utpilla utpilla left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a non-blocking comment: #4906 (comment)

@CodeBlanch CodeBlanch merged commit d4d5122 into open-telemetry:main Oct 5, 2023
@CodeBlanch CodeBlanch deleted the api-gettracer-memoryleak branch October 5, 2023 18:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pkg:OpenTelemetry.Api Issues related to OpenTelemetry.Api NuGet package
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants