Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ddtrace/tracer: fix tracer.StartSpanFromContext race condition on opts arg. #1127

Merged
merged 4 commits into from
Jan 20, 2022

Conversation

evanj
Copy link
Contributor

@evanj evanj commented Jan 14, 2022

The opts input slice comes from the caller. If it has extra capacity,
the calls to append will use the slice that is passed in. If this
function is called from multiple goroutines, this causes a race,
where spans may not have the correct parents.

Add a test to demonstrate the error. This test fails with race, and
should also fail without race detection, but that will depend on the
race itself.

@evanj evanj requested a review from a team January 14, 2022 20:53
The opts input slice comes from the caller. If it has extra capacity,
the calls to append will use the slice that is passed in. If this
function is called from multiple goroutines, this causes a race,
where spans may not have the correct parents.

Add a test to demonstrate the error. This test fails with race, and
should also fail without race detection, but that will depend on the
race itself.
@evanj evanj force-pushed the evan.jones/startspan-race branch from d7eb0d2 to 4281663 Compare January 14, 2022 20:55
Copy link
Member

@felixge felixge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Will also try to kick off CI now.

Comment on lines +42 to +45
// copy opts in case the caller reuses the slice in parallel
// we will add at least 1, at most 2 items
optsLocal := make([]StartSpanOption, len(opts), len(opts)+2)
copy(optsLocal, opts)
Copy link
Member

@felixge felixge Jan 17, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My first instinct when I saw this was to worry about about overhead. But on second thought I think this improves the safety of this method AND reduces the overhead for the normal case where the opts slice doesn't have extra capacity.

The reason for this is that append() does an implicit alloc + copy anyway when opts doesn't have extra capacity. And this allocation is probably often oversized because the Go runtime doesn't know that we only need space for 2 more items in the slice.

So 👍 from me. Thank you for this thoughtful fix!

But somebody from @DataDog/tracing-go should also take a look before merging this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went through the same trip as you looking at this. IIUC, am I correct to say that append does an alloc + copy anyway, so this only makes that more explicit, and additionally safer?

Regardless of this assumption, can you (@evanj) please post the results of a simple micro-benchmark starting spans with a child so we can observe any differences?

Copy link
Member

@felixge felixge Jan 17, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

am I correct to say that append does an alloc + copy anyway, so this only makes that more explicit, and additionally safer?

There are two cases:

  1. The normal case where len(opts) == cap(opts): In this case the append() previously had to alloc + copy anyway, but there was no safety issue. This case should be slightly optimized by this patch (because the alloc doesn't end up much larger than needed if len(opts) is big).
  2. The edge case where len(opts) < cap(opts): In this case the append() previously didn't have to alloc+copy, but it was unsafe if the method was called with the same opts slice concurrently. With this patch the safety problem is fixed, but the extra alloc+copy may cause slightly more overhead than before.

Regardless of this assumption, can you (@evanj) please post the results of a simple micro-benchmark starting spans with a child so we can observe any differences?

Since I had a suitable benchmark laying around for this, I went ahead and ran it. As expected, there is no noticable change in performance:

name                                     old time/op           new time/op           delta
EndpointsAndHotspots/direct/hello-world           11.4µs ± 2%           11.4µs ± 3%   ~     (p=0.968 n=10+9)

For full details see: https://gist.github.com/felixge/8b2e45b4ec3b08bf9d2df0c4b26ce896

(If you're wondering why there is no decrease in overhead: I suspect you need to pass much larger []StartSpanOption slices into the method to see a difference one way or another)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's safe to assume that 99% of common patterns aren't the situation described in point 2, so we're good. Thanks for posting the benchmarks results.

@felixge
Copy link
Member

felixge commented Jan 17, 2022

Hm, not sure how to get CircleCI to run this. The branch doesn't show up on in the UI. @knusbaum @gbbr any ideas?

@gbbr gbbr added this to the 1.36.0 milestone Jan 17, 2022
@evanj
Copy link
Contributor Author

evanj commented Jan 19, 2022

Thanks for the detailed comments and for the benchmarks!

@gbbr gbbr merged commit 9bc920a into v1 Jan 20, 2022
@gbbr gbbr deleted the evan.jones/startspan-race branch January 20, 2022 10:37
@gbbr gbbr changed the title tracer.StartSpanFromContext: Fix a race if opts is reused ddtrace/tracer: fix tracer.StartSpanFromContext race condition on opts arg. Jan 20, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants