Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor metric records logic #468

Merged
merged 9 commits into from
Feb 6, 2020
Merged

Conversation

bogdandrutu
Copy link
Member

Changes in the PR:

  • Cleanup the complicated logic of keeping track of active Records, by implementing a mechanism that guarantees that an entry can be referenced iff (if and only if) it is in the map as well, so records removed from the map will never see new updates.
  • Removes the need of two different lists (primary and reclaimed), also the logic of moving entries between the lists.
  • Removes the need of the epoch logic. A simple modified atomic boolean is enough.

This PR also:

  • Fixes race condition when a reclaimed entry is added back to the primary and then removes another entry from the map (same mapKey). This can happen when an entry is initially removed (some usages still happens to move it to reclaimed), a new entry for the same mapKey is added to the map, so when the initial entry is added back to the primary list will remove the newly added entry from the map. If it happens that the newly entry is bound then we will have "stale" records that will be used only for that handle.
  • Simplifies the code, there may be other edge-cases and hidden race conditions with a more complicated code.
  • Improves performance on the critical path, by not requiring cas when unbind (and removing extra atomic lods/stores - memory barriers), a cas is added on the collection side but that happens async.

Possible future improvements:

  • Understand better when to report changes for a Record: currently if a Bound is used the values are always reported even if no changes.
  • Use a per instrument map/list that improve contention on the map and list (even if they are "lock free" data structures, in a lot of places we do cas while succeed).
  • Consider to iterate over the lock free records list without always removing and adding back elements (2 cas operations).
  • For readability: Isolate the lock free list logic (by creating a proper list class, instead of having methods on the singlePtr as well as on the SDK + Record.

@jmacd
Copy link
Contributor

jmacd commented Feb 5, 2020

@krnowak please review :)

@jmacd
Copy link
Contributor

jmacd commented Feb 5, 2020

@paivagustavo please review :)

@jmacd
Copy link
Contributor

jmacd commented Feb 5, 2020

Looks nice. Got any new benchmark results? Thanks :)

@bogdandrutu
Copy link
Member Author

@jmacd please tell me how to run them

@jmacd
Copy link
Contributor

jmacd commented Feb 5, 2020

Should be just go test -bench=. in the package directory, before and after.

@bogdandrutu
Copy link
Member Author

Give me 1h to understand why the stress test blocked exactly where I pointed that it may block :))

@jmacd
Copy link
Contributor

jmacd commented Feb 5, 2020

Cool. This is now the second technique I've seen that uses an atomic integer plus one bit to accomplish lock freedom. Discussed here, and as seen in the Prometheus-Go client. Interestingly (or confusingly), the other technique uses the high bit while this technique uses the low bit.

@bogdandrutu
Copy link
Member Author

@jmacd done you can see the last commit if you are interested where the bug was :). Interestingly I couldn't reproduce on my local machine.

@jmacd
Copy link
Contributor

jmacd commented Feb 6, 2020

Cool. The stress test reproduced a number of bugs for me, during development, but it was always on CircleCI and never on my local machine. I think it has to do with limited CPU count?

@bogdandrutu
Copy link
Member Author

@jmacd I have some comments in the future work, please let me know if you want to create an issue for them, also left some todos related. Would like to know if you want to track these and how.

@jmacd jmacd added the area:metrics Part of OpenTelemetry Metrics label Feb 6, 2020
sdk/metric/refcount_mapped.go Outdated Show resolved Hide resolved
sdk/metric/refcount_mapped.go Outdated Show resolved Hide resolved
sdk/metric/refcount_mapped.go Outdated Show resolved Hide resolved
Copy link
Member

@krnowak krnowak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly typos, but the busy waiting during instrument binding makes me uneasy.

sdk/metric/alignment_test.go Outdated Show resolved Hide resolved
sdk/metric/refcount_mapped.go Outdated Show resolved Hide resolved
sdk/metric/refcount_mapped.go Outdated Show resolved Hide resolved
sdk/metric/refcount_mapped.go Outdated Show resolved Hide resolved
sdk/metric/sdk.go Outdated Show resolved Hide resolved
sdk/metric/refcount_mapped.go Outdated Show resolved Hide resolved
sdk/metric/refcount_mapped.go Outdated Show resolved Hide resolved
sdk/metric/sdk.go Outdated Show resolved Hide resolved
sdk/metric/sdk.go Show resolved Hide resolved
sdk/metric/sdk.go Show resolved Hide resolved
sdk/metric/refcount_mapped.go Show resolved Hide resolved
sdk/metric/refcount_mapped.go Outdated Show resolved Hide resolved
sdk/metric/refcount_mapped.go Outdated Show resolved Hide resolved
@bogdandrutu bogdandrutu requested a review from krnowak February 6, 2020 17:42
Signed-off-by: Bogdan Cristian Drutu <[email protected]>
Signed-off-by: Bogdan Cristian Drutu <[email protected]>
Signed-off-by: Bogdan Cristian Drutu <[email protected]>
Signed-off-by: Bogdan Cristian Drutu <[email protected]>
Signed-off-by: Bogdan Cristian Drutu <[email protected]>
Signed-off-by: Bogdan Cristian Drutu <[email protected]>
Copy link
Member

@krnowak krnowak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Found last small typo, otherwise looks good. Thanks!

sdk/metric/refcount_mapped.go Outdated Show resolved Hide resolved
sdk/metric/sdk.go Show resolved Hide resolved
Signed-off-by: Bogdan Cristian Drutu <[email protected]>
@jmacd jmacd merged commit 69df67d into open-telemetry:master Feb 6, 2020
@bogdandrutu bogdandrutu deleted the metric branch February 6, 2020 22:50
@jmacd jmacd mentioned this pull request Mar 26, 2020
@pellared pellared added this to the untracked milestone Nov 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:metrics Part of OpenTelemetry Metrics
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants