Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API usability improvements in sampling package #31940

Merged
merged 84 commits into from
Apr 9, 2024

Conversation

jmacd
Copy link
Contributor

@jmacd jmacd commented Mar 25, 2024

Description:

Several usability issues were ironed out while working on #31894. This PR is the pkg/sampling changes from that PR.

Highlights:

Link to tracking Issue: #31918

Testing: New tests added for coverage.

Documentation: New comments to explain.

//
// return fixedThreshold.ShouldSample(rnd)
// }
// func MakeDecision(tracestate string, tid TraceID) bool {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: the Go toolchain is reformatting this comment block for me. 🤷

Comment on lines +46 to +71
// Calculate the amount of precision needed to encode the
// threshold with reasonable precision. Here, we count the
// number of leading `0` or `f` characters and automatically
// add precision to preserve relative error near the extremes.
//
// Frexp() normalizes both the fraction and one-minus the
// fraction, because more digits of precision are needed if
// either value is near zero. Frexp returns an exponent <= 0.
//
// If `exp <= -4`, there will be a leading hex `0` or `f`.
// For every multiple of -4, another leading `0` or `f`
// appears, so this raises precision accordingly.
_, expF := math.Frexp(fraction)
_, expR := math.Frexp(1 - fraction)
precision = min(NumHexDigits, max(precision+expF/-hexBits, precision+expR/-hexBits))

// Compute the threshold
scaled := uint64(math.Round(fraction * float64(MaxAdjustedCount)))
threshold := MaxAdjustedCount - scaled

// Round to the specified precision, if less than the maximum.
if shift := hexBits * (NumHexDigits - precision); shift != 0 {
half := uint64(1) << (shift - 1)
threshold += half
threshold >>= shift
threshold <<= shift
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note I removed ErrPrecisionUnderflow because it was not a meaningful error condition -- I would otherwise work around the problem. While investigating the head-sampler changes for OTel-Go I found a much nicer solution, copied here (with new testing).

@jmacd jmacd marked this pull request as ready for review March 25, 2024 17:31
@jmacd jmacd requested review from a team and mx-psi March 25, 2024 17:31
@jmacd
Copy link
Contributor Author

jmacd commented Mar 25, 2024

Reviewers: the new API surface is exercised in #31946.

Copy link
Contributor

github-actions bot commented Apr 9, 2024

This PR was marked stale due to lack of activity. It will be closed in 14 days.

@MovieStoreGuy
Copy link
Contributor

Sorry @jmacd for the long time waiting,

Just rerunning the build and hopefully we should be all good to merge this in.

@MovieStoreGuy MovieStoreGuy merged commit beef35e into open-telemetry:main Apr 9, 2024
169 of 170 checks passed
@github-actions github-actions bot added this to the next release milestone Apr 9, 2024
jpkrohling pushed a commit that referenced this pull request May 15, 2024
…ation, prepare for OTEP 235 support (#31946)

**Description:**

Refactors the probabilistic sampling processor to prepare it for more
OTEP 235 support.

This clarifies existing inconsistencies between tracing and logging
samplers, see the updated README. The tracing priority mechanism applies
a 0% or 100% sampling override (e.g., "1" implies 100% sampling),
whereas the logging sampling priority mechanism supports
variable-probability override (e.g., "1" implies 1% sampling).

This pins down cases where no randomness is available, and organizes the
code to improve readability. A new type called `randomnessNamer` carries
the randomness information (from the sampling pacakge) and a name of the
policy that derived it. When sampling priority causes the effective
sampling probability to change, the value "sampling.priority" replaces
the source of randomness, which is currently limited to "trace_id_hash"
or the name of the randomess-source attribute, for logs.

While working on #31894, I discovered that some inputs fall through to
the hash function with zero bytes of input randomness. The hash
function, computed on an empty input (for logs) or on 16 bytes of zeros
(which OTel calls an invalid trace ID), would produce a fixed random
value. So, for example, when logs are sampled and there is no TraceID
and there is no randomness attribute value, the result will be sampled
at approximately 82.9% and above.

In the refactored code, an error is returned when there is no input
randomness. A new boolean configuration field determines the outcome
when there is an error extracting randomness from an item of telemetry.
By default, items of telemetry with errors will not pass through the
sampler. When `FailClosed` is set to false, items of telemetry with
errors will pass through the sampler.

The original hash function, which uses 14 bits of information, is
structured as an "acceptance threshold", ultimately the test for
sampling translated into a positive decision when `Randomness <
AcceptThreshold`. In the OTEP 235 scheme, thresholds are rejection
thresholds--this PR modifies the original 14-bit accept threshold into a
56-bit reject threshold, using Threshold and Randomness types from the
sampling package. Reframed in this way, in the subsequent PR (i.e.,
#31894) the effective sampling probability will be seamlessly conveyed
using OTEP 235 semantic conventions.

Note, both traces and logs processors are now reduced to a function like
this:

```
				return commonSamplingLogic(
					ctx,
					l,
					lsp.sampler,
					lsp.failClosed,
					lsp.sampler.randomnessFromLogRecord,
					lsp.priorityFunc,
					"logs sampler",
					lsp.logger,
				)
```

which is a generic function that handles the common logic on a per-item
basis and ends in a single metric event. This structure makes it clear
how traces and logs are processed differently and have different
prioritization schemes, currently. This structure also makes it easy to
introduce new sampler modes, as shown in #31894. After this and #31940
merge, the changes in #31894 will be relatively simple to review as the
third part in a series.

**Link to tracking Issue:**

Depends on #31940.
Part of #31918.

**Testing:** Added. Existing tests already cover the exact random
behavior of the current hashing mechanism. Even more testing will be
introduced with the last step of this series. Note that
#32360
is added ahead of this test to ensure refactoring does not change
results.

**Documentation:** Added.

---------

Co-authored-by: Kent Quirk <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants