Skip to content

Commit

Permalink
Refactor the probabilistic sampler processor; add FailClosed configur…
Browse files Browse the repository at this point in the history
…ation, prepare for OTEP 235 support (#31946)

**Description:**

Refactors the probabilistic sampling processor to prepare it for more
OTEP 235 support.

This clarifies existing inconsistencies between tracing and logging
samplers, see the updated README. The tracing priority mechanism applies
a 0% or 100% sampling override (e.g., "1" implies 100% sampling),
whereas the logging sampling priority mechanism supports
variable-probability override (e.g., "1" implies 1% sampling).

This pins down cases where no randomness is available, and organizes the
code to improve readability. A new type called `randomnessNamer` carries
the randomness information (from the sampling pacakge) and a name of the
policy that derived it. When sampling priority causes the effective
sampling probability to change, the value "sampling.priority" replaces
the source of randomness, which is currently limited to "trace_id_hash"
or the name of the randomess-source attribute, for logs.

While working on #31894, I discovered that some inputs fall through to
the hash function with zero bytes of input randomness. The hash
function, computed on an empty input (for logs) or on 16 bytes of zeros
(which OTel calls an invalid trace ID), would produce a fixed random
value. So, for example, when logs are sampled and there is no TraceID
and there is no randomness attribute value, the result will be sampled
at approximately 82.9% and above.

In the refactored code, an error is returned when there is no input
randomness. A new boolean configuration field determines the outcome
when there is an error extracting randomness from an item of telemetry.
By default, items of telemetry with errors will not pass through the
sampler. When `FailClosed` is set to false, items of telemetry with
errors will pass through the sampler.

The original hash function, which uses 14 bits of information, is
structured as an "acceptance threshold", ultimately the test for
sampling translated into a positive decision when `Randomness <
AcceptThreshold`. In the OTEP 235 scheme, thresholds are rejection
thresholds--this PR modifies the original 14-bit accept threshold into a
56-bit reject threshold, using Threshold and Randomness types from the
sampling package. Reframed in this way, in the subsequent PR (i.e.,
#31894) the effective sampling probability will be seamlessly conveyed
using OTEP 235 semantic conventions.

Note, both traces and logs processors are now reduced to a function like
this:

```
				return commonSamplingLogic(
					ctx,
					l,
					lsp.sampler,
					lsp.failClosed,
					lsp.sampler.randomnessFromLogRecord,
					lsp.priorityFunc,
					"logs sampler",
					lsp.logger,
				)
```

which is a generic function that handles the common logic on a per-item
basis and ends in a single metric event. This structure makes it clear
how traces and logs are processed differently and have different
prioritization schemes, currently. This structure also makes it easy to
introduce new sampler modes, as shown in #31894. After this and #31940
merge, the changes in #31894 will be relatively simple to review as the
third part in a series.

**Link to tracking Issue:**

Depends on #31940.
Part of #31918.

**Testing:** Added. Existing tests already cover the exact random
behavior of the current hashing mechanism. Even more testing will be
introduced with the last step of this series. Note that
#32360
is added ahead of this test to ensure refactoring does not change
results.

**Documentation:** Added.

---------

Co-authored-by: Kent Quirk <[email protected]>
  • Loading branch information
jmacd and kentquirk authored May 15, 2024
1 parent 9e9f393 commit 4fa4603
Show file tree
Hide file tree
Showing 22 changed files with 1,021 additions and 190 deletions.
27 changes: 27 additions & 0 deletions .chloggen/probabilisticsampler_failclosed.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
# Use this changelog template to create an entry for release notes.

# One of 'breaking', 'deprecation', 'new_component', 'enhancement', 'bug_fix'
change_type: enhancement

# The name of the component, or a single word describing the area of concern, (e.g. filelogreceiver)
component: probabilisticsamplerprocessor

# A brief description of the change. Surround your text with quotes ("") if it needs to start with a backtick (`).
note: Adds the `FailClosed` flag to solidify current behavior when randomness source is missing.

# Mandatory: One or more tracking issues related to the change. You can use the PR number here if no issue exists.
issues: [31918]

# (Optional) One or more lines of additional information to render under the primary note.
# These lines will be padded with 2 spaces and then inserted directly into the document.
# Use pipe (|) for multiline entries.
subtext:

# If your change doesn't affect end users or the exported elements of any package,
# you should instead start your pull request title with [chore] or use the "Skip Changelog" label.
# Optional: The change log or logs in which this entry should be included.
# e.g. '[user]' or '[user, api]'
# Include 'user' if the change is relevant to end users.
# Include 'api' if there is a change to a library API.
# Default: '[user]'
change_logs: [user]
3 changes: 3 additions & 0 deletions cmd/configschema/go.mod
Original file line number Diff line number Diff line change
Expand Up @@ -566,6 +566,7 @@ require (
github.com/open-telemetry/opentelemetry-collector-contrib/internal/kafka v0.100.0 // indirect
github.com/open-telemetry/opentelemetry-collector-contrib/internal/sharedcomponent v0.100.0 // indirect
github.com/open-telemetry/opentelemetry-collector-contrib/pkg/experimentalmetricmetadata v0.100.0 // indirect
github.com/open-telemetry/opentelemetry-collector-contrib/pkg/sampling v0.100.0 // indirect
github.com/open-telemetry/opentelemetry-collector-contrib/pkg/translator/azure v0.100.0 // indirect
github.com/open-telemetry/opentelemetry-collector-contrib/pkg/winperfcounters v0.100.0 // indirect
github.com/open-telemetry/opentelemetry-collector-contrib/processor/probabilisticsamplerprocessor v0.100.0 // indirect
Expand Down Expand Up @@ -1226,3 +1227,5 @@ replace github.com/open-telemetry/opentelemetry-collector-contrib/connector/graf
replace github.com/open-telemetry/opentelemetry-collector-contrib/extension/sumologicextension => ../../extension/sumologicextension

replace github.com/open-telemetry/opentelemetry-collector-contrib/receiver/splunkenterprisereceiver => ../../receiver/splunkenterprisereceiver

replace github.com/open-telemetry/opentelemetry-collector-contrib/pkg/sampling => ../../pkg/sampling
1 change: 1 addition & 0 deletions cmd/otelcontribcol/builder-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -472,3 +472,4 @@ replaces:
- github.com/open-telemetry/opentelemetry-collector-contrib/extension/opampcustommessages => ../../extension/opampcustommessages
- github.com/open-telemetry/opentelemetry-collector-contrib/confmap/provider/s3provider => ../../confmap/provider/s3provider
- github.com/open-telemetry/opentelemetry-collector-contrib/confmap/provider/secretsmanagerprovider => ../../confmap/provider/secretsmanagerprovider
- github.com/open-telemetry/opentelemetry-collector-contrib/pkg/sampling => ../../pkg/sampling
3 changes: 3 additions & 0 deletions cmd/otelcontribcol/go.mod
Original file line number Diff line number Diff line change
Expand Up @@ -631,6 +631,7 @@ require (
github.com/open-telemetry/opentelemetry-collector-contrib/pkg/ottl v0.100.0 // indirect
github.com/open-telemetry/opentelemetry-collector-contrib/pkg/pdatautil v0.100.0 // indirect
github.com/open-telemetry/opentelemetry-collector-contrib/pkg/resourcetotelemetry v0.100.0 // indirect
github.com/open-telemetry/opentelemetry-collector-contrib/pkg/sampling v0.100.0 // indirect
github.com/open-telemetry/opentelemetry-collector-contrib/pkg/translator/azure v0.100.0 // indirect
github.com/open-telemetry/opentelemetry-collector-contrib/pkg/translator/jaeger v0.100.0 // indirect
github.com/open-telemetry/opentelemetry-collector-contrib/pkg/translator/loki v0.100.0 // indirect
Expand Down Expand Up @@ -1293,3 +1294,5 @@ replace github.com/open-telemetry/opentelemetry-collector-contrib/extension/opam
replace github.com/open-telemetry/opentelemetry-collector-contrib/confmap/provider/s3provider => ../../confmap/provider/s3provider

replace github.com/open-telemetry/opentelemetry-collector-contrib/confmap/provider/secretsmanagerprovider => ../../confmap/provider/secretsmanagerprovider

replace github.com/open-telemetry/opentelemetry-collector-contrib/pkg/sampling => ../../pkg/sampling
2 changes: 2 additions & 0 deletions connector/datadogconnector/go.mod
Original file line number Diff line number Diff line change
Expand Up @@ -332,3 +332,5 @@ replace github.com/open-telemetry/opentelemetry-collector-contrib/extension/stor
replace github.com/openshift/api v3.9.0+incompatible => github.com/openshift/api v0.0.0-20180801171038-322a19404e37

replace github.com/open-telemetry/opentelemetry-collector-contrib/processor/transformprocessor => ../../processor/transformprocessor

replace github.com/open-telemetry/opentelemetry-collector-contrib/pkg/sampling => ../../pkg/sampling
3 changes: 3 additions & 0 deletions exporter/datadogexporter/go.mod
Original file line number Diff line number Diff line change
Expand Up @@ -250,6 +250,7 @@ require (
github.com/open-telemetry/opentelemetry-collector-contrib/internal/filter v0.100.0 // indirect
github.com/open-telemetry/opentelemetry-collector-contrib/pkg/ottl v0.100.0 // indirect
github.com/open-telemetry/opentelemetry-collector-contrib/pkg/pdatautil v0.100.0 // indirect
github.com/open-telemetry/opentelemetry-collector-contrib/pkg/sampling v0.100.0 // indirect
github.com/open-telemetry/opentelemetry-collector-contrib/pkg/stanza v0.100.0 // indirect
github.com/open-telemetry/opentelemetry-collector-contrib/pkg/translator/prometheus v0.100.0 // indirect
github.com/opencontainers/go-digest v1.0.0 // indirect
Expand Down Expand Up @@ -427,3 +428,5 @@ replace github.com/open-telemetry/opentelemetry-collector-contrib/processor/tail
replace github.com/open-telemetry/opentelemetry-collector-contrib/extension/storage => ../../extension/storage

replace github.com/open-telemetry/opentelemetry-collector-contrib/processor/transformprocessor => ../../processor/transformprocessor

replace github.com/open-telemetry/opentelemetry-collector-contrib/pkg/sampling => ../../pkg/sampling
2 changes: 2 additions & 0 deletions exporter/datadogexporter/integrationtest/go.mod
Original file line number Diff line number Diff line change
Expand Up @@ -341,3 +341,5 @@ replace github.com/open-telemetry/opentelemetry-collector-contrib/processor/prob
replace github.com/open-telemetry/opentelemetry-collector-contrib/receiver/prometheusreceiver => ../../../receiver/prometheusreceiver

replace github.com/open-telemetry/opentelemetry-collector-contrib/processor/transformprocessor => ../../../processor/transformprocessor

replace github.com/open-telemetry/opentelemetry-collector-contrib/pkg/sampling => ../../../pkg/sampling
3 changes: 3 additions & 0 deletions go.mod
Original file line number Diff line number Diff line change
Expand Up @@ -578,6 +578,7 @@ require (
github.com/open-telemetry/opentelemetry-collector-contrib/pkg/ottl v0.100.0 // indirect
github.com/open-telemetry/opentelemetry-collector-contrib/pkg/pdatautil v0.100.0 // indirect
github.com/open-telemetry/opentelemetry-collector-contrib/pkg/resourcetotelemetry v0.100.0 // indirect
github.com/open-telemetry/opentelemetry-collector-contrib/pkg/sampling v0.100.0 // indirect
github.com/open-telemetry/opentelemetry-collector-contrib/pkg/stanza v0.100.0 // indirect
github.com/open-telemetry/opentelemetry-collector-contrib/pkg/translator/azure v0.100.0 // indirect
github.com/open-telemetry/opentelemetry-collector-contrib/pkg/translator/jaeger v0.100.0 // indirect
Expand Down Expand Up @@ -1226,3 +1227,5 @@ replace github.com/open-telemetry/opentelemetry-collector-contrib/extension/enco
replace github.com/open-telemetry/opentelemetry-collector-contrib/extension/ackextension => ./extension/ackextension

replace github.com/open-telemetry/opentelemetry-collector-contrib/receiver/splunkenterprisereceiver => ./receiver/splunkenterprisereceiver

replace github.com/open-telemetry/opentelemetry-collector-contrib/pkg/sampling => ./pkg/sampling
190 changes: 150 additions & 40 deletions processor/probabilisticsamplerprocessor/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,51 +15,159 @@
[contrib]: https://github.com/open-telemetry/opentelemetry-collector-releases/tree/main/distributions/otelcol-contrib
<!-- end autogenerated section -->

The probabilistic sampler supports two types of sampling for traces:

1. `sampling.priority` [semantic
convention](https://github.com/opentracing/specification/blob/master/semantic_conventions.md#span-tags-table)
as defined by OpenTracing
1. Trace ID hashing

The `sampling.priority` semantic convention takes priority over trace ID hashing. As the name
implies, trace ID hashing samples based on hash values determined by trace IDs. See [Hashing](#hashing) for more information.
The probabilistic sampler processor supports several modes of sampling
for spans and log records. Sampling is performed on a per-request
basis, considering individual items statelessly. For whole trace
sampling, see
[tailsamplingprocessor](../tailsamplingprocessor/README.md).

For trace spans, this sampler supports probabilistic sampling based on
a configured sampling percentage applied to the TraceID. In addition,
the sampler recognizes a `sampling.priority` annotation, which can
force the sampler to apply 0% or 100% sampling.

For log records, this sampler can be configured to use the embedded
TraceID and follow the same logic as applied to spans. When the
TraceID is not defined, the sampler can be configured to apply hashing
to a selected log record attribute. This sampler also supports
sampling priority.

## Consistency guarantee

A consistent probability sampler is a Sampler that supports
independent sampling decisions for each span or log record in a group
(e.g. by TraceID), while maximizing the potential for completeness as
follows.

Consistent probability sampling requires that for any span in a given
trace, if a Sampler with lesser sampling probability selects the span
for sampling, then the span would also be selected by a Sampler
configured with greater sampling probability.

## Completeness property

A trace is complete when all of its members are sampled. A
"sub-trace" is complete when all of its descendents are sampled.

Ordinarily, Trace and Logging SDKs configure parent-based samplers
which decide to sample based on the Context, because it leads to
completeness.

When non-root spans or logs make independent sampling decisions
instead of using the parent-based approach (e.g., using the
`TraceIDRatioBased` sampler for a non-root span), incompleteness may
result, and when spans and log records are independently sampled in a
processor, as by this component, the same potential for completeness
arises. The consistency guarantee helps minimimize this issue.

Consistent probability samplers can be safely used with a mixture of
probabilities and preserve sub-trace completeness, provided that child
spans and log records are sampled with probability greater than or
equal to the parent context.

Using 1%, 10% and 50% probabilities for example, in a consistent
probability scheme the 50% sampler must sample when the 10% sampler
does, and the 10% sampler must sample when the 1% sampler does. A
three-tier system could be configured with 1% sampling in the first
tier, 10% sampling in the second tier, and 50% sampling in the bottom
tier. In this configuration, 1% of traces will be complete, 10% of
traces will be sub-trace complete at the second tier, and 50% of
traces will be sub-trace complete at the third tier thanks to the
consistency property.

These guidelines should be considered when deploying multiple
collectors with different sampling probabilities in a system. For
example, a collector serving frontend servers can be configured with
smaller sampling probability than a collector serving backend servers,
without breaking sub-trace completeness.

## Sampling randomness

To achieve consistency, sampling randomness is taken from a
deterministic aspect of the input data. For traces pipelines, the
source of randomness is always the TraceID. For logs pipelines, the
source of randomness can be the TraceID or another log record
attribute, if configured.

For log records, the `attribute_source` and `from_attribute` fields determine the
source of randomness used for log records. When `attribute_source` is
set to `traceID`, the TraceID will be used. When `attribute_source`
is set to `record` or the TraceID field is absent, the value of
`from_attribute` is taken as the source of randomness (if configured).

## Sampling priority

The sampling priority mechanism is an override, which takes precedence
over the probabilistic decision in all modes.

🛑 Compatibility note: Logs and Traces have different behavior.

In traces pipelines, when the priority attribute has value 0, the
configured probability will by modified to 0% and the item will not
pass the sampler. When the priority attribute is non-zero the
configured probability will be set to 100%. The sampling priority
attribute is not configurable, and is called `sampling.priority`.

In logs pipelines, when the priority attribute has value 0, the
configured probability will by modified to 0%, and the item will not
pass the sampler. Otherwise, the logs sampling priority attribute is
interpreted as a percentage, with values >= 100 equal to 100%
sampling. The logs sampling priority attribute is configured via
`sampling_priority`.

## Sampling algorithm

### Hash seed

The hash seed method uses the FNV hash function applied to either a
Trace ID (spans, log records), or to the value of a specified
attribute (only logs). The hashed value, presumed to be random, is
compared against a threshold value that corresponds with the sampling
percentage.

This mode requires configuring the `hash_seed` field. This mode is
enabled when the `hash_seed` field is not zero, or when log records
are sampled with `attribute_source` is set to `record`.

In order for hashing to be consistent, all collectors for a given tier
(e.g. behind the same load balancer) must have the same
`hash_seed`. It is also possible to leverage a different `hash_seed`
at different collector tiers to support additional sampling
requirements.

This mode uses 14 bits of sampling precision.

### Error handling

This processor considers it an error when the arriving data has no
randomess. This includes conditions where the TraceID field is
invalid (16 zero bytes) and where the log record attribute source has
zero bytes of information.

By default, when there are errors determining sampling-related
information from an item of telemetry, the data will be refused. This
behavior can be changed by setting the `fail_closed` property to
false, in which case erroneous data will pass through the processor.

## Configuration

The following configuration options can be modified:
- `hash_seed` (no default): An integer used to compute the hash algorithm. Note that all collectors for a given tier (e.g. behind the same load balancer) should have the same hash_seed.
- `sampling_percentage` (default = 0): Percentage at which traces are sampled; >= 100 samples all traces

Examples:
- `sampling_percentage` (32-bit floating point, required): Percentage at which items are sampled; >= 100 samples all items, 0 rejects all items.
- `hash_seed` (32-bit unsigned integer, optional, default = 0): An integer used to compute the hash algorithm. Note that all collectors for a given tier (e.g. behind the same load balancer) should have the same hash_seed.
- `fail_closed` (boolean, optional, default = true): Whether to reject items with sampling-related errors.

```yaml
processors:
probabilistic_sampler:
hash_seed: 22
sampling_percentage: 15.3
```
### Logs-specific configuration

The probabilistic sampler supports sampling logs according to their trace ID, or by a specific log record attribute.
The probabilistic sampler optionally may use a `hash_seed` to compute the hash of a log record.
This sampler samples based on hash values determined by log records. See [Hashing](#hashing) for more information.

The following configuration options can be modified:
- `hash_seed` (no default, optional): An integer used to compute the hash algorithm. Note that all collectors for a given tier (e.g. behind the same load balancer) should have the same hash_seed.
- `sampling_percentage` (required): Percentage at which logs are sampled; >= 100 samples all logs, 0 rejects all logs.
- `attribute_source` (default = traceID, optional): defines where to look for the attribute in from_attribute. The allowed values are `traceID` or `record`.
- `from_attribute` (default = null, optional): The optional name of a log record attribute used for sampling purposes, such as a unique log record ID. The value of the attribute is only used if the trace ID is absent or if `attribute_source` is set to `record`.
- `sampling_priority` (default = null, optional): The optional name of a log record attribute used to set a different sampling priority from the `sampling_percentage` setting. 0 means to never sample the log record, and >= 100 means to always sample the log record.

## Hashing

In order for hashing to work, all collectors for a given tier (e.g. behind the same load balancer)
must have the same `hash_seed`. It is also possible to leverage a different `hash_seed` at
different collector tiers to support additional sampling requirements. Please refer to
[config.go](./config.go) for the config spec.
- `attribute_source` (string, optional, default = "traceID"): defines where to look for the attribute in from_attribute. The allowed values are `traceID` or `record`.
- `from_attribute` (string, optional, default = ""): The name of a log record attribute used for sampling purposes, such as a unique log record ID. The value of the attribute is only used if the trace ID is absent or if `attribute_source` is set to `record`.
- `sampling_priority` (string, optional, default = ""): The name of a log record attribute used to set a different sampling priority from the `sampling_percentage` setting. 0 means to never sample the log record, and >= 100 means to always sample the log record.

Examples:

Sample 15% of the logs:
Sample 15% of log records according to trace ID using the OpenTelemetry
specification.

```yaml
processors:
probabilistic_sampler:
Expand All @@ -76,7 +184,8 @@ processors:
from_attribute: logID # value is required if the source is not traceID
```
Sample logs according to the attribute `priority`:
Give sampling priority to log records according to the attribute named
`priority`:

```yaml
processors:
Expand All @@ -85,6 +194,7 @@ processors:
sampling_priority: priority
```

## Detailed examples

Refer to [config.yaml](./testdata/config.yaml) for detailed
examples on using the processor.
Refer to [config.yaml](./testdata/config.yaml) for detailed examples
on using the processor.
10 changes: 10 additions & 0 deletions processor/probabilisticsamplerprocessor/config.go
Original file line number Diff line number Diff line change
Expand Up @@ -35,6 +35,16 @@ type Config struct {
// different sampling rates, configuring different seeds avoids that.
HashSeed uint32 `mapstructure:"hash_seed"`

// FailClosed indicates to not sample data (the processor will
// fail "closed") in case of error, such as failure to parse
// the tracestate field or missing the randomness attribute.
//
// By default, failure cases are sampled (the processor is
// fails "open"). Sampling priority-based decisions are made after
// FailClosed is processed, making it possible to sample
// despite errors using priority.
FailClosed bool `mapstructure:"fail_closed"`

// AttributeSource (logs only) defines where to look for the attribute in from_attribute. The allowed values are
// `traceID` or `record`. Default is `traceID`.
AttributeSource `mapstructure:"attribute_source"`
Expand Down
Loading

0 comments on commit 4fa4603

Please sign in to comment.