Discuss post-trace (tail-based) sampling #425

yurishkuro · 2017-09-24T23:15:04Z

Booking this as a placeholder for discussions about implementing tail-based sampling. Copying from our original roadmap document:

Instead of blindly sampling a random % of all traces, we want to sample traces that are interesting, which can mean high latency, unexpected error codes, equal representation of different latency buckets, etc.

The proposed approach is to configure client libraries to emit all spans over UDP to the local jaeger-agent running on the host, irrespective of any sampling. The local jaeger-agents will store all spans in-memory and submit minimal statistics about the spans to the central jaeger-collectors, using consistent hashing, so that statistics about any given trace is assembled from multiple hosts on a single jaeger-collector. Once the trace is completed, the collector will decide if that trace is “interesting”, and if so it will pull the rest of the trace details (complete spans, with tags, events, and micro-logs) from the participating jaeger-agents and forward it to the storage. If the trace is deemed not interesting, the collector will inform jaeger-agents that the data can be discarded.

This approach will require higher resource usage. However, the mitigating factors are:

Span reporting is done over UDP via loopback interface to the agent on the same host.
- A possible further optimization, at the expense of client library complexity, is to keep the spans in memory of the business process, and only submit statistics to the agent. This will significantly reduce the overhead of serializing spans to Thrift or protobuf.
- [by @kwojcik] Another option that might be over-the-top is to use a shared memory segment to avoid going through the kernel all together (may not work with containers).
The spans are stored in memory for the duration of the trace, so the memory usage on the host will go up. However, the TTL of this caching is very short, on average equal to the average latency of the top level request. A hard upper bound, e.g. 5-10 seconds, can be placed on the cache and spans exceeding that time can be automatically flushed to the agent/collector, since they are most likely to be a part of “interesting” trace (we expect 5 sec response time is not the norm).
All spans, not just pre-sampled, need to be reported to the central collectors in order to make the “interesting trace” determination. To reduce the volume of traffic, only minimal required statistics about the span will be reported in the first pass. A certain amount of decision making can actually be placed onto jaeger-agent, who may keep stats on the latency buckets by service + endpoint, and force the sampling of the whole trace.
- This can be reduced even further if the logic for deciding what "interesting" traces are can be done at the span level and pushed down to the agents

Update 2020-02-05

We're currently working on a prototype of "aggregating collectors" that will support tail-based sampling. Note that OpenTelemetry Collector already implements a simple version of tail-based sampling, however it only works in a single-node mode, whereas the work we're doing allows to shard traces across multiple aggregators, which allows to scale to higher levels of traffic. @vprithvi is giving regular updates from this work in the Jaeger Project status calls.

jpkrohling · 2017-09-25T10:58:13Z

It seems that networking is the main driver for this decision to happen at the agent level (as the collector can scale horizontally), but would be nice to document the why behind this feature.

Span reporting is done over UDP via loopback interface to the agent on the same host

I think we shouldn't make this as a hard assumption. Rather, assume that the agent is "close enough", like in a sidecar when doing a container deployment (second container in a pod, in Kubernetes terms).

Other than that, looks sane to me :)

vprithvi · 2017-10-02T15:16:53Z

The proposed approach is to configure client libraries to emit all spans over UDP to the local jaeger-agent running on the host, irrespective of any sampling.

I propose some refinements. Instead of going all or nothing, I think we should use adaptive sampling to select a sample of spans to emit. Ideally, we can go all the way to 100pc, but having some head based sampling could provide an upper bound on the resources used by the instrumented process.

The local jaeger-agents will store all spans in-memory and submit minimal statistics about the spans to the central jaeger-collectors

Similar to @jpkrohling's comment, this makes the assumption that sending Spans could saturate links between jaeger-agent and jaeger-collector. This may not be true in many modern data centers. GCP, for instance, allows for 2Gbps of egress throughput per vCPU core.

In that case, we might be able to send spans to jaeger-collector and deal with provisioning memory there instead. This also works better in the case of functional architectures where one might not be able to easily run a sidecar.

Note that storing things in memory makes them susceptible to loss during instance failures; deploys should also wait for all spans to be drained from memory. However, this might be an acceptable trade off for observability data like spans.

I think we should identify which of the following is the bottleneck for most services to better our design:

Collecting data via client instrumentation (CPU time)
Reporting spans to jaeger agent (local UDP utilization)
Reporting spans to jaeger collector (network utilization)

yurishkuro · 2017-10-02T15:38:24Z

@vprithvi yes we need to profile all three.

I do not see tail-based sampling as a replacement for adaptive sampling, they can co-exist. Adaptive loop drives head-based sampling that ensures continuous probing of the architecture that is useful for capacity planning, blackbox text coverage, etc. Tail-based sampling is useful for capturing anomalies. We can run both of those at the same time.

otisg · 2018-04-12T18:36:22Z

Looks like X-Trace team at Brown is looking into this as well: https://groups.google.com/forum/?#!msg/distributed-tracing/fybf1cW04ZU/KhcF5NxTBwAJ

Sending all traces to Collector sounds scary me. We run a multi-tenant (tracing) SaaS and run Collector on our side, while our users run clients and agents on their infra. I would think/hope the post-trace sampling could be done close to the origin - in agents themselves.

yurishkuro · 2018-04-12T20:00:16Z

@otisg yes, the plan is to store the spans in memory on the agents or a central tier that is on-premises, close to the application itself.

maniaabdi · 2018-09-13T13:19:10Z

Do we have any implemented version of tail sampling with Jaeger?

isaachier · 2018-09-13T13:26:35Z

Not yet. On our to do list for sure.

rwkarg · 2018-10-16T17:54:33Z

The new Kafka backend would be interesting for this. All spans get reported, one job could aggregate metrics across all spans, another aggregate dependency metrics, another could sample windows and sample traces for ES/Cassandra storage (outliers plus some sample of "normal" calls)

yurishkuro · 2018-10-16T18:00:46Z

Kafka is ok post-collection / post-sampling. For our use case we cannot scale Kafka to collect all spans before sampling.

jewelzhu · 2019-01-15T15:34:31Z

We are very interested in this post-trace sampling feature, and I have a question: How to determine if a trace is finished? When a collector receives some spans of a trace including the root span, it cannot know if there are some FollowsFrom spans still not finished.

rstraszewski · 2019-02-26T18:30:16Z

We are also interested in this feature. Ability to always log error spans, and sample the rest, would be great. Is this feature still on your to-do list?

yurishkuro · 2019-02-26T21:15:18Z

It is, but very likely we'd simply use the implementation in opencensus collector that already supports it for single node.

jpkrohling · 2019-04-02T13:23:31Z

@yurishkuro should this be closed then?

yurishkuro · 2019-04-02T14:34:33Z

I wouldn't, until we have some concrete results / guidance.

jpkrohling · 2019-04-02T14:41:50Z

I wouldn't, until we have some concrete results / guidance.

What would be necessary to close this? Get a PoC using OC in the client doing tail-based sampling?

yurishkuro · 2019-04-02T15:41:56Z

Just looking at my internal document, these are the questions we want from the POC:

Measure the overhead of tracing clients on the applications, especially those with high QPS. This is an important technical limitation of the tail-based sampling.
Measure/quantify resource consumption in the agents and collectors. Extrapolate to Uber-scale traffic
- Investigate the impact of compression
Measure/quantify increase in network bandwidth usage
Experiment with sampling on histograms (latency buckets), and more advanced techniques
Validation
- Compute histograms from full corpus of observed traces and compare w metrics

Stono · 2019-06-07T12:03:27Z

Literally just sat in the office talking about our dilemma of wanting to capture all errors, but sample others. But we can't sample everything, as it's too much data.

vikrambe · 2020-02-25T17:47:18Z

We are also interested in this feature. Ability to always log error spans, and sample the rest, would be great. Is this feature still on your to-do list?

I think we should allow sampling based on SpanTag and attach a sample %. For example we would want to capture all the traces with error=true 100% and http.status_code 200 10%

Stono · 2020-02-25T18:16:54Z

Error wouldn’t work at a trace level in a microservice architecture as the spans before your error app will have already been reported to jäger based on their sample rate

william-tran · 2020-05-07T19:54:57Z

With tail sampling implemented in otel-collector, and jaeger kafka recevier and exporter just implemented in #2221 and #2135, would this work and be HA? jager-client -> otel-collector (http receiver, jaeger_kafka exporter) -> otel-collector (jaeger_kafka receiver, tail sampler). You're using kafka as the "traceID aware load balancer" because spans are written to the topic keyed by traceId.

yurishkuro · 2020-05-07T20:23:34Z

Kafka stores data on disk. If you can afford storing all trace data on disk before sampling, you can probably also afford storing all of it after sampling, making tail sampling unnecessary.

wyattanderson · 2020-05-07T20:36:42Z

Kafka stores data on disk. If you can afford storing all trace data on disk before sampling, you can probably also afford storing all of it after sampling, making tail sampling unnecessary.

You'd be storing all trace data on disk, sure, but for a much shorter period of time than you'd actually store traces in Elasticsearch or Cassandra, with a linear access pattern (suitable for magnetic storage), and very little processing overhead (i.e. no indexing), essentially only serving as an HA buffer in front of the collector. I'll admit, I don't have hyperscale experience and it's possible I'm missing something obvious, but I imagine there might be a number of scenarios where this approach would be valid.

yurishkuro · 2020-05-07T20:43:19Z

That's fair, short retention and raw bytes access pattern certainly make Kafka more forgiving of high data volumes. As for being an "HA buffer", or more precisely as a form of sharding, it's feasible, however with the default setup that ingesters have today, nothing prevents Kafka from rebalancing the consumers at any time, so it's not quite the stable sharding solution that you probably want. Another alternative is to manually partition traces across differently named topics, and setup ingesters to consume from different topics, to avoid rebalancing from the brokers.

JasonChen86899 · 2020-10-16T07:02:51Z

Update 2020-02-05

We're currently working on a prototype of "aggregating collectors" that will support tail-based sampling. Note that OpenTelemetry Collector already implements a simple version of tail-based sampling, however it only works in a single-node mode, whereas the work we're doing allows to shard traces across multiple aggregators, which allows to scale to higher levels of traffic. @vprithvi is giving regular updates from this work in the Jaeger Project status calls.

Are there any updates on "aggregating collectors" or just use OpenTelemetry Collector? I have just added this function based on master branch including client lib. I don’t know if I did repetitive work. Thx

jpkrohling · 2020-10-16T07:28:55Z

I'll let @vprithvi share an update on this side of the game, but from OpenTelemetry's side, we are very close to getting all the pieces for a scalable tail-based sampling. The final piece is probably this:

open-telemetry/opentelemetry-collector#1724

JasonChen86899 · 2020-10-16T09:41:47Z

open-telemetry/opentelemetry-collector#1724

Yeah, great! A good design for tail-based sampling. I designed another implementation without a loader balancer. Record upstream and downstream ips that bound with span tags

vprithvi · 2020-10-16T13:23:32Z

@JasonChen86899 @jpkrohling - unfortunately no new development has been happening on Jaeger at Uber since late March. We were able to get a proof of concept running before then which assumes that it is running on Uber infra, so it's unlikely to be open sourced soon.

JasonChen86899 · 2020-10-22T03:35:33Z

@vprithvi - have done some changes about tail-based.hope your suggestions
jaeger: https://github.com/JasonChen86899/jaeger/tree/tail_based_sampling
jaeger-client-go: https://github.com/JasonChen86899/jaeger-client-go/tree/tail_based_sampling

yurishkuro · 2020-10-22T03:46:40Z

@JasonChen86899 could you outline the changes?

JasonChen86899 · 2020-10-23T04:10:23Z

@JasonChen86899 could you outline the changes?

@yurishkuro @vprithvi

jaeger-client-go:
a. Add functions which handle with span tags with rpc error or other inner error or http status=5xx
b. Add functions which baggage upstream ip and downstream ip.
c. Update report span process. When span finished, check tags and baggages, then send to agent
jaeger:
a. Update agent process. When tags has special value, then push to collector otherwise put into buffer maintained few time(time wheel buffer)
b. Update collector process. When collector receive special spans, then got downstream ip form them for downstream spans, then loop query until trace end.

heruv1m · 2021-05-26T12:11:18Z

Is there any alive version of tail based sampling?

jpkrohling · 2021-05-26T12:20:49Z

No, unfortunately not as part of the Jaeger project.

jkowall · 2021-05-26T12:44:41Z

Otel tail-based sampling and filtering is working great these days thanks to all of your work on that @jpkrohling I know of many users who have implemented it so far.

jdomag · 2022-08-19T10:22:06Z

Is there any update on this or this idea is dead?

88vishnu · 2024-01-11T12:38:23Z

Is tail based sampling available in jaeger-collector?
If yes then please provide me list of configuration parameters.
Thanks

yurishkuro · 2024-01-11T18:04:23Z

Tail based sampling today is supported by opentelemetry collector. Once Jaeger-v2 is released the same plugin would be usable directly in Jaeger.

88vishnu · 2024-01-12T01:46:25Z

Thanks

yurishkuro · 2024-12-14T21:55:38Z

Tail based sampling is supported in Jaeger v2 via tail sampling processor.

yurishkuro added enhancement roadmap labels Nov 30, 2017

jpkrohling mentioned this issue Jun 25, 2018

Provide a way to sample entire trace which contains span with tag "error=true" jaegertracing/jaeger-client-java#465

Closed

yurishkuro mentioned this issue Oct 29, 2018

can we always sample slow requests? #1152

Closed

Dieterbe mentioned this issue Oct 30, 2018

slow query log grafana/metrictank#982

Closed

jpkrohling mentioned this issue Apr 4, 2019

Selectively send all long-running traces? #1456

Closed

beyang mentioned this issue Mar 18, 2020

jaeger integration substantially harms search performance sourcegraph/sourcegraph-public-snapshot#5363

Closed

yurishkuro mentioned this issue Oct 10, 2020

Tail_base_sampling sample mode for agent and collector #2549

Closed

eltalkarim mentioned this issue Oct 28, 2020

Start Sampling after a certain Threshold #2594

Open

Stono mentioned this issue Dec 7, 2020

Tracing Sample rate on a per-service basis istio/istio#12818

Open

yurishkuro removed the roadmap label Feb 3, 2024

yurishkuro closed this as completed Dec 14, 2024

Discuss post-trace (tail-based) sampling #425

Discuss post-trace (tail-based) sampling #425

Comments

yurishkuro commented Sep 24, 2017 • edited Loading

Update 2020-02-05

jpkrohling commented Sep 25, 2017

vprithvi commented Oct 2, 2017

yurishkuro commented Oct 2, 2017

otisg commented Apr 12, 2018

yurishkuro commented Apr 12, 2018

maniaabdi commented Sep 13, 2018

isaachier commented Sep 13, 2018

rwkarg commented Oct 16, 2018

yurishkuro commented Oct 16, 2018

jewelzhu commented Jan 15, 2019

rstraszewski commented Feb 26, 2019

yurishkuro commented Feb 26, 2019

jpkrohling commented Apr 2, 2019

yurishkuro commented Apr 2, 2019

jpkrohling commented Apr 2, 2019

yurishkuro commented Apr 2, 2019 • edited Loading

Stono commented Jun 7, 2019

vikrambe commented Feb 25, 2020

Stono commented Feb 25, 2020

william-tran commented May 7, 2020

yurishkuro commented May 7, 2020

wyattanderson commented May 7, 2020

yurishkuro commented May 7, 2020

JasonChen86899 commented Oct 16, 2020

Update 2020-02-05

jpkrohling commented Oct 16, 2020

JasonChen86899 commented Oct 16, 2020

vprithvi commented Oct 16, 2020 • edited Loading

JasonChen86899 commented Oct 22, 2020 • edited Loading

yurishkuro commented Oct 22, 2020

JasonChen86899 commented Oct 23, 2020 • edited Loading

heruv1m commented May 26, 2021

jpkrohling commented May 26, 2021

jkowall commented May 26, 2021

jdomag commented Aug 19, 2022

88vishnu commented Jan 11, 2024 • edited Loading

yurishkuro commented Jan 11, 2024

88vishnu commented Jan 12, 2024

yurishkuro commented Dec 14, 2024

yurishkuro commented Sep 24, 2017 •

edited

Loading

yurishkuro commented Apr 2, 2019 •

edited

Loading

vprithvi commented Oct 16, 2020 •

edited

Loading

JasonChen86899 commented Oct 22, 2020 •

edited

Loading

JasonChen86899 commented Oct 23, 2020 •

edited

Loading

88vishnu commented Jan 11, 2024 •

edited

Loading