-
Notifications
You must be signed in to change notification settings - Fork 50
proposal: Better default sampler #156
Comments
agree
|
/cc @bogdandrutu |
Changing the default is a breaking change which we may not be able to do that easily. I understand the problem and I think the current best solution is to add a qps based sampler and suggest that in our examples instead of always sample. What do you think? @adriancole @Ramonza |
Well we have a well-defined mechanism for making breaking API changes in Go. It seems like we should also have a way to make breaking non-API changes like this. I definitely agree that it needs to be done slowly over time and can't just be changed outright and that the examples are the first thing we should update. |
I like the idea but also think it should be done carefully (no surprises to existing users). Is there an explicit flag/mode that can enable this behavior for people who're trying to demo or debug? that way we don't surprise the existing users |
We are using adaptive sampling as a default in Application Insights. It adjusts sampling percentage so the rate of produced telemetry will be limited to 5 (default) per second. Since pricing is per volume of telemetry - this approach allows to estimate the cost of monitoring and keep it (virtually) constant on any high load including spikes. It also works great in dev environment. More dev notes here. Will this one be a better alternative? |
Is this the same as RateLimit sampling described here: https://github.com/census-instrumentation/opencensus-specs/blob/master/trace/Sampling.md I agree that something like this is a better default. |
@Ramonza idea of adaptive sampling is easier. Adaptive sampling is basically a wrapper on probability sampled that adjusts probability sampler config based on previous time interval. All decisions of adaptive sampler are delayed, but it make the spans easier to analyze. Each time frame sampling probability was constant and you can gain up numbers to estimate the real number of spans were seen in that interval. |
I think the default sampler of 1/10k is causing a number of problems.
First, users who want to initially start doing tracing want to be able to see traces without rolling it out to production. In this case, they usually set the sampler to AlwaysSample.
Then, sometimes they commit this and forget to change it leading to unintentionally starting traces for all requests. In a high-traffic system, this can be very costly.
My initial proposal for discussion would be that we change the default sampler so that it is suitable for both development and production. For example, we could rate limit sampling up to 1 per second and thereafter, sample 1 in 1000 (say) up to 5 per second.
This should allow people to see traces during development and testing (all requests will be traces if the service sees less than 1qps) and should still be fine in production since we limit ourselves to 5 per second.
The text was updated successfully, but these errors were encountered: