Allow to limit number of spans per trace for UI #2496

timothybasanov · 2019-04-12T02:35:15Z

Feature:
Additional configuration parameter to limit number of spans within a single trace.
Can potentially be applied to search results and/or in-memory storage.

Rational
Accidentally running a big fan-out job may create trace with a million spans. This affects user in several ways:

Zipkin UI has JavaScript stack overflow exceptions when there are too many spans to render
- Even when it does not throw browsers tend to hang when dealing with that much data
zipkin2.server.internal.ZipkinQueryApiV2#writeTraces has an integer overflow when resulting JSON is more than 2GB in size
- User does not know which span is big, but as soon as he hits it in search results it throws 500 errors from the server. So he can not exclude one bad trace from results
- non-200 errors are not rendered in Zipkin UI, so user has no way of knowing what went wrong.
In-Memory storage behaves really poorly as it iterates over all spans within a single trace
- This disproportionally affect load-testing jobs on a dev machine.

Proposed solution:

Additional configuration parameter that limits total number of spans sent from Zipkin UI within a single trace
- Dropping some data is better then failing the whole search result
- Ideally it should show in end-user UI that some spans were dropped and view is incomplete
  - This is hard to implement, temporary solution: after dropping some spans add one span back that would say "error: too many spans within a trace, had to drop some on the floor"
Additional configuration parameter that limits total number of spans within a trace for In-Memory storage
- Dropping some data may be better than bringing down the whole server when it either runs out of memory or gets stuck in an endless GC loop.

The text was updated successfully, but these errors were encountered:

drolando · 2019-04-12T03:01:20Z

It'd be nice if we could paginate the response if there are more than let's say 10k spans. It might be hard though to figure out a nice way to display that though

codefromthecrypt · 2019-04-12T03:58:45Z

as these sorts of issues tend to be re-hash in nature, and usually recreate content in various places, can we move parts of the discussion where some history exists?

pagination or lack thereof to Add support for pagination of trace query results #2240
UI presentation of large amounts of spans zipkin ui very slow when has a lot of span #1460

We don't have any issue at the moment on mitigating based on span count at ingress (collection). I think this issue has new discussions on that point, and probably deserves to put mind on it.

codefromthecrypt · 2019-04-12T04:02:53Z

some notes about the issue here.

in-memory storage is explicitly mentioned as not for production. we have some other in-memory stuff cooking ex https://github.com/adriancole/zipkin-voltdb
even if in-memory was ok, I don't know how many backends would work with (million-scale) traces.
to the best of my knowledge we've not yet discussed span count based dropping at server level

I do think span count pruning has been discussed in brave, but there are issues about how that affects things. collector side would involve state to track it.. this is possible to do in single instance scenarios or where there are some shared state you can look up based on trace ID. I can see why someone would wonder about if we could special case this in the in-memory collector as we already have collections by trace ID.

codefromthecrypt · 2019-04-12T07:13:35Z

@timothybasanov ps do you mind hopping on gitter for the load test related questions? I want to dig into them but not wander the issue too much https://gitter.im/openzipkin/zipkin

timothybasanov · 2019-04-12T19:40:19Z

We probably can close this ticket (and related) as duplicates of one big ticket.
I'll try to talk on Gitter in more detail on what I'm doing locally with tests. :)

I think other tickets are slightly different from this one: they are about slow UI, this one is about UI silently crashing with exceptions (overflow and/or browser crash) and server silently failing with exceptions (integer overflow). I don't expect Zipkin to work well with big traces, I just want it to not to crash.

codefromthecrypt · 2019-04-12T22:54:40Z

I'd prefer to have tickets for individually solvable things even if there is a big ticket with tickboxes. especially where existing tickets are present. this will help reduce a problems with the big ticket approach including. * long winding discussions (tickets dont thread as well as email etc) * rehashing incompletely discussions that already exist in a new place * reduced approachability of doing anything as the work feels too big. iotw please let's not close existing issues unless they are done or dupes of older issues. I am pretty sure there were other tickets that discussed what to do with large traces. difference between 10k and 1M is a factor of that discussion. however it is fine with me to repurpose this issue for too huge to load traces. please dont close this as I dont want to require people to have to look at two issues to get the discussion so far.

…

On Sat, Apr 13, 2019, 4:40 AM Timothy Basanov ***@***.***> wrote: We probably can close this ticket (and related) as duplicates of one big ticket. I'll try to talk on Gitter in more detail on what I'm doing locally with tests. :) I think other tickets are slightly different from this one: they are about *slow* UI, this one is about UI silently crashing with exceptions (overflow and/or browser crash) and server silently failing with exceptions (integer overflow). I don't expect Zipkin to work well with big traces, I just want it to not to crash. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#2496 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAD61zFw4GYokg_tCSqsGxDP8VxQOu8Sks5vgOElgaJpZM4crNNh> .

codefromthecrypt · 2019-04-12T23:14:55Z

on this particular point of too big to load at all. there seem to be two ways about it if we are focusing on after the fact (UI) vs before (collection)

we can identify based on size (easiest possibly a HEAD request)
we can identify based on span count or some other trace scoped heuristic (engineering required if we want to stop browser from crashing reading the json)

while UI is primary user of api, it isn't the only user. so changing the API with special hooks is likely not worth doing. The api is not suited enough to retrofit to things like pagination quickly also. This issue is about preventing crashes so we can keep that in mind.

there are three entry points which this limiting concerns:

trace list screen (one trace in results could be poison)
trace by ID and trace by JSON (the only trace loaded could be poison big)

I can think of some remedies, and we have to remember this is different than the other issue about too big, but not too big to crash.

I think simple way out would be to just refuse to load when payload is larger than say 5MiB by default, with a setting one can update. I prefer size based as traces can be huge for other reasons not just span count like too many tags or annotations.

Another way that comes to mind is look at what tools exist for incremental parsing of json. cc @openzipkin/ui

codefromthecrypt · 2019-04-12T23:32:49Z

opened #2498 on the in memory storage special casing though it wont work for other storage. rationale being that it is meant for testing so maybe we can improve the limiter

codefromthecrypt · 2019-05-02T09:25:47Z

suggested by @jorgheymans on what to do when said limit is reached in #2554

I think that a drill-down mode could be more helpful here, a natural way to investigate a trace is by clicking your way through the parts that take up the most time until you get to the leaf spans.

codefromthecrypt · 2019-05-02T09:31:31Z

fwiw @bulicekj at the last UI workshop had the same idea I think, which was some sort of progressive loading in a trace. there are some issues about that as we do validation up front etc. but anyway I think this is helpful https://cwiki.apache.org/confluence/display/ZIPKIN/2019-04-17+UX+workshop+and+Lens+GA+planning+at+LINE+Fukuoka

codefromthecrypt · 2019-05-02T09:35:06Z

actually I missed #2411 is the better issue for the progressive loading.. argh

codefromthecrypt added enhancement ui Zipkin UI labels Apr 17, 2019

codefromthecrypt mentioned this issue Apr 18, 2019

the trace page loads too slowly #2411

Closed

jorgheymans mentioned this issue May 2, 2019

be able to handle large traces in UI #2554

Closed

jorgheymans mentioned this issue Oct 8, 2020

Making sense of large traces #3230

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow to limit number of spans per trace for UI #2496

Allow to limit number of spans per trace for UI #2496

timothybasanov commented Apr 12, 2019

drolando commented Apr 12, 2019

codefromthecrypt commented Apr 12, 2019

codefromthecrypt commented Apr 12, 2019

codefromthecrypt commented Apr 12, 2019

timothybasanov commented Apr 12, 2019

codefromthecrypt commented Apr 12, 2019 via email

codefromthecrypt commented Apr 12, 2019

codefromthecrypt commented Apr 12, 2019

codefromthecrypt commented May 2, 2019

codefromthecrypt commented May 2, 2019

codefromthecrypt commented May 2, 2019

Allow to limit number of spans per trace for UI #2496

Allow to limit number of spans per trace for UI #2496

Comments

timothybasanov commented Apr 12, 2019

drolando commented Apr 12, 2019

codefromthecrypt commented Apr 12, 2019

codefromthecrypt commented Apr 12, 2019

codefromthecrypt commented Apr 12, 2019

timothybasanov commented Apr 12, 2019

codefromthecrypt commented Apr 12, 2019 via email

codefromthecrypt commented Apr 12, 2019

codefromthecrypt commented Apr 12, 2019

codefromthecrypt commented May 2, 2019

codefromthecrypt commented May 2, 2019

codefromthecrypt commented May 2, 2019