Should lens fan-out queries to the "archive" zipkin #3023

codefromthecrypt · 2020-03-12T00:16:48Z

#3018 adds the archive button which allows someone to POST the current trace to an archive server. One idea around permalinks was to use a proxy behind the UI, to query multiple sources when you know the archive is zipkin. However, this is more infrastructure

@mchandramouli suggests:

Zipkin lens [can] search both live and archive cluster (beyond live cluster time range) so archived traces can appear in the same UI

#3018 (comment)

It is possible that the "archive" is write-only, for example a Zipkin clone or APM endpoint that doesn't support our api. However, an OPTIONS request could let us know that. It is unlikely we'll know what timerange is valid on the other source as the "archive" could be just another wider scoped service, and we don't even know the reason for archival is time related.

However, we can consider impact of doing this:

CORS - we have permissive cross-origin settings, so a default zipkin service will accept cross-origin requests.. lens could query two zipkin services
conflicts - it is possible two services has the same data, but we could prefer the archive on conflict
performance - the browser will get multiple queries back and that would double the bandwith on overlap. Some tuning or configuration might be needed to say when to use the archive.

On the latter, if we do end up with a standard tag, querying archive would be as easy as looking for the tag "zipkin.archive" or whatever it ends up named as. This would be a cheap way to reduce volatility on normal queries.

jcchavezs · 2020-03-12T00:30:33Z

I think there are many implications on doing a mixed search, one is performance, another one is sorting, counting, dedup, also lookback might not make sense for archived (maybe serviceName either since we won't retrieve serviceNames/spanNames from archive or will we?). I strongly believe if we aim to search archived in a zipkin (which can be detected with the `OPTIONS` call) the easiest is that we tell lens to call one API or another based on a URI (e.g /zipkin and /zipkin/archived). In such cases /zipkin/archived will work the same as /zipkin but different zipkin API and appending the tag query to the call.

…

On Thu, 12 Mar 2020, 01:17 Adrian Cole, ***@***.***> wrote: #3018 <#3018> adds the archive button which allows someone to POST the current trace to an archive server. One idea around permalinks was to use a proxy behind the UI, to query multiple sources when you know the archive is zipkin. However, this is more infrastructure @mchandramouli <https://github.com/mchandramouli> suggests: Zipkin lens [can] search both live and archive cluster (beyond live cluster time range) so archived traces can appear in the same UI #3018 (comment) <#3018 (comment)> It is possible that the "archive" is write-only, for example a Zipkin clone or APM endpoint that doesn't support our api. However, an OPTIONS request could let us know that. It is unlikely we'll know what timerange is valid on the other source as the "archive" could be just another wider scoped service, and we don't even know the reason for archival is time related. However, we can consider impact of doing this: - CORS - we have permissive cross-origin settings, so a default zipkin service will accept cross-origin requests.. lens could query two zipkin services - conflicts - it is possible two services has the same data, but we could prefer the archive on conflict - performance - the browser will get multiple queries back and that would double the bandwith on overlap. Some tuning or configuration might be needed to say when to use the archive. On the latter, if we do end up with a standard tag, querying archive would be as easy as looking for the tag "zipkin.archive" or whatever it ends up named as. This would be a cheap way to reduce volatility on normal queries. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#3023>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAXOYAXLBNLSXWVAJYHOPOLRHASZNANCNFSM4LGBJYSQ> .

codefromthecrypt · 2020-03-12T00:47:54Z

Thanks, @jcchavezs I probably wasn't clear. what I meant was that lens in its api code could embed the logic, just calling two apis like mux does. Main difference in simplification is that we are talking about 1+1 apis, not 1+N which I agree is more complex.

Any logic about merging applies agree, but I don't think that and counts are complex when there are only 2 sources. Since only 2, conflict resolution logic to pick one trace instead of merging them. Also, if a problem that someone asked for 10 traces and somehow that period had overlap and some from archive, deciding which to throw out is something we can do predictably also (ex. the newest in that period, as we do in cassandra fan-outs in the backend). Concretely, if this is bound only to "archive" then it we get the luxury of simple conflict resolution.

lookback isn't an issue really, where it matters is set on the backend. the lookback in lens is just about the default duration of the trace query. For example, you can override this now with 1970 as we do in our testdata readme

I think what I caused most confusion with was by mentioning OPTIONS and not explaining why. What I meant was that we can use OPTIONS to try to detect if the archiveUrl is a normal zipkin server, and then no logic happens at all if we know it not to be. That was just a simplication.. again all UI logic is exactly the same except the api handler, it is only moving the code in lua into javascript, which would have exactly the same concerns.

These clarifications don't imply I personally think we should do this, just want to help ease this topic as I think path based would be even worse to do in our api as then it messes with internal routing logic, etc. easier for a proxy to just mount the other /zipkin under /zipkin-archive imho

In other words, if I have a vote, it is to do the least impact if we are to do something like this, and so far affecting the api javascript code only seems the least impact.

codefromthecrypt · 2020-03-12T00:59:21Z

Some real impacts of proxying regardless of lua or in-browser:

pollution of the service/span/auto-complete name list or a choice to prefer

probably one thing @jcchavezs implies is that if you merge names, maybe you don't want to use service-span names from a very long time ago. Note: this is the same issue on the archive itself, so effectively if you have a problem there, it escalates the issue here. This is because our names queries do not include a timerange.. they are independent of time.

For the "firehose" mode thing, where the source has TTL 1hr and archive has a short 3-7 day TTL. I do not expect this to be a concern. If people are using this to access traces from several years ago, if we did this, they shouldn't fan-out (ex not offer an archiveUrl property only archivePostUrl)

codefromthecrypt · 2020-03-12T01:29:59Z

an alternate to doing this is also not doing this and using existing tech.

We've had an environment property for a long time, to differentiate zipkins. Ex. one zipkin can set to "1hr" and another to "7days", they would have different URLS, possibly on the same host if using basePath and a dumb proxy. In any case this approach's main downside is that users have to be conscious of the URL in use. Luckily, #3018 solves this with the archiveUrl property, which is what you would paste into JIRA.

The other is transparently through proxying the api with mux, however that is additional infrastructure even if small to achieve.

My 2p: The only compelling reason to do this in-browser is in attempt to reduce infrastructure, while providing a natural permalink (subject to the TTL of the archive).

There could be sites that are unable to run a proxy, but we've not heard of them so far, especially in context of an unreleased UI feature for archive button. That in mind, this issue is probably fair game to let sit, allowing us to focus energy on non-speculative change.

mchandramouli · 2020-03-12T05:00:51Z

@adriancole @jcchavezs Maybe I should try to clarify what I meant by beyond live cluster time range in my original comment. Zipkin lens should fire a query to the archive server only for time range beyond the known ttl of the live cluster. i.e., if the ttl of live cluster is 24 hrs, UI should fire a query in archive for t - 24hrs and earlier. This will avoid dedup issues. Does it make sense?

codefromthecrypt · 2020-03-12T05:07:16Z

@mchandramouli I get you but it requires knowledge of this data pushed up into the UI, specific to the idea that this setup is commonly why the archive button appears. It would reduce dupes, but not prevent them, but yeah it is indeed a complexity worth mentioning. thanks!

mchandramouli · 2020-03-12T16:15:31Z

@adriancole I was wondering the same - we cannot keep UI clean and not be aware of backend internals (like ttl). Anyways, thought I will put it out on the table. I will watch this issue and see where we go with this.

jeqo · 2020-03-12T22:54:16Z

I'm initially inclined to this alternative:

an alternate to doing this is also not doing this and using existing tech.
We've had an environment property for a long time, to differentiate zipkins. Ex. one zipkin can set to "1hr" and another to "7days", they would have different URLS, possibly on the same host if using basePath and a dumb proxy.

-> 2 different instances (active and archive) with ${archivePostPath} feature to move traces between them.

environment would be flexible enough to be used to differentiate active and archive instances--even though I've to confess I haven't used this feature before.

I'd assume sites with this need will have an infrastructure big enough to justify running an archive instance in parallel.

Another alternative that is appealing to me is to have this functionality as part of the storage API. This will involve supporting multiple storages plugged to the same instance (e.g. traces->elasticsearch, dependencies->elasticsearch, archive->cassandra). Thinking out loud, in this scenario I'd envision Lens having a query flag to lookup from traces or archive endpoint. It could even show archive/favorite traces as top results if we support to return results from both endpoints.

Doing dual-read from Lens would be too complex imho--following your comments above. But other could weight this complexity better than me.

codefromthecrypt · 2020-06-22T02:11:56Z

I'm going to close this issue as it helps the team focus on issues to progress. We can re-open it if it becomes actionable with new information, more hands, more demand. Thanks folks!

codefromthecrypt added enhancement ui Zipkin UI labels Mar 12, 2020

codefromthecrypt mentioned this issue Mar 12, 2020

Add Archive Trace button #3018

Merged

jorgheymans mentioned this issue Apr 11, 2020

Add "save a trace" functionality. #1093

Closed

codefromthecrypt closed this as completed Jun 22, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Should lens fan-out queries to the "archive" zipkin #3023

Should lens fan-out queries to the "archive" zipkin #3023

codefromthecrypt commented Mar 12, 2020

jcchavezs commented Mar 12, 2020 via email

codefromthecrypt commented Mar 12, 2020

codefromthecrypt commented Mar 12, 2020 •

edited

Loading

codefromthecrypt commented Mar 12, 2020

mchandramouli commented Mar 12, 2020

codefromthecrypt commented Mar 12, 2020

mchandramouli commented Mar 12, 2020

jeqo commented Mar 12, 2020

codefromthecrypt commented Jun 22, 2020

Should lens fan-out queries to the "archive" zipkin #3023

Should lens fan-out queries to the "archive" zipkin #3023

Comments

codefromthecrypt commented Mar 12, 2020

jcchavezs commented Mar 12, 2020 via email

codefromthecrypt commented Mar 12, 2020

codefromthecrypt commented Mar 12, 2020 • edited Loading

pollution of the service/span/auto-complete name list or a choice to prefer

codefromthecrypt commented Mar 12, 2020

mchandramouli commented Mar 12, 2020

codefromthecrypt commented Mar 12, 2020

mchandramouli commented Mar 12, 2020

jeqo commented Mar 12, 2020

codefromthecrypt commented Jun 22, 2020

codefromthecrypt commented Mar 12, 2020 •

edited

Loading