Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Should lens fan-out queries to the "archive" zipkin #3023

Closed
codefromthecrypt opened this issue Mar 12, 2020 · 9 comments
Closed

Should lens fan-out queries to the "archive" zipkin #3023

codefromthecrypt opened this issue Mar 12, 2020 · 9 comments
Labels

Comments

@codefromthecrypt
Copy link
Member

#3018 adds the archive button which allows someone to POST the current trace to an archive server. One idea around permalinks was to use a proxy behind the UI, to query multiple sources when you know the archive is zipkin. However, this is more infrastructure

@mchandramouli suggests:

Zipkin lens [can] search both live and archive cluster (beyond live cluster time range) so archived traces can appear in the same UI

#3018 (comment)

It is possible that the "archive" is write-only, for example a Zipkin clone or APM endpoint that doesn't support our api. However, an OPTIONS request could let us know that. It is unlikely we'll know what timerange is valid on the other source as the "archive" could be just another wider scoped service, and we don't even know the reason for archival is time related.

However, we can consider impact of doing this:

  • CORS - we have permissive cross-origin settings, so a default zipkin service will accept cross-origin requests.. lens could query two zipkin services
  • conflicts - it is possible two services has the same data, but we could prefer the archive on conflict
  • performance - the browser will get multiple queries back and that would double the bandwith on overlap. Some tuning or configuration might be needed to say when to use the archive.

On the latter, if we do end up with a standard tag, querying archive would be as easy as looking for the tag "zipkin.archive" or whatever it ends up named as. This would be a cheap way to reduce volatility on normal queries.

@jcchavezs
Copy link
Contributor

jcchavezs commented Mar 12, 2020 via email

@codefromthecrypt
Copy link
Member Author

Thanks, @jcchavezs I probably wasn't clear. what I meant was that lens in its api code could embed the logic, just calling two apis like mux does. Main difference in simplification is that we are talking about 1+1 apis, not 1+N which I agree is more complex.

Any logic about merging applies agree, but I don't think that and counts are complex when there are only 2 sources. Since only 2, conflict resolution logic to pick one trace instead of merging them. Also, if a problem that someone asked for 10 traces and somehow that period had overlap and some from archive, deciding which to throw out is something we can do predictably also (ex. the newest in that period, as we do in cassandra fan-outs in the backend). Concretely, if this is bound only to "archive" then it we get the luxury of simple conflict resolution.

lookback isn't an issue really, where it matters is set on the backend. the lookback in lens is just about the default duration of the trace query. For example, you can override this now with 1970 as we do in our testdata readme

I think what I caused most confusion with was by mentioning OPTIONS and not explaining why. What I meant was that we can use OPTIONS to try to detect if the archiveUrl is a normal zipkin server, and then no logic happens at all if we know it not to be. That was just a simplication.. again all UI logic is exactly the same except the api handler, it is only moving the code in lua into javascript, which would have exactly the same concerns.

These clarifications don't imply I personally think we should do this, just want to help ease this topic as I think path based would be even worse to do in our api as then it messes with internal routing logic, etc. easier for a proxy to just mount the other /zipkin under /zipkin-archive imho

In other words, if I have a vote, it is to do the least impact if we are to do something like this, and so far affecting the api javascript code only seems the least impact.

@codefromthecrypt
Copy link
Member Author

codefromthecrypt commented Mar 12, 2020

Some real impacts of proxying regardless of lua or in-browser:

pollution of the service/span/auto-complete name list or a choice to prefer

probably one thing @jcchavezs implies is that if you merge names, maybe you don't want to use service-span names from a very long time ago. Note: this is the same issue on the archive itself, so effectively if you have a problem there, it escalates the issue here. This is because our names queries do not include a timerange.. they are independent of time.

For the "firehose" mode thing, where the source has TTL 1hr and archive has a short 3-7 day TTL. I do not expect this to be a concern. If people are using this to access traces from several years ago, if we did this, they shouldn't fan-out (ex not offer an archiveUrl property only archivePostUrl)

@codefromthecrypt
Copy link
Member Author

an alternate to doing this is also not doing this and using existing tech.

We've had an environment property for a long time, to differentiate zipkins. Ex. one zipkin can set to "1hr" and another to "7days", they would have different URLS, possibly on the same host if using basePath and a dumb proxy. In any case this approach's main downside is that users have to be conscious of the URL in use. Luckily, #3018 solves this with the archiveUrl property, which is what you would paste into JIRA.

The other is transparently through proxying the api with mux, however that is additional infrastructure even if small to achieve.

My 2p: The only compelling reason to do this in-browser is in attempt to reduce infrastructure, while providing a natural permalink (subject to the TTL of the archive).

There could be sites that are unable to run a proxy, but we've not heard of them so far, especially in context of an unreleased UI feature for archive button. That in mind, this issue is probably fair game to let sit, allowing us to focus energy on non-speculative change.

@mchandramouli
Copy link

@adriancole @jcchavezs Maybe I should try to clarify what I meant by beyond live cluster time range in my original comment. Zipkin lens should fire a query to the archive server only for time range beyond the known ttl of the live cluster. i.e., if the ttl of live cluster is 24 hrs, UI should fire a query in archive for t - 24hrs and earlier. This will avoid dedup issues. Does it make sense?

@codefromthecrypt
Copy link
Member Author

@mchandramouli I get you but it requires knowledge of this data pushed up into the UI, specific to the idea that this setup is commonly why the archive button appears. It would reduce dupes, but not prevent them, but yeah it is indeed a complexity worth mentioning. thanks!

@mchandramouli
Copy link

@adriancole I was wondering the same - we cannot keep UI clean and not be aware of backend internals (like ttl). Anyways, thought I will put it out on the table. I will watch this issue and see where we go with this.

@jeqo
Copy link
Member

jeqo commented Mar 12, 2020

I'm initially inclined to this alternative:

an alternate to doing this is also not doing this and using existing tech.
We've had an environment property for a long time, to differentiate zipkins. Ex. one zipkin can set to "1hr" and another to "7days", they would have different URLS, possibly on the same host if using basePath and a dumb proxy.

-> 2 different instances (active and archive) with ${archivePostPath} feature to move traces between them.

environment would be flexible enough to be used to differentiate active and archive instances--even though I've to confess I haven't used this feature before.

I'd assume sites with this need will have an infrastructure big enough to justify running an archive instance in parallel.

Another alternative that is appealing to me is to have this functionality as part of the storage API. This will involve supporting multiple storages plugged to the same instance (e.g. traces->elasticsearch, dependencies->elasticsearch, archive->cassandra). Thinking out loud, in this scenario I'd envision Lens having a query flag to lookup from traces or archive endpoint. It could even show archive/favorite traces as top results if we support to return results from both endpoints.

Doing dual-read from Lens would be too complex imho--following your comments above. But other could weight this complexity better than me.

@codefromthecrypt
Copy link
Member Author

I'm going to close this issue as it helps the team focus on issues to progress. We can re-open it if it becomes actionable with new information, more hands, more demand. Thanks folks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants