Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[APM] Time shifting in DT #24072

Closed
sorenlouv opened this issue Oct 16, 2018 · 2 comments
Closed

[APM] Time shifting in DT #24072

sorenlouv opened this issue Oct 16, 2018 · 2 comments
Assignees
Labels
Team:APM All issues that need APM UI Team support v6.5.0

Comments

@sorenlouv
Copy link
Member

sorenlouv commented Oct 16, 2018

Clock skew is a phenomenon that happens when servers in a distributed system do not follow the same clock time. This is a problem for distributed tracing, since the timestamp is critical in visualizing the relationships between transactions and spans across services.

Minimizing clock skew
Time for RUM agent events is inherently skewed, since the timestamp is recorded after the fact by the apm-server.

Clock skew for other agents can be mitigated by something like NTP but not eliminated. According to one source a default NTP setup will poll the NTP server in an interval between 64 and 1024 seconds. NTP is not perfect and can have problems on its own but it is probably good enough in most cases.

Question:

  • Should agents and apm-server detect and warn if NTP is not enabled?
  • Should agents and apm-server make a sanity time check against [some magic server]?

Proposed solution
In the event of clock skew, there is not much the agents or apm-server can do. Instead I propose that the UI should ensure that the positioning of events at the very least doesn't conflict with the specified parent-child relationship. Meaning: an event should never start before the parent that initiated it did.

In the following example we have two services: opbeans-node and opbeans-node-api with three events:

  1. opbeans-node: initiates the trace (transaction)
  2. opbeans node: makes an outgoing request to opbeans-node-api (span)
  3. opbeans-node-api: receives the request (transaction)

In the following example, the clock of opbeans-node-api is ahead of the clock in opbeans-node, which causes the transaction in opbeans-node-api to start before the request from opbeans-node has been made:
screen shot 2018-09-26 at 17 48 45

If we assume zero latency between the two services, we can adjust the mis-aligned transaction to start at the same time as its parent span:
screen shot 2018-09-26 at 17 48 20

Above is a very simple example, and a big trace might require all downstream children to be re-adjusted when their parent is adjusted. There are probably a lot of gotchas I haven't thought about, but I think the UI needs to solve these issues regardless.

@sorenlouv sorenlouv added Team:APM All issues that need APM UI Team support [zube]: Inbox labels Oct 16, 2018
@elasticmachine
Copy link
Contributor

Pinging @elastic/apm-ui

@sorenlouv
Copy link
Member Author

Original (correct) timeline:
screen shot 2018-10-25 at 14 42 17

A span is shifted to the right to simulate clock skew:
screen shot 2018-10-25 at 14 42 23

To correct clock skew affected child spans are shifted to the rightL
screen shot 2018-10-25 at 14 42 31

Issues:

  • Relative distance between child spans is not preserved
  • The simple time shifting doesn't take end-duration of the parent into account (currently childen are shifted outside the end of the transaction)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Team:APM All issues that need APM UI Team support v6.5.0
Projects
None yet
Development

No branches or pull requests

4 participants