You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Clock skew is a phenomenon that happens when servers in a distributed system do not follow the same clock time. This is a problem for distributed tracing, since the timestamp is critical in visualizing the relationships between transactions and spans across services.
Minimizing clock skew
Time for RUM agent events is inherently skewed, since the timestamp is recorded after the fact by the apm-server.
Clock skew for other agents can be mitigated by something like NTP but not eliminated. According to one source a default NTP setup will poll the NTP server in an interval between 64 and 1024 seconds. NTP is not perfect and can have problems on its own but it is probably good enough in most cases.
Question:
Should agents and apm-server detect and warn if NTP is not enabled?
Should agents and apm-server make a sanity time check against [some magic server]?
Proposed solution
In the event of clock skew, there is not much the agents or apm-server can do. Instead I propose that the UI should ensure that the positioning of events at the very least doesn't conflict with the specified parent-child relationship. Meaning: an event should never start before the parent that initiated it did.
In the following example we have two services: opbeans-node and opbeans-node-api with three events:
opbeans-node: initiates the trace (transaction)
opbeans node: makes an outgoing request to opbeans-node-api (span)
opbeans-node-api: receives the request (transaction)
In the following example, the clock of opbeans-node-api is ahead of the clock in opbeans-node, which causes the transaction in opbeans-node-api to start before the request from opbeans-node has been made:
If we assume zero latency between the two services, we can adjust the mis-aligned transaction to start at the same time as its parent span:
Above is a very simple example, and a big trace might require all downstream children to be re-adjusted when their parent is adjusted. There are probably a lot of gotchas I haven't thought about, but I think the UI needs to solve these issues regardless.
The text was updated successfully, but these errors were encountered:
Clock skew is a phenomenon that happens when servers in a distributed system do not follow the same clock time. This is a problem for distributed tracing, since the timestamp is critical in visualizing the relationships between transactions and spans across services.
Minimizing clock skew
Time for RUM agent events is inherently skewed, since the timestamp is recorded after the fact by the apm-server.
Clock skew for other agents can be mitigated by something like NTP but not eliminated. According to one source a default NTP setup will poll the NTP server in an interval between 64 and 1024 seconds. NTP is not perfect and can have problems on its own but it is probably good enough in most cases.
Question:
Proposed solution
In the event of clock skew, there is not much the agents or apm-server can do. Instead I propose that the UI should ensure that the positioning of events at the very least doesn't conflict with the specified parent-child relationship. Meaning: an event should never start before the parent that initiated it did.
In the following example we have two services: opbeans-node and opbeans-node-api with three events:
In the following example, the clock of opbeans-node-api is ahead of the clock in opbeans-node, which causes the transaction in opbeans-node-api to start before the request from opbeans-node has been made:
If we assume zero latency between the two services, we can adjust the mis-aligned transaction to start at the same time as its parent span:
Above is a very simple example, and a big trace might require all downstream children to be re-adjusted when their parent is adjusted. There are probably a lot of gotchas I haven't thought about, but I think the UI needs to solve these issues regardless.
The text was updated successfully, but these errors were encountered: