Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lack of clarity on how the trace fields are supposed to be used #998

Open
nikclayton-dfinity opened this issue Oct 3, 2020 · 18 comments
Open
Labels
ready Issues we'd like to address in the future.

Comments

@nikclayton-dfinity
Copy link

Description of the issue:

The description of the tracing fields at https://www.elastic.co/guide/en/ecs/current/ecs-tracing.html is unclear on how they are supposed to be used.

Any additional context or examples:

The ordering of the fields on the page is alphabetical, but this presents them out of order with how they are supposed to be used.

I think the hierarchy is:

A trace contains one or more transactions, a transaction contains zero or more spans (cite for "zero or more" is https://www.elastic.co/guide/en/apm/get-started/current/transactions.html).

If this is correct then the documentation for a span, which starts "Unique identifier of the span within the scope of its trace" should probably be changed to "Unique identifier of the span within the scope of its transaction".

Linking to https://www.elastic.co/guide/en/apm/get-started/current/distributed-tracing.html from the descriptions would also be helpful.

As would an explicit description of the hierarchy in the introductory information in the page.

Uniqueness constraint is unclear

Does the "unique" in that sentence mean that span IDs should aim to be universally unique, or does it mean that a span ID only has to be unique within a single transaction, so:

  trace: {id: "123"},
  transaction: {id: "456"},
  span: {id: "authorization"},

and

  trace: {id: "123"},
  transaction: {id: "789"},
  span: {id: "authorization"},

would be OK (identical span IDs, but the transaction IDs are different) ?

How these work together in a tiered architecture is unclear

For example, suppose you have:

Client -> Load Balancer -> Authorization Server
              `----------> Application Server -> Database 1 -> Filesystem 1
                                    `----------> Database 2 -> Filesystem 2
  1. Client makes request which terminates at the Load Balancer.

  2. The Load Balancer will perform a lookup request to the authorization server, then forward the request to the application server (this is a slightly contrived example), so the load balancer has to:

a. Create a new trace id (since this is "a user request handled by multiple inter-connected services")

b. Contact the authorization server

  • Create a transaction ID ("the highest level of work measured within a service, such as a request to a server")
  • Create a span ID ("an operation within a transaction, such as a request to another service")
  1. The Authorization server receives the request.

What does it need to receive in order to effectively log an ECS event?

In particular, how do we correlate the event log the Authorization server is going to generate with the span for this request that the Load Balancer generated?

Does the span id that the Load Balancer generated become the trace id that the authorization server uses?

https://www.elastic.co/guide/en/apm/get-started/current/transaction-spans.html suggests that there are supposed to transaction.id and parent.id attributes, but they're not present in the schema.

[I'm generating doing this in Rust, for which there isn't an APM agent yet, I want my app to emit logs that can be seamlessly ingested in to Elastic with tracing support]

@nikclayton-dfinity nikclayton-dfinity added the bug Something isn't working label Oct 3, 2020
@ebeahan ebeahan added review and removed review labels Oct 6, 2020
@ebeahan
Copy link
Member

ebeahan commented Oct 7, 2020

Thanks @nikclayton-dfinity for the feedback. The detailed issue write-up is greatly appreciated.

@felixbarny @axw would either of you be able to help answer the APM-related questions concerning the tracing.* fields and their appropriate usage?

@ebeahan
Copy link
Member

ebeahan commented Oct 7, 2020

The ordering of the fields on the page is alphabetical, but this presents them out of order with how they are supposed to be used.

Yes the fields in each field set are sorted alphabetical, but agree we can improve some of the supporting documentation to make usage clearer.

@axw
Copy link
Member

axw commented Oct 8, 2020

@nikclayton-dfinity apologies for the lack of clarity. As you have discovered, we have only added a very small subset of tracing fields to ECS. We added these primarily to document how correlation across traces and logs should work. Excuses over, I'll try to clarify :)

I think the hierarchy is:

A trace contains one or more transactions, a transaction contains zero or more spans (cite for "zero or more" is https://www.elastic.co/guide/en/apm/get-started/current/transactions.html).

You understand correctly.

If this is correct then the documentation for a span, which starts "Unique identifier of the span within the scope of its trace" should probably be changed to "Unique identifier of the span within the scope of its transaction".

Although spans are logically scoped to transactions, span IDs must be unique within a trace. To uniquely identify any transaction or span, you must consider both trace.id and (transaction.id or span.id). A transaction can have a span as a parent.

Does the span id that the Load Balancer generated become the trace id that the authorization server uses?

No, all events related to the same original client request will share the same trace ID. The span ID corresponding to the Load Balancer's outgoing request to the Authorization Server will be used as the parent ID for the transaction recorded by the Authorization Server.

I'll illustrate what the events would look like with your example:

Client -> Load Balancer -> Authorization Server
              `----------> Application Server -> Database 1 -> Filesystem 1
                                    `----------> Database 2 -> Filesystem 2

Following your description, it sounds like Client is not instrumented so I'll generally ignore it. I'll assume the following components are instrumented: Load Balancer, Authorization Server, Application Server.

  1. Load Balancer receives a request, and starts a transaction (T1). Because the Client is uninstrumented, this is the root transaction, and a new trace is started.
  2. Load Balancer makes a request to Authorization Server
    i. The outgoing request is recorded as a span (T1_S1), which is a child of T1. T1_S1's span and trace IDs are injected into the HTTP request headers sent to Authorization Server.
    ii. Authorization Server receives the request, and reports it as a transaction (T2), which is a child of T1_S1.
  3. Load Balancer forwards the client's request to Application Server
    i. The request is recorded as a span (T1_S2). T1_S2's span and trace IDs are injected into the HTTP request headers sent to Application Server.
    ii. Application Server receives the request, and reports it as a transaction (T3), which is a child of T2_S2.
  4. Application Server performs database queries to Database 1 and Database 2. These are recorded as additional spans (T3_S1, T3_S2) which are children of T3. Databases are typically not instrumented.

So we end up with the following:

  • T1: {trace.id: "trace_uuid", transaction.id: "t1"}
    • T1_S1: {trace.id: "trace_uuid", transaction.id: "t1", span.id: "t1_s1", parent.id: "t1"}
      • T2: {trace.id: "trace_uuid", transaction.id: "t2", parent.id: "t1_s1"}
    • T1_S2: {trace.id: "trace_uuid", transaction.id: "t1", span.id: "t1_s2", parent.id: "t1"}
      • T3: {trace.id: "trace_uuid", transaction.id: "t3", parent.id: "t1_s2"}
        • T3_S1: {trace.id: "trace_uuid", transaction.id: "t3", span.id: "t3_s1", parent.id: "t3"}
        • T3_S2: {trace.id: "trace_uuid", transaction.id: "t3", span.id: "t3_s2", parent.id: "t3"}

So I'll try to summarise:

  • All events that were caused by the original request should share the same trace ID
  • An incoming request will be reported as a transaction.
    • If it is the first (client-facing) service then it will also start a new trace.
    • If the requestor is also instrumented, and there is a corresponding outgoing span, then the transaction's parent ID should be set to that span's ID.
  • An outgoing request will be reported as a span.

@ebeahan
Copy link
Member

ebeahan commented Oct 22, 2020

Thanks @axw for jumping in and helping provide clarity!

elastic.co/guide/en/apm/get-started/current/transaction-spans.html suggests that there are supposed to transaction.id and parent.id attributes, but they're not present in the schema.

transaction.id is already defined under the tracing fieldset, but correct parent.id is not currently.

@axw - parent.id isn't currently one of the defined tracing fields. Is adding it into ECS something to consider?

@axw
Copy link
Member

axw commented Oct 26, 2020

Initially at least the goal of the tracing section in ECS was to explain correlation across types of data (namely traces and logs). For correlating traces and logs you would be using trace.id, transaction.id, and span.id -- but parent.id is unlikely.

For a somewhat more comprehensive explanation of the fields, https://www.elastic.co/guide/en/apm/get-started/current/transaction-spans.html is the source of truth.

I'm not against adding more of the tracing fields to ECS, but I'd want to know what problem we're solving with that.

@webmat
Copy link
Contributor

webmat commented Oct 29, 2020

Thanks for the detailed answer, @axw!

I'd want to know what problem we're solving with that

A good question. Actually I'd like to understand how users can leverage these fields in custom data sources. Since @felixbarny contributed the first two fields here, perhaps he can chime in as well.

Were these fields added purely for documentation purposes of important APM fields? Or is it possible e.g. for someone to carry on with tagging events with these identifiers in custom sources for them to bubble up in APM to help complete a trace?

If it's possible for users to do so, is there APM docs we could link to from here, to help explain the process and the possibilities?

@axw
Copy link
Member

axw commented Oct 30, 2020

Were these fields added purely for documentation purposes of important APM fields? Or is it possible e.g. for someone to carry on with tagging events with these identifiers in custom sources for them to bubble up in APM to help complete a trace?

The fields we have here were added to explain how to enable trace/log correlation. i.e. if you want to correlate logs with traces, then you should include trace.id and transaction.id or span.id. For purposes of log correlation I don't think parent.id is useful.

If it's possible for users to do so, is there APM docs we could link to from here, to help explain the process and the possibilities?

We document all fields in https://www.elastic.co/guide/en/apm/server/current/exported-fields.html, however the documentation isn't likely to be very helpful for implementing an agent. For non-agent custom data sources which produce Elasticsearch docs directly, we don't have a good reference guide; but these are also exceedingly rare.

For developing an agent for Elastic APM the fields aren't particularly relevant. What matters most is the protocol between the agent and APM Server. For developing agents we have https://github.com/elastic/apm/tree/master/specs/agents, which references https://www.elastic.co/guide/en/apm/server/current/intake-api.html for the protocol.

@ebeahan
Copy link
Member

ebeahan commented Oct 30, 2020

The fields we have here were added to explain how to enable trace/log correlation. i.e. if you want to correlate logs with traces, then you should include trace.id and transaction.id or span.id.

If there's collective agreement, I'd like to adjust the current description of the tracing.* fieldset to better capture this intent.

@webmat
Copy link
Contributor

webmat commented Oct 30, 2020

Thanks for the details @axw.

For non-agent custom data sources which produce Elasticsearch docs directly, we don't have a good reference guide; but these are also exceedingly rare.

This is very helpful context. If it's not a first class workflow, we can say so.

@axw
Copy link
Member

axw commented Nov 2, 2020

If there's collective agreement, I'd like to adjust the current description of the tracing.* fieldset to better capture this intent.

@ebeahan sounds good. Do you have something in mind already? Maybe some kind of disclaimer at the top that this documentation is intended for log correlation, with pointers to the other docs for other use cases?

@webmat
Copy link
Contributor

webmat commented Nov 2, 2020

We actually have some flexibility to document this in ECS, if needed.

Since #988, a given field set can have a whole free form documentation page that accompanies it. So we can do more than just a quick warning. We could go into some details in free form asciidoc, and yes of course defer and link to APM docs as well (we don't want to repeat everything).

If you want to see an example, this draft PR #1066 is the first to add such a docs page, in this case to the "user" fields. You can see the "usage" subsection in the sidebar, and there's also a call out to it at the top of the normal "user page".

Usage section in sidebar

Normal user page (PR)

Usage docs for user (PR)

@webmat
Copy link
Contributor

webmat commented Nov 3, 2020

@axw Just went through this issue again, and I'd like to bring this to something actionable.

The main purpose of having these fields in ECS is to help folks correlate logs around an APM-instrumented app with the events generated by APM. The simple case here is simply tagging raw logs of the main app (e.g. customizing Rails logs) with these 3 fields.

  • If users have the ability to pass along these IDs to subsystems like database or another microservice, could / should they tag these logs the same way as well?
  • This is strictly for correlation and pivoting between APM and related logs via the Logs UI, correct? These logs won't appear inside APM in any way?
  • In order to instrument their logs, users will need to grab these IDs somewhere / somehow. Can you point me to docs for that? I assume one simply calls a method on the APM agent of their language of choice?

If users are looking to build an APM agent, that's a whole different endeavour, and we will 100% defer to the documentation you linked to above.

Should users feel free to use these fields when doing distributed tracing ad hoc, without Elastic APM?

@axw
Copy link
Member

axw commented Nov 4, 2020

If users have the ability to pass along these IDs to subsystems like database or another microservice, could / should they tag these logs the same way as well?

You could, but the more common thing to do here would be to use distributed tracing to continue a trace in the downstream microservice/subsystem. Traces to go between services, and then correlation between the traces and logs of the same service.

This is strictly for correlation and pivoting between APM and related logs via the Logs UI, correct?

Yes.

These logs won't appear inside APM in any way?

Not currently. In the future we intend to have an embedded Logs viewer right in the APM UI: elastic/kibana#79995

In order to instrument their logs, users will need to grab these IDs somewhere / somehow. Can you point me to docs for that? I assume one simply calls a method on the APM agent of their language of choice?

How it works depends on the agent. Each APM agent has its own "Log correlation" page.

The Java agent provides a config variable to inject the IDs, and then if you use an ECS logger they'll get added to your log records automatically: https://www.elastic.co/guide/en/apm/agent/java/current/log-correlation.html#log-correlation-enable

The Go agent provides integrations with several popular logging libraries to grab trace IDs from the current trace context: https://www.elastic.co/guide/en/apm/agent/go/current/supported-tech.html#supported-tech-logging

Should users feel free to use these fields when doing distributed tracing ad hoc, without Elastic APM?

There's no harm in that, but no great benefit either. It wouldn't be enough to have the traces show up in the APM app in Kibana.

@sgryphon
Copy link

For purposes of log correlation I don't think parent.id is useful.

I'd like to see parent.id, or probably parentSpan.id could be a clearer name, added to ECS.

The values for trace.id, span.id, and parent.id allow, in decreasing order of importance, correlation of all log messages within one logical operation (trace.id), the log messages within one subsection of the operation (span.id), and show the hierarchical parent-child relationship between the subsections (parent.id).

Where you have another source of information recording the parent-child relationship, then the data contained in parent.id becomes redundant... in fact if you log multiple messages the values are redundant (a particular span.id always has the same parent.id).

However, it then means you need to merge in that other data set in order to have the needed information. Having the parent.id in each record means you can determine parent-child relationships without any other data, without full APM, with only a subset of records, etc. It allows logging to be stand alone.

The same is true of many other fields within ECS, for example the information about operation system, server, user, etc is all repeated, even though they will contain the same repeated information within a session.

@ebeahan ebeahan added ready Issues we'd like to address in the future. and removed review bug Something isn't working labels Nov 17, 2020
@webmat webmat added the 1.9.0 label Nov 17, 2020
@ebeahan
Copy link
Member

ebeahan commented Feb 1, 2021

Let's revive this conversation. 😄

@sgryphon has kindly opened #1128 to work on expanding the tracing.* docs as well as #1142, a proposal to add parent.id. I'd like to close out the discussion here with a decision on parent.id: do we see it as a possible addition to ECS at this time?

@axw @felixbarny any feedback on @sgryphon's latest thoughts around adding a parent.id field to the tracing fieldset?

@axw
Copy link
Member

axw commented Feb 2, 2021

I have not changed my opinion since #998 (comment).

The tracing fields in ECS were never intended to be complete for describing or reconstructing a distributed trace -- they were intended for trace/log correlation. We can change that of course, but I don't see a compelling reason to do so. I would prefer to cover them in the Elastic APM docs.

@sgryphon
Copy link

sgryphon commented Feb 3, 2021

@axw one scenario would be for use of Elasticsearch for logging without using Elastic APM. Including supporting just W3C Trace Context, which has slightly different semantics (no separate transaction).

Working on one of the clients/implementations (dotnet), it seems strange to map the trace and span parts, yet not map the parent. This is not using APM, just using the Trace Context support in .NET.

There is also clearly some support, with at least one other user @alankis commenting on the PR.

It is not the end of the world; in most cases all you need is traceid, and you can infer spanid/parentid from the source systems, topology knowledge, and temporal links, i.e. system B is only called from A, so we know A is the parent of B within a trace. The additional information from parent is only relevant in highly complex systems.

What a decision would mean was whether I add it to the client as "parent.id", for forward compatibility, or "Parent.id" as a custom field.

@axw
Copy link
Member

axw commented Feb 3, 2021

@sgryphon Thanks for the clarification. Although I don't see parent.id changing any time soon, I'd prefer not to set it in stone by adding it to ECS just now. That's why I'd rather it was kept in the Elastic APM docs, as an implementation detail. Your earlier comment captures my sentiment:

... parent.id, or probably parentSpan.id could be a clearer name ...

i.e. I'm not a big fan of the current name. parent.id is a bit generic, and I'd prefer it was more obviously related to tracing.

I'd suggest adding this as a custom field which just happens to match what Elastic APM uses.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ready Issues we'd like to address in the future.
Projects
None yet
Development

No branches or pull requests

5 participants