Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: SDK Extension AWS Gem #386

Closed
wants to merge 3 commits into from

Conversation

dlahn
Copy link
Contributor

@dlahn dlahn commented Mar 17, 2023

This implements #33. This adds a gem opentelemetry-sdk-extension-aws, which allows for XRay ID generation. This lays the groundwork to add resource detectors for AWS resources.

This code is mostly moved from the opentelemetry-propagator-xray gem.

This implements open-telemetry#33. This adds a gem opentelemetry-sdk-extension-aws, which allows for XRay ID generation. This lays the groundwork to add resource detectors for AWS resources.

This code is mostly moved from the opentelemetry-propagator-xray gem.
@linux-foundation-easycla
Copy link

linux-foundation-easycla bot commented Mar 17, 2023

CLA Signed

The committers listed above are authorized under a signed CLA.

  • ✅ login: dlahn / name: Dave Lahn (a13e4cd)

@arielvalentin
Copy link
Collaborator

@dlahn Thank you for your submission! We have some logistical things we need to figure out before continuing.

  1. @open-telemetry/ruby-contrib-maintainers has limited capacity when it come to maintaining AWS related features so we ask for your patience as we will need time to learn more about XRay IDs to provide you with a thorough review.
  2. Since we do not have this AWS expertise, would you be willing to take on a role as a collaborator and maintain this component as well as perform code reviews as needed?
  3. Have you tested this component in a production setting? Would you be amenable to providing screen shots and documentation for opentelemetry.io to help others understand how to leverage this?
  4. It looks like the test suite and release are not incorporated into GitHub Actions. In order for us to accept this submission we will need you to update the configurations to ensure these changes are tested as part of CI and releasable.

Thank you!

Copy link
Contributor

@ahayworth ahayworth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👋 Thanks for this! A few small things before we smash the merge button:

  • I double-checked and yes indeed - this code is duplicated from the x-ray propagator gem. Is your intention to submit a follow-on PR to have the propagator depend on this gem and use the generator here? That would be my preference, so we don't have a duplicate implementation hanging around forever.
  • I left some minor nit-picks in the test file. Since we're touching it, I thought some spacing might be nice. 😄

On a broader note, I think moving the id generator to an extension for the reasons outlined in #33 is a good idea! But, it isn't technically a prerequisite for a resource detector (we could always throw it in with the other ones). I'm not sure if we really should keep the AWS resource detector here but the other detectors there, you know? We should do one or the other, and be consistent. That's probably a bigger discussion for @open-telemetry/ruby-contrib-maintainers and @open-telemetry/ruby-contrib-approvers sometime. We haven't really had many resource detectors until now, so I don't think it's really come up much.

Anyways - overall, this looks good to me! I appreciate you tackling this, and laying the groundwork for more things in the future! 😄 ❤️

(edit: whoops, @arielvalentin is right - this isn't getting tested. Please to add it to the configs. 😄 )

end
# Make sure it's valid
_(date).wont_be_nil
end
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
end
end

it 'must generate 16 in length' do
trace_id = id_generator.generate_trace_id
_(trace_id.length).must_equal(16)
end
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
end
end

@ahayworth
Copy link
Contributor

I double-checked and yes indeed - this code is duplicated from the x-ray propagator gem. Is your intention to submit a follow-on PR to have the propagator depend on this gem and use the generator here? That would be my preference, so we don't have a duplicate implementation hanging around forever.

Actually @dlahn - now that I'm looking a little more - it's not really clear to me when one needs to use the id generator. What purpose does it have outside of the propagator? I'll be honest in that I have virtually no experience using XRay, so if you wouldn't mind I would certainly appreciate knowing more about when you'd want to use the id generator and not the propagator, or vice-versa...

@robbkidd
Copy link
Member

I tried finding the back story on this from the links in #33, but I'm still not clear on the problem being solved.

From that issue:

[...] the X-Ray ID Generator lives inside the opentelemetry-propagator-xray gem [...]
Lambda Instrumentations should not be taking the AWS X-Ray ID Generator as a dependency, since it is not required. [...]

I've followed the summary about propagation being mentioned at the API-level and pluggable ID generators being mentioned at the SDK-level. What operational problem in Lambda is caused by both propagation and ID generation being supplied by a single gem whose only runtime dependency is on the API library?

@dlahn
Copy link
Contributor Author

dlahn commented Mar 18, 2023

@dlahn Thank you for your submission! We have some logistical things we need to figure out before continuing.

  1. @open-telemetry/ruby-contrib-maintainers has limited capacity when it come to maintaining AWS related features so we ask for your patience as we will need time to learn more about XRay IDs to provide you with a thorough review.
  2. Since we do not have this AWS expertise, would you be willing to take on a role as a collaborator and maintain this component as well as perform code reviews as needed?
  3. Have you tested this component in a production setting? Would you be amenable to providing screen shots and documentation for opentelemetry.io to help others understand how to leverage this?
  4. It looks like the test suite and release are not incorporated into GitHub Actions. In order for us to accept this submission we will need you to update the configurations to ensure these changes are tested as part of CI and releasable.

Thank you!

Thanks for the reply!

  1. Of course!
  2. Happy to be involved.
  3. Not as of yet, however, this is mostly a refactor from the existing code in the XRay propagator as per X-Ray ID Generator should be split from X-Ray Propagator Package #33 and Add Missing AWS Resource Detectors #34.
  4. I will do this if it is determined it is best to go ahead with this PR.

@ahayworth
Copy link
Contributor

(The rack test that is failing is unrelated to this PR; see #387 and #388)

@dlahn
Copy link
Contributor Author

dlahn commented Mar 18, 2023

@arielvalentin @ahayworth @robbkidd From reading #33 and #34, it seemed like a decision was made to follow the SDK extension pattern as some other languages are doing, and have the ID generator and resource detectors in that AWS SDK Extension.

As @ahayworth brought up, I can't really think of a reason to generate XRay IDs if you aren't going to use the propagator, except that the AwsSdk instrumentation seems to implement its own methods for handling this, so you wouldn't need the XRay propagator. We don't actually use AWS XRay, but XRay IDs are a part of the opentelemetry spec for Lambda.

If this is the way forward, then this could eventually be removed from the XRay propagator gem.

However, if the consensus is that it is best to keep them together in the propagator, then perhaps we should close #33, and I am happy to start looking at #34 and have it live in resource_detectors.

Let me know your thoughts and we can go from there!

@robbkidd
Copy link
Member

robbkidd commented Mar 18, 2023

🤔 Here's what I'm thinking at the moment. I don't know if this is the way forward you mentioned or what consensus might be gathering around, but here goes …

AWS X-Ray Propagation & ID Generation

X-Ray propagation should (and currently does) live in a single, named gem. Propagation is spec'd at the API level, so this gem should (and currently does) only depend on the API. Because generating X-Ray IDs is something that only matters to propagation, the ID generator should (and currently does) live in the propagation gem.

The vanilla OTel Ruby SDK provides the mechanism—as required by spec and with the optional "allowing custom implementations of an interface"—for a tracer provider to accept an ID generator implementation for its tracers to use for generating trace and span IDs. The spec does not state that an ID generator implementation has any direct dependencies on an SDK, only that an ID generator might need to come in a particular shape (interface/duck type/whatevs) to be used by an SDK.

AWS Lambda Resource Detection

With Resources spec'd at the SDK-level, SDKs own the base implementation for populating resource attributes. So, resource detector libraries have runtime dependencies on an SDK.

Depend on API or Depend on SDK

With a goal of "just one library to support OTel use in AWS", combining X-Ray propagation and AWS infra/compute resource detection results in a library that depends on the SDK (which depends on the API).

With a goal of "minimize required runtime dependencies when using only some but not all discrete features of OTel in AWS", X-Ray propagation and resource detection for the various AWS infra/compute services would call for separate gems.

And there's a Why Not Both? option of separate propagation and resource detection gems and then a third library—empty of any implementation—that exists only to require them both (and any others that come along).

@blumamir
Copy link
Member

blumamir commented Mar 20, 2023

I was trying to understand the motivation for this PR and I guess there are some black spots in my understanding how x-ray works

X-Ray service

It seems to me there is the x-ray service that is able to ingest spans, for which your PR is referencing here. In this document, it looks like the x-ray receiver excepect for the format of the trace id to be

A trace_id consists of three numbers separated by hyphens. For example, 1-58406520-a006649127e371903a2de979. This includes:

Trace ID Format
The version number, for instance, 1.

The time of the original request, in Unix epoch time, in 8 hexadecimal digits. For example, 10:00AM December 2nd, 2016 PST in epoch time is 1480615200 seconds, or 58406520 in hexadecimal.

A 96-bit identifier for the trace, globally unique, in 24 hexadecimal digits.

This is regardless of the propagation. One could instrument a service and export traces to x-ray collector but not use any of AWS APIs thus no need for a propagator.

So I guess my questions are:

  • Is it mandatory for anyone using x-ray as a backend to generate trace id in the x-ray format?
  • How is it expected to work when the trace ids are generated in a different place which doesn't use x-ray trace id format, and propagated to a service via other propagators like w3c? The trace id is determined by the first span being created and it might decide to generate incompatible trace ids, right? for which downstream otel SDKs has no control either way.
  • Is it documented when and why exactly the x-ray id generator should be set up by users?
  • Since both x-ray and default id generators create 32 hex digits (16 bytes) of random trace id, It seems to not be breaking anything using it instead of the default one. My question is what will not work if the default generator is used and serialized into x-ray trace id (which does not honor the 4 bytes timestamp porotcol)? how will x-ray behave if the first 4 bytes are random numbers? If it's not compatible with x-ray, I assume that the user must configure the entire system to use the x-ray id generators (regardless of propagators)?

X-Ray Propagator

From the current code base, it seems to me that the x-ray propagator is not even using the id generator, they are just bundled in the same package since they are both in the x-ray scope, but do not have any dependencies or relations to one another.

It seems to me that the x-ray propagator has inferred assumptions that the trace id is 32 hex digits (16 bytes), which is compatible with default and x-ray id generators, so the propagator can be used with the default id generator at the protocol level at least. However, is it guaranteed by the SDK? could someone craft a custom id generator that will be serialized into invalid x-ray trace ids?

extract:

        # Convert an id from a hex encoded string to byte array. Assumes the input id has already been
        # validated to be 35 characters in length.
        def to_trace_id(hex_id)
          Array(hex_id[2..9] + hex_id[11..hex_id.length]).pack('H*')
        end

inject

          xray_trace_id = "1-#{ot_trace_id[0..7]}-#{ot_trace_id[8..ot_trace_id.length]}"
          parent_id = span_context.hex_span_id

          xray_value = "Root=#{xray_trace_id};Parent=#{parent_id};Sampled=#{sampling_state}"

          setter.set(carrier, XRAY_CONTEXT_KEY, xray_value)

And the same as the x-ray service - what will happen if a user propagates a trace id in the x-ray format, but for which the first 4 bytes are random and not timestamp? Will it be propagated "as-is" by AWS, or will it confuses it and cause something not to work correctly?

AWS Resource detector

I support moving the id generator to a separate gem, but I don't think the resource detector should live in this gem. In JS at least you can install individual detectors as you like regardless of any other SDK component, and I think they should not be bundled with id generators same as it might not makes sense to bind id generator and propagator.

A user can run a service in AWS and use any trace id format with or without x-ray propagation but still desire AWS resource attributes. A user can also choose to use x-ray formats and propagation but have no interest in the resource (or decide to drop it due to costs).

Not sure thought what is the conventions in ruby, since I see all the detectors are currently bundled into the same gem, in which case shouldn't AWS detector live there too?

Separate Packages

Given all the above, I think it semantically makes sense to separate the id generator from the propagator, as they might often be used together but not necessarily. If someone wants to use x-ray ids but has no need in the propagator (for example if exporting the spans to x-ray) it might feel incorrect to have a dependency on "opentelemetry-propagator-xray" just to use the id generator.

I have to admit the pieces of the puzzle still don't connect well in my head. Would love to understand more about how and when a user is expected to use these components to craft a working otel setup, but I support this PR regardless. I do think it will be valuable to document as much as we know in the README of the components, so users don't need to guess or research how and when to use them.

Sorry if some of my points are trivial and I am just missing something obvious.

@blumamir
Copy link
Member

  • I double-checked and yes indeed - this code is duplicated from the x-ray propagator gem. Is your intention to submit a follow-on PR to have the propagator depend on this gem and use the generator here? That would be my preference, so we don't have a duplicate implementation hanging around forever.

To me, it makes sense that the id generator implementation is removed from the propagator package, and consumed from its own gem if needed. The propagator does not use it anyhow.
But it could break users if they are currently requiring it from "opentelemetry-propagator-xray" and it will be moved

@dlahn
Copy link
Contributor Author

dlahn commented Mar 20, 2023

@blumamir

Is it mandatory for anyone using x-ray as a backend to generate trace id in the x-ray format?
Yes. https://docs.aws.amazon.com/xray/latest/devguide/xray-api-sendingdata.html#xray-api-traceids

How is it expected to work when the trace ids are generated in a different place which doesn't use x-ray trace id format, and propagated to a service via other propagators like w3c? The trace id is determined by the first span being created and it might decide to generate incompatible trace ids, right? for which downstream otel SDKs has no control either way.
That is a valid concern. You wouldn't want to use the ID generator in this case.

Is it documented when and why exactly the x-ray id generator should be set up by users?
If a consensus is reached how to proceed, I can add additional documentation.

From reading #33 and #34, it seemed like the SIG was leaning toward the SDK extension pattern as some other languages are doing (more about this in #33), and have the ID generator and resource detectors in that AWS SDK Extension.

Since the OpenTelemetry Specification for Lambda (https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/trace/semantic_conventions/instrumentation/aws-lambda.md#aws-x-ray-environment-span-link) wants XRay IDs used, there are certainly going to be cases where XRay IDs are used, but the XRay service is not. This is our case in production.

If the SDK Extension for AWS is not the way forward (I appreciate things may have changed), then I would like to proceed with #34 and have it live in the resource_detectors gem.

@blumamir
Copy link
Member

blumamir commented Mar 20, 2023

If the SDK Extension for AWS is not the way forward (I appreciate things may have changed), then I would like to proceed with #34 and have it live in the resource_detectors gem.

I am not involved with ruby codebase maintenance and was not following the original discussions, so I guess it's a decision for @open-telemetry/ruby-contrib-maintainers to take. My personal preference which follows the JS structure is to have a separate gem for each component, e.g. "opentelemetry-sdk-extension-aws", "opentelemetry-resource-detector-aws", and "opentelemetry-propagator-xray". This will allow users to pull in what they need, understand what they depend on, and have a consistent mental model with other SDKs and possibly inside the ruby otel ecosystem as well (e.g. propagators come from opentelemetry-propagator-foo, resource detectors comes from opentelemetry-resource-detector-foo and id generators comes from opentelemetry-sdk-extension-foo. Hosting id generator inside a propagator is not mentally easy to follow (though possible with small effort). It will also make it easy to extend in the future. e.g. if someone implements or require a custom resource detector / propagator / id generator which is not in contrib repo, they can implement it as a separate gem and consume it in a consistent manner like contrib components are consumed.

Since users consume "propagator" detectors based on a component:

spec.add_dependency "opentelemetry-propagator-xray", "1.2.3"
spec.add_dependency "opentelemetry-propagator-ottrace", "1.2.3"
spec.add_dependency "opentelemetry-propagator-b3", "1.2.3"
spec.add_dependency "opentelemetry-propagator-jaeger", "1.2.3"

I would expect the same pattern for other components as well, e.g.

spec.add_dependency "opentelemetry-resource-detector-aws", "1.2.3"
spec.add_dependency "opentelemetry-resource-detector-service", "1.2.3"
spec.add_dependency "opentelemetry-resource-detector-foo", "1.2.3"

spec.add_dependency "opentelemetry-sdk-extention-baz", "1.2.3"

But actually, my main focus (which might be unrelated directly to the changes you suggest here) is to understand in what cases should each component be used by end users. Nevertheless, I support the changes in this PR and also adding a resource detector for AWS in whatever place found fit for it.

To sum up what I understand currently (which is based on my research and not proven knowledge or background):

  • AWS resource detector can be used for any service running in AWS cloud regardless of x-ray
  • AWS propagator should be used only when injecting and extracting context when communicating with AWS services apis. This is expected to happen in aws-sdk instrumentation, and when manually interacting with the API, but should rarely be used by end users
  • x-ray id generator should be used across the system if a span from this trace is expected to reach x-ray service. This can happen when exporting from a service to x-ray backend, or when AWS services themselves are instrumented and report to x-ray with this trace id

I can imagine codebases that may need any subset of the above components, thus I support serving them from different gems.

@robbkidd
Copy link
Member

robbkidd commented Mar 20, 2023

I don't yet understand what concrete problem is solved by moving the X-Ray trace-and-span ID generator to a separate gem. More gems are more maintenance. The more dependency compatibilities that need to be negotiated during dep solving, the more complicated testing and usage becomes.

I have re-read #33 and the supporting statements for a standalone X-Ray ID generator there are:

[X-Ray prop and ID gen] is a problem for projects that want to install either the Propagator or the ID Generator, but not the other.

Why do projects want to install one without the other?

Lambda Instrumentations should not be taking the AWS X-Ray ID Generator a dependency, since it is not required.

I disagree with this conclusion, but that could be because I am mistakenly thinking that X-Ray tracing is more like OpenTelemetry tracing than it really is.

  • An instrumented Lambda service is taking a dependency on X-Ray propagation.
  • Full support for propagating any trace format requires the ability to generate the format's trace ID properly for when a service starts a new trace.
  • An instrumented Lambda service requires an ID generator if any behavior within the service starts a new trace. This might be rare, but I do not believe it is never.

The more I think about this situation, the more I think that all Lambda services participating in X-Ray traces ought to have an X-Ray ID generator set in their OpenTelemetry configuration—not automatically by the library, but conveyed through documentation of an OTel config for Lambda where the user sets the recommended ID generator—so that when the service developers come to a point where they add behavior that they believe is best represented by a new trace—maybe with a span_link back to a span in the original trace—the instrumentation configuration already supports their case as opposed to the developers needing to remember to add the ID generator to the configuration and add the X-Ray ID generator gem.

@blumamir
Copy link
Member

Let's imagine a ruby service that exports to x-ray. According to the docs, it must use x-ray id generator, but not necessarily the propagator (only if it communicates with AWS API which is not always the case).

This service will need to depend on opentelemetry-propagator-xray but it does not need the propagator, only the id generator. The person who is setting up otel will look for the id generator and will have to find it in the propagator gem which is not immediate.

Someone else might see this dependency and assumes x-ray propagation is taken place which is misleading.

Having said that, I might be a bit biased since I work with the JS otel components and find it intuitive and consistent. I don't have any ruby experience and am not familiar with the conventions and practices. Your argument for complexity sounds reasonable. In this case, did it make sense to separate propagators into individual gems and not having one gem opentelemetry-propagators for all of them? I tend to favor consistency but that's just my personal opinion.

@robbkidd
Copy link
Member

This service will need to depend on opentelemetry-propagator-xray but it does not need the propagator, only the id generator. The person who is setting up otel will look for the id generator and will have to find it in the propagator gem which is not immediate.

OK! I see the separation in this direction. Some services need the ID generator, but not the propagator.
So one of the problems that prop/id-gen separation could address is "discoverability of the X-Ray ID generator".

My read of #33 has it proposing the separation from the other direction: a service using the propagator, but not the ID generator. The bulk of my opinion against separation so far is that a propagation implementation requires an ID generator to support starting new traces. If separated, I'm currently thinking propagation would depend on the standalone ID generator.

@blumamir
Copy link
Member

blumamir commented Mar 20, 2023

My read of #33 has it proposing the separation from the other direction: a service using the propagator, but not the ID generator.

I think it is very possible. One can run a ruby service that propagates context across AWS services, but reports to some other open-telemetry compatible vendor that can receive any trace id. In this case, there is no need to consume the x-ray id generator as a dependency or in SDK setup code (though I don't think it will do any harm aside from unneeded complicated setup code). An end user will only need the propagator (most of the time as a dependency of the aws-sdk instrumentation and not directly).

I even think that most codebases just need to register aws-sdk instrumentation and that's it, the propagation should be handled by instrumentation alone (unless they export spans to x-ray collector in which case they also need to register the id generator to SDK with or without aws-sdk instrumentation / x-ray propagator)

My motto is to assume as less as possible about how end users will want to use components, and just expose them in an easy-to-follow pattern and let users install and use them as they see fit for their setup and needs.

@fbogsany
Copy link
Contributor

TL;DR: I think that if you're exporting to X-Ray, then you have to generate X-Ray IDs, and if you're using AWS services and you want spans from them as well (I believe some generate server-side spans), then you have to propagate in the X-Ray format. If you don't care for AWS service spans and you don't export to X-Ray, then you don't need X-Ray ID generation or propagation. Based on this, I think X-Ray propagation is only needed if you're exporting to X-Ray, and therefore it depends on X-Ray ID generation, but X-Ray ID generation can be used in isolation (if you're exporting to X-Ray but not using AWS services).

The primary unknown for me is: do AWS services provide an option to export spans from that service to some other vendor?

I may contradict myself below - the rest of this was 🤔 while 👆 is the output.

Enumerating use cases:

runs in AWS uses AWS services exports to X-Ray need X-Ray prop need X-Ray ids
yes yes yes yes yes
yes yes no yes maybe
yes no yes no yes
no yes yes yes yes
yes no no no no
no yes no yes maybe
no no yes no yes
no no no no no

I think the "maybes" are probably "no". It is certainly true that if you are exporting to X-Ray, you have to use X-Ray ID generation. X-Ray uses the timestamp portion to drop spans older than 10 minutes (which implies you can't export a trace longer than 10 minutes without losing some spans, but I digress). I believe some AWS services will export spans to X-Ray, but I don't think they do the same timestamp check - X-Ray will simply drop spans with old timestamps - so if you're using AWS services but don't care for their spans (i.e. you're purely a client and your requests don't transit an AWS service, and you're not using X-Ray) then that's equivalent the "don't use AWS services" case.

@robbkidd
Copy link
Member

... the propagation should be handled by instrumentation alone ...

Instrumentation that propagates all defer to the propagators present and configured. This is mostly in the instrumentation for the various HTTP client libraries and request handlers, HTTP being the most well-trod medium to propagate over between different services. So, while instrumentation has the appearance of handling propagation, they are really delegating the details of format to a propagator.

@blumamir
Copy link
Member

I am not sure about it, but I think that AWS can be configured to propagate the context without creating spans (and exporting them to x-ray by default), in which case "maybe"s might be a no?

This is the case where a propagator is needed while the id generator is not.

I know AWS had a few issues around this feature (where it was not properly honored and broke a trace since the AWS internal span was missing). Not sure if they are fixed by now or not.

The primary unknown for me is: do AWS services provide an option to export spans from that service to some other vendor?

last time I checked on oct 2022 the answer was:

Exporting spans from X-Ray in OTLP format is a huge ask we have but it is quite an undertaking. Definitely on the roadmap, but no firm ETA from our end

@blumamir
Copy link
Member

... the propagation should be handled by instrumentation alone ...

Instrumentation that propagates all defer to the propagators present and configured. This is mostly in the instrumentation for the various HTTP client libraries and request handlers, HTTP being the most well-trod medium to propagate over between different services. So, while instrumentation has the appearance of handling propagation, they are really delegating the details of format to a propagator.

But the x-ray propagator is no good for any other injection other than AWS services, right? There is no need to inject X-Amzn-Trace-Id key into kafka headers or try extract it from sidekiq. Users might expect aws propagation to be part of aws-sdk instrumentation regardless of them installing propagator separately. This is what I've seen other SIGs doing, for example: java aws-sdk instrumentation has direct dependency on "io.opentelemetry.contrib:opentelemetry-aws-xray-propagator".

@dlahn
Copy link
Contributor Author

dlahn commented Mar 20, 2023

Users might expect aws propagation to be part of aws-sdk instrumentation regardless of them installing propagator separately.

This is the case for us. The XRay propagator isn't strictly necessary, as the AwsSdk instrumentation provides these helpers.

@blumamir
Copy link
Member

blumamir commented Mar 20, 2023

I think X-Ray propagation is only needed if you're exporting to X-Ray

@fbogsany I don't agree with that. If you send a message to SQS which AWS forward to SNS, then in order to have e2e observability across this call you must inject x-ray headers with x-ray propagator. Other headers like b3/w3c are not guaranteed to be propagated via AWS services. So a user that needs a complete trace that involves AWS intermediary must use x-ray propagator, but there is no obligation to export any of this to x-ray backend.

edit:
What I mean is that the producers and consumers are connected to each other, with no instrumentation of the intermediary, but the context still needs to be propagated correctly by the intermediary and that is only guaranteed with X-Amzn-Trace-Id

@fbogsany
Copy link
Contributor

Exporting spans from X-Ray in OTLP format is a huge ask we have but it is quite an undertaking. Definitely on the roadmap, but no firm ETA from our end

This isn't the same thing as the AWS service allows exporting to another vendor. If something has to be ingested first to X-Ray and then exported, then it is subject to the same restrictions as "exporting to X-Ray", and needs to have X-Ray conforming trace IDs (with the encoded timestamp within the past 10 minutes).

@blumamir
Copy link
Member

Exporting spans from X-Ray in OTLP format is a huge ask we have but it is quite an undertaking. Definitely on the roadmap, but no firm ETA from our end

This isn't the same thing as the AWS service allows exporting to another vendor. If something has to be ingested first to X-Ray and then exported, then it is subject to the same restrictions as "exporting to X-Ray", and needs to have X-Ray conforming trace IDs (with the encoded timestamp within the past 10 minutes).

The original question was:

Do you know when it will be possible for users to export these SQS spans from x-ray to other otlp compatible endpoint like aspecto collector?

I can think of many ways to implement this feature and encourage anyone with motivation to ping AWS again and help push this forward

@fbogsany
Copy link
Contributor

I think X-Ray propagation is only needed if you're exporting to X-Ray

@fbogsany I don't agree with that. If you send a message to SQS which AWS forward to SNS, then in order to have e2e observability across this call you must inject x-ray headers with x-ray propagator. Other headers like b3/w3c are not guaranteed to be propagated via AWS services. So a user that needs a complete trace that involves AWS intermediary must use x-ray propagator, but there is no obligation to export any of this to x-ray backend.

edit: What I mean is that the producers and consumers are connected to each other, with no instrumentation of the intermediary, but the context still needs to be propagated correctly by the intermediary and that is only guaranteed with X-Amzn-Trace-Id

That's fair. I alluded to that case (Lambda may be similar) with

i.e. you're purely a client and your requests don't transit an AWS service, and you're not using X-Ray

An interesting question, though, is whether X-Amzn-Trace-Id propagation is "transparent" in this case. I.e. if you propagate context in that format and SQS/SNS are traced / send traces to X-Ray and you yourself are not using X-Ray, will you then have broken traces (because you're using a different backend from what SQS/SNS are using)? Background context: we experienced exactly this with Google and GCLB some years back - if you used the then-current Google trace context propagation, didn't use Stackdriver Trace, but did have requests transiting GCLB, then you'd end up with broken traces because GCLB would generate a span for the LB transit, and update the trace context header, but GCLB would export its spans to Stackdriver and our spans would go somewhere else.

So, if you send a message to SQS which AWS forwards to SNS and you inject the X-Amzn-Trace-ID is that header propagated untouched or is it modified by the AWS intermediaries?

@blumamir
Copy link
Member

To my understanding which might be not up to date, x-ray is the only place where these AWS internal service spans are currently being sent to with the limitations it brings. Another option at the moment is to configure AWS not to instrument their services in which case the context is just propagated directly as is where AWS implemented it (not all of the services).

The last time I touched this kind of setup was around 2 years ago so things might have changed but this was the last thing I recall. Would be best to try it hands-on, dig through documentations and code or reach out to an AWS engineer that can clarify that. I don't have the capacity but would love to know the best practices from AWS for using these components in otel setups.

@fbogsany
Copy link
Contributor

Stepping back a bit here, our general philosophy is that vendor-specific extensions should be provided by vendors in their own repos. Unless the spec explicitly states otherwise, things like X-Ray ID generation, propagation, and export are the responsibility of vendors and should not be maintained by the OpenTelemetry Ruby (contrib) maintainers/approvers, or live in our repos.

Given that, the packaging of these features is up to AWS, and if you are a customer of AWS, you can try to influence their decisions to meet your needs.

@blumamir
Copy link
Member

An interesting question, though, is whether X-Amzn-Trace-Id propagation is "transparent" in this case. I.e. if you propagate context in that format and SQS/SNS are traced / send traces to X-Ray and you yourself are not using X-Ray, will you then have broken traces (because you're using a different backend from what SQS/SNS are using)? Background context: we experienced exactly this with Google and GCLB some years back - if you used the then-current Google trace context propagation, didn't use Stackdriver Trace, but did have requests transiting GCLB, then you'd end up with broken traces because GCLB would generate a span for the LB transit, and update the trace context header, but GCLB would export its spans to Stackdriver and our spans would go somewhere else.

That is exactly what we saw as well. That was the motivation to propagate SQS context with MessageAttributes in aws-sdk instrumentation ruby and js instead of relying on AWS headers which are sensitive to bugs and configuration issues with intermediaries. see this issue for example. There were other bugs as well causing a issues with context propagation

So, if you send a message to SQS which AWS forwards to SNS and you inject the X-Amzn-Trace-ID is that header propagated untouched or is it modified by the AWS intermediaries?

This question should better be answered by AWS people. To my understanding (but I have to admit it was long time ago), it should be propagated in mainstream cases, but that depends on how they implemented and interpreted context propagation (not as easy as it sounds). seems there are some more details here for SQS, and I guess other services have different levels of support for context propagation.

@robertlaurin
Copy link
Contributor

robertlaurin commented Mar 20, 2023

Stepping back a bit here, our general philosophy is that vendor-specific extensions should be provided by vendors in their own repos.

This is inline with discussions we had with @robbkidd during one of the recent SIG calls about what packages were appropriate for pushing into contrib. Certain things make a bit more sense in a Honeycomb repo, and some of the more vendor agnostic things they do would be happily accepted into our contrib repo.

@blumamir
Copy link
Member

Stepping back a bit here, our general philosophy is that vendor-specific extensions should be provided by vendors in their own repos. Unless the spec explicitly states otherwise, things like X-Ray ID generation, propagation, and export are the responsibility of vendors and should not be maintained by the OpenTelemetry Ruby (contrib) maintainers/approvers, or live in our repos.

Given that, the packaging of these features is up to AWS, and if you are a customer of AWS, you can try to influence their decisions to meet your needs.

In js and other SIGs, the vendor components are hosted in contrib repos and are being maintained by AWS engineers (and the contrib maintainers if they are up to it). I think there are prons and cons to each option. If ruby prefers to have these vendor components in a separate repo, that would require someone from AWS to step up and do the work. Are you familiar with the relevant persona to reach out to?

@fbogsany
Copy link
Contributor

Are you familiar with the relevant persona to reach out to?

No, I'm not. I'm not an AWS customer. When we find someone, we should (IMO) figure out how to transfer ownership of the X-Ray propagator and AWS SDK instrumentation gems to them. We also need to figure out how to break up the resource detectors gem to extract vendor-specific resource detection.

https://github.com/alolita is on the Governance Committee for OpenTelemetry and used to represent AWS, but she moved to Apple a year ago. AFAIK AWS no longer has representation on the Technical Committee or Governance Committee.

@blumamir
Copy link
Member

Are you familiar with the relevant persona to reach out to?

No, I'm not. I'm not an AWS customer. When we find someone, we should (IMO) figure out how to transfer ownership of the X-Ray propagator and AWS SDK instrumentation gems to them. We also need to figure out how to break up the resource detectors gem to extract vendor-specific resource detection.

https://github.com/alolita is on the Governance Committee for OpenTelemetry and used to represent AWS, but she moved to Apple a year ago. AFAIK AWS no longer has representation on the Technical Committee or Governance Committee.

Back to the original issue, until this process is started with aws, my personal opinion is to move forward with the current PR that extracts the id generator into its own gem for the reasons above, but the extracted id generation code in x-ray propagator should either be dependent or removed (if not a breaking change). I would love to see our findings on when to use each component in the READMEs to guide users according to our best knowledge so far. And support aws resource detector being added as it's own gem or in the existing general resource detectors gem.

That is my 2 cents as a ruby newbie. Will let the maintainers take it from here :)

@dlahn
Copy link
Contributor Author

dlahn commented Mar 23, 2023

As per discussions in Slack, this is being closed in favour of keeping the ID generator and propagator together. an AWS resource detector would live in the resource_detectors gem for now.

@dlahn dlahn closed this Mar 23, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants