Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add instrumentation for AWS Lambda Service - Implementation (Part 2/2) #777

Conversation

NathanielRN
Copy link
Contributor

Description

Follow up to #739. In this PR, we add the Python code to actually implement instrumentation.

This PR adds Instrumentation for Lambda functions on the AWS Lambda service. Specifically, when you use AWS Lambda to call your def lambda_handler(lambda_event, lambda_context) on AWS, you can use this instrumentation package and its scripts to instrument that invocation with OpenTelemetry!

Please read the description in #739 to learn more.

Type of change

Please delete options that are not relevant.

  • New feature (non-breaking change which adds functionality)

How Has This Been Tested?

The unit tests cover the manual instrumentation scenarios of

Does This PR Require a Core Repo Change?

Checklist:

See contributing.md for styleguide, changelog guidelines, and more.

  • Followed the style guidelines of this project
  • Changelogs have been updated
  • Unit tests have been added
  • Documentation has been updated

@NathanielRN NathanielRN changed the title Add instrumentation for AWS Lambda Service - Implementation Add instrumentation for AWS Lambda Service - Implementation (Part 2/2) Oct 25, 2021
@NathanielRN NathanielRN force-pushed the fill-in-aws_lambda-instrumentation-files branch from 16ac59f to ce122d7 Compare October 25, 2021 20:59
@NathanielRN NathanielRN force-pushed the fill-in-aws_lambda-instrumentation-files branch from ce122d7 to c79c369 Compare October 25, 2021 22:23
@NathanielRN NathanielRN force-pushed the fill-in-aws_lambda-instrumentation-files branch 4 times, most recently from c079d96 to 06c3b96 Compare October 26, 2021 15:43
Comment on lines 188 to 198
with tracer.start_as_current_span(
name=orig_handler_name,
context=parent_context,
kind=SpanKind.SERVER,
) as span:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In #739 (comment) @owais said:

What if the lambda was triggered by SQS/SNS/Kinesis/S3/etc? It would be a consumer span in that case.

The spec does say the following:

For all events, a span with kind SERVER MUST be created corresponding to the function invocation unless stated otherwise below.

It didn't seem too controversial for the original PR that added it to the specs, comments like open-telemetry/opentelemetry-specification#1442 (comment) were just making sure downstream span were labeled as CONSUMER.

I think the intention is that Lambda spans should always be treated as a "Server" which handles requests even if it is triggered from an SQS publish. The concept of linking traces between SQS and Lambda is a very difficult problem we have been trying to solve at AWS as noted in aws/aws-xray-sdk-go#218 and several other linked issues.

That's why I believe the spec has said a blanket statement that Lambda spans should always be SERVER spans.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May be I need to read it in more detail but it says "unless stated otherwise below." and then in SQS section, it says,

The span kind for both types of SQS spans MUST be CONSUMER.

I'll review this and the spec again in detail tomorrow. If you get other approvals by then please go ahead with the merge as we can update this later if needed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right it does say that...

Hm interesting, reading that sentence you pointed out I guess you could connect the main paragraph:

... SQS ... triggers a Lambda function. ... We consider processing both of a batch (1) and of each individual message (2). The function invocation span MUST correspond to the SQS event, which is the batch of messages. ** For each message, an additional span SHOULD be created to correspond with the handling of the SQS message. **

to this sentence:

The span kind for both types of SQS spans MUST be CONSUMER.

and understand the following:

The spec says the instrumentation has to have the following spans:

  • The overall Lambda span
  • The SQS span
    • Batch messages
    • Indidivdual messages

So I guess the question is does both types of SQS spans refer to the Lambda function span triggered for batch/individual SQS messages or does it refer to the SQS span created in the user code to handle the batch/individual SQS spans.

The "For all events" in "For all events, a span with kind SERVER MUST be created corresponding to the function invocation unless stated otherwise below." is what makes me think it is the latter. But yeah like you said there is the unless stated otherwise below.

But yes the goal of this PR is just to bring over the code from opentelemetry-lambda so hopefully we can make improvements as we go and while it's still in a beta stage 🙂

Please let me know if this makes sense to you! Also if @anuraaga has any input that would be super helpful! 😄

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I understood from it was that ONLY two spans should be created SQS batches and both should be of type CONSUMER. I think in case of SQS there shouldn't be any SERVER span especially if the Lambda is triggered periodically or by SQS itself. Overall it feels weird to have a PRODUCER <> SERVER span pair and then have a CONSUMER span as a child of the server span. I think a lot of APM backends will find that strange. Let's wait for @anuraaga to chime in.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry if the spec wording isn't too clear - indeed the intent is when handling SQS, for there to only be 1-2 (one of the spans generally needs to be created by the user) spans of type CONSUMER and no spans of type SERVER. There are some examples at the bottom which may also clear things up, though I can't say the ascii art I wrote renders very well...

Copy link
Contributor Author

@NathanielRN NathanielRN Oct 27, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @anuraaga that makes a lot of sense! I couldn't find any similar implementations in the other languages on the opentelemetry-lambda repo so I went ahead with this idea for determining if the Lambda was triggered from SQS:

Based on the triggering Lambda from SQS docs, the Lambda event will have the key "Records" if triggered from SQS. So in 8bf9147 I could check for it like this:

# See more:
# https://docs.aws.amazon.com/lambda/latest/dg/with-sqs.html
if (
    lambda_event
    and isinstance(lambda_event, dict)
    and "Records" in lambda_event
):
    span_kind = SpanKind.CONSUMER
else:
    span_kind = SpanKind.SERVER

I think we could also merge without this, and the suggestion is new, but shouldn't be a harm to try it out 🙂

@owais Please let me know if you think this is a good solution!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think checking for looking for events['Records"][0]["event_source"] in ["aws:sqs"] is probably good enough but let's open an issue for this and solve it later in order to not hold this PR any longer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh thanks for that, I think that's a great suggestion! I changed it to this:

# See more:
# https://docs.aws.amazon.com/lambda/latest/dg/with-sqs.html
try:
    if lambda_event["Records"][0]["eventSource"] == "aws:sqs":
        span_kind = SpanKind.CONSUMER
except Exception:  # pylint: disable=broad-except
    span_kind = SpanKind.SERVER

Copy link
Contributor

@anuraaga anuraaga Oct 27, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Java is lucky to be able to just check instanceof on the actual types of arguments. This sort of attribute matching approach would be needed for nodejs though, where we haven't implemented the SQS handling yet, so there is indeed precedence for starting with only HTTP support and following up with SQS ;) But looks like you have a reasonable solution here.


tracer = get_tracer(__name__, __version__, tracer_provider)

with tracer.start_as_current_span(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In #739 (review) @owais said:

Q: How would this work if the lambda uses a framework like flask or falcon? Wouldn't that result in two SERVER spans?

I think this is related to #777 (comment) below.

I think it will return 2 SERVER spans but that's what we expect? AWS Lambda is a short lived easy SERVER right? So it would have as much sense as running a flask app within another flask app? 😅 Please correct me if I'm wrong but that's my understanding, in which case 2 SERVER span would be expected?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that would be the ideal outcome but this is not a problem specific to lambda as we have the same issue with uwsig/gunicorn + flask/django/etc. We can solve it separately.

@NathanielRN NathanielRN requested a review from owais October 26, 2021 16:29
@NathanielRN NathanielRN marked this pull request as ready for review October 26, 2021 16:29
@NathanielRN NathanielRN requested a review from a team October 26, 2021 16:29
@NathanielRN
Copy link
Contributor Author

@owais I finished moving the comments from #739 here. Please let me know if you have any questions! 🙂

@NathanielRN NathanielRN force-pushed the fill-in-aws_lambda-instrumentation-files branch 3 times, most recently from a182761 to 8bf9147 Compare October 27, 2021 15:38
# See more:
# https://docs.aws.amazon.com/lambda/latest/dg/with-sqs.html
try:
if lambda_event["Records"][0]["eventSource"] == "aws:sqs":
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we code this defensively or catch only the expected exceptions?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm sure we can do that! Although it's such simple code I can't think of any other exception that should occur?

Either way I updated it to look like this based on my tests of what could go wrong 🙂 Please let me know what you think!

# See more:
# https://docs.aws.amazon.com/lambda/latest/dg/with-sqs.html
try:
    if lambda_event["Records"][0]["eventSource"] == "aws:sqs":
        span_kind = SpanKind.CONSUMER
except (IndexError, KeyError, TypeError):
    span_kind = SpanKind.SERVER

@NathanielRN NathanielRN requested a review from owais October 27, 2021 17:52
@NathanielRN NathanielRN force-pushed the fill-in-aws_lambda-instrumentation-files branch from c221f49 to 5bf66d2 Compare October 28, 2021 19:27
@NathanielRN NathanielRN force-pushed the fill-in-aws_lambda-instrumentation-files branch from 5bf66d2 to f84c263 Compare October 28, 2021 21:11
@owais owais merged commit 671aea3 into open-telemetry:main Nov 1, 2021
@NathanielRN NathanielRN deleted the fill-in-aws_lambda-instrumentation-files branch November 1, 2021 14:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants