Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Docs: explain how to cleanup with middy early return #1410

Closed
quantumew opened this issue Apr 13, 2023 · 15 comments · Fixed by #1473
Closed

Docs: explain how to cleanup with middy early return #1410

quantumew opened this issue Apr 13, 2023 · 15 comments · Fixed by #1473
Assignees
Labels
confirmed The scope is clear, ready for implementation documentation Improvements or additions to documentation tracer This item relates to the Tracer Utility

Comments

@quantumew
Copy link

quantumew commented Apr 13, 2023

Expected Behaviour

Instructions on how to avoid this segment leak scenario.

Current Behaviour

When middy returns early, the segment is not closed (middy does not run after middleware), eventually this results in a RangeError via the incrementCounter call due to how many open segments exist.

Code snippet

const tracer = new Tracer({ serviceName: 'xxx' })

...

export const handler = middy(lambdaHandler).use(captureLambdaHandler(tracer)).use(cacheMiddleware);

Steps to Reproduce

  1. Create a lambda with provisioned concurrency
  2. Hook up the captureLambdaHandler middleware
  3. Create middleware that runs after the capture mw, that returns early in its before
  4. Put this lambda under high load (this issue keeps arising 40 mins into our load tests)

Possible Solution

Before returning early in middy, I grab the current segment and close it. That does not seem to solve the problem, surprisingly. Do I have to also close all potential sub segments? Is the handler segment not the issue, is it actually the parent lambda segment?

AWS Lambda Powertools for TypeScript version

latest

AWS Lambda function runtime

16.x

Packaging format used

npm

Execution logs

2023-04-10T14:16:56.326Z	d94677bc-a252-4830-a678-2c5ece7e0aff	ERROR	Invoke Error 	

{""errorType"":""RangeError"",""errorMessage"":""Maximum call stack size exceeded"",""stack"":[""RangeError: Maximum call stack size exceeded"",""    

at Ae.incrementCounter (/var/task/index.js:25:9545)"",""    

at Ae.incrementCounter (/var/task/index.js:25:9621)"",""    at Ae.incrementCounter (/var/task/index.js:25:9621)"",""

at Ae.incrementCounter (/var/task/index.js:25:9621)"",""    at Ae.incrementCounter (/var/task/index.js:25:9621)"","" 

at Ae.incrementCounter (/var/task/index.js:25:9621)"",""    at Ae.incrementCounter (/var/task/index.js:25:9621)"","" 

at Ae.incrementCounter (/var/task/index.js:25:9621)"",""    at Ae.incrementCounter (/var/task/index.js:25:9621)"",""

at Ae.incrementCounter (/var/task/index.js:25:9621)""]}
@quantumew quantumew added triage This item has not been triaged by a maintainer, please wait bug Something isn't working labels Apr 13, 2023
@dreamorosi dreamorosi changed the title Segment leak with middy early return Bug: Segment leak with middy early return Apr 13, 2023
@dreamorosi
Copy link
Contributor

Hi @quantumew, thank you for taking the time to open this issue.

Based on the info you have provided above I don't have enough info to reproduce the issue. Also, asking this to better understand the issue, how did you pinpoint the issue to either Middy or the Tracer?

The log that you have provided contains only a method called incrementCounter which is not a method present in our codebase, nor in Middy's (based on this search) so I'm not sure what I should be looking for.

When middy returns early, the segment is not closed (middy does not run after middleware)

Are you referring to this feature of Middy? Does your second custom middleware returns early like this?

Before returning early in middy, I grab the current segment and close it. That does not seem to solve the problem, surprisingly. Do I have to also close all potential sub segments? Is the handler segment not the issue, is it actually the parent lambda segment?

Let me break this down:

Before returning early in middy, I grab the current segment and close it. That does not seem to solve the problem, surprisingly.

That's a sensible solution, any chance you could share the bit of code that you use to do this operation? I'd like to see how you're closing the segment.

Do I have to also close all potential sub segments?

Not necessarily. Once you close the parent all its children should also be flushed.

Is the handler segment not the issue, is it actually the parent lambda segment?

I'm not 100% sure, but I doubt that. The function cannot mutate this segment and it's handled by the X-Ray SDK and refreshed at each invocation based on the invocation it/Trace ID present in the environment.


I'm not 100% sure I can pinpoint the problem with the current info, but what I can say is that based on the current implementation of the captureLambdaHandler middleware, both handler and root segments should be garbage collected at each invocation given that we are reassigning them at the beginning of each request.

Basically even if you didn't close any of the segment in your early return, the next function invocation should reassign the values of the segments and overwrite its previous values. In this case you might see trace data loss or incorrect segment duration but not stack overflow.

@dreamorosi dreamorosi added tracer This item relates to the Tracer Utility need-response This item requires a response from a customer and will considered stale after 2 weeks and removed triage This item has not been triaged by a maintainer, please wait labels Apr 13, 2023
@dreamorosi dreamorosi self-assigned this Apr 13, 2023
@quantumew
Copy link
Author

Thank you for such a thorough response.

Here is how I attempt to close the segment:

if (tracer && tracer.isTracingEnabled()) {
  try {
    // Close lambda handler segment since we are circuit breaking here
    tracer.getSegment().close();
  } catch (err) {
    log.warn('Could not close the current segment before returning early', { err });
  }
}

The increment call is coming from the xray sdk.

Based on that stack trace, it looks like the problem is deeply nested segments, eventually causing a stack overflow.

I came to this conclusion purely through inference. The Tracer is created on lambda cold start. I assumed that when you called getSegment here, it returned the previous invocation's handler segment, since it was never closed, since we returned early. Resulting in handler segments nested in handler segments.

It just occurred to me that I am not doing this step. Could that be the problem? I am closing it but it is still the current segment?

@dreamorosi
Copy link
Contributor

I assumed that when you called getSegment here, it returned the previous invocation's handler segment, since it was never closed, since we returned early.

That's a fair assumption and one that right off the bat, I'm also not able to rule out.

Now that I have a clearer idea of what your code is doing I can do more tests and give you a more precise answer.

It just occurred to me that I am not doing this step. Could that be the problem? I am closing it but it is still the current segment?

It could. The issue here is that you won't have access to that segment since it's not stored anywhere in the Tracer class.

I need to spend some time thinking about those two points before weighing in, but I see that there's a chance we might not be taking in account this edge case of users returning early in one of the downstream middleware.

I'll spend some time looking at this tomorrow and get back to you here.

Thanks for the patience and for reporting this interesting case!

@dreamorosi
Copy link
Contributor

If the above is correct, I'm thinking this might be a good use case for using Middy's internal storage and allow downstream middleware to bail out early while also cleaning up traces, metrics, and other Powertools related actions.

@dreamorosi
Copy link
Contributor

I've been talking with @willfarrell (maintainer of Middy) and another option that came up would be to place any middleware that you know might return early as first in the middleware chain:

const handler = middy()
  .use(myCacheMiddleware)
  .use(captureLambdaHandler(tracer))

This way if the first middleware returns, the captureLambdaHandler won't enter the before phase and potentially leaving a dangling segment.

If this is the problem, this should fix your issue. However I'll still continue investigating to see if we can do something for this edge case as I would prefer not having to rely on ordering.

@quantumew
Copy link
Author

quantumew commented Apr 13, 2023

Definitely, great suggestion, it is one of our workarounds on one of our endpoints where disabling tracing was not an option. However, then we never get any trace data for our cache hits. My hope is to have the best of both worlds. Although, you could argue that a combination of ordering plus manual instrumentation would suffice here. Perhaps that is the tactical path I will pursue for now.

@willfarrell
Copy link

There are undocumented features in @middy/core that I think can help here. There is a requestEnd hook you can pass in. I've put it on my list to write some docs on how these work.

Example use: https://middy.js.org/docs/best-practices/profiling

@dreamorosi
Copy link
Contributor

Sorry, I know I said I'd look at it tomorrow but I was so curious I couldn't resist looking into it 🙃

I've been doing some tests and I was able to reproduce the behavior you have described and also find a potential solution. Allow me to elaborate.

This is my function handler:

import { Tracer, captureLambdaHandler } from "@aws-lambda-powertools/tracer";
import { cacheMiddleware } from "./middlewares";

const tracer = new Tracer();
// Dummy cache store
const storage = new Map<string, string>();

export const handler = middy(async () => {
  // This should be logged only when cache in case of cache miss 
  console.log("handler run");

  return "some response";
})
  .use(captureLambdaHandler(tracer))
  .use(cacheMiddleware({ storage }));

It's fairly trivial but it helps us model the issue, the important details here are:

  • I'm importing a cacheMiddleware (which we'll look at in a second)
  • I am using Tracer and I'm also creating a dummy cache store (a Map)
  • My handler has a log that I expect to see only when the cache misses
  • The handler uses both captureLambdaHandler and my other middleware (in this order)

Now, like you suggested earlier, I am returning early when I encounter the same id in the event and I have it in the cache:

import { Tracer } from '@aws-lambda-powertools/tracer';
import type { Subsegment } from "aws-xray-sdk-core";
import middy from "@middy/core";

const tracer = new Tracer();

type myEvent = {
  id: string;
};

const cacheMiddleware = ({
  storage,
}: {
  storage: Map<string, string>;
}): middy.MiddlewareObj => {
  let cacheKey: string;

  const before = (request: middy.Request): string | void => {
    cacheKey = (request.event as myEvent).id;
    // Cache hit
    if (storage.has(cacheKey)) {
       // Return from cache
      return storage.get(cacheKey);
    }
  };

  const after = (request: middy.Request): void => {
    // Save response for potential later usage
    storage.set(cacheKey, request.response as string);
  };

  return {
    before,
    after,
  };
};

Note on reinstantiating Tracer

The two modules don't share the same instance of the Tracer, this is not a problem in this specific case as the segment is being stored in the context brought by the X-Ray SDK, however I'd recommend to instantiate it in a shared module like powertools.ts, and import it from both files:

import { Tracer } from '@aws-lambda-powertools/tracer';

const tracer = new Tracer();

export {
  tracer,
};

This will allow you to also share configs and so on, so I would always recommend it.

Note
I'm running a patched version of the X-Ray SDK that emits debug logs whenever a segment is sent to X-Ray - you won't see the same logs if you run the same code.

Anyway, invoking this function 4 times, with id in this order: 1, 2, 1 (cache hit expected), 3, generates these logs:

2023-04-14T00:17:12.825+02:00	INIT_START Runtime Version: nodejs:18.v5	Runtime Version ARN: arn:aws:lambda:eu-west-1::runtime:c869d752e4ae21a3945cfcb3c1ff2beb1f160d7bcec3b0a8ef7caceae73c055f
2023-04-14T00:17:13.097+02:00	START RequestId: a227980e-2636-4b93-a146-14e28cfd433f Version: $LATEST
2023-04-14T00:17:13.100+02:00	2023-04-13T22:17:13.100Z	a227980e-2636-4b93-a146-14e28cfd433f	INFO	cacheKey 1
2023-04-14T00:17:13.133+02:00	2023-04-13T22:17:13.132Z	a227980e-2636-4b93-a146-14e28cfd433f	INFO	handler run
2023-04-14T00:17:13.274+02:00	2023-04-13T22:17:13.274Z	a227980e-2636-4b93-a146-14e28cfd433f	DEBUG	Subsegment sent: {"trace_id:"1-64387f68-41aa499836aa999a0206f6c4","id":"6d83baa571e06f42"}
2023-04-14T00:17:13.274+02:00	2023-04-13T22:17:13.274Z	a227980e-2636-4b93-a146-14e28cfd433f	DEBUG	UDP message sent: {"id":"6d83baa571e06f42","name":"## index.handler","start_time":1681424233.099,"annotations":{"ColdStart":true,"Service":"service_undefined"},"metadata":{"service_undefined":{"index.handler response":"some response"}},"end_time":1681424233.154,"type":"subsegment","parent_id":"41810eee6f6ba62a","trace_id":"1-64387f68-41aa499836aa999a0206f6c4"}
2023-04-14T00:17:13.292+02:00	END RequestId: a227980e-2636-4b93-a146-14e28cfd433f
2023-04-14T00:17:13.293+02:00	REPORT RequestId: a227980e-2636-4b93-a146-14e28cfd433f	Duration: 178.47 ms	Billed Duration: 179 ms	Memory Size: 128 MB	Max Memory Used: 71 MB	Init Duration: 270.67 ms	
                             	XRAY TraceId: 1-64387f68-41aa499836aa999a0206f6c4	SegmentId: 7ea8f8c73cac7ce8	Sampled: true	
2023-04-14T00:17:16.184+02:00	START RequestId: ef4a9331-86da-45b9-a2d3-0ab60d9d133c Version: $LATEST
2023-04-14T00:17:16.185+02:00	2023-04-13T22:17:16.185Z	ef4a9331-86da-45b9-a2d3-0ab60d9d133c	INFO	cacheKey 2
2023-04-14T00:17:16.185+02:00	2023-04-13T22:17:16.185Z	ef4a9331-86da-45b9-a2d3-0ab60d9d133c	INFO	handler run
2023-04-14T00:17:16.192+02:00	2023-04-13T22:17:16.192Z	ef4a9331-86da-45b9-a2d3-0ab60d9d133c	DEBUG	Subsegment sent: {"trace_id:"1-64387f6c-0b2913961dc2ee30714fdc4d","id":"2f553eb71d5b86d4"}
2023-04-14T00:17:16.192+02:00	2023-04-13T22:17:16.192Z	ef4a9331-86da-45b9-a2d3-0ab60d9d133c	DEBUG	UDP message sent: {"id":"2f553eb71d5b86d4","name":"## index.handler","start_time":1681424236.185,"annotations":{"ColdStart":false,"Service":"service_undefined"},"metadata":{"service_undefined":{"index.handler response":"some response"}},"end_time":1681424236.186,"type":"subsegment","parent_id":"1b626776719e976c","trace_id":"1-64387f6c-0b2913961dc2ee30714fdc4d"}
2023-04-14T00:17:16.194+02:00	END RequestId: ef4a9331-86da-45b9-a2d3-0ab60d9d133c
2023-04-14T00:17:16.194+02:00	REPORT RequestId: ef4a9331-86da-45b9-a2d3-0ab60d9d133c	Duration: 8.93 ms	Billed Duration: 9 ms	Memory Size: 128 MB	Max Memory Used: 71 MB	
                             	XRAY TraceId: 1-64387f6c-0b2913961dc2ee30714fdc4d	SegmentId: 39922fd50ead9f6d	Sampled: true	
2023-04-14T00:17:19.082+02:00	START RequestId: 0fa0c270-74b1-47c0-b2fc-2afa0183d110 Version: $LATEST
2023-04-14T00:17:19.083+02:00	2023-04-13T22:17:19.083Z	0fa0c270-74b1-47c0-b2fc-2afa0183d110	INFO	cacheKey 1
2023-04-14T00:17:19.083+02:00	2023-04-13T22:17:19.083Z	0fa0c270-74b1-47c0-b2fc-2afa0183d110	INFO	cache hit on 1
2023-04-14T00:17:19.092+02:00	END RequestId: 0fa0c270-74b1-47c0-b2fc-2afa0183d110
2023-04-14T00:17:19.092+02:00	REPORT RequestId: 0fa0c270-74b1-47c0-b2fc-2afa0183d110	Duration: 2.08 ms	Billed Duration: 3 ms	Memory Size: 128 MB	Max Memory Used: 71 MB	
                             	XRAY TraceId: 1-64387f6e-38dbe23056f63fa14ad819b1	SegmentId: 75a6dd267bc8d122	Sampled: true	
2023-04-14T00:17:22.322+02:00	START RequestId: 61f8f1a8-824a-4e37-ac85-36c21dd607ef Version: $LATEST
2023-04-14T00:17:22.452+02:00	2023-04-13T22:17:22.452Z	61f8f1a8-824a-4e37-ac85-36c21dd607ef	INFO	cacheKey 3
2023-04-14T00:17:22.453+02:00	2023-04-13T22:17:22.452Z	61f8f1a8-824a-4e37-ac85-36c21dd607ef	INFO	handler run
2023-04-14T00:17:22.473+02:00	2023-04-13T22:17:22.472Z	61f8f1a8-824a-4e37-ac85-36c21dd607ef	DEBUG	Subsegment sent: {"trace_id:"1-64387f6e-38dbe23056f63fa14ad819b1","id":"f681fb20be6fa41e"}
2023-04-14T00:17:22.473+02:00	2023-04-13T22:17:22.473Z	61f8f1a8-824a-4e37-ac85-36c21dd607ef	DEBUG	UDP message sent: {"id":"f681fb20be6fa41e","name":"## index.handler","start_time":1681424242.452,"annotations":{"ColdStart":false,"Service":"service_undefined"},"metadata":{"service_undefined":{"index.handler response":"some response"}},"end_time":1681424242.453,"type":"subsegment","parent_id":"b1a6ccebf9a712d8","trace_id":"1-64387f6e-38dbe23056f63fa14ad819b1"}
2023-04-14T00:17:22.474+02:00	END RequestId: 61f8f1a8-824a-4e37-ac85-36c21dd607ef
2023-04-14T00:17:22.474+02:00	REPORT RequestId: 61f8f1a8-824a-4e37-ac85-36c21dd607ef	Duration: 152.04 ms	Billed Duration: 153 ms	Memory Size: 128 MB	Max Memory Used: 72 MB	
                             	XRAY TraceId: 1-64387f72-339d9ca93562977a43d0288a	SegmentId: 57426b78052e7953	Sampled: true	

This shows two interesting things:

  • the 3rd run (RequestId: 0fa0c270-74b1-47c0-b2fc-2afa0183d110 & TraceId: 1-64387f6e-38dbe23056f63fa14ad819b1) - which is a cache hit - is not sending any segment to X-Ray even though it was sampled
  • the 4th run (RequestId: 61f8f1a8-824a-4e37-ac85-36c21dd607ef & TraceId: 1-64387f72-339d9ca93562977a43d0288a) is sending the segment over, but as you can see from its content is in fact sending the segment from the previous invocation

Now, let's modify the cacheMiddleware, and specifically the before method:

const before = (request: middy.Request): string | void => {
    cacheKey = (request.event as myEvent).id;
    // Cache hit
    if (storage.has(cacheKey)) {
+      // Get current segment, which if present should always be the main handler segment (`### index.handler`), so we can typecast
+      const myLastSegment = tracer.getSegment() as Subsegment | undefined;
+      if (myLastSegment) {
+        // You can add annotations or metadata here
+        myLastSegment.addAnnotation("CacheHit", true);
+        // Close it
+        myLastSegment.close();
+        // Then finally set back to the `facade` segment (the one created by the Lambda service) - which is accessible at the .parent property
+        tracer.setSegment(myLastSegment.parent);
+      }

       // Return from cache
      return storage.get(cacheKey);
    }
  };

and then run the function again with the same event order as before: 1, 2, 1 (cache hit), 3:

2023-04-14T00:17:12.825+02:00	INIT_START Runtime Version: nodejs:18.v5	Runtime Version ARN: arn:aws:lambda:eu-west-1::runtime:c869d752e4ae21a3945cfcb3c1ff2beb1f160d7bcec3b0a8ef7caceae73c055f
2023-04-14T00:17:13.097+02:00	START RequestId: a227980e-2636-4b93-a146-14e28cfd433f Version: $LATEST
2023-04-14T00:17:13.100+02:00	2023-04-13T22:17:13.100Z	a227980e-2636-4b93-a146-14e28cfd433f	INFO	cacheKey 1
2023-04-14T00:17:13.133+02:00	2023-04-13T22:17:13.132Z	a227980e-2636-4b93-a146-14e28cfd433f	INFO	handler run
2023-04-14T00:17:13.274+02:00	2023-04-13T22:17:13.274Z	a227980e-2636-4b93-a146-14e28cfd433f	DEBUG	Subsegment sent: {"trace_id:"1-64387f68-41aa499836aa999a0206f6c4","id":"6d83baa571e06f42"}
2023-04-14T00:17:13.274+02:00	2023-04-13T22:17:13.274Z	a227980e-2636-4b93-a146-14e28cfd433f	DEBUG	UDP message sent: {"id":"6d83baa571e06f42","name":"## index.handler","start_time":1681424233.099,"annotations":{"ColdStart":true,"Service":"service_undefined"},"metadata":{"service_undefined":{"index.handler response":"some response"}},"end_time":1681424233.154,"type":"subsegment","parent_id":"41810eee6f6ba62a","trace_id":"1-64387f68-41aa499836aa999a0206f6c4"}
2023-04-14T00:17:13.292+02:00	END RequestId: a227980e-2636-4b93-a146-14e28cfd433f
2023-04-14T00:17:13.293+02:00	REPORT RequestId: a227980e-2636-4b93-a146-14e28cfd433f	Duration: 178.47 ms	Billed Duration: 179 ms	Memory Size: 128 MB	Max Memory Used: 71 MB	Init Duration: 270.67 ms	
                             	XRAY TraceId: 1-64387f68-41aa499836aa999a0206f6c4	SegmentId: 7ea8f8c73cac7ce8	Sampled: true	
2023-04-14T00:17:16.184+02:00	START RequestId: ef4a9331-86da-45b9-a2d3-0ab60d9d133c Version: $LATEST
2023-04-14T00:17:16.185+02:00	2023-04-13T22:17:16.185Z	ef4a9331-86da-45b9-a2d3-0ab60d9d133c	INFO	cacheKey 2
2023-04-14T00:17:16.185+02:00	2023-04-13T22:17:16.185Z	ef4a9331-86da-45b9-a2d3-0ab60d9d133c	INFO	handler run
2023-04-14T00:17:16.192+02:00	2023-04-13T22:17:16.192Z	ef4a9331-86da-45b9-a2d3-0ab60d9d133c	DEBUG	Subsegment sent: {"trace_id:"1-64387f6c-0b2913961dc2ee30714fdc4d","id":"2f553eb71d5b86d4"}
2023-04-14T00:17:16.192+02:00	2023-04-13T22:17:16.192Z	ef4a9331-86da-45b9-a2d3-0ab60d9d133c	DEBUG	UDP message sent: {"id":"2f553eb71d5b86d4","name":"## index.handler","start_time":1681424236.185,"annotations":{"ColdStart":false,"Service":"service_undefined"},"metadata":{"service_undefined":{"index.handler response":"some response"}},"end_time":1681424236.186,"type":"subsegment","parent_id":"1b626776719e976c","trace_id":"1-64387f6c-0b2913961dc2ee30714fdc4d"}
2023-04-14T00:17:16.194+02:00	END RequestId: ef4a9331-86da-45b9-a2d3-0ab60d9d133c
2023-04-14T00:17:16.194+02:00	REPORT RequestId: ef4a9331-86da-45b9-a2d3-0ab60d9d133c	Duration: 8.93 ms	Billed Duration: 9 ms	Memory Size: 128 MB	Max Memory Used: 71 MB	
                             	XRAY TraceId: 1-64387f6c-0b2913961dc2ee30714fdc4d	SegmentId: 39922fd50ead9f6d	Sampled: true	
2023-04-14T00:17:19.082+02:00	START RequestId: 0fa0c270-74b1-47c0-b2fc-2afa0183d110 Version: $LATEST
2023-04-14T00:17:19.083+02:00	2023-04-13T22:17:19.083Z	0fa0c270-74b1-47c0-b2fc-2afa0183d110	INFO	cacheKey 1
2023-04-14T00:17:19.083+02:00	2023-04-13T22:17:19.083Z	0fa0c270-74b1-47c0-b2fc-2afa0183d110	INFO	cache hit on 1
2023-04-14T00:17:19.092+02:00	END RequestId: 0fa0c270-74b1-47c0-b2fc-2afa0183d110
2023-04-14T00:17:19.092+02:00	REPORT RequestId: 0fa0c270-74b1-47c0-b2fc-2afa0183d110	Duration: 2.08 ms	Billed Duration: 3 ms	Memory Size: 128 MB	Max Memory Used: 71 MB	
                             	XRAY TraceId: 1-64387f6e-38dbe23056f63fa14ad819b1	SegmentId: 75a6dd267bc8d122	Sampled: true	
2023-04-14T00:17:22.322+02:00	START RequestId: 61f8f1a8-824a-4e37-ac85-36c21dd607ef Version: $LATEST
2023-04-14T00:17:22.452+02:00	2023-04-13T22:17:22.452Z	61f8f1a8-824a-4e37-ac85-36c21dd607ef	INFO	cacheKey 3
2023-04-14T00:17:22.453+02:00	2023-04-13T22:17:22.452Z	61f8f1a8-824a-4e37-ac85-36c21dd607ef	INFO	handler run
2023-04-14T00:17:22.473+02:00	2023-04-13T22:17:22.472Z	61f8f1a8-824a-4e37-ac85-36c21dd607ef	DEBUG	Subsegment sent: {"trace_id:"1-64387f6e-38dbe23056f63fa14ad819b1","id":"f681fb20be6fa41e"}
2023-04-14T00:17:22.473+02:00	2023-04-13T22:17:22.473Z	61f8f1a8-824a-4e37-ac85-36c21dd607ef	DEBUG	UDP message sent: {"id":"f681fb20be6fa41e","name":"## index.handler","start_time":1681424242.452,"annotations":{"ColdStart":false,"Service":"service_undefined"},"metadata":{"service_undefined":{"index.handler response":"some response"}},"end_time":1681424242.453,"type":"subsegment","parent_id":"b1a6ccebf9a712d8","trace_id":"1-64387f6e-38dbe23056f63fa14ad819b1"}
2023-04-14T00:17:22.474+02:00	END RequestId: 61f8f1a8-824a-4e37-ac85-36c21dd607ef
2023-04-14T00:17:22.474+02:00	REPORT RequestId: 61f8f1a8-824a-4e37-ac85-36c21dd607ef	Duration: 152.04 ms	Billed Duration: 153 ms	Memory Size: 128 MB	Max Memory Used: 72 MB	
                             	XRAY TraceId: 1-64387f72-339d9ca93562977a43d0288a	SegmentId: 57426b78052e7953	Sampled: true	

and got the following logs:

2023-04-14T00:03:02.681+02:00	START RequestId: 0bcbbdac-9083-4d1d-9374-37fed8a99126 Version: $LATEST
2023-04-14T00:03:02.696+02:00	2023-04-13T22:03:02.696Z	0bcbbdac-9083-4d1d-9374-37fed8a99126	INFO	cacheKey 1
2023-04-14T00:03:02.716+02:00	2023-04-13T22:03:02.716Z	0bcbbdac-9083-4d1d-9374-37fed8a99126	INFO	handler run
2023-04-14T00:03:02.877+02:00	2023-04-13T22:03:02.877Z	0bcbbdac-9083-4d1d-9374-37fed8a99126	DEBUG	Subsegment sent: {"trace_id:"1-64387c16-069c0abf0e8aab7334f136c2","id":"3d20f50fa7445b50"}
2023-04-14T00:03:02.877+02:00	2023-04-13T22:03:02.877Z	0bcbbdac-9083-4d1d-9374-37fed8a99126	DEBUG	UDP message sent: {"id":"3d20f50fa7445b50","name":"## index.handler","start_time":1681423382.683,"annotations":{"ColdStart":true,"Service":"service_undefined"},"metadata":{"service_undefined":{"index.handler response":"some response"}},"end_time":1681423382.719,"type":"subsegment","parent_id":"22ea02d8751dab92","trace_id":"1-64387c16-069c0abf0e8aab7334f136c2"}
2023-04-14T00:03:02.898+02:00	END RequestId: 0bcbbdac-9083-4d1d-9374-37fed8a99126
2023-04-14T00:03:02.898+02:00	REPORT RequestId: 0bcbbdac-9083-4d1d-9374-37fed8a99126	Duration: 214.97 ms	Billed Duration: 215 ms	Memory Size: 128 MB	Max Memory Used: 71 MB	Init Duration: 254.88 ms	
                             	XRAY TraceId: 1-64387c16-069c0abf0e8aab7334f136c2	SegmentId: 0ad95f532492336f	Sampled: true	
2023-04-14T00:03:05.279+02:00	START RequestId: a8bc902f-3132-41d1-975b-4913131169fc Version: $LATEST
2023-04-14T00:03:05.280+02:00	2023-04-13T22:03:05.280Z	a8bc902f-3132-41d1-975b-4913131169fc	INFO	cacheKey 2
2023-04-14T00:03:05.280+02:00	2023-04-13T22:03:05.280Z	a8bc902f-3132-41d1-975b-4913131169fc	INFO	handler run
2023-04-14T00:03:05.296+02:00	2023-04-13T22:03:05.296Z	a8bc902f-3132-41d1-975b-4913131169fc	DEBUG	Subsegment sent: {"trace_id:"1-64387c19-1ddffc1931fcdbed0f4e1565","id":"ef7585b4e04e81b7"}
2023-04-14T00:03:05.296+02:00	2023-04-13T22:03:05.296Z	a8bc902f-3132-41d1-975b-4913131169fc	DEBUG	UDP message sent: {"id":"ef7585b4e04e81b7","name":"## index.handler","start_time":1681423385.28,"annotations":{"ColdStart":false,"Service":"service_undefined"},"metadata":{"service_undefined":{"index.handler response":"some response"}},"end_time":1681423385.281,"type":"subsegment","parent_id":"bfb07a7fce3fa757","trace_id":"1-64387c19-1ddffc1931fcdbed0f4e1565"}
2023-04-14T00:03:05.298+02:00	END RequestId: a8bc902f-3132-41d1-975b-4913131169fc
2023-04-14T00:03:05.298+02:00	REPORT RequestId: a8bc902f-3132-41d1-975b-4913131169fc	Duration: 18.05 ms	Billed Duration: 19 ms	Memory Size: 128 MB	Max Memory Used: 71 MB	
                             	XRAY TraceId: 1-64387c19-1ddffc1931fcdbed0f4e1565	SegmentId: 49844485080aaa91	Sampled: true	
2023-04-14T00:03:08.462+02:00	START RequestId: 5e232d31-afb0-45eb-a4df-8361b8bf8cdc Version: $LATEST
2023-04-14T00:03:08.463+02:00	2023-04-13T22:03:08.463Z	5e232d31-afb0-45eb-a4df-8361b8bf8cdc	INFO	cacheKey 1
2023-04-14T00:03:08.464+02:00	2023-04-13T22:03:08.464Z	5e232d31-afb0-45eb-a4df-8361b8bf8cdc	INFO	cache hit on 1
2023-04-14T00:03:08.466+02:00	2023-04-13T22:03:08.465Z	5e232d31-afb0-45eb-a4df-8361b8bf8cdc	DEBUG	Subsegment sent: {"trace_id:"1-64387c1c-456a4cd763c61e726e5da722","id":"acaf78d6d0d7638a"}
2023-04-14T00:03:08.466+02:00	2023-04-13T22:03:08.465Z	5e232d31-afb0-45eb-a4df-8361b8bf8cdc	DEBUG	UDP message sent: {"id":"acaf78d6d0d7638a","name":"## index.handler","start_time":1681423388.463,"annotations":{"ColdStart":false,"Service":"service_undefined","CacheHit":true},"end_time":1681423388.464,"type":"subsegment","parent_id":"d865d82bb659b853","trace_id":"1-64387c1c-456a4cd763c61e726e5da722"}
2023-04-14T00:03:08.477+02:00	END RequestId: 5e232d31-afb0-45eb-a4df-8361b8bf8cdc
2023-04-14T00:03:08.477+02:00	REPORT RequestId: 5e232d31-afb0-45eb-a4df-8361b8bf8cdc	Duration: 14.24 ms	Billed Duration: 15 ms	Memory Size: 128 MB	Max Memory Used: 71 MB	
                             	XRAY TraceId: 1-64387c1c-456a4cd763c61e726e5da722	SegmentId: 4bb109a441a7f6b1	Sampled: true	
2023-04-14T00:03:12.800+02:00	START RequestId: 5aeb291d-b94b-4baf-bd08-297b03454825 Version: $LATEST
2023-04-14T00:03:12.917+02:00	2023-04-13T22:03:12.917Z	5aeb291d-b94b-4baf-bd08-297b03454825	INFO	cacheKey 3
2023-04-14T00:03:12.917+02:00	2023-04-13T22:03:12.917Z	5aeb291d-b94b-4baf-bd08-297b03454825	INFO	handler run
2023-04-14T00:03:12.936+02:00	2023-04-13T22:03:12.936Z	5aeb291d-b94b-4baf-bd08-297b03454825	DEBUG	Subsegment sent: {"trace_id:"1-64387c20-3895fc156abdf11d03e71b06","id":"52962f26a5392b53"}
2023-04-14T00:03:13.006+02:00	2023-04-13T22:03:13.005Z	5aeb291d-b94b-4baf-bd08-297b03454825	DEBUG	UDP message sent: {"id":"52962f26a5392b53","name":"## index.handler","start_time":1681423392.917,"annotations":{"ColdStart":false,"Service":"service_undefined"},"metadata":{"service_undefined":{"index.handler response":"some response"}},"end_time":1681423392.917,"type":"subsegment","parent_id":"40890a0a796976db","trace_id":"1-64387c20-3895fc156abdf11d03e71b06"}
2023-04-14T00:03:13.017+02:00	END RequestId: 5aeb291d-b94b-4baf-bd08-297b03454825
2023-04-14T00:03:13.017+02:00	REPORT RequestId: 5aeb291d-b94b-4baf-bd08-297b03454825	Duration: 216.32 ms	Billed Duration: 217 ms	Memory Size: 128 MB	Max Memory Used: 72 MB	
                             	XRAY TraceId: 1-64387c20-3895fc156abdf11d03e71b06	SegmentId: 20a6f98a0d43bf97	Sampled: true	

With the updated middleware we can notice that:

  • the segment is sent successfully in all invocations
  • each invocation is sending the correct segment (the TraceId value at the end of the logs matches what's in the segment)
  • in the case of a cache hit, it also includes the CacheHit annotation, that I can now use to filter my traces.

I can see the same trace data in the CloudWatch console:
image


Based on the above I think we have narrowed down the cause of the issue, which is like you were initially suggesting, related to the fact that the segments are not being properly closed & restored in the event of an early return in a downstream middleware.

Based on the above, I also don't think this should be classified as a bug in the proper sense of the term, meaning that while I agree that this is not documented anywhere, the library was working as intended.

With this in mind I'd like to change the labeling & type of the issue to a documentation improvement and work on a PR that adds a section to the docs that explains how to handle this type of case, additionally I'd like to also add this to the Powertools section of the the Middy docs.

@dreamorosi dreamorosi added discussing The issue needs to be discussed, elaborated, or refined documentation Improvements or additions to documentation confirmed The scope is clear, ready for implementation and removed need-response This item requires a response from a customer and will considered stale after 2 weeks bug Something isn't working discussing The issue needs to be discussed, elaborated, or refined labels Apr 13, 2023
@dreamorosi dreamorosi moved this from Pending review to Backlog in AWS Lambda Powertools for TypeScript Apr 13, 2023
@dreamorosi dreamorosi changed the title Bug: Segment leak with middy early return Docs: explain how to cleanup with middy early return Apr 13, 2023
@willfarrell
Copy link

I put some simple docs together for the supported hooks.

https://middy.js.org/docs/intro/hooks/

@quantumew
Copy link
Author

@dreamorosi wow, that is exactly what I was missing! Much appreciated. Also, I agree with your assessment about the classification of this issue.

@willfarrell TIL, thank you, that'll be super useful.

@dreamorosi
Copy link
Contributor

Hi @quantumew, I've been noodling on this a bit more on this and I think that there's some room for improvement beyond just documenting. I have come up with a proposed solution and I'd like to hear your opinion as a potential user, if you don't mind.

The problem I see with just documenting is that not everyone who uses the early return feature in Middy might be aware of the unintended consequences of not performing the post-request actions related to powertools and so it's unlikely they'd think of looking for this in the docs unless something is broken (like it happened to you).

Additionally, as shown by the discussion above, the solution is not immediately obvious and in a way, unnecessarily exposes the internals of our captureLambdaHandler middleware. In the future we might modify its implementation and add/remove setup & teardown actions and this would require our users to keep up with these changes and modify their custom middleware accordingly.

To avoid all this, and to provide a simpler/clearer DX, I'm thinking of exposing a function aimed at developers writing middleware that might return early.

As a developer, you'd simply have to import the function and call it. Powertools would handle all cleanup operations and stay up to date with the rest of the utilities:

+ import { getPowertoolsCloseFunction } from '@aws-lambda-powertools/commons';

const before = (request: middy.Request): string | void => {
    cacheKey = (request.event as myEvent).id;
    // Cache hit
    if (storage.has(cacheKey)) {
-      // Get current segment, which if present should always be the main handler segment (`### index.handler`), so we can typecast
-      const myLastSegment = tracer.getSegment() as Subsegment | undefined;
-      if (myLastSegment) {
-        // You can add annotations or metadata here
-        myLastSegment.addAnnotation("CacheHit", true);
-        // Close it
-        myLastSegment.close();
-        // Then finally set back to the `facade` segment (the one created by the Lambda service) - which is accessible at the .parent property
-        tracer.setSegment(myLastSegment.parent);
-      }
-    }
     
+    const powertoolsCloseFn = getPowertoolsCloseFunction(request);
+    powertoolsCloseFn?.(
+      tracer: {
+        annotations: [{ key: "CacheHit", value: true }],
+       },
+    )
      // Return from cache
      return storage.get(cacheKey);
}
  };

Note that the naming and signature are not final, however the idea is that in your middleware, you import the function factory and then before returning in your code, you get the function it and call it.

When calling the function factory, you need to pass it the request. This helps the factory determine which other Powertools middleware you are using and automatically perform any kind of cleanup required.

When calling the function, you can also optionally pass extra arguments to customize the cleanup actions. For instance, in the case of Tracer, you might want to set annotations or metadata to your segment before closing it. This type of signature allows also for future Powertools middleware you might decide to adopt 👀 .

What do you think? Would you find this useful over the previous solution? Would really appreciate your opinion on this.

@quantumew
Copy link
Author

@dreamorosi I dig it. I agree, understanding how to properly close and reset context is a bit esoteric so obfuscating that behind a helper is a great idea.

One question though, what if the middleware this exposes uses the aforementioned hooks instead? Specifically the requestEnd hook to close the segment. If I understand it, that would ensure that it runs after, onError, or early return. Covering all scenarios.

@dreamorosi dreamorosi moved this from Backlog to Working on it in AWS Lambda Powertools for TypeScript May 2, 2023
@dreamorosi
Copy link
Contributor

Apologies for the delayed response.

The method you described, with requestEnd involves using a different Middy feature. Both this method and the one I suggested would work and would solve the issue at hand.

With the hook method: 1/ the library author has to offer the plugins and, 2/ users will need to remember to import and use it in any of their functions that use middleware that might return early, i.e.

import { powertoolsCleanupPlugin } from '@aws-lambda-powertools/commons';
import { Tracer, captureLambdaHandler } from '@aws-lambda-powertools/tracer';
import { cacheMiddleware } from './cacheMiddleware'; // see implementation above

export const handler = middy(powertoolsCleanupPlugin()) // this needs to be used every time
  .use(captureLambdaHandler(tracer)) // this is the middleware that we need to cleanup
  .use(cacheMiddleware) // this is the middleware that someone wrote, and that could return early
  .handler(() => {}) // this is the function handler
  

with the method that I suggested earlier instead, the implementation for end users and people using both Powertools and custom middlewares would look like this:

import { Tracer, captureLambdaHandler } from '@aws-lambda-powertools/tracer';
import { cacheMiddleware } from './cacheMiddleware'; // see implementation above

export const handler = middy()
  .use(captureLambdaHandler(tracer)) // this is the middleware that we need to cleanup
  .use(cacheMiddleware) // this is the middleware that someone wrote, and that could return early
  .handler(() => {}) // this is the function handler

The difference is that end users don't have to know/care that some middleware need to be cleaned up, because this responsibility is shifted to the middleware author (see here).

@dreamorosi dreamorosi linked a pull request May 24, 2023 that will close this issue
9 tasks
@github-project-automation github-project-automation bot moved this from Working on it to Coming soon in AWS Lambda Powertools for TypeScript May 24, 2023
@github-actions
Copy link
Contributor

⚠️ COMMENT VISIBILITY WARNING ⚠️

Comments on closed issues are hard for our team to see.
If you need more assistance, please either tag a team member or open a new issue that references this one.
If you wish to keep having a conversation with other community members under this issue feel free to do so.

@github-actions github-actions bot added pending-release This item has been merged and will be released soon and removed confirmed The scope is clear, ready for implementation labels May 24, 2023
@github-actions
Copy link
Contributor

github-actions bot commented Jun 9, 2023

This is now released under v1.9.0 version!

@dreamorosi dreamorosi added confirmed The scope is clear, ready for implementation and removed pending-release This item has been merged and will be released soon labels Jun 9, 2023
@dreamorosi dreamorosi moved this from Coming soon to Shipped in AWS Lambda Powertools for TypeScript Jun 9, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
confirmed The scope is clear, ready for implementation documentation Improvements or additions to documentation tracer This item relates to the Tracer Utility
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants