Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

aws-cdk: Asset publishing fails every second time #24298

Closed
abrose opened this issue Feb 23, 2023 · 13 comments
Closed

aws-cdk: Asset publishing fails every second time #24298

abrose opened this issue Feb 23, 2023 · 13 comments
Labels
@aws-cdk/aws-lambda Related to AWS Lambda bug This issue is a bug.

Comments

@abrose
Copy link

abrose commented Feb 23, 2023

Describe the bug

Hello,

We are using CDK with SST as a higher level wrapper. One of our SST Apps consists of seven stacks - the whole thing builds up a pretty huge and parallelized step function which calls many lambda functions which itself do different machine learning stuff. Now when deploying one or several of the stacks suddenly starts failing. And this is the strange part - when the deployment is executed again, then it works. Next time - fails. Then works again.

Expected Behavior

Previously everything worked perfectly. We already suspected that the issue might be the number of resources of the stack, so we split some of the stacks to several smaller stacks but this also didn't help.

Current Behavior

Here is the screenshot of the error (private parts blacked out):
cdk_build_error

Unfortunately the error message is not helpful at all, atm I have no clue what the reason for the failures is and how to fix it. I tried to delete the .build folder after each build, but it doesn't help. I'm using SST 1.18.4, CDK 2.50.0 on Node 16.13.1. I tried switching to node 18 but it also didn't help.

Reproduction Steps

Simply running npx sst deploy triggers the CDK build and the build fails every second time.

Possible Solution

No response

Additional Information/Context

No response

CDK CLI Version

2.50.0

Framework Version

No response

Node.js Version

16.13.1

OS

Ubuntu 20.04 (WSL2@Win11)

Language

Typescript, Python

Language Version

Python 3.9,

Other information

Any hint or suggestion would be very appreciated since we need to meet a deadline and this issue is a hard blocker for us right now.

Regards,
Alfred

@abrose abrose added bug This issue is a bug. needs-triage This issue or PR still needs to be triaged. labels Feb 23, 2023
@github-actions github-actions bot added the @aws-cdk/aws-lambda Related to AWS Lambda label Feb 23, 2023
@joni3k
Copy link

joni3k commented Feb 23, 2023

We've noticed the same issue with one of our CDK apps, which deploys a big amount of lambdas.

@mrgrain
Copy link
Contributor

mrgrain commented Feb 23, 2023

Can you run your cdk commands with --debug and -vvv for additional output.
(But I don't know if/how you can achieve this with SST)

@pahud pahud added investigating This issue is being investigated and/or work is in progress to resolve the issue. and removed needs-triage This issue or PR still needs to be triaged. labels Feb 23, 2023
@abrose
Copy link
Author

abrose commented Feb 23, 2023

I can run sst deploy --verbose which also enables the CDK verbose mode, but I'm not sure if the debug mode is also enabled. Anyway, this is the output using --verbose, there is slightly more output, but nothing really new.
cdk_build_error2

Looking at the output I had a quite stupid thought - could it possibly be the case that due to some async stuff the publishing process starts before the asset zip is finished? Apparently there is a matching zip file but it still has the suffix "_tmp". Maybe the large amount of resources in the stack slows down the zip creation to a point where this happens? Then the next run has a "warm" cache and the zip file is created fast enough?

@abrose
Copy link
Author

abrose commented Feb 23, 2023

Here's another bit of information which might help - I tried to deploy changes only in one substack (with sst one can target substacks individually) and this worked perfectly.

@mrgrain
Copy link
Contributor

mrgrain commented Feb 23, 2023

It's an interesting though. Something like that might well be the case.

Looking at the code the promise appears to be awaited correctly though

await writeZipFile(directory, temporaryOutputFile);
await moveIntoPlace(temporaryOutputFile, outputFile, logger);

Would you be able to post the verbose output for one of the successful second runs?

@mrgrain
Copy link
Contributor

mrgrain commented Feb 23, 2023

Also, we had this PR merged recently #24026 which fixes an issues that's at least very similar to yours #23290

That fix is only available from v2.64.0 though

@abrose
Copy link
Author

abrose commented Feb 23, 2023

Unfortunately I can't figure out how to "force" SST to use a newer CDK build. I created a post in SST's discord for this issue and hope that there is a simple way to test this out.
Regarding the verbose output of a successful (mostly cached) run - on the first glance I could not identify any interesting parts. The entire log is almost 500kb large and there is too much to black out - can I send it directly to you somehow? It's a client's project and the stack names could give away some sensible bits of information about the product we're building.

@mrgrain
Copy link
Contributor

mrgrain commented Feb 23, 2023

Unfortunately I can't figure out how to "force" SST to use a newer CDK build. I created a post in SST's discord for this issue and hope that there is a simple way to test this out.

Yeah that's something you will have to take up with SST unfortunately. :(

Regarding the verbose output of a successful (mostly cached) run - on the first glance I could not identify any interesting parts.

I basically want to confirm what happens on the second run. Is the previously failing asset in question (identifiable by the hash) loaded from cache, or does building suddenly work. If loaded from cache there would be a message in gray prefixed with cached, otherwise you would see the build: Zip.... from above.

The entire log is almost 500kb large and there is too much to black out - can I send it directly to you somehow? It's a client's project and the stack names could give away some sensible bits of information about the product we're building.

Not sure what your requirements are. My DMs are open on cdk.dev though. I'm @mrgrain there as well.

@mrgrain
Copy link
Contributor

mrgrain commented Feb 23, 2023

Based on the other issue, it might also help to disable concurrency. That would make everything slower though.
But you know, just to confirm if it's the same problem.

@abrose
Copy link
Author

abrose commented Feb 23, 2023

The failing hash from the earlier run appears multiple times in the following successful run but there is no "cached" string or anything else indicating that it was loaded from cache. Here is an example of one of the occurrences
cdk_build_error3

There is also nothing being rebuilt as far as I can see.

Regarding disabling concurrency - I've already tried to disable SST's concurrency by setting SST_BUILD_CONCURRENCY=1 but this didn't solve the issue.

@abrose
Copy link
Author

abrose commented Feb 24, 2023

So, we found a way to force SST to use the newer CDK (by using package.json overrides) and yes the problem is gone. So now we need to find out if and how fast SST will update it's dependency and if we continue to use it or decide to use CDK directly.
Thank you for your support!

@mrgrain
Copy link
Contributor

mrgrain commented Feb 24, 2023

That's excellent news. Will be good to know what SST comes back with, but overrides might be a workable solution either way.

Closing this issue here.

@mrgrain mrgrain closed this as completed Feb 24, 2023
@mrgrain mrgrain removed the investigating This issue is being investigated and/or work is in progress to resolve the issue. label Feb 24, 2023
@github-actions
Copy link

⚠️COMMENT VISIBILITY WARNING⚠️

Comments on closed issues are hard for our team to see.
If you need more assistance, please either tag a team member or open a new issue that references this one.
If you wish to keep having a conversation with other community members under this issue feel free to do so.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
@aws-cdk/aws-lambda Related to AWS Lambda bug This issue is a bug.
Projects
None yet
Development

No branches or pull requests

4 participants