[CLI] cdk deploy keeps calling describeStackEvents() for ~10 minutes after stack update #9470

kennu · 2020-08-05T19:32:32Z

AWS CDK cdk deploy keeps calling describeStackEvents() for about 10 minutes after the CloudFormation stack has been successfully updated. It seems that CDK iterates through the entire event history of the stack in 5 second intervals. This makes the cdk deploy command take a very long time.

During the delay AWS CDK appears to be stuck in the "creating CloudFormation changeset" phase. In reality it has already created the changeset and applied it to the CloudFormation stack.

Reproduction Steps

You might need a CloudFormation stack that has a lot of event history to notice that AWS CDK iterates through all of them, causing the long delay.

You can see the describeStackEvents() iteration in debug output by using cdk deploy -v -v -v. It will keep showing entries like this (with different NextToken as it progresses):

[AWS cloudformation 200 5.499s 0 retries] describeStackEvents({
  StackName: 'webcat-registrar-dev',
  NextToken: 'f73Uze7bI/CdCn4qrXZ31hZlkhi5jbWWog...'
})

What did you expect to happen?

cdk deploy should finish in about 3 minutes.

What actually happened?

cdk deploy takes 14 minutes to complete.

Environment

CLI Version : 1.56.0
Framework Version: 1.56.0
Node.js Version: v12.18.2
OS : Linux (Ubuntu in Docker)
Language (Version): TypeScript 3.9.7

Other

This is 🐛 Bug Report

The text was updated successfully, but these errors were encountered:

kennu · 2020-08-05T20:11:17Z

Here is the sequence of debug output that occurs at the time the stack update finishes. After this the describeStackEvents() go on.

Stack webcat-registrar-dev has an ongoing operation in progress and is not stable (UPDATE_IN_PROGRESS)
[AWS cloudformation 200 6.336s 0 retries] describeStackEvents({
  StackName: 'webcat-registrar-dev',
  NextToken: 'xxx'
})
[AWS cloudformation 200 6.115s 0 retries] describeStackEvents({
  StackName: 'webcat-registrar-dev',
  NextToken: 'xxx'
})
[AWS cloudformation 200 5.294s 0 retries] describeStacks({ StackName: 'webcat-registrar-dev' })
Stack webcat-registrar-dev has an ongoing operation in progress and is not stable (UPDATE_IN_PROGRESS)
[AWS cloudformation 200 6.107s 0 retries] describeStackEvents({
  StackName: 'webcat-registrar-dev',
  NextToken: 'xxx'
})
[AWS cloudformation 200 5.29s 0 retries] describeStacks({ StackName: 'webcat-registrar-dev' })
Stack webcat-registrar-dev has an ongoing operation in progress and is not stable (UPDATE_COMPLETE_CLEANUP_IN_PROGRESS)
[AWS cloudformation 200 6.279s 0 retries] describeStackEvents({
  StackName: 'webcat-registrar-dev',
  NextToken: 'xxx'
})
[AWS cloudformation 200 6.309s 0 retries] describeStackEvents({
  StackName: 'webcat-registrar-dev',
  NextToken: 'xxx'
})
[AWS cloudformation 200 5.264s 0 retries] describeStacks({ StackName: 'webcat-registrar-dev' })
[AWS cloudformation 200 5.892s 0 retries] describeStackEvents({
  StackName: 'webcat-registrar-dev',
  NextToken: 'xxx'
})
[AWS cloudformation 200 5.994s 0 retries] describeStackEvents({
  StackName: 'webcat-registrar-dev',
  NextToken: 'xxx'
})
[AWS cloudformation 200 6.231s 0 retries] describeStackEvents({
  StackName: 'webcat-registrar-dev',
  NextToken: 'xxx'
})
[AWS cloudformation 200 6.36s 0 retries] describeStackEvents({
  StackName: 'webcat-registrar-dev',
  NextToken: 'xxx'
})

shivlaks · 2020-08-06T04:07:58Z

@kennu what's the stack status? does it end with UPDATE_COMPLETE_CLEANUP_IN_PROGRESS or does it transition to UPDATE_COMPLATE and then the cdk continues polling the stack?

at a cursory glance, it looks like the deploy status is not being classified as terminal so the CLI continues polling the stack until a terminal status is reached or it times out and gives up.

kennu · 2020-08-06T13:48:28Z

I think the stack final status is always UPDATE_COMPLETE. At least it was the last few times when I deployed and took the timings for this issue. Everything looks normal in AWS Console.

kennu · 2020-08-06T13:50:58Z

To my understanding, this is the recursive step that keeps calling describeStackEvents() forever:

aws-cdk/packages/aws-cdk/lib/api/util/cloudformation/stack-activity-monitor.ts

Line 197 in 16d2a48

    
           // only read next page if all the events we read are new events. otherwise, we can rest.

I'm not quite sure about the logic there, but it seems like it just keeps iterating more and more events in my stack(s).

rix0rrr · 2020-08-13T14:33:11Z

If it's true that the paging token is not properly retained between polls, that would be pretty bad.

kennu · 2020-08-13T16:55:58Z

I think that the token is retained, but I just have a lot of events in that stack. So it takes a long time to iterate through all of them.

kennu · 2020-08-13T18:10:53Z

I double checked by adding a console.log() of every stack event that AWS CDK scans through. It really iterates all the stack events from today all the way back to 2019-04-07 when the stack was created.

Unless I misunderstand something, the whole logic of StackActivityMonitor.readEvents() is flawed. It will always scan through the entire stack event history, which keeps growing forever and making deployments slower and slower.

This is the last entry scanned before cdk deploy completes after 15 minutes of iterating events:

{
  StackId: 'arn:aws:cloudformation:eu-west-1:xxx:stack/webcat-registrar-dev/xxx',
  EventId: 'xxx',
  StackName: 'webcat-registrar-dev',
  LogicalResourceId: 'webcat-registrar-dev',
  PhysicalResourceId: 'arn:aws:cloudformation:eu-west-1:xxx:stack/webcat-registrar-dev/xxx',
  ResourceType: 'AWS::CloudFormation::Stack',
  Timestamp: 2019-04-07T00:12:50.615Z,
  ResourceStatus: 'REVIEW_IN_PROGRESS',
  ResourceStatusReason: 'User Initiated'
}
Stack webcat-registrar-dev has completed updating

rix0rrr · 2020-08-18T09:27:20Z

Yep. Turns out we are accidentally quadratic!

The stack monitor used to read all stack events on every tick (every 1-5 seconds). That's right, every 5 seconds it would page through all events that had ever happened to the stack, only to get to the last page to see if there were new events. We were quadratic in our `DescribeStackEvents` API call. This behavior has probably contributed to our users getting throttled by CloudFormation in the past, and only really shows pathological behavior on stacks with a long history (which we ourselves don't routinely manage using `cdk deploy`), which is why it had eluded detection for so long. The fix is to remember the token of the last page between calls, so that we don't have to read through all pages just to get to the last one. Fixes #9470.

…9795) The stack monitor used to read all stack events on every tick (every 1-5 seconds). That's right, every 5 seconds it would page through all events that had ever happened to the stack, only to get to the last page to see if there were new events. We were quadratic in our `DescribeStackEvents` API call. This behavior has probably contributed to our users getting throttled by CloudFormation in the past, and only really shows pathological behavior on stacks with a long history (which we ourselves don't routinely manage using `cdk deploy`), which is why it had eluded detection for so long. The fix is to remember the token of the last page between calls, so that we don't have to read through all pages just to get to the last one. Fixes #9470. ---- *By submitting this pull request, I confirm that my contribution is made under the terms of the Apache-2.0 license*

…ws#9795) The stack monitor used to read all stack events on every tick (every 1-5 seconds). That's right, every 5 seconds it would page through all events that had ever happened to the stack, only to get to the last page to see if there were new events. We were quadratic in our `DescribeStackEvents` API call. This behavior has probably contributed to our users getting throttled by CloudFormation in the past, and only really shows pathological behavior on stacks with a long history (which we ourselves don't routinely manage using `cdk deploy`), which is why it had eluded detection for so long. The fix is to remember the token of the last page between calls, so that we don't have to read through all pages just to get to the last one. Fixes aws#9470. ---- *By submitting this pull request, I confirm that my contribution is made under the terms of the Apache-2.0 license*

In #9795, I was under the impression that the `DescribeStackEvents` would list events in chronological order. This was a misunderstanding resulting from not reading the documentation-- in fact, the events are returned in reverse chronological order. The paging logic was therefore wrong and the events would be fed into the activity printer in the wrong order, leading to incorrect progress reporting. This is a critical bug and must be fixed immediately! While I was in there, fixed two other paper cuts around the printer at the same time: - Events occurring near the start of the monitor would be missed. This was because time-based filtering would be employed based on the client machine's local clock, which might be out of sync with the remote clock. Use the server's clock instead, by using the `CreationTime` of the ChangeSet as the start time. - Events occurring near the end of the monitor would be missed. This was because `DescribeStackStatus` might return a completion state before the monitor had polled the events that led up to the completion. Fix by doing a final poll after the stop call has been received, so we are guaranteed to have seen all events. The actual claim in that PR was incorrect. We were *not* quadratic (we wouldn't list all pages on every tick), we were just more linear than necessary (we *would* list all pages on the *first* tick... which is still a lot). The current PR properly fixes the issue by stopping paging not just if we've already seen the events, but also if the events are starting to fall outside the time window we are interested in. This *actually* fixes #9470. ---- *By submitting this pull request, I confirm that my contribution is made under the terms of the Apache-2.0 license*

kennu added bug This issue is a bug. needs-triage This issue or PR still needs to be triaged. labels Aug 5, 2020

SomayaB changed the title ~~[aws-cdk] cdk deploy keeps calling describeStackEvents() for ~10 minutes after stack update~~ [CLI] cdk deploy keeps calling describeStackEvents() for ~10 minutes after stack update Aug 5, 2020

github-actions bot added the package/tools Related to AWS CDK Tools or CLI label Aug 5, 2020

github-actions bot assigned shivlaks Aug 5, 2020

shivlaks added p1 response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 7 days. labels Aug 6, 2020

SomayaB removed the needs-triage This issue or PR still needs to be triaged. label Aug 6, 2020

github-actions bot removed the response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 7 days. label Aug 7, 2020

rix0rrr added this to the [CLI] Soon milestone Aug 13, 2020

rix0rrr mentioned this issue Aug 18, 2020

fix(cli): stack monitor reads complete stack history every 5 seconds #9795

Merged

mergify bot closed this as completed in #9795 Aug 18, 2020

rix0rrr mentioned this issue Aug 19, 2020

chore(cli): fix bugs introduced in stack monitor #9830

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CLI] cdk deploy keeps calling describeStackEvents() for ~10 minutes after stack update #9470

[CLI] cdk deploy keeps calling describeStackEvents() for ~10 minutes after stack update #9470

kennu commented Aug 5, 2020

kennu commented Aug 5, 2020

shivlaks commented Aug 6, 2020

kennu commented Aug 6, 2020

kennu commented Aug 6, 2020

rix0rrr commented Aug 13, 2020

kennu commented Aug 13, 2020

kennu commented Aug 13, 2020

rix0rrr commented Aug 18, 2020

[CLI] cdk deploy keeps calling describeStackEvents() for ~10 minutes after stack update #9470

[CLI] cdk deploy keeps calling describeStackEvents() for ~10 minutes after stack update #9470

Comments

kennu commented Aug 5, 2020

Reproduction Steps

What did you expect to happen?

What actually happened?

Environment

Other

kennu commented Aug 5, 2020

shivlaks commented Aug 6, 2020

kennu commented Aug 6, 2020

kennu commented Aug 6, 2020

rix0rrr commented Aug 13, 2020

kennu commented Aug 13, 2020

kennu commented Aug 13, 2020

rix0rrr commented Aug 18, 2020