Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

aws-cdk-lib: Custom resource provider can't exceed 1 hour #24974

Closed
jamiewhit opened this issue Apr 6, 2023 · 12 comments · Fixed by #27533
Closed

aws-cdk-lib: Custom resource provider can't exceed 1 hour #24974

jamiewhit opened this issue Apr 6, 2023 · 12 comments · Fixed by #27533
Assignees
Labels
@aws-cdk/custom-resources Related to AWS CDK Custom Resources aws-cdk-lib Related to the aws-cdk-lib package bug This issue is a bug. documentation This is a problem with documentation. p1

Comments

@jamiewhit
Copy link

jamiewhit commented Apr 6, 2023

Describe the bug

A custom resource provider can be configured to have a totalTimeout of up to 2 hours according to this documentation. However, in practice the custom resource times out in CloudFormation after 1 hour with this message:

CloudFormation did not receive a response from your Custom Resource

Expected Behavior

I expect CloudFormation to allow up to 2 hours for the custom resource to be created.

Current Behavior

The step function which is created by the provider framework is successfully configured to run up to 2 hours, but CloudFormation fails after 1 hour. Even though CloudFormation fails to create/update the resource, the step function keeps running until the resource is successfully provisioned before the 2 hour timeout.

Reproduction Steps

I've created a custom resource provider which runs asynchronously. I've followed this documentation to set up the provider with an isComplete handler which polls the status of my custom resource: https://docs.aws.amazon.com/cdk/api/v2/docs/aws-cdk-lib.custom_resources-readme.html#asynchronous-providers-iscomplete.

 const provider = new Provider(this, 'DeploymentProvider', {
    onEventHandler: onEvent,
    isCompleteHandler: isComplete,
    totalTimeout: Duration.hours(2),
    queryInterval: Duration.seconds(30)
});

Possible Solution

Either update the custom resource provider to support a max totalTimeout of 1 hour or clarify the intention of the timeout value, if it's not meant to represent the timeout supported by CloudFormation.

Additional Information/Context

No response

CDK CLI Version

2.72.0

Framework Version

No response

Node.js Version

v14.20.0

OS

macOS 12.6.1

Language

Typescript

Language Version

No response

Other information

No response

@jamiewhit jamiewhit added bug This issue is a bug. needs-triage This issue or PR still needs to be triaged. labels Apr 6, 2023
@github-actions github-actions bot added the aws-cdk-lib Related to the aws-cdk-lib package label Apr 6, 2023
@pahud pahud added the investigating This issue is being investigated and/or work is in progress to resolve the issue. label Apr 7, 2023
@pahud pahud self-assigned this Apr 7, 2023
@pahud pahud removed the needs-triage This issue or PR still needs to be triaged. label Apr 7, 2023
@pahud
Copy link
Contributor

pahud commented Apr 7, 2023

According to this blog post, I am afraid the default custom resource creation timeout is 1 hour, to extend the timeout, we will need to create a wait condition handler and allow custom resource handler callback the wait condition handler but I did find any relevant implementation in CDK.

I'll discuss with the team internally.

@pahud pahud added p2 effort/medium Medium work item – several days of effort labels Apr 7, 2023
@peterwoodworth
Copy link
Contributor

@pahud the totalTimeout prop ends up getting used in the WaitConditionHandler stepfunction. We use the totalTimeout to calculate the maxAttempts based on the interval count.

const definition = Stack.of(this).toJsonString({
StartAt: 'framework-isComplete-task',
States: {
'framework-isComplete-task': {
End: true,
Retry: [{
ErrorEquals: ['States.ALL'],
IntervalSeconds: props.interval.toSeconds(),
MaxAttempts: props.maxAttempts,
BackoffRate: props.backoffRate,
}],

@pahud
Copy link
Contributor

pahud commented Apr 7, 2023

Thank you @peterwoodworth. Correct me if I was wrong but from what I've observed in the source code and document, the default timeout of the CfnCustomResource is 1 hour according to this document and the custom resource provider is essentially a provider of CfnCustomResource and is responsible to cfn-response to CFN within this limitation.

Now, in this framework, if isCompleteHandler is provided, a waiter state machine will be created and kicked off with the onEvent handler as its entrypoint execution followed by isCompleteHandler with backoff retry invocation. This means:

  1. The onEvent handler will be invoked synchronously with await call and is subject to its 15min timeout, in this case if your onEvent handler can successfully return within 15min, everything is good.
  2. When both onEvent handler and isComplete handler are provided, the framework will kick off the waiter state machine immediately after the onEvent response. By default, the state machine start execution invocation is async. This means it's state machine's responsibility to invoke isCompleteHandler to send the cfn response within the 1 hour.

Now, let's look at the totalTimeout property mentioned above. According to this, it's essentially designed for isCompleteHandler and is calculated here in calculagteRetryPolicy().

With that being said, I think props.totalTimeout is designed only to calculate the maxAttempts and the whole custom resource provider is still subject to and capped by the default 1 hour limitation of the CfnCustomResource timeout mentioned in the document, which I don't think the waiter state machine can break, even not the whole provider framework.

I feel the CfnWaitCondition and CfnWaitConditionHandle might be required as described in this blog post and this github sample code snippet or completely auto created by the provider framework when props.totalTimeout is specified more than 1 hour. This might be a working option to break the 1 hour limitation.

But I could still be wrong and miss something in the source code. Let me know if I missed anything.

@pahud
Copy link
Contributor

pahud commented Apr 11, 2023

I can confirm the 2 hours of totalTimeout does not work.

export class Demotack extends Stack {
  constructor(scope: Construct, id: string, props?: StackProps) {
    super(scope, id, props);

    const handlerCommonConfig = {
      runtime: lambda.Runtime.PYTHON_3_9,
      code: lambda.Code.fromAsset(path.join(__dirname, '../lambda.d')),
    }

    const onEventHandler = new lambda.Function(this, 'OnEvent', {
      ...handlerCommonConfig,
      handler: 'index.on_event',
    })

    const isCompleteHandler = new lambda.Function(this, 'IsComplete', {
      ...handlerCommonConfig,
      handler: 'index.is_complete',
    })

    const provider = new cr.Provider(this, 'Provider', {
      onEventHandler,
      isCompleteHandler,
      queryInterval: Duration.minutes(15),
      totalTimeout: Duration.hours(2),
      logRetention: log.RetentionDays.ONE_DAY,
    }) 

    new CustomResource(this, 'CR', {
      serviceToken: provider.serviceToken,
      properties: {
        startedAt: Math.floor(Date.now() / 1000),
      }
    })
  }
}

index.py

import time


def on_event(event, context):
  print(event)
  request_type = event['RequestType']
  if request_type == 'Create': return on_create(event)
  if request_type == 'Update': return on_update(event)
  if request_type == 'Delete': return on_delete(event)
  raise Exception("Invalid request type: %s" % request_type)

def on_create(event):
  props = event["ResourceProperties"]
  print("create new resource with props %s" % props)
  return {}

def on_update(event):
  physical_id = event["PhysicalResourceId"]
  props = event["ResourceProperties"]
  print("update resource %s with props %s" % (physical_id, props))
  # ...

def on_delete(event):
  physical_id = event["PhysicalResourceId"]
  print("delete resource %s" % physical_id)
  # ...
  
def is_complete(event, context):
  print(event)
  resource_props = event["ResourceProperties"]
  started_at = int(resource_props["startedAt"])
  now_ts = int(time.time())
  delta = now_ts - started_at
  print(f'now: {now_ts} started: {started_at} delta: {delta}')
  # 2 hours + 5min
  is_ready = delta > 60*60*2+300

  return { 'IsComplete': is_ready }

stack started creation at 8:38am
image

CR started creation at 8:39am
image

CR creation failed at 9:40am
image

@otaviomacedo otaviomacedo added the @aws-cdk/custom-resources Related to AWS CDK Custom Resources label Apr 27, 2023
@BwL1289
Copy link

BwL1289 commented May 28, 2023

@pahud If CR Provider timeout can't exceed 1 hour, is it possible to implement a workaround by deploying my own state machine to take a wait condition?

For context, I am using CDK Pipelines and one of the stages takes ~6-7 hours. After it completes, I need to start the next stage of the pipeline.

Based on this blog post and this article it should be possible, but would appreciate feedback.

Since this is P2, not sure if when this may be implemented and if it will extend to the full 12 hour Sfn limit?

Appreciate the help and guidance.

@BwL1289
Copy link

BwL1289 commented Jun 19, 2023

For anyone reading this, the above is possible. It's a bit non-trivial, but I was able to implement it to bypass the 1 hour timeout. Excited for CDK to support this out of the box.

@AaronZyLee
Copy link

Is there a plan for the fix? The wait condition seems reasonable for over 1 hour resources. I am currently developing a new construct depending on the framework and would like to see the fix is prioritized by team.

@peterwoodworth peterwoodworth added p1 and removed p2 investigating This issue is being investigated and/or work is in progress to resolve the issue. labels Sep 11, 2023
@pahud pahud removed their assignment Oct 11, 2023
@pahud
Copy link
Contributor

pahud commented Oct 12, 2023

@BwL1289

If CR Provider timeout can't exceed 1 hour, is it possible to implement a workaround by deploying my own state machine to take a wait condition?

Yes, that is exactly what I was thinking but unfortunately I didn't make it.

I noticed this issue has been prioritized as p1 and the team will be looking at it shortly. Meanwhile, we appreciate and welcome any PR to address this issue from the community as well. I guess having a custom wait condition could be the way to go.

@BwL1289
Copy link

BwL1289 commented Oct 12, 2023

Thanks @pahud.

I'm not sure I have time to submit a PR, but happy to provide you with my pythonic solution for this.

However, a question I do have that is somewhat related.

With this workaround implemented, if I have an on_update event, will previous values be stored in OldResourceProperties so that I can compare newer values to older ones (like the CR docs state)?

@evgenyka
Copy link
Contributor

We are going to fix the documentation to align with AWS CloudFormation 1 hour timeout. For the long running deployments please see https://aws.amazon.com/blogs/devops/implementing-long-running-deployments-with-aws-cloudformation-custom-resources-using-aws-step-functions/

@evgenyka evgenyka added documentation This is a problem with documentation. and removed effort/medium Medium work item – several days of effort labels Oct 16, 2023
@BwL1289
Copy link

BwL1289 commented Oct 16, 2023

@evgenyka to be clear, is CDK not going to support this for now?

@mergify mergify bot closed this as completed in #27533 Oct 16, 2023
mergify bot pushed a commit that referenced this issue Oct 16, 2023
…27533)

This change fixes the documentation for `ProviderProps.totalTimeout` to specify the correct value of 1 hour.

Closes #24974.

----

*By submitting this pull request, I confirm that my contribution is made under the terms of the Apache-2.0 license*
@github-actions
Copy link

⚠️COMMENT VISIBILITY WARNING⚠️

Comments on closed issues are hard for our team to see.
If you need more assistance, please either tag a team member or open a new issue that references this one.
If you wish to keep having a conversation with other community members under this issue feel free to do so.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
@aws-cdk/custom-resources Related to AWS CDK Custom Resources aws-cdk-lib Related to the aws-cdk-lib package bug This issue is a bug. documentation This is a problem with documentation. p1
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants