Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add AWS x-ray support for API Gateway #5692

Merged
merged 24 commits into from
Apr 15, 2019

Conversation

softprops
Copy link
Contributor

@softprops softprops commented Jan 13, 2019

What did you implement:

Closes #5564

This changes adds support for declaratively enabling active tracing for apigateway

How did you implement it:

If tracing is enabled we'll create a separate Stage resource. This feature is opt-in through explicit tracing configuration. When using a Stage resource we'll remove the StageName from the Deployment resource.

We can extend the Stage resource configurability later on via a dedicated serverless.yml section in provider.

How can we verify it:

To enable this you should only have to add provider:apiGateway.stageDescription.enableTracing

provider:
  name: aws
  tracing:
    apiGateway: true

Todos:

  • Write tests
  • Write documentation
  • Fix linting errors
  • Make sure code coverage hasn't dropped
  • Provide verification config / commands / resources
  • Enable "Allow edits from maintainers" for this PR
  • Update the messages below

Is this ready for review?: YES
Is it a breaking change?: NO

@softprops softprops changed the title add support for apigw xray traces by way of stageDescription add support for apigw xray active tracing by way of stageDescription Jan 13, 2019
@softprops
Copy link
Contributor Author

softprops commented Jan 13, 2019

while unit tests are passing an integration test im trying is not

serverless.yml

service: gw-tracing
provider:
  name: aws
  apiGateway:
    stageDescription:
      tracingEnabled: true
functions:
  hello:
    runtime: python3.6
    handler: hello.handler
    events:
      - http: any /

hello.py

def handler(event, ctx):
  return {
    "statusCode": 200
  }

package.json

{
  "dependencies": {
    "serverless": "softprops/serverless#deployment-stage-description"
  }
}
$ npx serverless deploy
...

  Serverless Error ---------------------------------------

  An error occurred: ApiGatewayDeployment1547359246920 - StageDescription cannot be specified when stage referenced by StageName already exists.

I saw a discussion about a similar issue here I'll let you know what I find

if (!_.isBoolean(tracingEnabled)) {
throw new Error('REST API stage description tracingEnabled must be a boolean');
}
return {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm letting tracingEnabled be the only supported (and documented) field associated with stage descriptions for now, but having this method in place should make it straightforward and easy to add additional supported fields in the future. Something I've not yet figured out is is there is a better place to communicate structural validation prior to serverless deploy invocations.

@softprops
Copy link
Contributor Author

It seems the cloudformation error only happens after the first deploy. I've seen this error references in at least one other serverless plugin

@softprops
Copy link
Contributor Author

I'm starting to get the impression stagedescription may indeed not be supported for updates to a stage despite working on the first deploy. I'm thinking another option may be for serverless to somehow manage a stage resource independently https://docs.amazonaws.cn/en_us/AWSCloudFormation/latest/UserGuide/aws-resource-apigateway-stage.html

@eahefnawy
Copy link
Contributor

@softprops thanks a lot for this PR, and especially for the super helpful comments 🙌 . I'm currently reviewing it.

Is it good to go from your side? or are there any other changes or issues you need to address? Just wanna know the status of this PR.

@softprops
Copy link
Contributor Author

@eahefnawy thanks for posting back.

I think this is going to have to be halted until I figure out the last issue I commented. Help on that would be welcomed. While the code checks according to cloudformation docs out the integration test I posted does not. I need to step a few levels deeper to how cloud formation is actually supposed to behave in this case.

The first deploy works ( with active tracing enabled ) but any deploy afterwards it fails with the error

 An error occurred: ApiGatewayDeployment{timestamp} - StageDescription cannot be specified when stage referenced by StageName already exists.

I've read through and tried to grok the following similar cases

mapbox/deprecated-hookshot#9 (comment)

https://github.com/jacob-meacham/serverless-plugin-bind-deployment-id#known-issues

https://forums.aws.amazon.com/thread.jspa?threadID=230706

This kind of makes me think that in order to get this to work I'd need to change some fundamentals for how serverless manages api gatway deployment stages, potentially as a first class cloud formation resource.

I'd appreciate any help I can get knowledge wise on the history of the current setup for apigw deployments. Maybe that was tried before but has since switched to its current state.

The example I provided above should be a good minimal reproducible case for a local test if you have time

@eahefnawy
Copy link
Contributor

@softprops hmmm yeah I see what you're saying. We are currently adding the stage implicitly via the Deployment resource. I'm hesitant to add a new CF resource for each stage as it'd increase the CF resource count and get us even closer to the 200 resources limit, which is already an issue.

Don't wanna solve a problem by introducing another problem. Fixing this specific problem for a subset of our users who want active tracing might introduce a new resource limit problem for all framework users. On the other hand, it's just a single resource for each stage, which doesn't seem like a lot. hmmm.

Would it be possible to only add this stage resource only if active tracing is enabled in sls.yml? If not, then we'd revert to the implicit stage definition in the Deployment resource?

I'm gonna raise this up internally with the team, and let you know 😊

@softprops
Copy link
Contributor Author

I can try that route. Another way to look at this to is that it would solve multiple problems when we figure this out. enabling tracing is one problem solved by adding the stage description, here's a list of all the other things that this would enable https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-resource-apigateway-stage.html

@nanoflat
Copy link

This will be really awesome to have

@jplock
Copy link

jplock commented Jan 29, 2019

Having an explicit StageDescription would help a lot, but probably require people to tear down their existing stacks and redeploy to make use it.

@eahefnawy
Copy link
Contributor

@softprops we've discussed it internally, and we'd like to merge it in only if you could add the new Stage resource for people activating active tracing in the yml file, otherwise it'd work the same way for the rest of the users.

As @jplock pointed out, this could be a breaking change for some users. Also, if someone is already at the 200 resource limit, they wouldn't be able to make another deployment with this core change.

@softprops
Copy link
Contributor Author

softprops commented Jan 31, 2019

@eahefnawy I still haven't found a way to work around updates. This only seems to work the first time you deploy. And time you deploy after you'll get the StageDescription cannot be specified when stage referenced by StageName already exists error. I'm still not sure what approach would be best to handle that.

You can reproduce the problem with the minimal example posted above

@AdrieanKhisbe
Copy link

I was working on adding possibility to configure Stage with someAccessLogSetting when I come across this issue.
I was developing a plugin serverless-apigateway-access-logs that dynamically extend the AWS::ApiGateway::Deployment with a StageDescription. Which works..... only the first time.

An error occurred: ApiGatewayDeployment1549288734560 - StageDescription cannot be specified when stage referenced by StageName already exists.

I think the best thing to do would be indeed as point out @eahefnawy to have an explicit AWS::ApiGateway::Stage in serverless we could configure.

I'll be happy to give a hand if needed has this is a current issue we are trying to solve @CoorpAcademy

@softprops
Copy link
Contributor Author

@AdrieanKhisbe I made that suggestion then @eahefnawy added that we may need to conditionally add this if activating tracing to avoid making this a breaking change. With your use case we'd also have to add another conditional if users opt into configuring the stage.AccessLogSetting.

I'll see if I can get the conditional based on xray tracing soon but id want to maybe see how well that works before adding additional conditional rules for adding a stage resource after.

@AdrieanKhisbe
Copy link

AdrieanKhisbe commented Feb 5, 2019

@softprops I see.

I was just wondering if the explicit AWS::ApiGateway::Stage in serverless should be configured via an option,
option that would be activated by plugins.
This would keep the serverless core simpler and that way, apigateway xray tracing could rather belong to a plugin.

What do you think about this @softprops and @eahefnawy ?

@softprops
Copy link
Contributor Author

I think the hesitation to make the Stage resource default is based on cloudformation limits of 200 resources and some users complain that serverlesses implicit stack should have fewer to opt out of to make more room for application resources.

@softprops
Copy link
Contributor Author

I did a little research on differences and what can be described Deployment.StageDescription vs Stage resources

Stage Description

{
  "AccessLogSetting" : AccessLogSetting,
  "CacheClusterEnabled" : Boolean,
  "CacheClusterSize" : String,
  "CacheDataEncrypted" : Boolean,
  "CacheTtlInSeconds" : Integer,
  "CachingEnabled" : Boolean,
  "CanarySetting" : CanarySetting,
  "ClientCertificateId" : String,
  "DataTraceEnabled" : Boolean,
  "Description" : String,
  "DocumentationVersion" : String,
  "LoggingLevel" : String,
  "MethodSettings" : [ MethodSetting, ... ],
  "MetricsEnabled" : Boolean,
  "Tags" : [ Resource Tag, ... ],
  "ThrottlingBurstLimit" : Integer,
  "ThrottlingRateLimit" : Number,
  "TracingEnabled" : Boolean,
  "Variables" : { String:String, ... }
}

Stage

{
    "AccessLogSetting" : AccessLogSetting,
    "CacheClusterEnabled" : Boolean,
    "CacheClusterSize" : String,
    "CanarySetting" : CanarySetting,
    "ClientCertificateId" : String,
    "DeploymentId" : String,
    "Description" : String,
    "DocumentationVersion" : String,
    "MethodSettings" : [ MethodSetting, ... ],
    "RestApiId" : String,
    "StageName" : String,
    "Tags" : [ Resource Tag, ... ],
    "TracingEnabled" : Boolean,
    "Variables" : { String:String, ... }
  }

Interestingly it seemed like there was more properties in StageDescription than Stage itself
namely CacheDataEncrypted, CacheTtlInSeconds, CachingEnabled, DataTraceEnabled, LoggingLevel, MetricsEnabled, ThrottlingBurstLimit, ThrottlingRateLimit

It seems all of these fields are also properties on elements of Stage.MethodSettings MethodSetting

This then begs the question if it would be misleading to be exposing stage settings on provider.apiGateway.stageDescription when the cloudformation resource serverless would generate, an actual Stage, would not be able to represent a direct mapping.

Let me know if this is thinking too far ahead, i.e. supporting stage properties other than tracingEnabled.

@pmuens
Copy link
Contributor

pmuens commented Apr 8, 2019

Thanks for commenting @orwell1984 👍

Hi @pmuens , thanks for your work on this. I was wondering if it's possible to adopt this feature in "create-only" mode, so to speak? The whole CF conundrum is caused by updating existing lambda, right?

Yes, that sounds like a good plan and seems to be the last resort for the time being to get this merged...

@pmuens
Copy link
Contributor

pmuens commented Apr 10, 2019

I looked into the CloudFormation problems with the current implementation again and it seems like there's no way to resolve them in a non-interruptive fashion.

Given that I've updated the docs and added a comment about the current limitations and the workaround of removing and re-deploying the API Gateway.

Using API Gateway X-Ray Tracing is opt-in, so the user should know what he can expect. Since it's opt-in we should move forward and merge this into master.

@dschep @eahefnawy can you take a final look into this and merge if appropriate? Thanks!

@pmuens
Copy link
Contributor

pmuens commented Apr 10, 2019

I've added some logic which inspects the CloudFormation error message and adds some information on how to solve the problem. This way the user directly sees in the CLI that a remove and re-deployment is necessary.

@pmuens
Copy link
Contributor

pmuens commented Apr 11, 2019

Here's another update after testing several edge cases with current implementation today:

During testing I experienced an odd behavior. If one enables tracing, deploys and afterwards disables tracing and deploys again, the stage disappears from the AWS API Gateway console. It can still be found in the compiled CloudFormation template and you can still access the deployed API, but the stage seems to be inaccessible on the AWS console. I'm not sure if this is a problem on AWS end or if it has something to do with our compiled CloudFormation template.

My hunch is that it's because we're going back from a dedicated AWS::ApiGateway::Stage resource to an inline StageName definition in the AWS::ApiGateway::Deployment resource. The odd thing is that we're not seeing any error thrown by AWS.

I'm putting this PR on-hold again since it's not clear if this is a breaking change or if there's anything we can do to work around this issue.

@pmuens
Copy link
Contributor

pmuens commented Apr 12, 2019

Knock, knock, it's me again...

Just looked into this again. Apparently there's no way to fix this (as we already know). Because of this I implemented a check which will diif the old CloudFormation template with the new one and prints out helpful error messages showing that upgrades require a remove and re-deploy and downgrades might result in unexpected behavior (the user can use --force to proceed anyway).

This logic will automatically apply to all future PRs which will also switch to use the new Stage resource.

Copy link
Contributor

@pmuens pmuens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just tested this again and it works fine. LGTM :shipit:

@pmuens pmuens merged commit 90a7adf into serverless:master Apr 15, 2019
@pmuens pmuens added this to the 1.41.0 milestone Apr 15, 2019
@jonathanalberghini
Copy link

And even after deploying once I need to always user the --force option???
provider:
name: aws
tracing:
apiGateway: false

or i get this error
Serverless Error ---------------------------------------

NOTE: Disabling API Gateway X-Ray Tracing for existing deployments might result in unexpected behavior.
We recommend to remove and re-deploy your API Gateway. Use the --force option if you want to proceed with the deployment.

Please refer to our documentation for more information.

@gootdude
Copy link

gootdude commented Apr 22, 2019

This feature has introduced production blocking bug as well, we can no longer deploy using serverless 1.41.0 as the checkForBreakingChanges.js logic always errors and states we must remove our existing API Gateway. Even after removing API Gatway, subsequent deploys fail because of this new check.

          // 1. if the user wants to upgrade to use the new AWS::APIGateway::Stage resource but
          // the old state still uses the stage defined on the AWS::ApiGateway::Deployment resource
          if (oldResources[oldDeploymentLogicalId] && oldResources[oldDeploymentLogicalId].Properties.StageName && newResources[stageLogicalId]) { // eslint-disable-line max-len
            const msg = [
              'NOTE: Enabling API Gateway X-Ray Tracing for existing ',
              'deployments requires a remove and re-deploy of your API Gateway. ',
              '\n\n  ',
              'Please refer to our documentation for more information.',
            ].join('');
            throw new this.serverless.classes.Error(msg);
          }

I have provided the relevant portion of my serverless.yml with the cloudformation resources below, we define this to enable detailed access logging in addition to x-ray tracing and set the deployment id using plugin serverless-plugin-bind-deployment-id

resources:
  Resources:
    ABCApiGateway:
      Type: AWS::ApiGateway::RestApi
      Properties:
        Name: ${self:provider.stage}-ABC
    ApiGatewayStage:
      Type: 'AWS::ApiGateway::Stage'
      Properties:
        DeploymentId:
          Ref: __deployment__
        RestApiId:
          Ref: ABCApiGateway
        StageName: ${self:provider.stage}
        TracingEnabled: true
        MethodSettings:
          - HttpMethod: "*"
            ResourcePath: "/*"
            MetricsEnabled: true
            DataTraceEnabled: false
        AccessLogSetting:
          Format: '{"apiId":"$context.apiId","stage":"$context.stage","resourcePath":"$context.resourcePath","requestId":"$context.requestId","awsEndpointRequestId":"$context.awsEndpointRequestId","xrayTraceId":"$context.xrayTraceId","requestTime":"$context.requestTime","requestTimeEpoch":$context.requestTimeEpoch,"httpMethod":"$context.httpMethod","status":"$context.status","path":"$context.path"}'
          DestinationArn:
                Fn::GetAtt:
                  - ABCAccessLogGroup
                  - Arn

@exoego
Copy link
Contributor

exoego commented Apr 23, 2019

@jonathannaguin
@gootdude
Could you open a new issue with detailed steps and configurations to reproduce?
Thanks in advance !!

@pmuens
Copy link
Contributor

pmuens commented Apr 23, 2019

This is a stupid way to implement.

@jonathanalberghini while I agree that this is annoying I have to say that we only had the best intentions when implementing this. The main reason was to provide some safeguards when people want to upgrade and use AWS X-Ray.

Saying that something is "a stupid way to implement" is very harsh and insulting. That's not how you should talk to people on the internet.


@exoego thanks for jumping in 👍

@gootdude thanks for providing in-depth information. I just prepared a PR where I remove the check for breaking changes. Will be out soon and after that I'll publish a patch release.

@pmuens
Copy link
Contributor

pmuens commented Apr 23, 2019

Quick update that we've just published https://github.com/serverless/serverless/releases/tag/v1.41.1 which should fix the problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add support for AWS x-ray on AWS API Gateway