Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regression: Upgrading to version 1.13.0 Increased our CPU usage by almost 10x #3358

Closed
RicardoNalesAmato opened this issue Apr 20, 2021 · 18 comments
Assignees
Labels
t:bug Something isn't working

Comments

@RicardoNalesAmato
Copy link

RicardoNalesAmato commented Apr 20, 2021

Describe the bug
Going from version 1.12.2 to 1.13.0 increased our CPU usage (per pod) by almost 10x (from around 100-150m to 1500m)

To Reproduce
Steps to reproduce the behavior:

  1. Have Around 15 Mappings/Hosts in 2 different namespaces. (Our mappings use regex)
  2. Upgrade from version 1.12.x to 1.13.0

Expected behavior
CPU Usage will go up dramatically

Versions (please complete the following information):

  • Ambassador: 1.13.0
  • Kubernetes environment: Self-managed Kubernetes Cluster
    Client Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.20.4", GitCommit:"e87da0bd6e03ec3fea7933c4b5263d151aafd07c", GitTreeState:"clean", BuildDate:"2021-02-21T20:23:45Z", GoVersion:"go1.15.8", Compiler:"gc", Platform:"darwin/amd64"}
    Server Version: version.Info{Major:"1", Minor:"19", GitVersion:"v1.19.9", GitCommit:"9dd794e454ac32d97cde41ae10be801ae98f75df", GitTreeState:"clean", BuildDate:"2021-03-18T01:00:06Z", GoVersion:"go1.15.8", Compiler:"gc", Platform:"linux/amd64"}
@rhs
Copy link
Contributor

rhs commented Apr 20, 2021

@RicardoNalesAmato Would it be possible for you to post the (possibly redacted) manifests of the ambassador resources in your cluster as well as the ambassador deployment itself? (That would greatly expedite our efforts to reproduce the issue.)

@RicardoNalesAmato
Copy link
Author

RicardoNalesAmato commented Apr 21, 2021

@rhs Absolutely!

To replicate the issue we just replace the image being used with 1.13.0 using this Deployment

apiVersion: apps/v1
kind: Deployment
metadata:
  annotations:
    deployment.kubernetes.io/revision: "1"
  labels:
    app.kubernetes.io/instance: ambassador-internal-default
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/name: ambassador-internal
    app.kubernetes.io/part-of: ambassador-internal-default
    helm.sh/chart: ambassador-internal-6.5.21
    product: aes
  name: ambassador-internal-default
  namespace: ambassador-internal
spec:
  progressDeadlineSeconds: 600
  replicas: 3
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      app.kubernetes.io/instance: ambassador-internal-default
      app.kubernetes.io/name: ambassador-internal
  strategy:
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 0
    type: RollingUpdate
  template:
    metadata:
      annotations:
        checksum/config: 01ba4719c80b6fe911b091a7c05124b64eeece964e09c058ef8f9805daca546b
      creationTimestamp: null
      labels:
        app: ambassador-internal
        app.kubernetes.io/instance: ambassador-internal-default
        app.kubernetes.io/managed-by: Helm
        app.kubernetes.io/name: ambassador-internal
        app.kubernetes.io/part-of: ambassador-internal-default
        product: aes
    spec:
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
              - key: app.kubernetes.io/name
                operator: In
                values:
                - ambassador-internal
            topologyKey: kubernetes.io/hostname
      containers:
      - env:
        - name: HOST_IP
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: status.hostIP
        - name: AMBASSADOR_NAMESPACE
          value: ambassador-internal
        - name: AMBASSADOR_DRAIN_TIME
          value: "180"
        - name: AMBASSADOR_FAST_RECONFIGURE
          value: "true"
        - name: AMBASSADOR_ID
          value: internal
        image: eu.gcr.io/images/ambassador:1.12.2
        imagePullPolicy: IfNotPresent
        livenessProbe:
          failureThreshold: 3
          httpGet:
            path: /ambassador/v0/check_alive
            port: admin
            scheme: HTTP
          initialDelaySeconds: 30
          periodSeconds: 3
          successThreshold: 1
          timeoutSeconds: 1
        name: ambassador-internal
        ports:
        - containerPort: 8080
          name: http
          protocol: TCP
        - containerPort: 8443
          name: https
          protocol: TCP
        - containerPort: 8877
          name: admin
          protocol: TCP
        readinessProbe:
          failureThreshold: 3
          httpGet:
            path: /ambassador/v0/check_ready
            port: admin
            scheme: HTTP
          initialDelaySeconds: 30
          periodSeconds: 3
          successThreshold: 1
          timeoutSeconds: 1
        resources:
          limits:
            cpu: "1"
            memory: 1Gi
          requests:
            cpu: 200m
            memory: 512Mi
        securityContext:
          allowPrivilegeEscalation: false
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /tmp/ambassador-pod-info
          name: ambassador-pod-info
          readOnly: true
      dnsConfig:
        options:
        - name: ndots
          value: "2"
      dnsPolicy: ClusterFirst
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext:
        runAsUser: 8888
      serviceAccount: ambassador-internal-default
      serviceAccountName: ambassador-internal-default
      terminationGracePeriodSeconds: 30
      volumes:
      - downwardAPI:
          defaultMode: 420
          items:
          - fieldRef:
              apiVersion: v1
              fieldPath: metadata.labels
            path: labels
        name: ambassador-pod-info

We have 31 Hosts across 3 namespaces (15 / 15 / 1). They all look like this (only the host changing):

apiVersion: getambassador.io/v2
kind: Host
metadata:
  name: myservice.com
spec:
  ambassador_id: internal
  hostname: myservice.com
  requestPolicy:
    insecure:
      action: Route
      additionalPort: 8080

We have 7 TLSContext across 3 namespaces (3 / 3 / 1) They all look like:

- apiVersion: getambassador.io/v2
  kind: TLSContext
  metadata:
    name: myservice-tls
  spec:
    ambassador_id: internal
    hosts:
    - myservice.com
    secret: myservice-tls-secret

And we have 31 Mappings across 3 namespaces ( 15 / 15 / 1 ). Again all look pretty much the same:

apiVersion: getambassador.io/v2
kind: Mapping
metadata:
  name: myservice-mapping
spec:
  ambassador_id: internal
  host: myservice.*
  host_regex: true
  load_balancer:
    policy: least_request
  prefix: /
  resolver: endpoint
  service: myservice:8281
  timeout_ms: 0

Regarding the mappings, some do not use the endpoint resolver (and thus no advanced load balancing is used either) and use the default timeout_ms.
We use the .* regex since some of our old services still add the port to the host header.

Thanks for looking into this!

@vichaos
Copy link

vichaos commented Apr 21, 2021

Could you compare generated envoy config(http://127.0.0.1:8001/config_dump) between those 2 versions.
and
Could you try adding strip_matching_host_port: true in ambassador module instead of using host_regex.

They have added this feature here -> https://github.com/datawire/ambassador/blob/master/CHANGELOG.md#1120-march-08-2021

@RicardoNalesAmato
Copy link
Author

RicardoNalesAmato commented Apr 21, 2021

@vichaos we tried using that feature, but the port needs to match the one envoy is using. So if the port used in the request is 80, it doesn’t match 8080 which is we are currently using (default for security reasons), and does not work.

We are thinking about changing the ports used by Envoy (to 80 and 443) to be able to use this feature.

@RicardoNalesAmato
Copy link
Author

I’ll post the dump as soon as possible.

@esmet
Copy link
Contributor

esmet commented Apr 21, 2021

@RicardoNalesAmato What does your Ambassador module config look like? Were you using prune_unreachable_routes by any chance? Feel free to reach out in the Ambassador Slack too.

@RicardoNalesAmato
Copy link
Author

RicardoNalesAmato commented Apr 21, 2021

@esmet our module is using the following options:

spec:
    ambassador_id: internal
    config:
      add_linkerd_headers: true
      envoy_log_type: json
      suppress_envoy_headers: true
      use_proxy_proto: true
      xff_num_trusted_hops: 1

@esmet
Copy link
Contributor

esmet commented Apr 21, 2021

Good to know. Are you able to tell which process inside of the container is using the most CPU? I'm curious if its Envoy or the Ambassador configuration processing bits.

@RicardoNalesAmato
Copy link
Author

Screen Shot 2021-04-21 at 12 09 09 PM

@esmet these are the processes that are using the most CPU. And here is the difference between 1.12.2 and 1.13.0 (Only the image was changed in one of our testing environments)
Screen Shot 2021-04-21 at 12 10 53 PM

@rhs
Copy link
Contributor

rhs commented Apr 21, 2021

@RicardoNalesAmato It's a bit of a shot in the dark, but I have a candidate fix. Could you try running the following image and see if it helps any?: datawiredev/ambassador:6eda18fc337d

@RicardoNalesAmato
Copy link
Author

@rhs absolutely, let me give it a try and get back to you

@RicardoNalesAmato
Copy link
Author

@rhs your new version fixed it!

On the left you can see when I deployed 1.13.0, then I reverted it back to 1.12.2. The new deployment was the image you provided me :)
Screen Shot 2021-04-21 at 1 05 35 PM

@RicardoNalesAmato
Copy link
Author

The CPU usage went from 400m+ to ~40m.

Screen Shot 2021-04-21 at 1 08 15 PM

@rhs
Copy link
Contributor

rhs commented Apr 21, 2021

@RicardoNalesAmato Thanks, I will make sure that fix makes it into an upcoming release!

FYI, if you can avoid using regular expressions in the host field of your mappings, there is an ambassador module option named prune_unreachable_routes that if enabled may result in significantly more efficient envoy configurations for your situation.

@rhs rhs changed the title Upgrading to version 1.13.0 Increased our CPU usage by almost 10x Regression: Upgrading to version 1.13.0 Increased our CPU usage by almost 10x Apr 21, 2021
@khussey khussey added the t:bug Something isn't working label Apr 21, 2021
@khussey khussey added this to the 2021 Cycle 3 Cool-down milestone Apr 21, 2021
@RicardoNalesAmato
Copy link
Author

@rhs @esmet @vichaos thanks a lot for the help!

@khussey
Copy link
Contributor

khussey commented Apr 22, 2021

This has been fixed in 1.13.1, which is now available.

@khussey khussey closed this as completed Apr 22, 2021
@blyles
Copy link

blyles commented Apr 24, 2021

We're still seeing this in 1.13.1, reverting to 1.12.2 is currently underway.

@rhs
Copy link
Contributor

rhs commented Apr 24, 2021

@vaustral Can you post the (possibly redacted) resources in your cluster so we can try to reproduce?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
t:bug Something isn't working
Projects
None yet
Development

No branches or pull requests

6 participants