Regression: Upgrading to version 1.13.0 Increased our CPU usage by almost 10x #3358

RicardoNalesAmato · 2021-04-20T21:53:37Z

Describe the bug
Going from version 1.12.2 to 1.13.0 increased our CPU usage (per pod) by almost 10x (from around 100-150m to 1500m)

To Reproduce
Steps to reproduce the behavior:

Have Around 15 Mappings/Hosts in 2 different namespaces. (Our mappings use regex)
Upgrade from version 1.12.x to 1.13.0

Expected behavior
CPU Usage will go up dramatically

Versions (please complete the following information):

Ambassador: 1.13.0
Kubernetes environment: Self-managed Kubernetes Cluster
Client Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.20.4", GitCommit:"e87da0bd6e03ec3fea7933c4b5263d151aafd07c", GitTreeState:"clean", BuildDate:"2021-02-21T20:23:45Z", GoVersion:"go1.15.8", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"19", GitVersion:"v1.19.9", GitCommit:"9dd794e454ac32d97cde41ae10be801ae98f75df", GitTreeState:"clean", BuildDate:"2021-03-18T01:00:06Z", GoVersion:"go1.15.8", Compiler:"gc", Platform:"linux/amd64"}

rhs · 2021-04-20T23:41:31Z

@RicardoNalesAmato Would it be possible for you to post the (possibly redacted) manifests of the ambassador resources in your cluster as well as the ambassador deployment itself? (That would greatly expedite our efforts to reproduce the issue.)

RicardoNalesAmato · 2021-04-21T02:21:55Z

@rhs Absolutely!

To replicate the issue we just replace the image being used with 1.13.0 using this Deployment

apiVersion: apps/v1
kind: Deployment
metadata:
  annotations:
    deployment.kubernetes.io/revision: "1"
  labels:
    app.kubernetes.io/instance: ambassador-internal-default
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/name: ambassador-internal
    app.kubernetes.io/part-of: ambassador-internal-default
    helm.sh/chart: ambassador-internal-6.5.21
    product: aes
  name: ambassador-internal-default
  namespace: ambassador-internal
spec:
  progressDeadlineSeconds: 600
  replicas: 3
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      app.kubernetes.io/instance: ambassador-internal-default
      app.kubernetes.io/name: ambassador-internal
  strategy:
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 0
    type: RollingUpdate
  template:
    metadata:
      annotations:
        checksum/config: 01ba4719c80b6fe911b091a7c05124b64eeece964e09c058ef8f9805daca546b
      creationTimestamp: null
      labels:
        app: ambassador-internal
        app.kubernetes.io/instance: ambassador-internal-default
        app.kubernetes.io/managed-by: Helm
        app.kubernetes.io/name: ambassador-internal
        app.kubernetes.io/part-of: ambassador-internal-default
        product: aes
    spec:
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
              - key: app.kubernetes.io/name
                operator: In
                values:
                - ambassador-internal
            topologyKey: kubernetes.io/hostname
      containers:
      - env:
        - name: HOST_IP
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: status.hostIP
        - name: AMBASSADOR_NAMESPACE
          value: ambassador-internal
        - name: AMBASSADOR_DRAIN_TIME
          value: "180"
        - name: AMBASSADOR_FAST_RECONFIGURE
          value: "true"
        - name: AMBASSADOR_ID
          value: internal
        image: eu.gcr.io/images/ambassador:1.12.2
        imagePullPolicy: IfNotPresent
        livenessProbe:
          failureThreshold: 3
          httpGet:
            path: /ambassador/v0/check_alive
            port: admin
            scheme: HTTP
          initialDelaySeconds: 30
          periodSeconds: 3
          successThreshold: 1
          timeoutSeconds: 1
        name: ambassador-internal
        ports:
        - containerPort: 8080
          name: http
          protocol: TCP
        - containerPort: 8443
          name: https
          protocol: TCP
        - containerPort: 8877
          name: admin
          protocol: TCP
        readinessProbe:
          failureThreshold: 3
          httpGet:
            path: /ambassador/v0/check_ready
            port: admin
            scheme: HTTP
          initialDelaySeconds: 30
          periodSeconds: 3
          successThreshold: 1
          timeoutSeconds: 1
        resources:
          limits:
            cpu: "1"
            memory: 1Gi
          requests:
            cpu: 200m
            memory: 512Mi
        securityContext:
          allowPrivilegeEscalation: false
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /tmp/ambassador-pod-info
          name: ambassador-pod-info
          readOnly: true
      dnsConfig:
        options:
        - name: ndots
          value: "2"
      dnsPolicy: ClusterFirst
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext:
        runAsUser: 8888
      serviceAccount: ambassador-internal-default
      serviceAccountName: ambassador-internal-default
      terminationGracePeriodSeconds: 30
      volumes:
      - downwardAPI:
          defaultMode: 420
          items:
          - fieldRef:
              apiVersion: v1
              fieldPath: metadata.labels
            path: labels
        name: ambassador-pod-info

We have 31 Hosts across 3 namespaces (15 / 15 / 1). They all look like this (only the host changing):

apiVersion: getambassador.io/v2
kind: Host
metadata:
  name: myservice.com
spec:
  ambassador_id: internal
  hostname: myservice.com
  requestPolicy:
    insecure:
      action: Route
      additionalPort: 8080

We have 7 TLSContext across 3 namespaces (3 / 3 / 1) They all look like:

- apiVersion: getambassador.io/v2
  kind: TLSContext
  metadata:
    name: myservice-tls
  spec:
    ambassador_id: internal
    hosts:
    - myservice.com
    secret: myservice-tls-secret

And we have 31 Mappings across 3 namespaces ( 15 / 15 / 1 ). Again all look pretty much the same:

apiVersion: getambassador.io/v2
kind: Mapping
metadata:
  name: myservice-mapping
spec:
  ambassador_id: internal
  host: myservice.*
  host_regex: true
  load_balancer:
    policy: least_request
  prefix: /
  resolver: endpoint
  service: myservice:8281
  timeout_ms: 0

Regarding the mappings, some do not use the endpoint resolver (and thus no advanced load balancing is used either) and use the default timeout_ms.
We use the .* regex since some of our old services still add the port to the host header.

Thanks for looking into this!

vichaos · 2021-04-21T12:20:23Z

Could you compare generated envoy config(http://127.0.0.1:8001/config_dump) between those 2 versions.
and
Could you try adding strip_matching_host_port: true in ambassador module instead of using host_regex.

They have added this feature here -> https://github.com/datawire/ambassador/blob/master/CHANGELOG.md#1120-march-08-2021

RicardoNalesAmato · 2021-04-21T13:45:02Z

@vichaos we tried using that feature, but the port needs to match the one envoy is using. So if the port used in the request is 80, it doesn’t match 8080 which is we are currently using (default for security reasons), and does not work.

We are thinking about changing the ports used by Envoy (to 80 and 443) to be able to use this feature.

RicardoNalesAmato · 2021-04-21T13:45:23Z

I’ll post the dump as soon as possible.

esmet · 2021-04-21T14:17:29Z

@RicardoNalesAmato What does your Ambassador module config look like? Were you using prune_unreachable_routes by any chance? Feel free to reach out in the Ambassador Slack too.

RicardoNalesAmato · 2021-04-21T18:08:53Z

@esmet our module is using the following options:

spec:
    ambassador_id: internal
    config:
      add_linkerd_headers: true
      envoy_log_type: json
      suppress_envoy_headers: true
      use_proxy_proto: true
      xff_num_trusted_hops: 1

esmet · 2021-04-21T18:21:06Z

Good to know. Are you able to tell which process inside of the container is using the most CPU? I'm curious if its Envoy or the Ambassador configuration processing bits.

RicardoNalesAmato · 2021-04-21T19:11:12Z

@esmet these are the processes that are using the most CPU. And here is the difference between 1.12.2 and 1.13.0 (Only the image was changed in one of our testing environments)

rhs · 2021-04-21T19:44:40Z

@RicardoNalesAmato It's a bit of a shot in the dark, but I have a candidate fix. Could you try running the following image and see if it helps any?: datawiredev/ambassador:6eda18fc337d

RicardoNalesAmato · 2021-04-21T19:56:16Z

@rhs absolutely, let me give it a try and get back to you

RicardoNalesAmato · 2021-04-21T20:09:17Z

@rhs your new version fixed it!

On the left you can see when I deployed 1.13.0, then I reverted it back to 1.12.2. The new deployment was the image you provided me :)

RicardoNalesAmato · 2021-04-21T20:10:44Z

The CPU usage went from 400m+ to ~40m.

rhs · 2021-04-21T20:36:54Z

@RicardoNalesAmato Thanks, I will make sure that fix makes it into an upcoming release!

FYI, if you can avoid using regular expressions in the host field of your mappings, there is an ambassador module option named prune_unreachable_routes that if enabled may result in significantly more efficient envoy configurations for your situation.

RicardoNalesAmato · 2021-04-21T23:16:31Z

@rhs @esmet @vichaos thanks a lot for the help!

khussey · 2021-04-22T16:36:05Z

This has been fixed in 1.13.1, which is now available.

blyles · 2021-04-24T14:19:06Z

We're still seeing this in 1.13.1, reverting to 1.12.2 is currently underway.

rhs · 2021-04-24T15:17:28Z

@vaustral Can you post the (possibly redacted) resources in your cluster so we can try to reproduce?

rhs changed the title ~~Upgrading to version 1.13.0 Increased our CPU usage by almost 10x~~ Regression: Upgrading to version 1.13.0 Increased our CPU usage by almost 10x Apr 21, 2021

khussey assigned rhs Apr 21, 2021

khussey added the t:bug Something isn't working label Apr 21, 2021

khussey added this to the 2021 Cycle 3 Cool-down milestone Apr 21, 2021

khussey closed this as completed Apr 22, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Regression: Upgrading to version 1.13.0 Increased our CPU usage by almost 10x #3358

Regression: Upgrading to version 1.13.0 Increased our CPU usage by almost 10x #3358

RicardoNalesAmato commented Apr 20, 2021 •

edited

Loading

rhs commented Apr 20, 2021

RicardoNalesAmato commented Apr 21, 2021 •

edited

Loading

vichaos commented Apr 21, 2021 •

edited

Loading

RicardoNalesAmato commented Apr 21, 2021 •

edited

Loading

RicardoNalesAmato commented Apr 21, 2021

esmet commented Apr 21, 2021

RicardoNalesAmato commented Apr 21, 2021 •

edited

Loading

esmet commented Apr 21, 2021

RicardoNalesAmato commented Apr 21, 2021

rhs commented Apr 21, 2021

RicardoNalesAmato commented Apr 21, 2021

RicardoNalesAmato commented Apr 21, 2021

RicardoNalesAmato commented Apr 21, 2021

rhs commented Apr 21, 2021

RicardoNalesAmato commented Apr 21, 2021

khussey commented Apr 22, 2021

blyles commented Apr 24, 2021

rhs commented Apr 24, 2021

Regression: Upgrading to version 1.13.0 Increased our CPU usage by almost 10x #3358

Regression: Upgrading to version 1.13.0 Increased our CPU usage by almost 10x #3358

Comments

RicardoNalesAmato commented Apr 20, 2021 • edited Loading

rhs commented Apr 20, 2021

RicardoNalesAmato commented Apr 21, 2021 • edited Loading

vichaos commented Apr 21, 2021 • edited Loading

RicardoNalesAmato commented Apr 21, 2021 • edited Loading

RicardoNalesAmato commented Apr 21, 2021

esmet commented Apr 21, 2021

RicardoNalesAmato commented Apr 21, 2021 • edited Loading

esmet commented Apr 21, 2021

RicardoNalesAmato commented Apr 21, 2021

rhs commented Apr 21, 2021

RicardoNalesAmato commented Apr 21, 2021

RicardoNalesAmato commented Apr 21, 2021

RicardoNalesAmato commented Apr 21, 2021

rhs commented Apr 21, 2021

RicardoNalesAmato commented Apr 21, 2021

khussey commented Apr 22, 2021

blyles commented Apr 24, 2021

rhs commented Apr 24, 2021

RicardoNalesAmato commented Apr 20, 2021 •

edited

Loading

RicardoNalesAmato commented Apr 21, 2021 •

edited

Loading

vichaos commented Apr 21, 2021 •

edited

Loading

RicardoNalesAmato commented Apr 21, 2021 •

edited

Loading

RicardoNalesAmato commented Apr 21, 2021 •

edited

Loading