Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kapp and kube API Server calls limits #627

Closed
revolunet opened this issue Oct 18, 2022 · 13 comments
Closed

kapp and kube API Server calls limits #627

revolunet opened this issue Oct 18, 2022 · 13 comments
Labels
carvel triage This issue has not yet been reviewed for validity discussion This issue is not a bug or feature and a conversation is needed to find an appropriate resolution helping with an issue Debugging happening to identify the problem

Comments

@revolunet
Copy link

revolunet commented Oct 18, 2022

Hello,

I'm benchmarking some kapp deploy commands on a big manifest file with 6 containers and some wait-rules, without kapp-controller, and i'm facing 403 errors from the APIServer if i do multiple concurrent kapp deploy. Looks like these 403 make kapp stop with :

kapp: Error: waiting on reconcile job/job-template-kapp1-1-32ylei-db-hasura-create-secret-672rpn (batch/v1) namespace: fabrique-ci:
  Errored:
    Listing schema.GroupVersionResource{Group:"", Version:"v1", Resource:"pods"}, namespaced: true:
        Fetching all namespaces: an error on the server ("error trying to reach service: dial tcp 10.0.0.1:443: connect: connection refused") has prevented the request from succeeding (get namespaces)

I've done various tests and set kapp-api-qps to 10 and kapp-api-burst to 10 and have no more ideas so i'd like to share this with you, maybe you'll have some 😉

Looks like most of 403 are related to cluster-wide API calls (namespaces, pods...)

Have anyone experiences this kind of behaviour ? we're using AKS with Rancher.

Some numbers for a multiple deploy (3) with the below manifests (stripped) :

In this graph you can see APIServer responses to kapp :

  • green : 200 or 201
  • blue: 404
  • red: >201 and !=404 (mostly 403)

Capture d’écran 2022-10-19 à 01 47 08

Samples errors :

Capture d’écran 2022-10-19 à 01 49 22

Sample manifests :

---
apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    component: app
    application: template
  name: app
  namespace: template-kapp1
  annotations:
    kapp.k14s.io/disable-original: ""
    kapp.k14s.io/create-strategy: fallback-on-update
    kapp.k14s.io/update-strategy: fallback-on-replace
    kapp.k14s.io/nonce: ""
    kapp.k14s.io/change-group: kontinuous/template-kapp1
    kapp.k14s.io/change-group.app: kontinuous/app.template-kapp1
    kapp.k14s.io/change-rule.build-app: upsert after upserting kontinuous/build-app.template-kapp1
    kapp.k14s.io/change-rule.keycloakx: upsert after upserting kontinuous/keycloakx.template-kapp1
    kapp.k14s.io/change-rule.hasura: upsert after upserting kontinuous/hasura.template-kapp1
spec:
---
apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    component: hasura
    application: template
  name: hasura
  namespace: template-kapp1
  annotations:
    kapp.k14s.io/disable-original: ""
    kapp.k14s.io/create-strategy: fallback-on-update
    kapp.k14s.io/update-strategy: fallback-on-replace
    kapp.k14s.io/nonce: ""
    kapp.k14s.io/change-group: kontinuous/template-kapp1
    kapp.k14s.io/change-group.hasura: kontinuous/hasura.template-kapp1
    kapp.k14s.io/change-rule.build-hasura: upsert after upserting kontinuous/build-hasura.template-kapp1
    kapp.k14s.io/change-rule.db-hasura: upsert after upserting kontinuous/db-hasura.template-kapp1
    kapp.k14s.io/change-rule.keycloakx: upsert after upserting kontinuous/keycloakx.template-kapp1
spec:

---
apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    component: maildev
    application: template
  name: maildev
  namespace: template-kapp1
  annotations:
    kapp.k14s.io/disable-original: ""
    kapp.k14s.io/create-strategy: fallback-on-update
    kapp.k14s.io/update-strategy: fallback-on-replace
    kapp.k14s.io/nonce: ""
    kapp.k14s.io/change-group: kontinuous/template-kapp1
    kapp.k14s.io/change-group.maildev: kontinuous/maildev.template-kapp1
spec:

---
apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    component: metabase
    application: template
  name: metabase
  namespace: template-kapp1
  annotations:
    kapp.k14s.io/disable-original: ""
    kapp.k14s.io/create-strategy: fallback-on-update
    kapp.k14s.io/update-strategy: fallback-on-replace
    kapp.k14s.io/nonce: ""
    kapp.k14s.io/change-group: kontinuous/template-kapp1
    kapp.k14s.io/change-group.metabase: kontinuous/metabase.template-kapp1
    kapp.k14s.io/change-rule.db-metabase: upsert after upserting kontinuous/db-metabase.template-kapp1
spec:

---
apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    component: pgweb
    application: template
  name: pgweb
  namespace: template-kapp1
  annotations:
    kapp.k14s.io/disable-original: ""
    kapp.k14s.io/create-strategy: fallback-on-update
    kapp.k14s.io/update-strategy: fallback-on-replace
    kapp.k14s.io/nonce: ""
    kapp.k14s.io/change-group: kontinuous/template-kapp1
    kapp.k14s.io/change-group.pgweb: kontinuous/pgweb.template-kapp1
    kapp.k14s.io/change-rule.db-hasura: upsert after upserting kontinuous/db-hasura.template-kapp1
spec:
   
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: keycloakx
  annotations:
    kapp.k14s.io/disable-original: ""
    kapp.k14s.io/change-group: kontinuous/template-kapp1
    kapp.k14s.io/change-group.keycloakx: kontinuous/keycloakx.template-kapp1
    kapp.k14s.io/change-rule.db-keycloak: upsert after upserting kontinuous/db-keycloak.template-kapp1
  namespace: template-kapp1
spec:
   
---
apiVersion: batch/v1
kind: Job
metadata:
  name: job-template-kapp1-build-app-kaniko-3zekn9
  namespace: fabrique-ci
  annotations:
    kapp.k14s.io/nonce: ""
    kapp.k14s.io/update-strategy: fallback-on-replace
    kapp.k14s.io/change-group: kontinuous/template-kapp1
    kapp.k14s.io/change-group.build-app: kontinuous/build-app.template-kapp1
    kapp.k14s.io/change-group.build-app.kaniko: kontinuous/build-app.kaniko.template-kapp1
    kapp.k14s.io/change-group.build-app..kaniko: kontinuous/build-app..kaniko.template-kapp1
    kapp.k14s.io/disable-original: ""
    kapp.k14s.io/create-strategy: fallback-on-update
    kapp.k14s.io/change-group.jobs: kontinuous/jobs.template-kapp1
spec:
 
---
apiVersion: batch/v1
kind: Job
metadata:
  name: job-template-kapp1-build-hasura-kaniko-3d6853
  namespace: fabrique-ci
  annotations:
    kapp.k14s.io/nonce: ""
    kapp.k14s.io/update-strategy: fallback-on-replace
    kapp.k14s.io/change-group: kontinuous/template-kapp1
    kapp.k14s.io/change-group.build-hasura: kontinuous/build-hasura.template-kapp1
    kapp.k14s.io/change-group.build-hasura.kaniko: kontinuous/build-hasura.kaniko.template-kapp1
    kapp.k14s.io/change-group.build-hasura..kaniko: kontinuous/build-hasura..kaniko.template-kapp1
    kapp.k14s.io/disable-original: ""
    kapp.k14s.io/create-strategy: fallback-on-update
    kapp.k14s.io/change-group.jobs: kontinuous/jobs.template-kapp1
spec:

---
apiVersion: batch/v1
kind: Job
metadata:
  name: job-template-kapp1-db-hasura-create-db-1dtpbq
  namespace: fabrique-ci
  annotations:
    kapp.k14s.io/nonce: ""
    kapp.k14s.io/update-strategy: fallback-on-replace
    kapp.k14s.io/change-group: kontinuous/template-kapp1
    kapp.k14s.io/change-group.db-hasura: kontinuous/db-hasura.template-kapp1
    kapp.k14s.io/change-group.db-hasura.create-db: kontinuous/db-hasura.create-db.template-kapp1
    kapp.k14s.io/change-group.db-hasura..create-db: kontinuous/db-hasura..create-db.template-kapp1
    kapp.k14s.io/change-rule.db-hasura..create-secret: upsert after upserting kontinuous/db-hasura..create-secret.template-kapp1
    kapp.k14s.io/disable-original: ""
    kapp.k14s.io/create-strategy: fallback-on-update
    kapp.k14s.io/change-group.jobs: kontinuous/jobs.template-kapp1
spec:

---
apiVersion: batch/v1
kind: Job
metadata:
  name: job-template-kapp1-db-hasura-create-secret-672rpn
  namespace: fabrique-ci
  annotations:
    kapp.k14s.io/nonce: ""
    kapp.k14s.io/update-strategy: fallback-on-replace
    kapp.k14s.io/change-group: kontinuous/template-kapp1
    kapp.k14s.io/change-group.db-hasura: kontinuous/db-hasura.template-kapp1
    kapp.k14s.io/change-group.db-hasura.create-secret: kontinuous/db-hasura.create-secret.template-kapp1
    kapp.k14s.io/change-group.db-hasura..create-secret: kontinuous/db-hasura..create-secret.template-kapp1
    kapp.k14s.io/disable-original: ""
    kapp.k14s.io/create-strategy: fallback-on-update
    kapp.k14s.io/change-group.jobs: kontinuous/jobs.template-kapp1

spec:
   
---
apiVersion: batch/v1
kind: Job
metadata:
  name: job-template-kapp1-db-keycloak-create-db-3dxq1g
  namespace: fabrique-ci
  annotations:
    kapp.k14s.io/nonce: ""
    kapp.k14s.io/update-strategy: fallback-on-replace
    kapp.k14s.io/change-group: kontinuous/template-kapp1
    kapp.k14s.io/change-group.db-keycloak: kontinuous/db-keycloak.template-kapp1
    kapp.k14s.io/change-group.db-keycloak.create-db: kontinuous/db-keycloak.create-db.template-kapp1
    kapp.k14s.io/change-group.db-keycloak..create-db: kontinuous/db-keycloak..create-db.template-kapp1
    kapp.k14s.io/change-rule.db-keycloak..create-secret: >-
      upsert after upserting
      kontinuous/db-keycloak..create-secret.template-kapp1
    kapp.k14s.io/disable-original: ""
    kapp.k14s.io/create-strategy: fallback-on-update
    kapp.k14s.io/change-group.jobs: kontinuous/jobs.template-kapp1

spec:
   
---
apiVersion: batch/v1
kind: Job
metadata:
  name: job-template-kapp1-db-keycloak-create-secret-39r2rj
  namespace: fabrique-ci
  annotations:
    kapp.k14s.io/nonce: ""
    kapp.k14s.io/update-strategy: fallback-on-replace
    kapp.k14s.io/change-group: kontinuous/template-kapp1
    kapp.k14s.io/change-group.db-keycloak: kontinuous/db-keycloak.template-kapp1
    kapp.k14s.io/change-group.db-keycloak.create-secret: kontinuous/db-keycloak.create-secret.template-kapp1
    kapp.k14s.io/change-group.db-keycloak..create-secret: kontinuous/db-keycloak..create-secret.template-kapp1
    kapp.k14s.io/disable-original: ""
    kapp.k14s.io/create-strategy: fallback-on-update
    kapp.k14s.io/change-group.jobs: kontinuous/jobs.template-kapp1
spec:
 
---
apiVersion: batch/v1
kind: Job
metadata:
  name: job-template-kapp1-db-metabase-create-db-xgit30
  namespace: fabrique-ci
  annotations:
    kapp.k14s.io/nonce: ""
    kapp.k14s.io/update-strategy: fallback-on-replace
    kapp.k14s.io/change-group: kontinuous/template-kapp1
    kapp.k14s.io/change-group.db-metabase: kontinuous/db-metabase.template-kapp1
    kapp.k14s.io/change-group.db-metabase.create-db: kontinuous/db-metabase.create-db.template-kapp1
    kapp.k14s.io/change-group.db-metabase..create-db: kontinuous/db-metabase..create-db.template-kapp1
    kapp.k14s.io/change-rule.db-metabase..create-secret: >-
      upsert after upserting
      kontinuous/db-metabase..create-secret.template-kapp1
    kapp.k14s.io/disable-original: ""
    kapp.k14s.io/create-strategy: fallback-on-update
    kapp.k14s.io/change-group.jobs: kontinuous/jobs.template-kapp1
spec:
  
---
apiVersion: batch/v1
kind: Job
metadata:
  name: job-template-kapp1-db-metabase-create-secret-2bu7x4
  namespace: fabrique-ci
  annotations:
    kapp.k14s.io/nonce: ""
    kapp.k14s.io/update-strategy: fallback-on-replace
    kapp.k14s.io/change-group: kontinuous/template-kapp1
    kapp.k14s.io/change-group.db-metabase: kontinuous/db-metabase.template-kapp1
    kapp.k14s.io/change-group.db-metabase.create-secret: kontinuous/db-metabase.create-secret.template-kapp1
    kapp.k14s.io/change-group.db-metabase..create-secret: kontinuous/db-metabase..create-secret.template-kapp1
    kapp.k14s.io/disable-original: ""
    kapp.k14s.io/create-strategy: fallback-on-update
    kapp.k14s.io/change-group.jobs: kontinuous/jobs.template-kapp1
  
spec:
 
---
apiVersion: batch/v1
kind: Job
metadata:
  name: job-template-kapp1-seed-hasura-import-secret-3h5a4u
  namespace: fabrique-ci
  annotations:
    kapp.k14s.io/nonce: ""
    kapp.k14s.io/update-strategy: fallback-on-replace
    kapp.k14s.io/change-group: kontinuous/template-kapp1
    kapp.k14s.io/change-group.seed-hasura: kontinuous/seed-hasura.template-kapp1
    kapp.k14s.io/change-group.seed-hasura.import-secret: kontinuous/seed-hasura.import-secret.template-kapp1
    kapp.k14s.io/change-group.seed-hasura..import-secret: kontinuous/seed-hasura..import-secret.template-kapp1
    kapp.k14s.io/change-rule.hasura: upsert after upserting kontinuous/hasura.template-kapp1
    kapp.k14s.io/disable-original: ""
    kapp.k14s.io/create-strategy: fallback-on-update
    kapp.k14s.io/change-group.jobs: kontinuous/jobs.template-kapp1
spec:
 
---
apiVersion: batch/v1
kind: Job
metadata:
  name: job-template-kapp1-seed-hasura-seed-db-59hfdf
  namespace: fabrique-ci
  annotations:
    kapp.k14s.io/nonce: ""
    kapp.k14s.io/update-strategy: fallback-on-replace
    kapp.k14s.io/change-group: kontinuous/template-kapp1
    kapp.k14s.io/change-group.seed-hasura: kontinuous/seed-hasura.template-kapp1
    kapp.k14s.io/change-group.seed-hasura.seed-db: kontinuous/seed-hasura.seed-db.template-kapp1
    kapp.k14s.io/change-group.seed-hasura..seed-db: kontinuous/seed-hasura..seed-db.template-kapp1
    kapp.k14s.io/change-rule.seed-hasura..import-secret: >-
      upsert after upserting
      kontinuous/seed-hasura..import-secret.template-kapp1
    kapp.k14s.io/change-rule.hasura: upsert after upserting kontinuous/hasura.template-kapp1
    kapp.k14s.io/disable-original: ""
    kapp.k14s.io/create-strategy: fallback-on-update
    kapp.k14s.io/change-group.jobs: kontinuous/jobs.template-kapp1
spec:
 
@revolunet revolunet added the carvel triage This issue has not yet been reviewed for validity label Oct 18, 2022
@carvel-bot carvel-bot moved this to To Triage in Carvel Oct 18, 2022
@renuy renuy added helping with an issue Debugging happening to identify the problem discussion This issue is not a bug or feature and a conversation is needed to find an appropriate resolution and removed carvel triage This issue has not yet been reviewed for validity labels Oct 19, 2022
@renuy renuy moved this from To Triage to In Progress in Carvel Oct 19, 2022
@praveenrewar
Copy link
Member

Hi @revolunet! I am guessing that the server has a bunch of pending requests and therefore it's refusing the tcp connection.
You can try to increase the --wait-check-interval duration to a high number (say 5 seconds or 20 seconds) and see if that helps. Increasing it to 5 seconds would decrease the number of api calls made during the waiting stage to ~ 1/5th.

@revolunet
Copy link
Author

thanks @praveenrewar ! will try and report with these options

@revolunet
Copy link
Author

While it reduces the load in the long run it doesnt prevent 403s.

Looks like at the start of kapp deploy, theres a lot of requests made and it doesnt account the qps/burst/wait options

Capture d’écran 2022-10-19 à 09 49 52

@praveenrewar
Copy link
Member

I see. Would you be able to share any such error? If it's happening in the apply stage, then you could also try increasing --apply-check-interval and decreasing --apply-concurrency (default is 5). But I feel that it's happening much before that. You can also try reducing --existing-non-labeled-resources-check-concurrency to a smaller number like 5 (default is 100).

@revolunet
Copy link
Author

revolunet commented Oct 19, 2022

mmm thanks. i've tried many combinations without luck. looks like we have something wrong in our cluster. it fails as soon as we launch multiple parrallel kapp deploys; investigating...

@praveenrewar
Copy link
Member

Would you be able to share a couple of things which might help us in improving kapp performance (we are already working on a couple of things #599)

  • cluster configuration (RAM, CPU, number of nodes etc,.)
  • permissions available to the user/SA being used
  • couple of errors where you are seeing the 403 error
  • Number of kapp apps you are trying to deploy concurrently

@revolunet
Copy link
Author

Hi

Our cluster is 6x(6cpu + 32Go)

I tried with a superior account and got no 403 but still 499 or 500 from the API Server which make kapp stop.
So yes, the 403 are due to a limited serviceaccount that cannot query cluster-wide

We use Rancher and suspect it is our bottleneck here. We'll test directly on the API Server to see if it gets better.

It works well with one kapp deploy, sometimes two simultaneous but no more with the above manifests.

some logs examples

2022-10-19T09:59:58+00:00 499 /k8s/clusters/c-gjtkk/api/v1/namespaces/template-kapp1-8-2/configmaps kapp/v0.0.0 (linux/amd64) kubernetes/$Format
2022-10-19T09:59:58+00:00 499 /k8s/clusters/c-gjtkk/api/v1/namespaces/template-kapp1-8-2/configmaps kapp/v0.0.0 (linux/amd64) kubernetes/$Format
2022-10-19T09:59:58+00:00 499 /k8s/clusters/c-gjtkk/api/v1/namespaces/template-kapp1-8-2/configmaps kapp/v0.0.0 (linux/amd64) kubernetes/$Format
2022-10-19T09:59:58+00:00 409 /k8s/clusters/c-gjtkk/api/v1/namespaces/template-kapp1-8-2/serviceaccounts kapp/v0.0.0 (linux/amd64) kubernetes/$Format
2022-10-19T09:59:50+00:00 499 /k8s/clusters/c-gjtkk/api/v1/namespaces/template-kapp1-8-1/configmaps kapp/v0.0.0 (linux/amd64) kubernetes/$Format
2022-10-19T09:59:50+00:00 409 /k8s/clusters/c-gjtkk/api/v1/namespaces/template-kapp1-8-1/serviceaccounts kapp/v0.0.0 (linux/amd64) kubernetes/$Format

@praveenrewar
Copy link
Member

I tried with a superior account and got no 403 but still 499 or 500 from the API Server which make kapp stop.
So yes, the 403 are due to a limited serviceaccount that cannot query cluster-wide

I see, but usually if you get a forbidden error leads kapp to stop, I am wondering what caused these mani api calls then?

We use Rancher and suspect it is our bottleneck here. We'll test directly on the API Server to see if it gets better.

That is a possibility, because based on the cluster configuration, it should be able to handle these many requests.

@praveenrewar
Copy link
Member

Hi @revolunet ! Were you able to find the root cause of the failures? Let me know if you need any help or if you would like to share some information which could be helpful to improve kapp performance.

@revolunet
Copy link
Author

revolunet commented Oct 28, 2022

Hello @praveenrewar,

After many tests we've confirmed that it comes from the rancher API; For some reason it throws "connection refused" when under load and we're unable to find the root cause or more logs.

The good news is kapp works flawlessly when talking directly to the kube API server !

I think this issue can be closed

@praveenrewar
Copy link
Member

Thank you for the update @revolunet. Closing the issue for now, but feel free to re open it if you find something we can improve on.

Repository owner moved this from In Progress to Closed in Carvel Oct 28, 2022
@revolunet
Copy link
Author

Maybe kapp could have a better retry mechanism on API errors so it could also work with flaky clusters.

Thanks for your support !

@github-actions github-actions bot added the carvel triage This issue has not yet been reviewed for validity label Oct 28, 2022
@praveenrewar
Copy link
Member

We do have a set of retry-able errors, but currently retrying doesn't happen in the waiting stage. We are tracking that over here. Hopefully we will find a suitable solution to it soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
carvel triage This issue has not yet been reviewed for validity discussion This issue is not a bug or feature and a conversation is needed to find an appropriate resolution helping with an issue Debugging happening to identify the problem
Projects
None yet
Development

No branches or pull requests

3 participants