Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wait for a CRD type to deploy before deploying resources that use the type #1117

Closed
ringerc opened this issue Sep 20, 2021 · 12 comments
Closed
Labels
kind/feature Categorizes issue or PR as related to a new feature. lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one.

Comments

@ringerc
Copy link

ringerc commented Sep 20, 2021

Presently, when kubectl is used to apply a large manifest that defines new custom resource definitions (CRDs) as well as resources that use the new resource kind, conditions can cause the deployment to fail. Assuming you're using kubectl apply -f - and an external kustomize you might see an error like:

unable to recognize "STDIN": no matches for kind "Alertmanager" in version "monitoring.coreos.com/v1"

(the exact resource "kind" and api "version" will vary depending on what you're deplying).

This appears to be a race between the k8s cluster applying the new CRD types and kustomize sending requests that use the new types, but there's no indication of that in the command's output. It's confusing for new users, and it's a hassle operationally since deployments will fail then work when re-tried. This is something the kubectl tool could help users with.

The --server-side option does not help, as the same race occurs then. And --wait=true only affects resource removal, not creation.

This can often be reproduced with a kind cluster, though it varies since it's a race. For example:

kind create cluster
git clone -b v0.8.0 https://github.com/prometheus-operator/kube-prometheus
kubectl apply -k kube-prometheus/

... which will often fail with:

daemonset.apps/node-exporter created
unable to recognize "kube-prometheus/": no matches for kind "Alertmanager" in version "monitoring.coreos.com/v1"
unable to recognize "kube-prometheus/": no matches for kind "Prometheus" in version "monitoring.coreos.com/v1"
unable to recognize "kube-prometheus/": no matches for kind "PrometheusRule" in version "monitoring.coreos.com/v1"
unable to recognize "kube-prometheus/": no matches for kind "PrometheusRule" in version "monitoring.coreos.com/v1"
unable to recognize "kube-prometheus/": no matches for kind "PrometheusRule" in version "monitoring.coreos.com/v1"
unable to recognize "kube-prometheus/": no matches for kind "PrometheusRule" in version "monitoring.coreos.com/v1"
unable to recognize "kube-prometheus/": no matches for kind "PrometheusRule" in version "monitoring.coreos.com/v1"
unable to recognize "kube-prometheus/": no matches for kind "PrometheusRule" in version "monitoring.coreos.com/v1"
unable to recognize "kube-prometheus/": no matches for kind "PrometheusRule" in version "monitoring.coreos.com/v1"
unable to recognize "kube-prometheus/": no matches for kind "ServiceMonitor" in version "monitoring.coreos.com/v1"
unable to recognize "kube-prometheus/": no matches for kind "ServiceMonitor" in version "monitoring.coreos.com/v1"
unable to recognize "kube-prometheus/": no matches for kind "ServiceMonitor" in version "monitoring.coreos.com/v1"
unable to recognize "kube-prometheus/": no matches for kind "ServiceMonitor" in version "monitoring.coreos.com/v1"
unable to recognize "kube-prometheus/": no matches for kind "ServiceMonitor" in version "monitoring.coreos.com/v1"
unable to recognize "kube-prometheus/": no matches for kind "ServiceMonitor" in version "monitoring.coreos.com/v1"
unable to recognize "kube-prometheus/": no matches for kind "ServiceMonitor" in version "monitoring.coreos.com/v1"
unable to recognize "kube-prometheus/": no matches for kind "ServiceMonitor" in version "monitoring.coreos.com/v1"
unable to recognize "kube-prometheus/": no matches for kind "ServiceMonitor" in version "monitoring.coreos.com/v1"
unable to recognize "kube-prometheus/": no matches for kind "ServiceMonitor" in version "monitoring.coreos.com/v1"
unable to recognize "kube-prometheus/": no matches for kind "ServiceMonitor" in version "monitoring.coreos.com/v1"
unable to recognize "kube-prometheus/": no matches for kind "ServiceMonitor" in version "monitoring.coreos.com/v1"
unable to recognize "kube-prometheus/": no matches for kind "ServiceMonitor" in version "monitoring.coreos.com/v1"

but when the same command is repeated, it will succeed:

daemonset.apps/node-exporter unchanged
alertmanager.monitoring.coreos.com/main created
prometheus.monitoring.coreos.com/k8s created
prometheusrule.monitoring.coreos.com/alertmanager-main-rules created
prometheusrule.monitoring.coreos.com/kube-prometheus-rules created
...

There doesn't seem to be any (obvious) kubectl flag to impose a delay between requests, wait for a new resource to become visible before continuing, or retry a request if it fails because of a server-side error indicating something was missing.

The error message is confusing for new users and definitely does not help. A wording change and some context would help a lot. I raised that separately: #1118

@ringerc ringerc added the kind/feature Categorizes issue or PR as related to a new feature. label Sep 20, 2021
@k8s-ci-robot
Copy link
Contributor

@ringerc: This issue is currently awaiting triage.

SIG CLI takes a lead on issue triage for this repo, but any Kubernetes member can accept issues by applying the triage/accepted label.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Sep 20, 2021
@ringerc
Copy link
Author

ringerc commented Sep 20, 2021

See related issue discussing retries: kubernetes/kubernetes#5762 (comment)

The server does not issue HTTP status code 429 here, presumably because it doesn't know there's a pending deployment that will create this resource.

@ringerc
Copy link
Author

ringerc commented Sep 20, 2021

See related issue in kube-prometheus - but note that this is far from specific to kube-prometheus, it can affect anything where races exist between resource creation and resource use

prometheus-operator/prometheus-operator#1866

@ringerc ringerc changed the title Wait for a type to deploy before deploying resources that use the type Wait for a CRD type to deploy before deploying resources that use the type Sep 20, 2021
@ringerc
Copy link
Author

ringerc commented Sep 20, 2021

A workaround is to use kfilt to deploy the CRDs first, then wait for them to become visible, then deploy the rest:

kustomize build somedir | kfilt -i kind=CustomResourceDefinition | kubectl apply -f -
kustomize build somedir | kfilt -i kind=CustomResourceDefinition | kubectl wait --for condition=established --timeout=60s -f -
kustomize build somedir | kubectl apply -f - 

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 22, 2021
@eddiezane
Copy link
Member

Wait for a CRD type to deploy before deploying resources that use the type

This isn't something that we would implement for kubectl but we will investigate which part of apply can provide a better error. We'll handle that in #1118.

/close

@k8s-ci-robot
Copy link
Contributor

@eddiezane: Closing this issue.

In response to this:

Wait for a CRD type to deploy before deploying resources that use the type

This isn't something that we would implement for kubectl but we will investigate which part of apply can provide a better error. We'll handle that in #1118.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@eddiezane
Copy link
Member

@KnVerey do you know if Kustomize has any hooks authors can use similar to how Helm handles CRD installation separately?

@KnVerey
Copy link
Contributor

KnVerey commented Jan 20, 2022

@KnVerey do you know if Kustomize has any hooks authors can use similar to how Helm handles CRD installation separately?

It does not. Kustomize is a purely client-side, so it has no deploy-related features.

@ringerc
Copy link
Author

ringerc commented Jan 21, 2022

That's unfortunate, because it means basically every different user who wants robust deployment has to implement repetitive logic like:

kfilt -i kind=CustomResourceDefinition myconfig.yaml | kubectl apply -f -
kfilt -i kind=CustomResourceDefinition myconfig.yaml | kubectl wait --for condition=established --timeout=60s -f -
kfilt -i kind=Namespace myconfig.yaml | kubectl apply -f -
kfilt -i kind=Namespace myconfig.yaml | kubectl wait --for condition=established --timeout=60s -f -
kubectl apply -f myconfig.yaml

It's reasonable to expect the client sending data to kubectl to ensure that it is ordered correctly with CRDs, then namespaces, then other structure, such that dependencies are sensible.

But it's a pity that it's seemingly not practical for kubectl to ensure the requests apply correctly.

If waiting isn't viable, what about a --retry-delay '1s' --max-retry-count 5 for retrying individual requests?

@tlandschoff-scale
Copy link

Thank you so much for the kfilt example above. I am running into this in another way.

My vagrant provisioning step fails because k3s is installed but did not complete setting up traefik. So sometimes applying my resources failed with

default: error: unable to recognize "/vagrant/traefik_certificate.yaml": no matches for kind "TLSStore" in version "traefik.containo.us/v1alpha1"

So I can now fix this with

$ kubectl wait --for condition=established crd tlsstores.traefik.containo.us

Thanks! I agree that a more generic solution in kubectl would be great.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature. lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one.
Projects
None yet
Development

No branches or pull requests

6 participants