Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kops --wait #1517

Closed
ajohnstone opened this issue Jan 17, 2017 · 24 comments
Closed

Kops --wait #1517

ajohnstone opened this issue Jan 17, 2017 · 24 comments
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Milestone

Comments

@ajohnstone
Copy link
Contributor

ajohnstone commented Jan 17, 2017

Executing both kops cluster create ... and kops cluster update took 2.57 minutes for a new cluster. However the kubernetes cluster is not ready for use at this point.

E.g.

$ kubectl get pods
Unable to connect to the server: EOF

$ kops-1.5.0-alpha3 validate  cluster
Using cluster from kubectl context: xxxxxx.photobox.com

Validating cluster xxxxxx.photobox.com


Cannot get nodes for "xxxxxx.photobox.com": Get https://api.xxxxxx.photobox.com/api/v1/nodes: EOF

$ kops-1.5.0-alpha3 validate  cluster

Using cluster from kubectl context: xxxxxx.photobox.com

Validating cluster xxxxxx.photobox.com


Cannot get nodes for "xxxxxx.photobox.com": Get https://api.xxxxxx.photobox.com/api/v1/nodes: EOF

$ kops-1.5.0-alpha3 validate  cluster
Using cluster from kubectl context: xxxxxx.photobox.com

Validating cluster xxxxxx.photobox.com

INSTANCE GROUPS
NAME			ROLE	MACHINETYPE	MIN	MAX	SUBNETS
bastions		Bastion	t2.micro	1	1	utility-eu-west-1b,utility-eu-west-1c,utility-eu-west-1a
master-eu-west-1a	Master	m3.large	1	1	eu-west-1a
master-eu-west-1b	Master	m3.large	1	1	eu-west-1b
master-eu-west-1c	Master	m3.large	1	1	eu-west-1c
nodes			Node	m3.large	5	5	eu-west-1b,eu-west-1c,eu-west-1a

NODE STATUS
NAME						ROLE	READY
ip-10-0-102-114.eu-west-1.compute.internal	master	True
ip-10-0-104-134.eu-west-1.compute.internal	node	False
ip-10-0-120-191.eu-west-1.compute.internal	node	True
ip-10-0-47-148.eu-west-1.compute.internal	node	False
ip-10-0-58-87.eu-west-1.compute.internal	master	True
ip-10-0-78-49.eu-west-1.compute.internal	node	False
ip-10-0-80-192.eu-west-1.compute.internal	master	True
ip-10-0-84-31.eu-west-1.compute.internal	node	False

Validation Failed
Master(s) Not Ready 0 out of 3.
Node(s) Not Ready   4 out of 5.

Your cluster xxxxxx.photobox.com is NOT ready.

It would be ideal to have kops with a --wait option for the cluster to be ready.
There is a similar ticket #139.

There are a few example scripts to wait for a cluster to be initialised, however would be ideal to have this part of kops.

Also to note @chrislovecnm kops cluster validate doesn't do anything with kubectl get cs maybe ideal to add into the kops validate cluster?

kubectl config use-context ${CLUSTER_NAME}
echo -n "Waiting for cluster components to become ready."
until [ $(kubectl get cs 2> /dev/null| grep -e Healthy | wc -l | xargs) -ge 4 ]
do
  echo -n "."
  sleep 1
done
echo "ok"

echo -n "Waiting for minimum nodes to become ready."
min_nodes=$(kops get ig nodes --name ${CLUSTER_NAME} | grep nodes | awk '{print $4}')
until [ "$(kubectl get nodes 2> /dev/null| grep -v master | grep -e Ready | wc -l | xargs)" == "$min_nodes" ]
do
  echo -n "."
  sleep 1
done
echo "ok"

@shadoi

or

while [ 1 ]; do
    kops validate cluster && break || sleep 30
done;
@chrislovecnm
Copy link
Contributor

I am not a huge fan of putting in a wait. But will let @kris-nova and @justinsb weigh in as well.

@ajohnstone adding issue for adding kubectl get cs equivalent in kops validate

@justinsb
Copy link
Member

What is the objection to a wait command @chrislovecnm ?

@justinsb justinsb added this to the 1.5.1 milestone Jan 19, 2017
@krisnova
Copy link
Contributor

krisnova commented Jan 19, 2017

I am all for the --wait flag!

Concerns/Questions

What do we think about --validate?

My other concern would be a line from the story here:

It would be ideal to have kops with a --wait option for the cluster to be ready.

With of course the magic phrase being cluster to be ready.. As long as we can agree on what that means (and probably more importantly what it doesn't mean) and clearly stick to that I am good.

Should we use kops validate logic here?
Verify the API is listening?
/healthz?
etc..

Validation Timeout (T1 -> Failure)

So I think we have 2 spans of time to track. This would of course be the time kops would wait from the instant in time we would usually exit - until we finally give up waiting for success and error.

Minimum Valid Span (T2 -> Success)

So the other span we would need to track would be the time from the instant we receive a valid cluster until some arbitrary span has elapsed (while still receiving a valid message)

So basically - it seems simple to have kops hang.. but we would need to clarify a few things, and make clear expectations around what kops will and wont offer/promise the user.

@chrislovecnm
Copy link
Contributor

I like to keep this async or a single call. Implementing something I can do with a while loop in bash, just seems overkill. But that is just me. 🤷‍♂️

I can see that there is value to the user. I like having the validate wait, with a configurable amount of loops, more palatable. Putting a wait into the create, just seems overkill. But eh ... what do our users need :)

@kenden
Copy link
Contributor

kenden commented Sep 25, 2017

@chrislovecnm I have that while loop in bash inside a makefile:

.PHONY: wait_for_cluster_ready
wait_for_cluster_ready:
  @max_wait=900; \
  while [[ $${max_wait} -gt 0 ]]; do \
    ${KOPS} validate cluster ${CLUSTER_NAME} --state ${KOPS_STATE} && break || sleep 30; \
    max_wait=$$((max_wait - 30)); \
    echo "Waited 30 seconds. Still waiting max. $${max_wait}"; \
  done; \
  if [[ $${max_wait} -le 0 ]]; then \
    echo "Timeout: cluster does not validate after 15 minutes"; \
    exit 1; \
  fi

But I think it would look much cleaner with just a --wait on the create/update commands.
There are similar tickets for helm to make the user's life easier BTW.
helm/helm#2114
helm/helm#1805

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

Prevent issues from auto-closing with an /lifecycle frozen comment.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or @fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 6, 2018
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten
/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Feb 10, 2018
@fejta-bot
Copy link

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

@jbialy
Copy link

jbialy commented Apr 16, 2018

/reopen

@k8s-ci-robot
Copy link
Contributor

@jbialy: you can't re-open an issue/PR unless you authored it or you are assigned to it.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@techdragon
Copy link

@ajohnstone do you want to reopen this?

@justinsb justinsb reopened this Feb 13, 2019
@fejta-bot
Copy link

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

@k8s-ci-robot
Copy link
Contributor

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@jbialy
Copy link

jbialy commented Mar 20, 2019

Could this be reopened once again? I think it'd be a very useful feature to have!

@bryan-rhm
Copy link

I'm waiting for this feature too, this should be reopened

@TomBloo
Copy link

TomBloo commented Dec 12, 2019

Re-open!

/reopen

@rifelpet rifelpet reopened this Dec 12, 2019
@fejta-bot
Copy link

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

@k8s-ci-robot
Copy link
Contributor

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@SCLogo
Copy link

SCLogo commented Feb 13, 2020

no updates on this ?

@agilgur5
Copy link
Contributor

Would adding --wait to kops validate cluster be more acceptable than adding to kops create cluster? i.e. kops validate cluster --wait would loop until the cluster passes validation.

Right now I've just added this script/function to do so:

kops_validate_loop() {
  kops_exit_code=1

  # `kops validate cluster` should exit with code 0 to be successful
  while [[ $kops_exit_code -ne 0 ]]; do
    # subshell to workaround set -e
    kops_exit_code=$(kops validate cluster \
      --name=$CONTEXT \
      --state=$KOPS_STATE_STORE \
      > /dev/null; echo $?)

    if [[ $kops_exit_code -ne 0 ]]; then
      # same as kops
      echo "Cluster did not pass validation, will try again in 30s..."
      sleep 30
    fi
  done
}

(note that $CONTEXT and $KOPS_STATE_STORE must be set beforehand if you're planning to use that)

@olemarkus
Copy link
Member

Doesn't kops validate cluster have a --wait flag now?
kops validate cluster --wait 60s

@agilgur5
Copy link
Contributor

agilgur5 commented Jun 10, 2020

@olemarkus indeed it does, I did not know that and this was the place I found when searching which doesn't mention it. Thanks for pointing that out! So I guess #7371 fixes this issue more or less.

I will say that the docs are a bit confusing because kops validate says something almost identical to kops validate cluster, but the flags are different

@olemarkus
Copy link
Member

kops validate alone isn't a valid command though. So the flags are only the global flag + the help flag. But I'll see if I can make the examples a bit more clear.

@agilgur5
Copy link
Contributor

agilgur5 commented Jun 11, 2020

Thanks @olemarkus changes in #9333 would've definitely made me notice that. Yea I recognize validate is not a valid command but it previously gave details about validate cluster so I either didn't realize there was a different doc or didn't notice that the validate cluster doc had a tiny difference. Now they're a good bit different from each so I think that's easier to tell. This issue is still first on a google search but hopefully there's now a handful of ways to discover that flag and not miss it 🙂

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Projects
None yet
Development

No branches or pull requests