Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

change gossip dns conn limit by ENV #5077

Merged
merged 2 commits into from
Jul 20, 2018
Merged

Conversation

yancl
Copy link
Contributor

@yancl yancl commented Apr 28, 2018

so that we can run more nodes than 64 :)

@k8s-ci-robot k8s-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Apr 28, 2018
@yancl
Copy link
Contributor Author

yancl commented Apr 28, 2018

/assign @andrewsykim

Copy link
Contributor

@chrislovecnm chrislovecnm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@justinsb why do we even have this?? How about we bump to 1000?

if gossipDnsConnLimit != "" {
limit, err := strconv.Atoi(gossipDnsConnLimit)
if err != nil {
return nil, fmt.Errorf("cannot parse env GOSSIP_DNS_CONN_LIMIT value: %v", gossipDnsConnLimit)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also add err to message

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, thanks.

if err != nil {
return nil, fmt.Errorf("cannot parse env GOSSIP_DNS_CONN_LIMIT value: %v", gossipDnsConnLimit)
}
connLimit = limit
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line can just be above,

connLimit, err := strconv.Atoi(gossipDnsConnLimit)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we put all in one statement, it will output the following error(create a new inner scope var):

protokube/pkg/gossip/mesh/gossip.go:47:3: connLimit declared and not used

so i split it :)

@yancl
Copy link
Contributor Author

yancl commented Apr 30, 2018

Yes, i really have considered make a PR that just changes the default conn limit from 64 to some larger number, for example, 300, and i prefer that:)

I think the problem with a lot of nodes using gossip dns maybe related to the issue.

@chrislovecnm
Copy link
Contributor

@yancl thanks for your patience. I pinged @justinsb on this. It would be good to find out why we limited it

@justinsb
Copy link
Member

1000 would still be a limit :-)

I thought it was number of peers, I didn't realize it was anything more than that. Do we know that this limit is in fact a problem? I'd be inclined to just bump it to 200 - I think that's what weave did.

@chrislovecnm
Copy link
Contributor

@Yanci what have you tested?

@yancl
Copy link
Contributor Author

yancl commented May 24, 2018

@chrislovecnm sorry for late response:)

Yes, we have setup a k8s cluster with gossip dns enabled. but as we start more nodes than the limit(64), the new nodes doesn't show in kubectl get nodes, and we found ->[xxx] error during connection attempt:xxx in the syslog. the codes are here.

So i think the num of the total nodes of a k8s cluster can't exceed ConnLimit + 1 when using gossip mesh, we can make such check with the following steps:

  • support the ConnLimit is 1
  • and we first start 2 nodes, they mesh well.
  • then we start the 3rd node, it can't connect any one of the 2 nodes above according code here
  • if the 3rd node can not make connection with any of those 2 nodes, then it can not get information from them and update the apiserver&etcd servers addrs to the /etc/hosts , so the node will not join the cluster.

Is that correct?

@justinsb Yes, 1000 would still be a limit, we can bump it to 200. 64 is too small for a medium k8s cluster:)

thanks.

@justinsb justinsb added this to the 1.10 milestone Jun 2, 2018
@levpaul
Copy link

levpaul commented Jul 4, 2018

I ran into this issue today! 65 nodes was the maximum! Is there any way we can bypass this at the moment? How would one pass this arg through to new nodes too?

@ghost
Copy link

ghost commented Jul 4, 2018

When building this pr myself, k8s recognized more than 64 nodes.
I put the following settings in master nodes.

/etc/environment
GOSSIP_DNS_CONN_LIMIT=100

and I executed the following commands.

$ cd /var/cache/kubernetes-install
$ ./nodeup --conf=/var/cache/kubernetes-install/kube_env.yaml --v=8

@idealhack
Copy link
Member

/ok-to-test

@k8s-ci-robot k8s-ci-robot removed the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Jul 4, 2018
@jaredallard
Copy link
Contributor

LGTM, this has impacted us as well. Little insane that this is an issue :(

@justinsb
Copy link
Member

So I don't think this should be an env var that has to be set. I'm going to merge this PR, but I'm then going to send a PR that tries to set it automatically. I'd like to remove the env var at least from having to be set every time you call kops.

My understanding is that 64 nodes shouldn't be a hard limit, but rather each node will limit itself to 64 peers. As long as we don't form two disjoint segments, we should be able to go to much bigger sizes.

But I'll test it out!

/approve
/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jul 20, 2018
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: justinsb, yancl

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 20, 2018
@justinsb
Copy link
Member

/retest

@k8s-ci-robot k8s-ci-robot merged commit 2dbb6e8 into kubernetes:master Jul 20, 2018
justinsb added a commit to justinsb/kops that referenced this pull request Jul 21, 2018
This simply turns off gossip connection limits, so we shouldn't ever have to manually configure them.

Follow on to kubernetes#5077
@justinsb
Copy link
Member

justinsb commented Jul 21, 2018

I haven't been able to reproduce the problem, but I agree that it should happen! But #5486 should just remove the limit entirely.

Edit: was able to reproduce it. I started at 60 nodes + 1 master and then went to 80 nodes. I think it happens if you have 64 members that form a connected clique and refuse to admit anyone else.

justinsb added a commit to justinsb/kops that referenced this pull request Jul 22, 2018
This simply turns off gossip connection limits, so we shouldn't ever have to manually configure them.

Follow on to kubernetes#5077
@shrinandj
Copy link
Contributor

I understand this issue may be long forgotten. But it might help others who land here... hence this question.

If the connection limit has been removed entirely, the original problem this PR (or having the GOSSIP_DNS_CONN_LIMIT env variable) was trying to fix is also fixed. Explicitly setting GOSSIP_DNS_CONN_LIMIT to some value less than the total number of nodes could lead to problems (of running into a clique), right?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants