-
Notifications
You must be signed in to change notification settings - Fork 6.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Retries ssh connection for Gather node certs #10515
Retries ssh connection for Gather node certs #10515
Conversation
This allows this task to work with a forks count > 10 and the default configuration of sshd, which is to limit sessions to 10. (see MaxSessions in sshd_config). Since this is a delegate_to task, it connects to the same host (first etcd) for each node in the cluster, thus easily going above 10. Raising the ssh connection attempts allow for more robustness, without decreasing the forks count or serialising the tasks, which could slow the task (or the playbook as a whole, if decreasing forks).
Hi @VannTen. Thanks for your PR. I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
@@ -14,6 +14,8 @@ | |||
- "{{ my_etcd_node_certs }}" | |||
|
|||
- name: Gen_certs | Gather node certs | |||
vars: | |||
ansible_ssh_retries: 10 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure about setting a ansible_ssh_retries only here. I think if there is something to change it should probably be done globally, possibly in many tasks or even maybe having a special dummy task early to "warmup" the ssh connection (as they should be kept aftewards).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, I only put it here because from a quick grep, it looks like this is the only task with the following:
delegate_to
- no
run_once
- acts on all nodes (or at least most of them, not only etcd/control plane)
Which produces the high numbers of ssh connections.
I'm not sure a "warmup" task would do anything, because this is more about sessions than connections really. Basically MaxSessions in sshd applies to multiplexing ("Specifies the maximum number of open shell, login or subsystem (e.g. sftp) sessions permitted per network connection").
Doing that change globally would not be a good idea, I think, because it would prevent "failing fast" on the whole playbook.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
... and this makes me wonder if that problems with MaxSessions would also happens if that task was not using shell
. I'm not sure how that's working precisely in ansible, delegate_to task might be coalesced somehow when not using the shell module... 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah ok I see, then all good on my side for this patch. However Increasing MaxSession might be something that we could do in Kubespray if you are operating on a high enough number of nodes somehow (we could do that only on the first etcd nodes + first control plane nodes?).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, yeah, we could. That would require to restart sshd though, and I found it intrusive than necessary for that one task.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Iirc it's not super intrusive to do that, for instance it somehow preserve the current connection and you don't need to reconnect and so on
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(that's probably something to deal with possibly in another patch in any case)
/ok-to-test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
Thanks @VannTen /approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: MrFreezeex, VannTen, yankay The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
This allows this task to work with a forks count > 10 and the default configuration of sshd, which is to limit sessions to 10. (see MaxSessions in sshd_config). Since this is a delegate_to task, it connects to the same host (first etcd) for each node in the cluster, thus easily going above 10. Raising the ssh connection attempts allow for more robustness, without decreasing the forks count or serialising the tasks, which could slow the task (or the playbook as a whole, if decreasing forks).
This allows this task to work with a forks count > 10 and the default configuration of sshd, which is to limit sessions to 10. (see MaxSessions in sshd_config). Since this is a delegate_to task, it connects to the same host (first etcd) for each node in the cluster, thus easily going above 10. Raising the ssh connection attempts allow for more robustness, without decreasing the forks count or serialising the tasks, which could slow the task (or the playbook as a whole, if decreasing forks).
What type of PR is this?
/kind bug
What this PR does / why we need it:
This allows the 'Gen_certs | Gather node certs' tasks to work with a forks count > 10 and the default
configuration of sshd, which is to limit multiplexed sessions to 10. (see MaxSessions in sshd_config).
Since this is a delegate_to task, it connects to the same host (first
etcd) for each node in the cluster, thus easily going above 10.
Raising the ssh connection attempts allow for more robustness, without
decreasing the forks count or serialising the tasks, which could slow
the task (or the playbook as a whole, if decreasing forks).
Which issue(s) this PR fixes:
Fixes #10514
Special notes for your reviewer:
The possible issue I see is if the user has set a ansible_ssh_retries higher that 10 (for whatever reasons).
In this case we will lower this, which could have consequences.
However, I think that is an unlikely scenario, though possible.
Does this PR introduce a user-facing change?: