Retries ssh connection for Gather node certs #10515

VannTen · 2023-10-11T09:54:53Z

What type of PR is this?
/kind bug

What this PR does / why we need it:

This allows the 'Gen_certs | Gather node certs' tasks to work with a forks count > 10 and the default
configuration of sshd, which is to limit multiplexed sessions to 10. (see MaxSessions in sshd_config).

Since this is a delegate_to task, it connects to the same host (first
etcd) for each node in the cluster, thus easily going above 10.

Raising the ssh connection attempts allow for more robustness, without
decreasing the forks count or serialising the tasks, which could slow
the task (or the playbook as a whole, if decreasing forks).

Which issue(s) this PR fixes:
Fixes #10514

Special notes for your reviewer:

The possible issue I see is if the user has set a ansible_ssh_retries higher that 10 (for whatever reasons).
In this case we will lower this, which could have consequences.
However, I think that is an unlikely scenario, though possible.

Does this PR introduce a user-facing change?:

NONE

This allows this task to work with a forks count > 10 and the default configuration of sshd, which is to limit sessions to 10. (see MaxSessions in sshd_config). Since this is a delegate_to task, it connects to the same host (first etcd) for each node in the cluster, thus easily going above 10. Raising the ssh connection attempts allow for more robustness, without decreasing the forks count or serialising the tasks, which could slow the task (or the playbook as a whole, if decreasing forks).

k8s-ci-robot · 2023-10-11T09:55:03Z

Hi @VannTen. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

MrFreezeex · 2023-10-11T17:45:43Z

roles/etcd/tasks/gen_nodes_certs_script.yml

@@ -14,6 +14,8 @@
    - "{{ my_etcd_node_certs }}"

 - name: Gen_certs | Gather node certs
+  vars:
+    ansible_ssh_retries: 10


Not sure about setting a ansible_ssh_retries only here. I think if there is something to change it should probably be done globally, possibly in many tasks or even maybe having a special dummy task early to "warmup" the ssh connection (as they should be kept aftewards).

Well, I only put it here because from a quick grep, it looks like this is the only task with the following:

delegate_to

no run_once

acts on all nodes (or at least most of them, not only etcd/control plane)

Which produces the high numbers of ssh connections.
I'm not sure a "warmup" task would do anything, because this is more about sessions than connections really. Basically MaxSessions in sshd applies to multiplexing ("Specifies the maximum number of open shell, login or subsystem (e.g. sftp) sessions permitted per network connection").

Doing that change globally would not be a good idea, I think, because it would prevent "failing fast" on the whole playbook.

... and this makes me wonder if that problems with MaxSessions would also happens if that task was not using shell. I'm not sure how that's working precisely in ansible, delegate_to task might be coalesced somehow when not using the shell module... 🤔

Ah ok I see, then all good on my side for this patch. However Increasing MaxSession might be something that we could do in Kubespray if you are operating on a high enough number of nodes somehow (we could do that only on the first etcd nodes + first control plane nodes?).

Well, yeah, we could. That would require to restart sshd though, and I found it intrusive than necessary for that one task.

Iirc it's not super intrusive to do that, for instance it somehow preserve the current connection and you don't need to reconnect and so on

(that's probably something to deal with possibly in another patch in any case)

MrFreezeex · 2023-10-11T17:46:13Z

/ok-to-test

MrFreezeex

/lgtm

yankay · 2023-10-19T02:56:21Z

Thanks @VannTen

/approve

k8s-ci-robot · 2023-10-19T02:56:41Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: MrFreezeex, VannTen, yankay

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [yankay]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

This allows this task to work with a forks count > 10 and the default configuration of sshd, which is to limit sessions to 10. (see MaxSessions in sshd_config). Since this is a delegate_to task, it connects to the same host (first etcd) for each node in the cluster, thus easily going above 10. Raising the ssh connection attempts allow for more robustness, without decreasing the forks count or serialising the tasks, which could slow the task (or the playbook as a whole, if decreasing forks).

k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Oct 11, 2023

k8s-ci-robot requested review from cristicalin and EppO October 11, 2023 09:55

k8s-ci-robot added the size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. label Oct 11, 2023

MrFreezeex reviewed Oct 11, 2023

View reviewed changes

k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Oct 11, 2023

MrFreezeex approved these changes Oct 11, 2023

View reviewed changes

k8s-ci-robot assigned MrFreezeex Oct 11, 2023

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 11, 2023

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 19, 2023

k8s-ci-robot merged commit 0b2e5b2 into kubernetes-sigs:master Oct 19, 2023
59 checks passed

VannTen mentioned this pull request Nov 28, 2023

REQUEST: New membership for VannTen kubernetes/org#4607

Closed

9 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Retries ssh connection for Gather node certs #10515

Retries ssh connection for Gather node certs #10515

VannTen commented Oct 11, 2023 •

edited

Loading

k8s-ci-robot commented Oct 11, 2023

MrFreezeex Oct 11, 2023

VannTen Oct 11, 2023

VannTen Oct 11, 2023 •

edited

Loading

MrFreezeex Oct 11, 2023

VannTen Oct 11, 2023

MrFreezeex Oct 11, 2023

MrFreezeex Oct 11, 2023 •

edited

Loading

MrFreezeex commented Oct 11, 2023

MrFreezeex left a comment

yankay commented Oct 19, 2023

k8s-ci-robot commented Oct 19, 2023

Retries ssh connection for Gather node certs #10515

Retries ssh connection for Gather node certs #10515

Conversation

VannTen commented Oct 11, 2023 • edited Loading

k8s-ci-robot commented Oct 11, 2023

MrFreezeex Oct 11, 2023

Choose a reason for hiding this comment

VannTen Oct 11, 2023

Choose a reason for hiding this comment

VannTen Oct 11, 2023 • edited Loading

Choose a reason for hiding this comment

MrFreezeex Oct 11, 2023

Choose a reason for hiding this comment

VannTen Oct 11, 2023

Choose a reason for hiding this comment

MrFreezeex Oct 11, 2023

Choose a reason for hiding this comment

MrFreezeex Oct 11, 2023 • edited Loading

Choose a reason for hiding this comment

MrFreezeex commented Oct 11, 2023

MrFreezeex left a comment

Choose a reason for hiding this comment

yankay commented Oct 19, 2023

k8s-ci-robot commented Oct 19, 2023

VannTen commented Oct 11, 2023 •

edited

Loading

VannTen Oct 11, 2023 •

edited

Loading

MrFreezeex Oct 11, 2023 •

edited

Loading