Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retries ssh connection for Gather node certs #10515

Merged

Conversation

VannTen
Copy link
Contributor

@VannTen VannTen commented Oct 11, 2023

What type of PR is this?
/kind bug

What this PR does / why we need it:

This allows the 'Gen_certs | Gather node certs' tasks to work with a forks count > 10 and the default
configuration of sshd, which is to limit multiplexed sessions to 10. (see MaxSessions in sshd_config).

Since this is a delegate_to task, it connects to the same host (first
etcd) for each node in the cluster, thus easily going above 10.

Raising the ssh connection attempts allow for more robustness, without
decreasing the forks count or serialising the tasks, which could slow
the task (or the playbook as a whole, if decreasing forks).

Which issue(s) this PR fixes:
Fixes #10514

Special notes for your reviewer:

The possible issue I see is if the user has set a ansible_ssh_retries higher that 10 (for whatever reasons).
In this case we will lower this, which could have consequences.
However, I think that is an unlikely scenario, though possible.

Does this PR introduce a user-facing change?:

NONE

This allows this task to work with a forks count > 10 and the default
configuration of sshd, which is to limit sessions to 10. (see
MaxSessions in sshd_config).

Since this is a delegate_to task, it connects to the same host (first
etcd) for each node in the cluster, thus easily going above 10.

Raising the ssh connection attempts allow for more robustness, without
decreasing the forks count or serialising the tasks, which could slow
the task (or the playbook as a whole, if decreasing forks).
@k8s-ci-robot k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Oct 11, 2023
@k8s-ci-robot
Copy link
Contributor

Hi @VannTen. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. label Oct 11, 2023
@@ -14,6 +14,8 @@
- "{{ my_etcd_node_certs }}"

- name: Gen_certs | Gather node certs
vars:
ansible_ssh_retries: 10
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure about setting a ansible_ssh_retries only here. I think if there is something to change it should probably be done globally, possibly in many tasks or even maybe having a special dummy task early to "warmup" the ssh connection (as they should be kept aftewards).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, I only put it here because from a quick grep, it looks like this is the only task with the following:

  • delegate_to
  • no run_once
  • acts on all nodes (or at least most of them, not only etcd/control plane)

Which produces the high numbers of ssh connections.
I'm not sure a "warmup" task would do anything, because this is more about sessions than connections really. Basically MaxSessions in sshd applies to multiplexing ("Specifies the maximum number of open shell, login or subsystem (e.g. sftp) sessions permitted per network connection").

Doing that change globally would not be a good idea, I think, because it would prevent "failing fast" on the whole playbook.

Copy link
Contributor Author

@VannTen VannTen Oct 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... and this makes me wonder if that problems with MaxSessions would also happens if that task was not using shell. I'm not sure how that's working precisely in ansible, delegate_to task might be coalesced somehow when not using the shell module... 🤔

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah ok I see, then all good on my side for this patch. However Increasing MaxSession might be something that we could do in Kubespray if you are operating on a high enough number of nodes somehow (we could do that only on the first etcd nodes + first control plane nodes?).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, yeah, we could. That would require to restart sshd though, and I found it intrusive than necessary for that one task.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Iirc it's not super intrusive to do that, for instance it somehow preserve the current connection and you don't need to reconnect and so on

Copy link
Member

@MrFreezeex MrFreezeex Oct 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(that's probably something to deal with possibly in another patch in any case)

@MrFreezeex
Copy link
Member

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Oct 11, 2023
Copy link
Member

@MrFreezeex MrFreezeex left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 11, 2023
@yankay
Copy link
Member

yankay commented Oct 19, 2023

Thanks @VannTen

/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: MrFreezeex, VannTen, yankay

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 19, 2023
@k8s-ci-robot k8s-ci-robot merged commit 0b2e5b2 into kubernetes-sigs:master Oct 19, 2023
59 checks passed
guytet pushed a commit to guytet/kubespray that referenced this pull request Oct 30, 2023
This allows this task to work with a forks count > 10 and the default
configuration of sshd, which is to limit sessions to 10. (see
MaxSessions in sshd_config).

Since this is a delegate_to task, it connects to the same host (first
etcd) for each node in the cluster, thus easily going above 10.

Raising the ssh connection attempts allow for more robustness, without
decreasing the forks count or serialising the tasks, which could slow
the task (or the playbook as a whole, if decreasing forks).
pedro-peter pushed a commit to pedro-peter/kubespray that referenced this pull request May 8, 2024
This allows this task to work with a forks count > 10 and the default
configuration of sshd, which is to limit sessions to 10. (see
MaxSessions in sshd_config).

Since this is a delegate_to task, it connects to the same host (first
etcd) for each node in the cluster, thus easily going above 10.

Raising the ssh connection attempts allow for more robustness, without
decreasing the forks count or serialising the tasks, which could slow
the task (or the playbook as a whole, if decreasing forks).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

delegate_to interaction with sshd_config: MaxSessions
4 participants