Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KEP-3866 kube-proxy nftables to beta #4663

Merged
merged 1 commit into from
Jun 11, 2024

Conversation

danwinship
Copy link
Contributor

  • One-line PR description: move KEP-3866 kube-proxy nftables to beta
  • Other comments:

Started this to try and nail down what was left to be done. Still some updates pending.

@k8s-ci-robot
Copy link
Contributor

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory labels May 24, 2024
@k8s-ci-robot k8s-ci-robot added sig/network Categorizes an issue or PR as relevant to SIG Network. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels May 24, 2024
job, but this is because it doesn't run with `minSyncPeriod: 10s` like
the `iptables` job does, and so it syncs rule changes more often.
(However, the that it's able to do so without the cluster falling over
is a strong indication that it _is_ more efficient.)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we have some proof that handles large scale more efficiently https://gist.github.com/aojea/f9ca1a51e2afd03621744c95bfdab5b8

@danwinship danwinship mentioned this pull request May 25, 2024
10 tasks
@wojtek-t wojtek-t self-assigned this May 27, 2024
@@ -4,3 +4,5 @@
kep-number: 3866
alpha:
approver: "@wojtek-t"
beta:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@danwinship - are you planning to get that into 1.31?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes... was hoping to get the mode-switching e2e job implemented, but it's not yet. But we're otherwise mostly OK on the graduation criteria. (I need to push a small update...)

@danwinship danwinship marked this pull request as ready for review June 7, 2024 13:20
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 7, 2024
@danwinship danwinship force-pushed the nftables-proxy-to-beta branch 2 times, most recently from 8c2633b to 04efa74 Compare June 7, 2024 15:33
@BenTheElder BenTheElder self-assigned this Jun 7, 2024
Copy link
Member

@BenTheElder BenTheElder left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM for Beta
[PRR Shadow]

`InClusterNetworkLatency`, but no one is really looking at the
results yet and they may need work before they are useable.
We have an [nftables scalability job]. Initial performance is fine; we
have not done a lot of further testing/improvement yet.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But we do have evidence that it is more performant and should scale fine? I suspect this undersells it a bit :-)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For GA we should probably take a look at how the 5,000 node jobs do.

Copy link
Contributor Author

@danwinship danwinship Jun 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do have some numbers but I don't have them handy and wanted to push this sooner rather than later...

For GA we should probably take a look at how the 5,000 node jobs do.

Kube-proxy performance does not directly scale in any way with node count. If the 5000 node job creates more services than the 100 node job, then that affects the metrics, but you don't actually need a large cluster to test large numbers of services; from kube-proxy's perspective, having 10,000 Services that all point to the same 3 Pods is exactly the same as having 10,000 Services that point to 30,000 unique pods.

@npinaeva has been doing some tests of that sort, and at the moment nftables kube-proxy's performance with many many services is much closer to old pre-partial-sync iptables kube-proxy's performance than current iptables kube-proxy's performance, but well, we were waiting to have better perf numbers before we got too heavily into optimization.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What dimension of performance are we talking here, network programming latency is the main pain point today, dataplane performance as the linear search evaluation was always the main complain, but it seems that is more a synthetic problem

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

She was testing dataplane performance but ran into the programming latency performance problem while trying to test that. It's not clear yet exactly what aspect of the test data was making it slow, other than that the slowness was in the actual reload operation in the kernel, not in the userspace parsing, etc. (In theory, nft should not be slower than iptables, given that modern iptables is just translating to nftables underneath. Though it's possible that, e.g., adding an element to a set is unexpectedly slow, which is something that would affect our nftables performance but not our iptables-via-iptables-nft performance.)

For dataplane performance there wasn't any observable improvement at the size she was testing, which is kind of expected because we hadn't ever had any complaints at the sizes kube-proxy could already support well. (In particular, pre-MinimizeIPTablesRestore, it was basically impossible to have so many services that the linear search in the nat table would affect dataplane performance.)

In the filter table, the limits are much smaller, and we've hit them twice (kubernetes/kubernetes#56164 / kubernetes/kubernetes#95252), but it's not a "things gradually get slower" problem, it's an "everything works fine until suddenly the entire network is completely broken" problem. We should try to reproduce that and confirm that nftables doesn't hit the same problem.

users to be aware of whether they are depending on features that work
differently in the `nftables` backend, to help users decide whether
they can migrate to `nftables`, and whether they need any non-standard
configuration in order to do so.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should outline this somewhere, perhaps a feature blogpost?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, blog post sounds good

We do not currently have any apples-to-apples comparisons; the
`nftables` perf job uses more CPU than the corresponding `iptables`
job, but this is because it doesn't run with `minSyncPeriod: 10s` like
the `iptables` job does, and so it syncs rule changes more often.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But we could test this with the same config? Why not?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because the config used by the iptables test is objectively incorrect. 😝

I suppose I should try to get that fixed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@thockin
Copy link
Member

thockin commented Jun 7, 2024

/
Thanks!

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 7, 2024
Copy link
Member

@wojtek-t wojtek-t left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@danwinship - just some minor comments from me from the PRR perspective. PTAL

@@ -1536,6 +1496,10 @@ We expect no non-infra related flakes in the last month as a GA graduation crite
- We have at least the start of a plan for the next steps (changing
the default mode, deprecating the old backends, etc).

- No UNRESOLVED sections in the KEP. (In particular, we have figured
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

About Beta graduation criteria:

The nftables mode has seen at least a bit of real-world usage.

Do we have that? Can you elaborate a bit on that in the KEP?

<!--
Even if applying deprecation policies, they may still surprise some users.
-->
The new backend is not 100% compatible with the `iptables` backend.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?

I'm fine with the test not being executed now, but can you briefly describe the test that will be run prior beta graduation? [and update the details based on results later (probably after freeze)]?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this was answered in the "feature enablement/disablement" section but I'll mention it here too

@@ -1720,15 +1679,15 @@ question.

###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?

TBD.
`sync_proxy_rules_nftables_sync_failures_total` indicates the number
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please fill in the SLO question above.

[I would be fine with just saying that for now we're sticking to the network programming latency (https://github.com/kubernetes/community/blob/master/sig-scalability/slos/network_programming_latency.md ) and we will reevaluate that before GA.

@danwinship danwinship force-pushed the nftables-proxy-to-beta branch from 04efa74 to c386ddb Compare June 11, 2024 11:10
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 11, 2024
@wojtek-t
Copy link
Member

/lgtm
/approve PRR

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 11, 2024
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: danwinship, thockin, wojtek-t

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 11, 2024
@k8s-ci-robot k8s-ci-robot merged commit c7c9de8 into kubernetes:master Jun 11, 2024
4 checks passed
@k8s-ci-robot k8s-ci-robot added this to the v1.31 milestone Jun 11, 2024
@danwinship danwinship deleted the nftables-proxy-to-beta branch June 12, 2024 06:57
@brandond
Copy link

brandond commented Nov 7, 2024

@danwinship @wojtek-t shouldn't the new mode be added to the CLI flag help text now that it's on by default?
https://github.com/kubernetes/kubernetes/blob/v1.31.0/cmd/kube-proxy/app/options.go#L126

Since it's not mentioned in the CLI flag help, there is no coverage of nftables as a valid proxy-mode on Linux in the docs at https://kubernetes.io/docs/reference/command-line-tools-reference/kube-proxy/

@danwinship
Copy link
Contributor Author

yup, filed kubernetes/kubernetes#128698

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory lgtm "Looks good to me", indicates that a PR is ready to be merged. sig/network Categorizes an issue or PR as relevant to SIG Network. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants