-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KEP-3866 kube-proxy nftables to beta #4663
KEP-3866 kube-proxy nftables to beta #4663
Conversation
Skipping CI for Draft Pull Request. |
job, but this is because it doesn't run with `minSyncPeriod: 10s` like | ||
the `iptables` job does, and so it syncs rule changes more often. | ||
(However, the that it's able to do so without the cluster falling over | ||
is a strong indication that it _is_ more efficient.) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we have some proof that handles large scale more efficiently https://gist.github.com/aojea/f9ca1a51e2afd03621744c95bfdab5b8
@@ -4,3 +4,5 @@ | |||
kep-number: 3866 | |||
alpha: | |||
approver: "@wojtek-t" | |||
beta: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@danwinship - are you planning to get that into 1.31?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes... was hoping to get the mode-switching e2e job implemented, but it's not yet. But we're otherwise mostly OK on the graduation criteria. (I need to push a small update...)
8c2633b
to
04efa74
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM for Beta
[PRR Shadow]
`InClusterNetworkLatency`, but no one is really looking at the | ||
results yet and they may need work before they are useable. | ||
We have an [nftables scalability job]. Initial performance is fine; we | ||
have not done a lot of further testing/improvement yet. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But we do have evidence that it is more performant and should scale fine? I suspect this undersells it a bit :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For GA we should probably take a look at how the 5,000 node jobs do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do have some numbers but I don't have them handy and wanted to push this sooner rather than later...
For GA we should probably take a look at how the 5,000 node jobs do.
Kube-proxy performance does not directly scale in any way with node count. If the 5000 node job creates more services than the 100 node job, then that affects the metrics, but you don't actually need a large cluster to test large numbers of services; from kube-proxy's perspective, having 10,000 Services that all point to the same 3 Pods is exactly the same as having 10,000 Services that point to 30,000 unique pods.
@npinaeva has been doing some tests of that sort, and at the moment nftables kube-proxy's performance with many many services is much closer to old pre-partial-sync iptables kube-proxy's performance than current iptables kube-proxy's performance, but well, we were waiting to have better perf numbers before we got too heavily into optimization.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What dimension of performance are we talking here, network programming latency is the main pain point today, dataplane performance as the linear search evaluation was always the main complain, but it seems that is more a synthetic problem
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
She was testing dataplane performance but ran into the programming latency performance problem while trying to test that. It's not clear yet exactly what aspect of the test data was making it slow, other than that the slowness was in the actual reload operation in the kernel, not in the userspace parsing, etc. (In theory, nft should not be slower than iptables, given that modern iptables is just translating to nftables underneath. Though it's possible that, e.g., adding an element to a set is unexpectedly slow, which is something that would affect our nftables performance but not our iptables-via-iptables-nft performance.)
For dataplane performance there wasn't any observable improvement at the size she was testing, which is kind of expected because we hadn't ever had any complaints at the sizes kube-proxy could already support well. (In particular, pre-MinimizeIPTablesRestore
, it was basically impossible to have so many services that the linear search in the nat
table would affect dataplane performance.)
In the filter
table, the limits are much smaller, and we've hit them twice (kubernetes/kubernetes#56164 / kubernetes/kubernetes#95252), but it's not a "things gradually get slower" problem, it's an "everything works fine until suddenly the entire network is completely broken" problem. We should try to reproduce that and confirm that nftables doesn't hit the same problem.
users to be aware of whether they are depending on features that work | ||
differently in the `nftables` backend, to help users decide whether | ||
they can migrate to `nftables`, and whether they need any non-standard | ||
configuration in order to do so. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should outline this somewhere, perhaps a feature blogpost?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, blog post sounds good
We do not currently have any apples-to-apples comparisons; the | ||
`nftables` perf job uses more CPU than the corresponding `iptables` | ||
job, but this is because it doesn't run with `minSyncPeriod: 10s` like | ||
the `iptables` job does, and so it syncs rule changes more often. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But we could test this with the same config? Why not?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because the config used by the iptables test is objectively incorrect. 😝
I suppose I should try to get that fixed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/ /lgtm |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@danwinship - just some minor comments from me from the PRR perspective. PTAL
@@ -1536,6 +1496,10 @@ We expect no non-infra related flakes in the last month as a GA graduation crite | |||
- We have at least the start of a plan for the next steps (changing | |||
the default mode, deprecating the old backends, etc). | |||
|
|||
- No UNRESOLVED sections in the KEP. (In particular, we have figured |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
About Beta graduation criteria:
The nftables mode has seen at least a bit of real-world usage.
Do we have that? Can you elaborate a bit on that in the KEP?
<!-- | ||
Even if applying deprecation policies, they may still surprise some users. | ||
--> | ||
The new backend is not 100% compatible with the `iptables` backend. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?
I'm fine with the test not being executed now, but can you briefly describe the test that will be run prior beta graduation? [and update the details based on results later (probably after freeze)]?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this was answered in the "feature enablement/disablement" section but I'll mention it here too
@@ -1720,15 +1679,15 @@ question. | |||
|
|||
###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service? | |||
|
|||
TBD. | |||
`sync_proxy_rules_nftables_sync_failures_total` indicates the number |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please fill in the SLO question above.
[I would be fine with just saying that for now we're sticking to the network programming latency (https://github.com/kubernetes/community/blob/master/sig-scalability/slos/network_programming_latency.md ) and we will reevaluate that before GA.
04efa74
to
c386ddb
Compare
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: danwinship, thockin, wojtek-t The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
@danwinship @wojtek-t shouldn't the new mode be added to the CLI flag help text now that it's on by default? Since it's not mentioned in the CLI flag help, there is no coverage of |
yup, filed kubernetes/kubernetes#128698 |
Started this to try and nail down what was left to be done. Still some updates pending.