Initial draft of nftables support #883

mheon · 2024-01-04T19:12:29Z

This implementation lacks isolation support at present (that'll be a followon PR) and has only very minimal testing at the moment (including absolutely 0 testing for IPv6), but does handle all core tasks including forwarding and all types of port forwarding.

Fixes #816

packit-as-a-service · 2024-01-04T21:54:52Z

Ephemeral COPR build failed. @containers/packit-build please check.

mheon · 2024-01-05T14:26:58Z

Stripping WIP. Passes testing I've thrown at it. Now needs integration tests.

mheon · 2024-01-08T15:32:12Z

Integration tests added.

Caveat: I don't have local access to IPv6 here, so I cannot test IPv6 forwarding and port forwarding.

mheon · 2024-01-08T16:38:10Z

@baude @Luap99 I'm still working on the tests a bit but the code is ready for review

Luap99

Some initial look, not a full review.

FYI this seems to bloat the binary about 1.5 MB to about 21.5 MB total, not that this a blocker or anything I was just curious. I think we could use some time to debloat this at some point.

Luap99 · 2024-01-08T16:51:51Z

src/firewall/nft.rs

+                    stmt::Statement::Match(stmt::Match {
+                        left: expr::Expression::BinaryOperation(expr::BinaryOperation::AND(
+                            Box::new(expr::Expression::Named(expr::NamedExpression::Meta(
+                                expr::Meta {
+                                    key: expr::MetaKey::Mark,
+                                },
+                            ))),
+                            Box::new(expr::Expression::Number(MASK)),
+                        )),
+                        right: expr::Expression::Number(MASK),
+                        op: stmt::Operator::EQ,
+                    }),
+                    stmt::Statement::Masquerade(None),


WTF is this syntax, this looks so hard to use.

It is atrocious. But somehow better to use than the string API.

@mheon @Luap99

Hi, I'm the nftables-rs maintainer.

I agree with your concerns regarding the usage of the nftables-rs crate. I'm convinced that using the nftables JSON API is the way to go considering portability and maintainability.

Consider this usage example from nftnl-rs (native binding to libnftables): https://github.com/mullvad/nftnl-rs/blob/main/nftnl/examples/add-rules.rs#L147-L153

Likewise we could add this usability layer to nftables-rs through macros. WDYT?

I'm very interested in your opinion.

I think using macros that add better syntax sounds great, having something closer to the real nft syntax would be ideal.

I am a massive fan of this idea. The existing syntax can be very verbose, and I ended up with a lot of convenience functions to remove that boilerplate; adding macros on the library side would make using the library much more convenient.

And also strongly agree with the JSON API bit. My original attempt at this was using the Mullvad library, but it did not support everything we needed, and the description of that library as being based off reverse-engineered usage of libnftnl inside the nft binary was... concerning.

Thanks for your feedback.

@mheon: I went through the very same process with nftnl-rs, which didn't have the features I needed for another project. This is the reason I created nftables-rs.

I'll look into implementing nft-like syntax through macros.

Luap99 · 2024-01-08T17:04:43Z

src/firewall/nft.rs

+}
+
+/// Convert a subnet into a chain name.
+fn get_subnet_chain_name(subnet: IpNet, dnat: bool) -> String {


I am not a fan of using the subnet as chain name. What is wrong with the existing hash the used in iptables?
We could also just use the network ID I think?
Is there a reason to have one chain per subnet vs chain per network? One per subnet looks incorrect to me, because nft supports both ipv4 and ipv6 at the same time I would expect one chain that matches both the v4 and v6 subnet in there. More chains should also work but doesn't this just make things more complicated?

I like using subnets because it reads easier in nftables output (very obvious what each chain does), is completely unambiguous (no chance of a hash collision).

I could definitely combine v4 and v6 subnets into a single chain, but I think separating them is a bit more logical. It doesn't really need more rules, either (we need the same number of rules to actually forward traffic, and separate rules to forward traffic to the chain for v4 and v6 subnets).

Yeah I need to dump me a nft list ruleset output to see what it actually looks like to better judge what I prefer.

So I really do not like the name like that, my main issue is that with the default podman network create we juts iterate trough the subnets to assign them, so it is normal to have 10.89.0.0/24, 10.89.1.0/24, 10.89.2.0/24 and so on. This makes it hard for me to follow the chains as only a single digit changes in the name. I have no problem with having the subnet as part of the name but I would prefer if lets say 12 chars of the network ID are added as suffix to make them more distinct to my eyes.

I know we don't really have to look at this often but when we get bug reports we have to and I think a more distinct name will make it easier to find problems (e.g. missing chains/rules)

Done. Only 8 characters though (I don't want things to get ridiculously long with IPv6 addresses)

src/firewall/nft.rs

mheon · 2024-01-08T20:47:34Z

@Luap99 One last failing test, and it's a firewalld reload one. Any idea what might be going on?

Luap99 · 2024-01-09T10:50:59Z

Looks like you do not call into firewalld again to set the subnet trusted. (Do we even have to do that when using nftables?)
The VM is using f39 which has firewalld 2.0 which no longer flushes our custom fw rules by default, looking at your code you check that the nft rule exists then skip (continue) the subnet loop so the firewalld call is never done thus it is not trusted.

But because nftables - port forwarding ipv4 - tcp with firewalld reload passes I wonder if we even need to be trusted with nftables?

mheon · 2024-01-09T13:09:05Z

The firewalld folks said it was safer to do so. I'll move the firewalld make-trusted call to the top of the loop.

mheon · 2024-01-09T13:11:06Z

Alright, that should get tests passing. This is ready for review.

Luap99 · 2024-01-10T15:52:07Z

I noticed these errors in the journal when testing this on my host:

Jan 10 16:44:31 pholzing-fedora /usr/bin/podman[42596]: time="2024-01-10T16:44:31+01:00" level=info msg="netavark: [DEBUG netavark::firewall::firewalld] Removing firewalld rules for IPs 10.88.0.0/16\n"
Jan 10 16:44:31 pholzing-fedora firewalld[2202]: ERROR: UNKNOWN_ERROR: rule ref count bug, missing ref count: rule_key '{"%%POLICY_SORT_KEY%%": [0, 1, "trusted", "10.88.0.0/16", "", 0, 1, "trusted", "10.88.0.0/16", "", 0, -1], "chain": "filter_FORWARD_POLICIES", "expr": [{"match": {"left": {"payload": {"field": "saddr", "protocol": "ip"}}, "op": "==", "right": {"prefix": {"addr": "10.88.0.0", "len": 16}}}}, {"match": {"left": {"payload": {"field": "daddr", "protocol": "ip"}}, "op": "==", "right": {"prefix": {"addr": "10.88.0.0", "len": 16}}}}, {"jump": {"target": "filter_FWD_policy_netavark_portfwd"}}], "family": "inet", "table": "firewalld"}'
Jan 10 16:44:31 pholzing-fedora firewalld[2202]: ERROR: COMMAND_FAILED: UNKNOWN_ERROR: rule ref count bug, missing ref count: rule_key '{"%%POLICY_SORT_KEY%%": [0, 1, "trusted", "10.88.0.0/16", "", 0, 1, "trusted", "10.88.0.0/16", "", 0, -1], "chain": "filter_FORWARD_POLICIES", "expr": [{"match": {"left": {"payload": {"field": "saddr", "protocol": "ip"}}, "op": "==", "right": {"prefix": {"addr": "10.88.0.0", "len": 16}}}}, {"match": {"left": {"payload": {"field": "daddr", "protocol": "ip"}}, "op": "==", "right": {"prefix": {"addr": "10.88.0.0", "len": 16}}}}, {"jump": {"target": "filter_FWD_policy_netavark_portfwd"}}], "family": "inet", "table": "firewalld"}'
Jan 10 16:44:31 pholzing-fedora /usr/bin/podman[42596]: time="2024-01-10T16:44:31+01:00" level=info msg="netavark: [WARN  netavark::firewall::firewalld] Error removing subnet 10.88.0.0/16 from firewalld trusted zone: org.fedoraproject.FirewallD1.Exception: COMMAND_FAILED: UNKNOWN_ERROR: rule ref count bug, missing ref count: rule_key '{\"%%POLICY_SORT_KEY%%\": [0, 1, \"trusted\", \"10.88.0.0/16\", \"\", 0, 1, \"trusted\", \"10.88.0.0/16\", \"\", 0, -1], \"chain\": \"filter_FORWARD_POLICIES\", \"expr\": [{\"match\": {\"left\": {\"payload\": {\"field\": \"saddr\", \"protocol\": \"ip\"}}, \"op\": \"==\", \"right\": {\"prefix\": {\"addr\": \"10.88.0.0\", \"len\": 16}}}}, {\"match\": {\"left\": {\"payload\": {\"field\": \"daddr\", \"protocol\": \"ip\"}}, \"op\": \"==\", \"right\": {\"prefix\": {\"addr\": \"10.88.0.0\", \"len\": 16}}}}, {\"jump\": {\"target\": \"filter_FWD_policy_netavark_portfwd\"}}], \"family\": \"inet\", \"table\": \"firewalld\"}'\n"

I don't think they are related to this PR but I have to retest with the main/fedora version. Do you see this as well?

mheon · 2024-01-10T16:43:08Z

Nope. Don't see anything like that in journal (F38, fully updated, Podman from main, Netavark from my branch)

Luap99 · 2024-01-10T16:45:25Z

f39 ships with firewalld 2.0 and f38 with 1.X I think? So likely a firewalld change then. Anyway we can debug this if you get around to the firewalld stuff, doesn't seem related to this at all as I can reproduce using the fedora version with iptables.

Luap99 · 2024-01-10T16:49:28Z

One question though:

	chain NETAVARK-HOSTPORT-DNAT {
		tcp dport 80 jump nv_10_88_0_0_nm16_dnat
		tcp dport 81 jump nv_10_88_0_0_nm16_dnat
		tcp dport 82 jump nv_10_88_0_0_nm16_dnat
	}

	chain NETAVARK-HOSTPORT-SETMARK {
		meta mark set meta mark | 0x00002000
	}

	chain nv_10_88_0_0_nm16_dnat {
		ip saddr 10.88.0.0/16 tcp dport 80 jump NETAVARK-HOSTPORT-SETMARK
		ip saddr 127.0.0.1 tcp dport 80 jump NETAVARK-HOSTPORT-SETMARK
		tcp dport 80 dnat ip to 10.88.0.6:80
		ip saddr 10.88.0.0/16 tcp dport 81 jump NETAVARK-HOSTPORT-SETMARK
		ip saddr 127.0.0.1 tcp dport 81 jump NETAVARK-HOSTPORT-SETMARK
		tcp dport 81 dnat ip to 10.88.0.7:80
		ip saddr 10.88.0.0/16 tcp dport 82 jump NETAVARK-HOSTPORT-SETMARK
		ip saddr 127.0.0.1 tcp dport 82 jump NETAVARK-HOSTPORT-SETMARK
		tcp dport 82 dnat ip to 10.88.0.8:80
	}

What is the point of the dport match in NETAVARK-HOSTPORT-DNAT? The ..._dnat chain has to match the dport again anyway so just a direct jump should be easier?

mheon · 2024-01-10T16:49:32Z

Comments addressed (added network ID to subnet chain names). Should be ready for merge.

I'm on the firewalld stuff now, so clearly I need to do a dist-upgrade to f39 and see how badly we're broken (and what fixing it will mean - I really don't want to have to conventionalize on firewalld version)

mheon · 2024-01-10T16:52:22Z

I think that jump is mostly to prevent the NETAVARK-HOSTPORT-DNAT chain small. We could eliminate the nv_10_88_0_0_nm16_dnat chain entirely and stuff everything into NETAVARK-HOSTPORT-DNAT but that would get long and messy (especially if one container decides to do something like a 1000-port range forward). And so long as we have per-network DNAT chains, might as well only jump to them if we actually have to.

Luap99 · 2024-01-10T17:21:33Z

Ok I assumed a port range would actually only result is one rule not in one per port but seems like such feature (https://bugzilla.netfilter.org/show_bug.cgi?id=1501) is not yet implement or very new, not sure what the status is (finding out if patches on ML are merged is always such a chore)

mheon · 2024-01-10T17:27:33Z

If it is merged, it's not in the JSON schema yet, so we can't make use of it. It's unfortunate because it's the only place ranges don't work.

RE: Firewalld - The release notes say 2.0 has no breaking changes. I'm looking deeper.

Luap99 · 2024-01-10T17:28:32Z

src/network/internal_types.rs

@@ -35,6 +37,8 @@ pub struct TearDownNetwork {
 pub struct PortForwardConfigGeneric<Ports, IpAddresses> {
    /// id of container
    pub container_id: String,
+    /// id of the network
+    pub id: String,


Ah I just remembered the json parser is very strict and fields are required by default so the netavark firewalld reload service would fail to parse the config files created with the old version, not a huge deal but I think just adding #[serde(default)] to both fields might solve it. The string would be empty but given that iptables code does not use it that would be fine.

Also for consistency I think network_id is a better name

Adds an nftables firewall backend and tests for said backend. Implements basic forwarding, port forwarding, and teardown for all relevant rules. Heavily based on our existing iptables driver but with a number of improvements (we live in a dedicated table, so this should play much more nicely with other tools using the firewall; IPv4 and IPv6 share a table and almost all code; and rule structure is a bit simpler because we do have our own table and don't have to worry about cluttering up the FORWARD chain, we'll the the only ones using it. This implementation presently does not support isolation; that will be added in a followon. Fixes containers#816 Signed-off-by: Matthew Heon <[email protected]>

Luap99

LGTM assuming tests pass, nice work. I checked the simple bridge example with #884 and see a 4-5x speedup with nftables compared to iptables.

/lgtm
/hold

openshift-ci · 2024-01-10T18:00:40Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: Luap99, mheon

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [Luap99,mheon]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

mheon · 2024-01-10T18:28:51Z

/hold cancel

openshift-ci bot added do-not-merge/work-in-progress approved labels Jan 4, 2024

mheon force-pushed the nftables branch 3 times, most recently from 70de160 to 521b5b3 Compare January 4, 2024 21:46

mheon force-pushed the nftables branch 2 times, most recently from 102cdb5 to 4f3bd79 Compare January 5, 2024 02:18

This was referenced Jan 5, 2024

podman suport, using podmans docker compatibility mode nearly works but crashes Overview and Containers pages lisaac/luci-app-dockerman#181

Open

Podman doesn't set up firewall for use for 22.03.5 openwrt/packages#22255

Open

mheon force-pushed the nftables branch from 4f3bd79 to cd511cd Compare January 5, 2024 14:25

mheon changed the title ~~[WIP] Initial draft of nftables support~~ Initial draft of nftables support Jan 5, 2024

openshift-ci bot removed the do-not-merge/work-in-progress label Jan 5, 2024

mheon force-pushed the nftables branch 3 times, most recently from 601c63a to a9ff1ce Compare January 8, 2024 15:29

mheon force-pushed the nftables branch 2 times, most recently from b6d9c7d to c19a317 Compare January 8, 2024 16:37

Luap99 reviewed Jan 8, 2024

View reviewed changes

mheon force-pushed the nftables branch 3 times, most recently from 11e4f4c to e9ae31c Compare January 8, 2024 20:05

mheon force-pushed the nftables branch from e9ae31c to d7a9fd2 Compare January 9, 2024 13:10

mheon force-pushed the nftables branch 3 times, most recently from 5753bdf to 77f8df8 Compare January 9, 2024 15:57

mheon force-pushed the nftables branch from 77f8df8 to dac8186 Compare January 10, 2024 16:07

mheon force-pushed the nftables branch from dac8186 to d7802d0 Compare January 10, 2024 16:46

mheon force-pushed the nftables branch 2 times, most recently from be3f504 to 3a067df Compare January 10, 2024 17:18

Luap99 reviewed Jan 10, 2024

View reviewed changes

mheon force-pushed the nftables branch 2 times, most recently from c8c548a to 2a722d4 Compare January 10, 2024 17:40

mheon force-pushed the nftables branch from 2a722d4 to db43e53 Compare January 10, 2024 17:53

Luap99 approved these changes Jan 10, 2024

View reviewed changes

openshift-ci bot assigned Luap99 Jan 10, 2024

openshift-ci bot added do-not-merge/hold lgtm labels Jan 10, 2024

openshift-ci bot removed the do-not-merge/hold label Jan 10, 2024

openshift-merge-bot bot merged commit 5366e8f into containers:main Jan 10, 2024
24 checks passed

jwhb mentioned this pull request Mar 17, 2024

Improve API usability nftables-rs/nftables-rs#21

Open

westurner mentioned this pull request Apr 11, 2024

[feature request] nftables support moby/moby#26824

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Initial draft of nftables support #883

Initial draft of nftables support #883

mheon commented Jan 4, 2024

packit-as-a-service bot commented Jan 4, 2024

mheon commented Jan 5, 2024

mheon commented Jan 8, 2024

mheon commented Jan 8, 2024

Luap99 left a comment

Luap99 Jan 8, 2024

mheon Jan 8, 2024

jwhb Feb 6, 2024

Luap99 Feb 6, 2024

mheon Feb 6, 2024

jwhb Feb 6, 2024

Luap99 Jan 8, 2024

mheon Jan 8, 2024

Luap99 Jan 8, 2024

Luap99 Jan 10, 2024

mheon Jan 10, 2024

mheon commented Jan 8, 2024

Luap99 commented Jan 9, 2024

mheon commented Jan 9, 2024

mheon commented Jan 9, 2024

Luap99 commented Jan 10, 2024

mheon commented Jan 10, 2024

Luap99 commented Jan 10, 2024

Luap99 commented Jan 10, 2024

mheon commented Jan 10, 2024

mheon commented Jan 10, 2024

Luap99 commented Jan 10, 2024

mheon commented Jan 10, 2024

Luap99 Jan 10, 2024

mheon Jan 10, 2024

Luap99 left a comment

openshift-ci bot commented Jan 10, 2024

mheon commented Jan 10, 2024

Initial draft of nftables support #883

Initial draft of nftables support #883

Conversation

mheon commented Jan 4, 2024

packit-as-a-service bot commented Jan 4, 2024

mheon commented Jan 5, 2024

mheon commented Jan 8, 2024

mheon commented Jan 8, 2024

Luap99 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mheon commented Jan 8, 2024

Luap99 commented Jan 9, 2024

mheon commented Jan 9, 2024

mheon commented Jan 9, 2024

Luap99 commented Jan 10, 2024

mheon commented Jan 10, 2024

Luap99 commented Jan 10, 2024

Luap99 commented Jan 10, 2024

mheon commented Jan 10, 2024

mheon commented Jan 10, 2024

Luap99 commented Jan 10, 2024

mheon commented Jan 10, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Luap99 left a comment

Choose a reason for hiding this comment

openshift-ci bot commented Jan 10, 2024

mheon commented Jan 10, 2024